1–6 Jun 2025
Taipei International Convention Center (TICC)
Asia/Taipei timezone

The journey towards a specialized text embedding model for accelerator physics

THPM023
5 Jun 2025, 15:30
2h
Exhibiton Hall A _Magpie (TWTC)

Exhibiton Hall A _Magpie

TWTC

Poster Presentation MC6.D13 Machine Learning Thursday Poster Session

Speaker

Thorsten Hellert (Lawrence Berkeley National Laboratory)

Description

We present PhysBERT and AccPhysBERT, specialized sentence-embedding models trained on 1.2 million arXiv physics papers and fine-tuned for accelerator physics, respectively. Evaluation across retrieval, clustering, and similarity tasks shows gains of up to 12\% over general-purpose models for physics corpora and 18\% for accelerator-specific tasks. Applications include semantic reviewer–paper matching, Retrieval-Augmented Generation for control-room logbooks, and rapid sub-domain adaptation. We analyze key design choices—data curation, masking objectives, and contrastive fine-tuning—and outline strategies for continual adaptation, providing a blueprint for domain-specific embeddings in the physical sciences.

Region represented America
Paper preparation format LaTeX

Author

Thorsten Hellert (Lawrence Berkeley National Laboratory)

Co-authors

Andrea Pollastro (Lawrence Berkeley National Laboratory) Mr João Montenegro (Lawrence Berkeley National Laboratory) Marco Venturini (Lawrence Berkeley National Laboratory)

Presentation materials

There are no materials yet.