Speaker
Thorsten Hellert
(Lawrence Berkeley National Laboratory)
Description
We present PhysBERT and AccPhysBERT, specialized sentence-embedding models trained on 1.2 million arXiv physics papers and fine-tuned for accelerator physics, respectively. Evaluation across retrieval, clustering, and similarity tasks shows gains of up to 12\% over general-purpose models for physics corpora and 18\% for accelerator-specific tasks. Applications include semantic reviewer–paper matching, Retrieval-Augmented Generation for control-room logbooks, and rapid sub-domain adaptation. We analyze key design choices—data curation, masking objectives, and contrastive fine-tuning—and outline strategies for continual adaptation, providing a blueprint for domain-specific embeddings in the physical sciences.
Region represented | America |
---|---|
Paper preparation format | LaTeX |
Author
Thorsten Hellert
(Lawrence Berkeley National Laboratory)
Co-authors
Andrea Pollastro
(Lawrence Berkeley National Laboratory)
Mr
João Montenegro
(Lawrence Berkeley National Laboratory)
Marco Venturini
(Lawrence Berkeley National Laboratory)