Skip to main content
1–6 Jun 2025
Taipei International Convention Center (TICC)
Asia/Taipei timezone

Developing specialized text embedding models for accelerator physics

THPM023
5 Jun 2025, 15:30
2h
Exhibiton Hall A _Magpie (TWTC)

Exhibiton Hall A _Magpie

TWTC

Poster Presentation MC6.D13 Machine Learning Thursday Poster Session

Speaker

Thorsten Hellert (Lawrence Berkeley National Laboratory)

Description

The specialized terminology and complex concepts inherent in physics present significant challenges for Natural Language Processing (NLP), particularly when relying on general-purpose models. In this talk, I will discuss the development of physics-specific text embedding models designed to overcome these obstacles, beginning with PhysBERT—the first model pre-trained exclusively on a curated corpus of 1.2 million arXiv physics papers. Building upon this foundation, we turn our attention to accelerator physics, a subfield with even more intricate language and concepts. To effectively capture the nuances of this domain, we developed AccPhysBERT, a sentence embedding model fine-tuned specifically for accelerator physics literature. A key aspect of this development involved leveraging Large Language Models (LLMs) extensively to generate annotated training data, enabling AccPhysBERT to facilitate advanced NLP applications such as semantic paper-reviewer matching and integration into Retrieval-Augmented Generation systems.

Region represented America
Paper preparation format LaTeX

Author

Thorsten Hellert (Lawrence Berkeley National Laboratory)

Co-authors

Andrea Pollastro (Lawrence Berkeley National Laboratory) Marco Venturini (Lawrence Berkeley National Laboratory)

Presentation materials

There are no materials yet.