DIA Logo

Enhancing Chemical Entity Recognition and Contextual Understanding from Unstructured Text Using Regular Expressions and Large Language Models Based Multi-Model Pipelines  

 

This paper demonstrates how a pre-trained named entity recognition model, regular expression techniques, and a large language model can be integrated and leveraged to identify and extract key information—such as starting materials, suppliers, and supplier-related details—without the need for extensive ground truth data sets. This framework bypasses other methods like fine-tuning, retrieval augmented generation, or rule-based extraction, significantly enhancing both the efficiency and comprehensiveness of the extraction process while reducing costs and the reliance on large-scale labeled data.

Cover Page SM Manuscript_DIA White Paper_03302026