KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING

Nasrin Mohammadi; Max Dreger; Diego Collarana; Mohammad J. Eslamibidgoli; Kourosh Malek; M. Eikerling

doi:10.26434/chemrxiv.10001546/v1

KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING

Date

2026

Authors

Nasrin Mohammadi

Max Dreger

Diego Collarana

Mohammad J. Eslamibidgoli

Kourosh Malek

M. Eikerling

Abstract

In this work, we present a pipeline for automated knowledge graph construction from materials science literature using large language models (LLMs). The proposed method performs entity and relationship extraction guided by a data model based on the logic of the Elementary Multiperspective Material Ontology (EMMO), structuring the output into a machine-interpretable graph format. The pipeline integrates several key components, including prompt-based extraction, a hierarchical chunking strategy that leverages document structure and section headers, and post-processing steps such as normalization, LLM-assisted deduplication, and alignment of node identifiers. A central focus of this study is the evaluation of different chunking strategies. Specifically we compare fixed-size splitting with a hierarchical chunking approach that incorporates document structure and header information. Our results show that hierarchical chunking consistently outperforms fixed-size chunking across both entity and relationship extraction tasks, achieving higher precision, recall, and F1 scores through more context-aware segmentation. Extracted entities and relationships are aligned with a curated ground truth dataset through manual verification to ensure semantic correctness. Overall, these findings indicate that LLMs, when combined with domain-specific ontological guidance and well-designed pre-and post-processing, can effectively extract high quality knowledge graphs from complex materials science literature. This benefits materials scientists and researchers by reducing manual curation effort and accelerating data-driven materials discovery.