KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING

dc.contributor.authorNasrin Mohammadi
dc.contributor.authorMax Dreger
dc.contributor.authorDiego Collarana
dc.contributor.authorMohammad J. Eslamibidgoli
dc.contributor.authorKourosh Malek
dc.contributor.authorM. Eikerling
dc.coverage.spatialBolivia
dc.date.accessioned2026-03-22T20:00:19Z
dc.date.available2026-03-22T20:00:19Z
dc.date.issued2026
dc.description.abstractIn this work, we present a pipeline for automated knowledge graph construction from materials science literature using large language models (LLMs). The proposed method performs entity and relationship extraction guided by a data model based on the logic of the Elementary Multiperspective Material Ontology (EMMO), structuring the output into a machine-interpretable graph format. The pipeline integrates several key components, including prompt-based extraction, a hierarchical chunking strategy that leverages document structure and section headers, and post-processing steps such as normalization, LLM-assisted deduplication, and alignment of node identifiers. A central focus of this study is the evaluation of different chunking strategies. Specifically we compare fixed-size splitting with a hierarchical chunking approach that incorporates document structure and header information. Our results show that hierarchical chunking consistently outperforms fixed-size chunking across both entity and relationship extraction tasks, achieving higher precision, recall, and F1 scores through more context-aware segmentation. Extracted entities and relationships are aligned with a curated ground truth dataset through manual verification to ensure semantic correctness. Overall, these findings indicate that LLMs, when combined with domain-specific ontological guidance and well-designed pre-and post-processing, can effectively extract high quality knowledge graphs from complex materials science literature. This benefits materials scientists and researchers by reducing manual curation effort and accelerating data-driven materials discovery.
dc.identifier.doi10.26434/chemrxiv.10001546/v1
dc.identifier.urihttps://doi.org/10.26434/chemrxiv.10001546/v1
dc.identifier.urihttps://andeanlibrary.org/handle/123456789/79417
dc.sourceRWTH Aachen University
dc.subjectComputer science
dc.subjectChunking (psychology)
dc.subjectNatural language processing
dc.subjectPipeline (software)
dc.subjectArtificial intelligence
dc.subjectGraph
dc.subjectPreprocessor
dc.subjectInformation retrieval
dc.subjectKnowledge extraction
dc.subjectKnowledge base
dc.titleKNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING
dc.typearticle

Files