KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING

Nasrin Mohammadi; Max Dreger; Diego Collarana; Mohammad J. Eslamibidgoli; Kourosh Malek; M. Eikerling

doi:10.26434/chemrxiv.10001546/v1

KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING

dc.contributor.author	Nasrin Mohammadi
dc.contributor.author	Max Dreger
dc.contributor.author	Diego Collarana
dc.contributor.author	Mohammad J. Eslamibidgoli
dc.contributor.author	Kourosh Malek
dc.contributor.author	M. Eikerling
dc.coverage.spatial	Bolivia
dc.date.accessioned	2026-03-22T20:00:19Z
dc.date.available	2026-03-22T20:00:19Z
dc.date.issued	2026
dc.description.abstract	In this work, we present a pipeline for automated knowledge graph construction from materials science literature using large language models (LLMs). The proposed method performs entity and relationship extraction guided by a data model based on the logic of the Elementary Multiperspective Material Ontology (EMMO), structuring the output into a machine-interpretable graph format. The pipeline integrates several key components, including prompt-based extraction, a hierarchical chunking strategy that leverages document structure and section headers, and post-processing steps such as normalization, LLM-assisted deduplication, and alignment of node identifiers. A central focus of this study is the evaluation of different chunking strategies. Specifically we compare fixed-size splitting with a hierarchical chunking approach that incorporates document structure and header information. Our results show that hierarchical chunking consistently outperforms fixed-size chunking across both entity and relationship extraction tasks, achieving higher precision, recall, and F1 scores through more context-aware segmentation. Extracted entities and relationships are aligned with a curated ground truth dataset through manual verification to ensure semantic correctness. Overall, these findings indicate that LLMs, when combined with domain-specific ontological guidance and well-designed pre-and post-processing, can effectively extract high quality knowledge graphs from complex materials science literature. This benefits materials scientists and researchers by reducing manual curation effort and accelerating data-driven materials discovery.
dc.identifier.doi	10.26434/chemrxiv.10001546/v1
dc.identifier.uri	https://doi.org/10.26434/chemrxiv.10001546/v1
dc.identifier.uri	https://andeanlibrary.org/handle/123456789/79417
dc.source	RWTH Aachen University
dc.subject	Computer science
dc.subject	Chunking (psychology)
dc.subject	Natural language processing
dc.subject	Pipeline (software)
dc.subject	Artificial intelligence
dc.subject	Graph
dc.subject	Preprocessor
dc.subject	Information retrieval
dc.subject	Knowledge extraction
dc.subject	Knowledge base
dc.title	KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING
dc.type	article

Collections

Artículo Científico Publicado

KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING

Files

Collections