KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING
| dc.contributor.author | Nasrin Mohammadi | |
| dc.contributor.author | Max Dreger | |
| dc.contributor.author | Diego Collarana | |
| dc.contributor.author | Mohammad J. Eslamibidgoli | |
| dc.contributor.author | Kourosh Malek | |
| dc.contributor.author | M. Eikerling | |
| dc.coverage.spatial | Bolivia | |
| dc.date.accessioned | 2026-03-22T20:00:19Z | |
| dc.date.available | 2026-03-22T20:00:19Z | |
| dc.date.issued | 2026 | |
| dc.description.abstract | In this work, we present a pipeline for automated knowledge graph construction from materials science literature using large language models (LLMs). The proposed method performs entity and relationship extraction guided by a data model based on the logic of the Elementary Multiperspective Material Ontology (EMMO), structuring the output into a machine-interpretable graph format. The pipeline integrates several key components, including prompt-based extraction, a hierarchical chunking strategy that leverages document structure and section headers, and post-processing steps such as normalization, LLM-assisted deduplication, and alignment of node identifiers. A central focus of this study is the evaluation of different chunking strategies. Specifically we compare fixed-size splitting with a hierarchical chunking approach that incorporates document structure and header information. Our results show that hierarchical chunking consistently outperforms fixed-size chunking across both entity and relationship extraction tasks, achieving higher precision, recall, and F1 scores through more context-aware segmentation. Extracted entities and relationships are aligned with a curated ground truth dataset through manual verification to ensure semantic correctness. Overall, these findings indicate that LLMs, when combined with domain-specific ontological guidance and well-designed pre-and post-processing, can effectively extract high quality knowledge graphs from complex materials science literature. This benefits materials scientists and researchers by reducing manual curation effort and accelerating data-driven materials discovery. | |
| dc.identifier.doi | 10.26434/chemrxiv.10001546/v1 | |
| dc.identifier.uri | https://doi.org/10.26434/chemrxiv.10001546/v1 | |
| dc.identifier.uri | https://andeanlibrary.org/handle/123456789/79417 | |
| dc.source | RWTH Aachen University | |
| dc.subject | Computer science | |
| dc.subject | Chunking (psychology) | |
| dc.subject | Natural language processing | |
| dc.subject | Pipeline (software) | |
| dc.subject | Artificial intelligence | |
| dc.subject | Graph | |
| dc.subject | Preprocessor | |
| dc.subject | Information retrieval | |
| dc.subject | Knowledge extraction | |
| dc.subject | Knowledge base | |
| dc.title | KNOWLEDGE GRAPH CONSTRUCTION FROM MATERIALS SCIENCE LITERATURE USING LARGE LANGUAGE MODELS AND ADVANCED DATA PREPROCESSING | |
| dc.type | article |