Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Markus Stricker

From Word2Vec to Transformers: Text-Derived Composition Embeddings for Filtering Combinatorial Electrocatalysts

Mar 09, 2026

Lei Zhang, Markus Stricker

Abstract:Compositionally complex solid solution electrocatalysts span vast composition spaces, and even one materials system can contain more candidate compositions than can be measured exhaustively. Here we evaluate a label-free screening strategy that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts. We compare a corpus-trained Word2Vec baseline with transformer-based embeddings, where compositions are encoded either by linear element-wise mixing or by short composition prompts. Similarities to `concept directions', the terms conductivity and dielectric, define a 2-dimensional descriptor space, and a symmetric Pareto-front selection is used to filter candidate subsets without using electrochemical labels. Performance is assessed on 15 materials libraries including noble metal alloys and multicomponent oxides. In this setting, the lightweight Word2Vec baseline, which uses a simple linear combination of element embeddings, often achieves the highest number of reductions of possible candidate compositions while staying close to the best measured performance.

* 15 pages, 3 figures

Via

Access Paper or Ask Questions

Ontology-aligned structuring and reuse of multimodal materials data and workflows towards automatic reproduction

Jan 18, 2026

Sepideh Baghaee Ravari, Abril Azocar Guzman, Sarath Menon, Stefan Sandfeld, Tilmann Hickel, Markus Stricker

Abstract:Reproducibility of computational results remains a challenge in materials science, as simulation workflows and parameters are often reported only in unstructured text and tables. While literature data are valuable for validation and reuse, the lack of machine-readable workflow descriptions prevents large-scale curation and systematic comparison. Existing text-mining approaches are insufficient to extract complete computational workflows with their associated parameters. An ontology-driven, large language model (LLM)-assisted framework is introduced for the automated extraction and structuring of computational workflows from the literature. The approach focuses on density functional theory-based stacking fault energy (SFE) calculations in hexagonal close-packed magnesium and its binary alloys, and uses a multi-stage filtering strategy together with prompt-engineered LLM extraction applied to method sections and tables. Extracted information is unified into a canonical schema and aligned with established materials ontologies (CMSO, ASMO, and PLDO), enabling the construction of a knowledge graph using atomRDF. The resulting knowledge graph enables systematic comparison of reported SFE values and supports the structured reuse of computational protocols. While full computational reproducibility is still constrained by missing or implicit metadata, the framework provides a foundation for organizing and contextualizing published results in a semantically interoperable form, thereby improving transparency and reusability of computational materials data.

* 39 pages, 7 figures

Via

Access Paper or Ask Questions

Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts

May 27, 2025

Lei Zhang, Markus Stricker

Abstract:The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion'. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.

* 13 pages, 5 figures, 2 tables, accepted at ECMLPKDD 2025

Via

Access Paper or Ask Questions

Dislocation cartography: Representations and unsupervised classification of dislocation networks with unique fingerprints

Jun 21, 2024

Benjamin Udofia, Tushar Jogi, Markus Stricker

Abstract:Detecting structure in data is the first step to arrive at meaningful representations for systems. This is particularly challenging for dislocation networks evolving as a consequence of plastic deformation of crystalline systems. Our study employs Isomap, a manifold learning technique, to unveil the intrinsic structure of high-dimensional density field data of dislocation structures from different compression axis. The resulting maps provide a systematic framework for quantitatively comparing dislocation structures, offering unique fingerprints based on density fields. Our novel, unbiased approach contributes to the quantitative classification of dislocation structures which can be systematically extended.

* 26 pages, 7 figures

Via

Access Paper or Ask Questions

MatNexus: A Comprehensive Text Mining and Analysis Suite for Materials Discover

Nov 07, 2023

Lei Zhang, Markus Stricker

Abstract:MatNexus is a specialized software for the automated collection, processing, and analysis of text from scientific articles. Through an integrated suite of modules, the MatNexus facilitates the retrieval of scientific articles, processes textual data for insights, generates vector representations suitable for machine learning, and offers visualization capabilities for word embeddings. With the vast volume of scientific publications, MatNexus stands out as an end-to-end tool for researchers aiming to gain insights from scientific literature in material science, making the exploration of materials, such as the electrocatalyst examples we show here, efficient and insightful.

* 15 pages, 6 figures, submission to SoftwareX

Via

Access Paper or Ask Questions