Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seoin Back

Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

Jun 16, 2026

Dong Hyeon Mok, Jonggeol Na, Seoin Back

Abstract:Inverse design of heterogeneous catalysts remains challenging because catalyst surfaces exhibit substantial structural complexity with coupled surface-adsorbate interactions across a vast chemical space that is difficult to explore efficiently through conventional screening alone. Although machine learning-based high-throughput screening has accelerated catalyst discovery, its efficiency inevitably declines as the search space grows, motivating the development of generative models that can directly construct catalysts with target properties. Here, we present a conditional catalyst generative model based on the Generative Pretrained Transformer architecture with a numerical embedding layer that enables the generation of catalyst structures conditioned on both categorical and continuous properties within a single autoregressive framework. The model was pretrained on 133 million catalyst structures and subsequently fine-tuned on approximately 460,000 optimized structures with associated categorical properties and binding energies for conditional generation. The resulting model achieved 98% structural validity, 95% optimization validity, and high categorical condition fidelity, with a 93 % joint match rate for adsorbate type and composition. For binding energy conditioning, the match rate of approximately 20% represents a four-fold improvement over the baseline training distribution, and the generated distributions shift systematically toward the target values, enabling a 1.5 to 4-fold improvement in screening efficiency for reaction-targeted catalyst discovery without additional fine-tuning. These results show that large-scale autoregressive pre-training, combined with explicit property conditioning, provides a practical route toward controllable catalyst generation and accelerated catalysts discovery.

Via

Access Paper or Ask Questions

Reasoning-Driven Design of Single Atom Catalysts via a Multi-Agent Large Language Model Framework

Feb 25, 2026

Dong Hyeon Mok, Seoin Back, Victor Fung, Guoxiang Hu

Abstract:Large language models (LLMs) are becoming increasingly applied beyond natural language processing, demonstrating strong capabilities in complex scientific tasks that traditionally require human expertise. This progress has extended into materials discovery, where LLMs introduce a new paradigm by leveraging reasoning and in-context learning, capabilities absent from conventional machine learning approaches. Here, we present a Multi-Agent-based Electrocatalyst Search Through Reasoning and Optimization (MAESTRO) framework in which multiple LLMs with specialized roles collaboratively discover high-performance single atom catalysts for the oxygen reduction reaction. Within an autonomous design loop, agents iteratively reason, propose modifications, reflect on results and accumulate design history. Through in-context learning enabled by this iterative process, MAESTRO identified design principles not explicitly encoded in the LLMs' background knowledge and successfully discovered catalysts that break conventional scaling relations between reaction intermediates. These results highlight the potential of multi-agent LLM frameworks as a powerful strategy to generate chemical insight and discover promising catalysts.

Via

Access Paper or Ask Questions

Generative Language Model for Catalyst Discovery

Jul 19, 2024

Dong Hyeon Mok, Seoin Back

Figure 1 for Generative Language Model for Catalyst Discovery

Figure 2 for Generative Language Model for Catalyst Discovery

Figure 3 for Generative Language Model for Catalyst Discovery

Figure 4 for Generative Language Model for Catalyst Discovery

Abstract:Discovery of novel and promising materials is a critical challenge in the field of chemistry and material science, traditionally approached through methodologies ranging from trial-and-error to machine learning-driven inverse design. Recent studies suggest that transformer-based language models can be utilized as material generative models to expand chemical space and explore materials with desired properties. In this work, we introduce the Catalyst Generative Pretrained Transformer (CatGPT), trained to generate string representations of inorganic catalyst structures from a vast chemical space. CatGPT not only demonstrates high performance in generating valid and accurate catalyst structures but also serves as a foundation model for generating desired types of catalysts by fine-tuning with sparse and specified datasets. As an example, we fine-tuned the pretrained CatGPT using a binary alloy catalyst dataset designed for screening two-electron oxygen reduction reaction (2e-ORR) catalyst and generate catalyst structures specialized for 2e-ORR. Our work demonstrates the potential of language models as generative tools for catalyst discovery.

Via

Access Paper or Ask Questions

Catalyst design using actively learned machine with non-ab initio input features towards CO2 reduction reactions

Sep 14, 2017

Juhwan Noh, Jaehoon Kim, Seoin Back, Yousung Jung

Figure 1 for Catalyst design using actively learned machine with non-ab initio input features towards CO2 reduction reactions

Figure 2 for Catalyst design using actively learned machine with non-ab initio input features towards CO2 reduction reactions

Figure 3 for Catalyst design using actively learned machine with non-ab initio input features towards CO2 reduction reactions

Figure 4 for Catalyst design using actively learned machine with non-ab initio input features towards CO2 reduction reactions

Abstract:In conventional chemisorption model, the d-band center theory (augmented sometimes with the upper edge of d-band for imporved accuarcy) plays a central role in predicting adsorption energies and catalytic activity as a function of d-band center of the solid surfaces, but it requires density functional calculations that can be quite costly for large scale screening purposes of materials. In this work, we propose to use the d-band width of the muffin-tin orbital theory (to account for local coordination environment) plus electronegativity (to account for adsorbate renormalization) as a simple set of alternative descriptors for chemisorption, which do not demand the ab initio calculations. This pair of descriptors are then combined with machine learning methods, namely, artificial neural network (ANN) and kernel ridge regression (KRR), to allow large scale materials screenings. We show, for a toy set of 263 alloy systems, that the CO adsorption energy can be predicted with a remarkably small mean absolute deviation error of 0.05 eV, a significantly improved result as compared to 0.13 eV obtained with descriptors including costly d-band center calculations in literature. We achieved this high accuracy by utilizing an active learning algorithm, without which the accuracy was 0.18 eV otherwise. As a practical application of this machine, we identified Cu3Y@Cu as a highly active and cost-effective electrochemical CO2 reduction catalyst to produce CO with the overpotential 0.37 V lower than Au catalyst.

* Under review and including Electronic Supplementary Information (ESI)

Via

Access Paper or Ask Questions