Abstract:Radiation-induced grafting (RIG) enables precise functionalization of polymer films for ion-exchange membranes, CO2-separation membranes, and battery electrolytes by generating radicals on robust substrates to graft desired monomers. However, reproducibility remains limited due to unreported variability in base-film morphology (crystallinity, grain orientation, free volume), which governs monomer diffusion, radical distribution, and the Trommsdorff effect, leading to spatial graft gradients and performance inconsistencies. We present a hierarchical stacking optimization framework with a Dirichlet's Process (SoDip), a hierarchical data-driven framework integrating: (1) a decoder-only Transformer (DeepSeek-R1) to encode textual process descriptors (irradiation source, grafting type, substrate manufacturer); (2) TabNet and XGBoost for modelling multimodal feature interactions; (3) Gaussian Process Regression (GPR) with Dirichlet Process Mixture Models (DPMM) for uncertainty quantification and heteroscedasticity; and (4) Bayesian Optimization for efficient exploration of high-dimensional synthesis space. A diverse dataset was curated using ChemDataExtractor 2.0 and WebPlotDigitizer, incorporating numerical and textual variables across hundreds of RIG studies. In cross-validation, SoDip achieved ~33% improvement over GPR while providing calibrated confidence intervals that identify low-reproducibility regimes. Its stacked architecture integrates sparse textual and numerical inputs of varying quality, outperforming prior models and establishing a foundation for reproducible, morphology-aware design in graft polymerization research.
Abstract:The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.