Abstract:The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemRxivQuest, a curated dataset of 970 high-quality question-answer (QA) pairs derived from 155 ChemRxiv preprints across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification. The dataset emphasizes conceptual, mechanistic, applied, and experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemRxivQuest provides a foundational resource for chemistry NLP research, education, and tool development.
Abstract:Although Raman spectroscopy is widely used for the investigation of biomedical samples and has a high potential for use in clinical applications, it is not common in clinical routines. One of the factors that obstruct the integration of Raman spectroscopic tools into clinical routines is the complexity of the data processing workflow. Software tools that simplify spectroscopic data handling may facilitate such integration by familiarizing clinical experts with the advantages of Raman spectroscopy. Here, RAMANMETRIX is introduced as a user-friendly software with an intuitive web-based graphical user interface (GUI) that incorporates a complete workflow for chemometric analysis of Raman spectra, from raw data pretreatment to a robust validation of machine learning models. The software can be used both for model training and for the application of the pretrained models onto new data sets. Users have full control of the parameters during model training, but the testing data flow is frozen and does not require additional user input. RAMANMETRIX is available in two versions: as standalone software and web application. Due to the modern software architecture, the computational backend part can be executed separately from the GUI and accessed through an application programming interface (API) for applying a preconstructed model to the measured data. This opens up possibilities for using the software as a data processing backend for the measurement devices in real-time. The models preconstructed by more experienced users can be exported and reused for easy one-click data preprocessing and prediction, which requires minimal interaction between the user and the software. The results of such prediction and graphical outputs of the different data processing steps can be exported and saved.
Abstract:In biospectroscopy, suitably annotated and statistically independent samples (e. g. patients, batches, etc.) for classifier training and testing are scarce and costly. Learning curves show the model performance as function of the training sample size and can help to determine the sample size needed to train good classifiers. However, building a good model is actually not enough: the performance must also be proven. We discuss learning curves for typical small sample size situations with 5 - 25 independent samples per class. Although the classification models achieve acceptable performance, the learning curve can be completely masked by the random testing uncertainty due to the equally limited test sample size. In consequence, we determine test sample sizes necessary to achieve reasonable precision in the validation and find that 75 - 100 samples will usually be needed to test a good but not perfect classifier. Such a data set will then allow refined sample size planning on the basis of the achieved performance. We also demonstrate how to calculate necessary sample sizes in order to show the superiority of one classifier over another: this often requires hundreds of statistically independent test samples or is even theoretically impossible. We demonstrate our findings with a data set of ca. 2550 Raman spectra of single cells (five classes: erythrocytes, leukocytes and three tumour cell lines BT-20, MCF-7 and OCI-AML3) as well as by an extensive simulation that allows precise determination of the actual performance of the models in question.