Abstract:The design-make-test-analyze cycle in early-stage drug discovery remains constrained primarily by the "make" step: small-molecule synthesis is slow, costly, and difficult to scale or automate across diverse chemotypes. Enumerated chemical spaces aim to reduce this bottleneck by predefining synthesizable regions of chemical space from available building blocks and reliable reactions, yet existing commercial spaces are still limited by long turnaround times, narrow reaction scope, and substantial manual decision-making in route selection and execution. Here we present the first version of onepot CORE, an enumerated chemical space containing 3.4B molecules and corresponding on-demand synthesis product enabled by an automated synthesis platform and an AI chemist, Phil, that designs, executes, and analyzes experiments. onepot CORE is constructed by (i) selecting a reaction set commonly used in medicinal chemistry, (ii) sourcing and curating building blocks from supplier catalogs, (iii) enumerating candidate products, and (iv) applying ML-based feasibility assessment to prioritize compounds for robust execution. In the current release, the space is supported by seven reactions. We describe an end-to-end workflow - from route selection and automated liquid handling through workup and purification. We further report validation across operational metrics (success rate, timelines, purity, and identity), including NMR confirmation for a representative set of synthesized compounds and assay suitability demonstrated using a series of DPP4 inhibitors. Collectively, onepot CORE illustrates a path toward faster, more reliable access to diverse small molecules, supporting accelerated discovery in pharmaceuticals and beyond.




Abstract:As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.