Abstract:Recently, molecule generation using deep learning has been actively investigated in drug discovery. In this field, Transformer and VAE are widely used as powerful models, but they are rarely used in combination due to structural and performance mismatch of them. This study proposes a model that combines these two models through structural and parameter optimization in handling diverse molecules. The proposed model shows comparable performance to existing models in generating molecules, and showed by far superior performance in generating molecules with unseen structures. In addition, the proposed model successfully predicted molecular properties using the latent representation of VAE. Ablation studies suggested the advantage of VAE over other generative models like language model in generating novel molecules, and that the molecules can be described by ~32 dimensional variables, much smaller than existing descriptors and models. This study is expected to provide a virtual chemical library containing a wide variety of compounds for virtual screening and to enable efficient screening.
Abstract:Recent years have seen development of descriptor generation based on representation learning of extremely diverse molecules, especially those that apply natural language processing (NLP) models to SMILES, a literal representation of molecular structure. However, little research has been done on how these models understand chemical structure. To address this, we investigated the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer. The results suggest that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures. Consistently, the accuracy of molecular property predictions using descriptors generated from models at different learning steps was similar from the beginning to the end of training. Furthermore, we found that the Transformer requires particularly long training to learn chirality and sometimes stagnates with low translation accuracy due to misunderstanding of enantiomers. These findings are expected to deepen understanding of NLP models in chemistry.