Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objectives related to target affinity, drug-likeness, and synthesizability, facilitating its application in various drug development contexts. We improved the Particle Swarm Optimization (PSO) in the context of drug discoveries, and identified PSO-ENP as the optimal variant for multi-objective molecular generation and optimization through comparative experiments. The model also incorporates a novel target-ligand affinity predictor, enhancing the model's utility by supporting three-dimensional information and improving synthetic feasibility. Case studies focused on generating and optimizing drug-like big marine natural products were performed, underscoring PSO-ENP's effectiveness and demonstrating its considerable potential for practical drug discovery applications.
Anomaly detection is the task of identifying abnormal behavior of a system. Anomaly detection in computational workflows is of special interest because of its wide implications in various domains such as cybersecurity, finance, and social networks. However, anomaly detection in computational workflows~(often modeled as graphs) is a relatively unexplored problem and poses distinct challenges. For instance, when anomaly detection is performed on graph data, the complex interdependency of nodes and edges, the heterogeneity of node attributes, and edge types must be accounted for. Although the use of graph neural networks can help capture complex inter-dependencies, the scarcity of labeled anomalous examples from workflow executions is still a significant challenge. To address this problem, we introduce an autoencoder-driven self-supervised learning~(SSL) approach that learns a summary statistic from unlabeled workflow data and estimates the normal behavior of the computational workflow in the latent space. In this approach, we combine generative and contrastive learning objectives to detect outliers in the summary statistics. We demonstrate that by estimating the distribution of normal behavior in the latent space, we can outperform state-of-the-art anomaly detection methods on our benchmark datasets.
Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspiration from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$, and directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets.
Learning the similarity between structured data, especially the graphs, is one of the essential problems. Besides the approach like graph kernels, Gromov-Wasserstein (GW) distance recently draws big attention due to its flexibility to capture both topological and feature characteristics, as well as handling the permutation invariance. However, structured data are widely distributed for different data mining and machine learning applications. With privacy concerns, accessing the decentralized data is limited to either individual clients or different silos. To tackle these issues, we propose a privacy-preserving framework to analyze the GW discrepancy of node embedding learned locally from graph neural networks in a federated flavor, and then explicitly place local differential privacy (LDP) based on Multi-bit Encoder to protect sensitive information. Our experiments show that, with strong privacy protections guaranteed by the $\varepsilon$-LDP algorithm, the proposed framework not only preserves privacy in graph learning but also presents a noised structural metric under GW distance, resulting in comparable and even better performance in classification and clustering tasks. Moreover, we reason the rationale behind the LDP-based GW distance analytically and empirically.
Molecular fingerprints are the workhorse in ligand-based drug discovery. In recent years, increasing number of research papers reported fascinating results on using deep neural networks to learn 2D molecular representations as fingerprints. One may anticipate that the integration of deep learning would also contribute to the prosperity of 3D fingerprints. Here, we presented a new 3D small molecule fingerprint, the three-dimensional force fields fingerprint (TF3P), learned by deep capsular network whose training is in no need of labeled dataset for specific predictive tasks. TF3P can encode the 3D force fields information of molecules and demonstrates its stronger ability to capture 3D structural changes, recognize molecules alike in 3D but not in 2D, and recognize similar targets inaccessible by other fingerprints, including the solely existing 3D fingerprint E3FP, based on only ligands similarity. Furthermore, TF3P is compatible with both statistical models (e.g. similarity ensemble approach) and machine learning models. Altogether, we report TF3P as a new 3D small molecule fingerprint with promising future in ligand-based drug discovery.