Developing deep generative models has been an emerging field due to the ability to model and generate complex data for various purposes, such as image synthesis and molecular design. However, the advancement of deep generative models is limited by challenges to generate objects that possess multiple desired properties: 1) the existence of complex correlation among real-world properties is common but hard to identify; 2) controlling individual property enforces an implicit partially control of its correlated properties, which is difficult to model; 3) controlling multiple properties under various manners simultaneously is hard and under-explored. We address these challenges by proposing a novel deep generative framework that recovers semantics and the correlation of properties through disentangled latent vectors. The correlation is handled via an explainable mask pooling layer, and properties are precisely retained by generated objects via the mutual dependence between latent vectors and properties. Our generative model preserves properties of interest while handling correlation and conflicts of properties under a multi-objective optimization framework. The experiments demonstrate our model's superior performance in generating data with desired properties.
Detecting anomalies over real-world datasets remains a challenging task. Data annotation is an intensive human labor problem, particularly in sequential datasets, where the start and end time of anomalies are not known. As a result, data collected from sequential real-world processes can be largely unlabeled or contain inaccurate labels. These characteristics challenge the application of anomaly detection techniques based on supervised learning. In contrast, Multiple Instance Learning (MIL) has been shown effective on problems with incomplete knowledge of labels in the training dataset, mainly due to the notion of bags. While largely under-leveraged for anomaly detection, MIL provides an appealing formulation for anomaly detection over real-world datasets, and it is the primary contribution of this paper. In this paper, we propose an MIL-based formulation and various algorithmic instantiations of this framework based on different design decisions for key components of the framework. We evaluate the resulting algorithms over four datasets that capture different physical processes along different modalities. The experimental evaluation draws out several observations. The MIL-based formulation performs no worse than single instance learning on easy to moderate datasets and outperforms single-instance learning on more challenging datasets. Altogether, the results show that the framework generalizes well over diverse datasets resulting from different real-world application domains.
The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.
Designing molecules with specific properties is a long-lasting research problem and is central to advancing crucial domains such as drug discovery and material science. Recent advances in deep graph generative models treat molecule design as graph generation problems which provide new opportunities toward the breakthrough of this long-lasting problem. Existing models, however, have many shortcomings, including poor interpretability and controllability toward desired molecular properties. This paper focuses on new methodologies for molecule generation with interpretable and controllable deep generative models, by proposing new monotonically-regularized graph variational autoencoders. The proposed models learn to represent the molecules with latent variables and then learn the correspondence between them and molecule properties parameterized by polynomial functions. To further improve the intepretability and controllability of molecule generation towards desired properties, we derive new objectives which further enforce monotonicity of the relation between some latent variables and target molecule properties such as toxicity and clogP. Extensive experimental evaluation demonstrates the superiority of the proposed framework on accuracy, novelty, disentanglement, and control towards desired molecular properties. The code is open-source at https://anonymous.4open.science/r/MDVAE-FD2C.
Forecasting traffic flows is a central task in intelligent transportation system management. Graph structures have shown promise as a modeling framework, with recent advances in spatio-temporal modeling via graph convolution neural networks, improving the performance or extending the prediction horizon on traffic flows. However, a key shortcoming of state-of-the-art methods is their inability to take into account information of various modalities, for instance the impact of maintenance downtime on traffic flows. This is the issue we address in this paper. Specifically, we propose a novel model to predict traffic speed under the impact of construction work. The model is based on the powerful attention-based spatio-temporal graph convolution architecture but utilizes various channels to integrate different sources of information, explicitly builds spatio-temporal dependencies among traffic states, captures the relationships between heterogeneous roadway networks, and then predicts changes in traffic flow resulting from maintenance downtime events. The model is evaluated on two benchmark datasets and a novel dataset we have collected over the bustling Tyson's corner region in Northern Virginia. Extensive comparative experiments and ablation studies show that the proposed model can capture complex and nonlinear spatio-temporal relationships across a transportation corridor, outperforming baseline models.
The mean shift (MS) algorithm is a nonparametric method used to cluster sample points and find the local modes of kernel density estimates, using an idea based on iterative gradient ascent. In this paper we develop a mean-shift-inspired algorithm to estimate the modes of regression functions and partition the sample points in the input space. We prove convergence of the sequences generated by the algorithm and derive the non-asymptotic rates of convergence of the estimated local modes for the underlying regression model. We also demonstrate the utility of the algorithm for data-enabled discovery through an application on biomolecular structure data. An extension to subspace constrained mean shift (SCMS) algorithm used to extract ridges of regression functions is briefly discussed.
Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.
Disentangled representation learning has recently attracted a significant amount of attention, particularly in the field of image representation learning. However, learning the disentangled representations behind a graph remains largely unexplored, especially for the attributed graph with both node and edge features. Disentanglement learning for graph generation has substantial new challenges including 1) the lack of graph deconvolution operations to jointly decode node and edge attributes; and 2) the difficulty in enforcing the disentanglement among latent factors that respectively influence: i) only nodes, ii) only edges, and iii) joint patterns between them. To address these challenges, we propose a new disentanglement enhancement framework for deep generative models for attributed graphs. In particular, a novel variational objective is proposed to disentangle the above three types of latent factors, with novel architecture for node and edge deconvolutions. Moreover, within each type, individual-factor-wise disentanglement is further enhanced, which is shown to be a generalization of the existing framework for images. Qualitative and quantitative experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed model and its extensions.
Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrella of stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function, though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, this paper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein. Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challenge via graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structural meaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative, deep generative models becoming viable and informative complementary approaches to protein structure prediction.