A biological system is a complex network of heterogeneous molecular entities and their interactions contributing to various biological characteristics of the system. However, current biological networks are noisy, sparse, and incomplete, limiting our ability to create a holistic view of the biological system and understand the biological phenomena. Experimental identification of such interactions is both time-consuming and expensive. With the recent advancements in high-throughput data generation and significant improvement in computational power, various computational methods have been developed to predict novel interactions in the noisy network. Recently, deep learning methods such as graph neural networks have shown their effectiveness in modeling graph-structured data and achieved good performance in biomedical interaction prediction. However, graph neural networks-based methods require human expertise and experimentation to design the appropriate complexity of the model and significantly impact the performance of the model. Furthermore, deep graph neural networks face overfitting problems and tend to be poorly calibrated with high confidence on incorrect predictions. To address these challenges, we propose Bayesian model selection for graph convolutional networks to jointly infer the most plausible number of graph convolution layers (depth) warranted by data and perform dropout regularization simultaneously. Experiments on four interaction datasets show that our proposed method achieves accurate and calibrated predictions. Our proposed method enables the graph convolutional networks to dynamically adapt their depths to accommodate an increasing number of interactions.
Gaussian process training decomposes into inference of the (approximate) posterior and learning of the hyperparameters. For non-Gaussian (non-conjugate) likelihoods, two common choices for approximate inference are Expectation Propagation (EP) and Variational Inference (VI), which have complementary strengths and weaknesses. While VI's lower bound to the marginal likelihood is a suitable objective for inferring the approximate posterior, it does not automatically imply it is a good learning objective for hyperparameter optimization. We design a hybrid training procedure where the inference leverages conjugate-computation VI and the learning uses an EP-like marginal likelihood approximation. We empirically demonstrate on binary classification that this provides a good learning objective and generalizes better.
This paper studies learning on text-attributed graphs (TAGs), where each node is associated with a text description. An ideal solution for such a problem would be integrating both the text and graph structure information with large language models and graph neural networks (GNNs). However, the problem becomes very challenging when graphs are large due to the high computational complexity brought by large language models and training GNNs on big graphs. In this paper, we propose an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM. Instead of simultaneously training large language models and GNNs on big graphs, GLEM proposes to alternatively update the two modules in the E-step and M-step. Such a procedure allows to separately train the two modules but at the same time allows the two modules to interact and mutually enhance each other. Extensive experiments on multiple data sets demonstrate the efficiency and effectiveness of the proposed approach.
Deep Learning has proliferated dramatically across consumer devices in less than a decade, but has been largely powered through the hardware acceleration within isolated devices. Nonetheless, clear signals exist that the next decade of consumer intelligence will require levels of resources, a mixing of modalities and a collaboration of devices that will demand a significant pivot beyond hardware alone. To accomplish this, we believe a new Edge-AI paradigm will be necessary for this transition to be possible in a sustainable manner, without trespassing user-privacy or hurting quality of experience.
Graph Neural Networks (GNNs) have made tremendous progress in the graph classification task. However, a performance gap between the training set and the test set has often been noticed. To bridge such gap, in this work we introduce the first test-time training framework for GNNs to enhance the model generalization capacity for the graph classification task. In particular, we design a novel test-time training strategy with self-supervised learning to adjust the GNN model for each test graph sample. Experiments on the benchmark datasets have demonstrated the effectiveness of the proposed framework, especially when there are distribution shifts between training set and test set. We have also conducted exploratory studies and theoretical analysis to gain deeper understandings on the rationality of the design of the proposed graph test time training framework (GT3).
SpecAugment is a very effective data augmentation method for both HMM and E2E-based automatic speech recognition (ASR) systems. Especially, it also works in low-resource scenarios. However, SpecAugment masks the spectrum of time or the frequency domain in a fixed augmentation policy, which may bring relatively less data diversity to the low-resource ASR. In this paper, we propose a policy-based SpecAugment (Policy-SpecAugment) method to alleviate the above problem. The idea is to use the augmentation-select policy and the augmentation-parameter changing policy to solve the fixed way. These policies are learned based on the loss of validation set, which is applied to the corresponding augmentation policies. It aims to encourage the model to learn more diverse data, which the model relatively requires. In experiments, we evaluate the effectiveness of our approach in low-resource scenarios, i.e., the 100 hours librispeech task. According to the results and analysis, we can see that the above issue can be obviously alleviated using our proposal. In addition, the experimental results show that, compared with the state-of-the-art SpecAugment, the proposed Policy-SpecAugment has a relative WER reduction of more than 10% on the Test/Dev-clean set, more than 5% on the Test/Dev-other set, and an absolute WER reduction of more than 1% on all test sets.
Partial separable(PS) model is a powerful model for dynamic magnetic resonance imaging (MRI). PS model explicitly reduces the degree of freedom in the reconstruction problem, which is beneficial for high temporal resolution applications. However, long acquisition time and even longer reconstruction time prohibit the acceptance of PS model in daily practice. In this work, we propose to fully exploit the dimension-reduction property to accelerate the PS model. We optimize the data consistency term, and use a Tikhonov regularization term based on Frobenius norm of temporal difference, resulting in a totally dimension-reduced optimization technique. The proposed method is used for accelerating the free-running cardiac MRI. We have performed both retrospective experiments on public dataset and prospective experiments on in-vivo data, and compared the proposed method with least-square method and another two popular regularized PS model methods. The results show that the proposed method has robust performance against shortened acquisition time or suboptimal hyper-parameter settings, and achieves superior image quality over all other competing algorithms. The proposed method is 20-fold faster than the widely accepted PS+Sparse method, enabling data acquisition and image reconstruction to be completed in just a few seconds.
Learning temporal correspondence from unlabeled videos is of vital importance in computer vision, and has been tackled by different kinds of self-supervised pretext tasks. For the self-supervised learning, recent studies suggest using large-scale video datasets despite the training cost. We propose a spatial-then-temporal pretext task to address the training data cost problem. The task consists of two steps. First, we use contrastive learning from unlabeled still image data to obtain appearance-sensitive features. Then we switch to unlabeled video data and learn motion-sensitive features by reconstructing frames. In the second step, we propose a global correlation distillation loss to retain the appearance sensitivity learned in the first step, as well as a local correlation distillation loss in a pyramid structure to combat temporal discontinuity. Experimental results demonstrate that our method surpasses the state-of-the-art self-supervised methods on a series of correspondence-based tasks. The conducted ablation studies verify the effectiveness of the proposed two-step task and loss functions.
Due to the difficulties of obtaining multimodal paired images in clinical practice, recent studies propose to train brain tumor segmentation models with unpaired images and capture complementary information through modality translation. However, these models cannot fully exploit the complementary information from different modalities. In this work, we thus present a novel two-step (intra-modality and inter-modality) curriculum disentanglement learning framework to effectively utilize privileged semi-paired images, i.e. limited paired images that are only available in training, for brain tumor segmentation. Specifically, in the first step, we propose to conduct reconstruction and segmentation with augmented intra-modality style-consistent images. In the second step, the model jointly performs reconstruction, unsupervised/supervised translation, and segmentation for both unpaired and paired inter-modality images. A content consistency loss and a supervised translation loss are proposed to leverage complementary information from different modalities in this step. Through these two steps, our method effectively extracts modality-specific style codes describing the attenuation of tissue features and image contrast, and modality-invariant content codes containing anatomical and functional information from the input images. Experiments on three brain tumor segmentation tasks show that our model outperforms competing segmentation models based on unpaired images.
People perceive the world with different senses, such as sight, hearing, smell, and touch. Processing and fusing information from multiple modalities enables Artificial Intelligence to understand the world around us more easily. However, when there are missing modalities, the number of available modalities is different in diverse situations, which leads to an N-to-One fusion problem. To solve this problem, we propose a transformer based fusion block called TFusion. Different from preset formulations or convolution based methods, the proposed block automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. Specifically, the feature representations extracted from upstream processing model are projected as tokens and fed into transformer layers to generate latent multimodal correlations. Then, to reduce the dependence on particular modalities, a modal attention mechanism is introduced to build a shared representation, which can be applied by the downstream decision model. The proposed TFusion block can be easily integrated into existing multimodal analysis networks. In this work, we apply TFusion to different backbone networks for multimodal human activity recognition and brain tumor segmentation tasks. Extensive experimental results show that the TFusion block achieves better performance than the competing fusion strategies.