Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL.
Being able to create meaningful symbols and proficiently use them for higher cognitive functions such as communication, reasoning, planning, etc., is essential and unique for human intelligence. Current deep neural networks are still far behind human's ability to create symbols for such higher cognitive functions. Here we propose a solution, named SEA-net, to endow neural networks with ability of symbol creation, semantic understanding and communication. SEA-net generates symbols that dynamically configure the network to perform specific tasks. These symbols capture compositional semantic information that enables the system to acquire new functions purely by symbolic manipulation or communication. In addition, we found that these self-generated symbols exhibit an intrinsic structure resembling that of natural language, suggesting a common framework underlying the generation and understanding of symbols in both human brains and artificial neural networks. We hope that it will be instrumental in producing more capable systems in the future that can synergize the strengths of connectionist and symbolic approaches for AI.
We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline.
Increasing number of COVID-19 research literatures cause new challenges in effective literature screening and COVID-19 domain knowledge aware Information Retrieval. To tackle the challenges, we demonstrate two tasks along withsolutions, COVID-19 literature retrieval, and question answering. COVID-19 literature retrieval task screens matching COVID-19 literature documents for textual user query, and COVID-19 question answering task predicts proper text fragments from text corpus as the answer of specific COVID-19 related questions. Based on transformer neural network, we provided solutions to implement the tasks on CORD-19 dataset, we display some examples to show the effectiveness of our proposed solutions.
Machine learning, and representation learning in particular, has the potential to facilitate drug discovery by screening billions of compounds. For example, a successful approach is representing the molecules as a graph and utilizing graph neural networks (GNN). Yet, these approaches still require experimental measurements of thousands of compounds to construct a proper training set. While in some domains it is easier to acquire experimental data, in others it might be more limited. For example, it is easier to test the compounds on bacteria than perform in-vivo experiments. Thus, a key question is how to utilize information from a large available dataset together with a small subset of compounds where both domains are measured to predict compounds' effect on the second, experimentally less available domain. Current transfer learning approaches for drug discovery, including training of pre-trained modules or meta-learning, have limited success. In this work, we develop a novel method, named Symbiotic Message Passing Neural Network (SMPNN), for merging graph-neural-network models from different domains. Using routing new message passing lanes between them, our approach resolves some of the potential conflicts between the different domains, and implicit constraints induced by the larger datasets. By collecting public data and performing additional high-throughput experiments, we demonstrate the advantage of our approach by predicting anti-fungal activity from anti-bacterial activity. We compare our method to the standard transfer learning approach and show that SMPNN provided better and less variable performances. Our approach is general and can be used to facilitate information transfer between any two domains such as different organisms, different organelles, or different environments.
With the deployment of GPS-enabled devices and data acquisition technology, the massively generated GPS trajectory data provide a core support for advancing spatial-temporal data mining research. Nonetheless, GPS trajectories comprise personal geo-location information, rendering inevitable privacy concerns on plain data. One promising solution to this problem is trajectory generation, replacing the original data with the generated privacy-free ones. However, owing to the complex and stochastic behavior of human activities, generating high-quality trajectories is still in its infancy. To achieve the objective, we propose a diffusion-based trajectory generation (Diff-Traj) framework, effectively integrating the generation capability of the diffusion model and learning from the spatial-temporal features of trajectories. Specifically, we gradually convert real trajectories to noise through a forward trajectory noising process. Then, Diff-Traj reconstructs forged trajectories from the noise by a reverse trajectory denoising process. In addition, we design a trajectory UNet (Traj-UNet) structure to extract trajectory features for noise level prediction during the reverse process. Experiments on two real-world datasets show that Diff-Traj can be intuitively applied to generate high-quality trajectories while retaining the original distribution.
Recent graph neural networks (GNNs) with the attention mechanism have historically been limited to small-scale homogeneous graphs (HoGs). However, GNNs handling heterogeneous graphs (HeGs), which contain several entity and relation types, all have shortcomings in handling attention. Most GNNs that learn graph attention for HeGs learn either node-level or relation-level attention, but not both, limiting their ability to predict both important entities and relations in the HeG. Even the best existing method that learns both levels of attention has the limitation of assuming graph relations are independent and that its learned attention disregards this dependency association. To effectively model both multi-relational and multi-entity large-scale HeGs, we present Bi-Level Attention Graph Neural Networks (BA-GNN), scalable neural networks (NNs) that use a novel bi-level graph attention mechanism. BA-GNN models both node-node and relation-relation interactions in a personalized way, by hierarchically attending to both types of information from local neighborhood contexts instead of the global graph context. Rigorous experiments on seven real-world HeGs show BA-GNN consistently outperforms all baselines, and demonstrate quality and transferability of its learned relation-level attention to improve performance of other GNNs.
Traffic congestion caused by non-recurring incidents such as vehicle crashes and debris is a key issue for Traffic Management Centers (TMCs). Clearing incidents in a timely manner is essential for improving safety and reducing delays and emissions for the traveling public. However, TMCs and other responders face a challenge in predicting the duration of incidents (until the roadway is clear), making decisions of what resources to deploy difficult. To address this problem, this research developed an analytical framework and end-to-end machine-learning solution for predicting incident duration based on information available as soon as an incident report is received. Quality predictions of incident duration can help TMCs and other responders take a proactive approach in deploying responder services such as tow trucks, maintenance crews or activating alternative routes. The predictions use a combination of classification and regression machine learning modules. The performance of the developed solution has been evaluated based on the Mean Absolute Error (MAE), or deviation from the actual incident duration as well as Area Under the Curve (AUC) and Mean Absolute Percentage Error (MAPE). The results showed that the framework significantly improved incident duration prediction compared to methods from previous research.
Skeleton-based action recognition has achieved remarkable results in human action recognition with the development of graph convolutional networks (GCNs). However, the recent works tend to construct complex learning mechanisms with redundant training and exist a bottleneck for long time-series. To solve these problems, we propose the Temporal-Spatio Graph ConvNeXt (TSGCNeXt) to explore efficient learning mechanism of long temporal skeleton sequences. Firstly, a new graph learning mechanism with simple structure, Dynamic-Static Separate Multi-graph Convolution (DS-SMG) is proposed to aggregate features of multiple independent topological graphs and avoid the node information being ignored during dynamic convolution. Next, we construct a graph convolution training acceleration mechanism to optimize the back-propagation computing of dynamic graph learning with 55.08\% speed-up. Finally, the TSGCNeXt restructure the overall structure of GCN with three Spatio-temporal learning modules,efficiently modeling long temporal features. In comparison with existing previous methods on large-scale datasets NTU RGB+D 60 and 120, TSGCNeXt outperforms on single-stream networks. In addition, with the ema model introduced into the multi-stream fusion, TSGCNeXt achieves SOTA levels. On the cross-subject and cross-set of the NTU 120, accuracies reach 90.22% and 91.74%.
The combination of LiDAR and camera modalities is proven to be necessary and typical for 3D object detection according to recent studies. Existing fusion strategies tend to overly rely on the LiDAR modal in essence, which exploits the abundant semantics from the camera sensor insufficiently. However, existing methods cannot rely on information from other modalities because the corruption of LiDAR features results in a large domain gap. Following this, we propose CrossFusion, a more robust and noise-resistant scheme that makes full use of the camera and LiDAR features with the designed cross-modal complementation strategy. Extensive experiments we conducted show that our method not only outperforms the state-of-the-art methods under the setting without introducing an extra depth estimation network but also demonstrates our model's noise resistance without re-training for the specific malfunction scenarios by increasing 5.2\% mAP and 2.4\% NDS.