In e-commerce search, relevance between query and documents is an essential requirement for satisfying user experience. Different from traditional e-commerce platforms that offer products, users search on life service platforms such as Meituan mainly for product providers, which usually have abundant structured information, e.g. name, address, category, thousands of products. Modeling search relevance with these rich structured contents is challenging due to the following issues: (1) there is language distribution discrepancy among different fields of structured document, making it difficult to directly adopt off-the-shelf pretrained language model based methods like BERT. (2) different fields usually have different importance and their length vary greatly, making it difficult to extract document information helpful for relevance matching. To tackle these issues, in this paper we propose a novel two-stage pretraining and matching architecture for relevance matching with rich structured documents. At pretraining stage, we propose an effective pretraining method that employs both query and multiple fields of document as inputs, including an effective information compression method for lengthy fields. At relevance matching stage, a novel matching method is proposed by leveraging domain knowledge in search query to generate more effective document representations for relevance scoring. Extensive offline experiments and online A/B tests on millions of users verify that the proposed architectures effectively improve the performance of relevance modeling. The model has already been deployed online, serving the search traffic of Meituan for over a year.
The unmanned aerial vehicle (UAV) network is popular these years due to its various applications. In the UAV network, routing is significantly affected by the distributed network topology, leading to the issue that UAVs are vulnerable to deliberate damage. Hence, this paper focuses on the routing plan and recovery for UAV networks with attacks. In detail, a deliberate attack model based on the importance of nodes is designed to represent enemy attacks. Then, a node importance ranking mechanism is presented, considering the degree of nodes and link importance. However, it is intractable to handle the routing problem by traditional methods for UAV networks, since link connections change with the UAV availability. Hence, an intelligent algorithm based on reinforcement learning is proposed to recover the routing path when UAVs are attacked. Simulations are conducted and numerical results verify the proposed mechanism performs better than other referred methods.
When robots retrieve specific objects from cluttered scenes, such as home and warehouse environments, the target objects are often partially occluded or completely hidden. Robots are thus required to search, identify a target object, and successfully grasp it. Preceding works have relied on pre-trained object recognition or segmentation models to find the target object. However, such methods require laborious manual annotations to train the models and even fail to find novel target objects. In this paper, we propose an Image-driven Object Searching and Grasping (IOSG) approach where a robot is provided with the reference image of a novel target object and tasked to find and retrieve it. We design a Target Similarity Network that generates a probability map to infer the location of the novel target. IOSG learns a hierarchical policy; the high-level policy predicts the subtask type, whereas the low-level policies, explorer and coordinator, generate effective push and grasp actions. The explorer is responsible for searching the target object when it is hidden or occluded by other objects. Once the target object is found, the coordinator conducts target-oriented pushing and grasping to retrieve the target from the clutter. The proposed pipeline is trained with full self-supervision in simulation and applied to a real environment. Our model achieves a 96.0% and 94.5% task success rate on coordination and exploration tasks in simulation respectively, and 85.0% success rate on a real robot for the search-and-grasp task.
Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at https://github.com/LuminosityX/FNE.
Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{https://github.com/LuminosityX/HAT}.
Energy theft detection (ETD) and energy consumption forecasting (ECF) are two interconnected challenges in smart grid systems. Addressing these issues collectively is crucial for ensuring system security. This paper addresses the interconnected challenges of ETD and ECF in smart grid systems. The proposed solution combines long short-term memory (LSTM) and a denoising diffusion probabilistic model (DDPM) to generate input reconstruction and forecasting. By leveraging the reconstruction and forecasting errors, the system identifies instances of energy theft, with the methods based on reconstruction error and forecasting error complementing each other in detecting different types of attacks. Through extensive experiments on real-world and synthetic datasets, the proposed scheme outperforms baseline methods in ETD and ECF problems. The ensemble method significantly enhances ETD performance, accurately detecting energy theft attacks that baseline methods fail to detect. The research offers a comprehensive and effective solution for addressing ETD and ECF challenges, demonstrating promising results and improved security in smart grid systems.
Brain signals are important quantitative data for understanding physiological activities and diseases of human brain. Most existing studies pay attention to supervised learning methods, which, however, require high-cost clinical labels. In addition, the huge difference in the clinical patterns of brain signals measured by invasive (e.g., SEEG) and non-invasive (e.g., EEG) methods leads to the lack of a unified method. To handle the above issues, we propose to study the self-supervised learning (SSL) framework for brain signals that can be applied to pre-train either SEEG or EEG data. Intuitively, brain signals, generated by the firing of neurons, are transmitted among different connecting structures in human brain. Inspired by this, we propose MBrain to learn implicit spatial and temporal correlations between different channels (i.e., contacts of the electrode, corresponding to different brain areas) as the cornerstone for uniformly modeling different types of brain signals. Specifically, we represent the spatial correlation by a graph structure, which is built with proposed multi-channel CPC. We theoretically prove that optimizing the goal of multi-channel CPC can lead to a better predictive representation and apply the instantaneou-time-shift prediction task based on it. Then we capture the temporal correlation by designing the delayed-time-shift prediction task. Finally, replace-discriminative-learning task is proposed to preserve the characteristics of each channel. Extensive experiments of seizure detection on both EEG and SEEG large-scale real-world datasets demonstrate that our model outperforms several state-of-the-art time series SSL and unsupervised models, and has the ability to be deployed to clinical practice.