Detecting out-of-distribution (OOD) data is critical to building reliable machine learning systems in the open world. Among the existing OOD detection methods, ReAct is famous for its simplicity and efficiency, and has good theoretical analysis. The gap between ID data and OOD data is enlarged by clipping the larger activation value. But the question is, is this operation optimal? Is there a better way to expand the spacing between ID samples and OOD samples in theory? Driven by these questions, we view the optimal activation function modification from the perspective of functional extremum and propose the Variational Recified Acitvations (VRA) method. In order to make our method easy to practice, we further propose several VRA variants. To verify the effectiveness of our method, we conduct experiments on many benchmark datasets. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches. Meanwhile, our method is easy to implement and does not require additional OOD data or fine-tuning process. We can realize OOD detection in only one forward pass.
Detecting out-of-distribution (OOD) data is critical to building reliable machine learning systems in the open world. The previous methods either need to use additional data or use the information of training data. The method of using only the parameter information of the model is relatively poor. We propose an efficient method for OOD detection using only model parameter information. To verify the effectiveness of our method, we conduct experiments on four benchmark datasets. Experimental results demonstrate that our RG outperforms existing state-of-the-art approaches by 4.57\% in average AUROC. Meanwhile, our method is easy to implement and does not require additional OOD data or fine-tuning process. We can realize OOD detection in only one forward pass of any pretrained model.
Noisy partial label learning (noisy PLL) is an important branch of weakly supervised learning. Unlike PLL where the ground-truth label must reside in the candidate set, noisy PLL relaxes this constraint and allows the ground-truth label may not be in the candidate set. To address this problem, existing works attempt to detect noisy samples and estimate the ground-truth label for each noisy sample. However, detection errors are inevitable, and these errors will accumulate during training and continuously affect model optimization. To address this challenge, we propose a novel framework for noisy PLL, called ``Dynamically Adjusted Label Importance (DALI)''. It aims to reduce the negative impact of detection errors by trading off the initial candidate set and model outputs with theoretical guarantees. Experimental results on multiple datasets demonstrate that our DALI succeeds over existing state-of-the-art approaches on noisy PLL. Our code will soon be publicly available.
Partial label learning (PLL) is a typical weakly supervised learning, where each sample is associated with a set of candidate labels. The basic assumption of PLL is that the ground-truth label must reside in the candidate set. However, this assumption may not be satisfied due to the unprofessional judgment of the annotators, thus limiting the practical application of PLL. In this paper, we relax this assumption and focus on a more general problem, noisy PLL, where the ground-truth label may not exist in the candidate set. To address this challenging problem, we further propose a novel framework called "Automatic Refinement Network (ARNet)". Our method consists of multiple rounds. In each round, we purify the noisy samples through two key modules, i.e., noisy sample detection and label correction. To guarantee the performance of these modules, we start with warm-up training and automatically select the appropriate correction epoch. Meanwhile, we exploit data augmentation to further reduce prediction errors in ARNet. Through theoretical analysis, we prove that our method is able to reduce the noise level of the dataset and eventually approximate the Bayes optimal classifier. To verify the effectiveness of ARNet, we conduct experiments on multiple benchmark datasets. Experimental results demonstrate that our ARNet is superior to existing state-of-the-art approaches in noisy PLL. Our code will be made public soon.
Medical Relation Extraction (MRE) task aims to extract relations between entities in medical texts. Traditional relation extraction methods achieve impressive success by exploring the syntactic information, e.g., dependency tree. However, the quality of the 1-best dependency tree for medical texts produced by an out-of-domain parser is relatively limited so that the performance of medical relation extraction method may degenerate. To this end, we propose a method to jointly model semantic and syntactic information from medical texts based on causal explanation theory. We generate dependency forests consisting of the semantic-embedded 1-best dependency tree. Then, a task-specific causal explainer is adopted to prune the dependency forests, which are further fed into a designed graph convolutional network to learn the corresponding representation for downstream task. Empirically, the various comparisons on benchmark medical datasets demonstrate the effectiveness of our model.
With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.
In this paper, we propose the solution to the Multi-Task Learning (MTL) Challenge of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition. The task of ABAW is to predict frame-level emotion descriptors from videos: discrete emotional state; valence and arousal; and action units. Although researchers have proposed several approaches and achieved promising results in ABAW, current works in this task rarely consider interactions between different emotion descriptors. To this end, we propose a novel end to end architecture to achieve full integration of different types of information. Experimental results demonstrate the effectiveness of our proposed solution.
Speech emotion recognition (SER) is a crucial research topic in human-computer interactions. Existing works are mainly based on manually designed models. Despite their great success, these methods heavily rely on historical experience, which are time-consuming but cannot exhaust all possible structures. To address this problem, we propose a neural architecture search (NAS) based framework for SER, called "EmotionNAS". We take spectrogram and wav2vec features as the inputs, followed with NAS to optimize the network structure for these features separately. We further incorporate complementary information in these features through decision-level fusion. Experimental results on IEMOCAP demonstrate that our method succeeds over existing state-of-the-art strategies on SER.
Conversations have become a critical data format on social media platforms. Understanding conversation from emotion, content, and other aspects also attracts increasing attention from researchers due to its widespread application in human-computer interaction. In real-world environments, we often encounter the problem of incomplete modalities, which has become a core issue of conversation understanding. To address this problem, researchers propose various methods. However, existing approaches are mainly designed for individual utterances or medical images rather than conversational data, which cannot exploit temporal and speaker information in conversations. To this end, we propose a novel framework for incomplete multimodal learning in conversations, called "Graph Complete Network (GCNet)", filling the gap of existing works. Our GCNet contains two well-designed graph neural network-based modules, "Speaker GNN" and "Temporal GNN", to capture temporal and speaker information in conversations. To make full use of complete and incomplete data in feature learning, we jointly optimize classification and reconstruction in an end-to-end manner. To verify the effectiveness of our method, we conduct experiments on three benchmark conversational datasets. Experimental results demonstrate that our GCNet is superior to existing state-of-the-art approaches in incomplete multimodal learning.