How do we know when the predictions made by a classifier can be trusted? This is a fundamental problem that also has immense practical applicability, especially in safety-critical areas such as medicine and autonomous driving. The de facto approach of using the classifier's softmax outputs as a proxy for trustworthiness suffers from the over-confidence issue; while the most recent works incur problems such as additional retraining cost and accuracy versus trustworthiness trade-off. In this work, we argue that the trustworthiness of a classifier's prediction for a sample is highly associated with two factors: the sample's neighborhood information and the classifier's output. To combine the best of both worlds, we design a model-agnostic post-hoc approach NeighborAgg to leverage the two essential information via an adaptive neighborhood aggregation. Theoretically, we show that NeighborAgg is a generalized version of a one-hop graph convolutional network, inheriting the powerful modeling ability to capture the varying similarity between samples within each class. We also extend our approach to the closely related task of mislabel detection and provide a theoretical coverage guarantee to bound the false negative. Empirically, extensive experiments on image and tabular benchmarks verify our theory and suggest that NeighborAgg outperforms other methods, achieving state-of-the-art trustworthiness performance.
Connected and automated vehicles (CAVs) are viewed as a special kind of robots that have the potential to significantly improve the safety and efficiency of traffic. In contrast to many swarm robotics studies that are demonstrated in labs by employing a small number of robots, CAV studies aims to achieve cooperative driving of unceasing robot swarm flows. However, how to get the optimal passing order of such robot swarm flows even for a signal-free intersection is an NP-hard problem (specifically, enumerating based algorithm takes days to find the optimal solution to a 20-CAV scenario). Here, we introduce a novel cooperative driving algorithm (AlphaOrder) that combines offline deep learning and online tree searching to find a near-optimal passing order in real-time. AlphaOrder builds a pointer network model from solved scenarios and generates near-optimal passing orders instantaneously for new scenarios. Furthermore, our approach provides a general approach to managing preemptive resource sharing between swarm robotics (e.g., scheduling multiple automated guided vehicles (AGVs) and unmanned aerial vehicles (UAVs) at conflicting areas
Of particular interest is to discover useful representations solely from observations in an unsupervised generative manner. However, the question of whether existing normalizing flows provide effective representations for downstream tasks remains mostly unanswered despite their strong ability for sample generation and density estimation. This paper investigates this problem for such a family of generative models that admits exact invertibility. We propose Neural Principal Component Analysis (Neural-PCA) that operates in full dimensionality while capturing principal components in \emph{descending} order. Without exploiting any label information, the principal components recovered store the most informative elements in their \emph{leading} dimensions and leave the negligible in the \emph{trailing} ones, allowing for clear performance improvements of $5\%$-$10\%$ in downstream tasks. Such improvements are empirically found consistent irrespective of the number of latent trailing dimensions dropped. Our work suggests that necessary inductive bias be introduced into generative modelling when representation quality is of interest.
Learning from demonstration (LfD) methods have shown promise for solving multi-step tasks; however, these approaches do not guarantee successful reproduction of the task given disturbances. In this work, we identify the roots of such a challenge as the failure of the learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a Linear Temporal Logic (LTL) formula. Consequently, the imitator is robust to both task- and motion-level disturbances and guaranteed to achieve task success. Project page: https://sites.google.com/view/ltl-ds
With the dramatically increased number of parameters in language models, sparsity methods have received ever-increasing research focus to compress and accelerate the models. While most research focuses on how to accurately retain appropriate weights while maintaining the performance of the compressed model, there are challenges in the computational overhead and memory footprint of sparse training when compressing large-scale language models. To address this problem, we propose a Parameter-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training in downstream tasks. Specifically, we first combine the data-free and data-driven criteria to efficiently and accurately measure the importance of weights. Then we investigate the intrinsic redundancy of data-driven weight importance and derive two obvious characteristics i.e., low-rankness and structuredness. Based on that, two groups of small matrices are introduced to compute the data-driven importance of weights, instead of using the original large importance score matrix, which therefore makes the sparse training resource-efficient and parameter-efficient. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters. For instance, compared with previous sparsity methods, our PST only requires 1.5% trainable parameters to achieve comparable performance on BERT.
We present Neighborhood Attention Transformer (NAT), an efficient, accurate and scalable hierarchical transformer that works well on both image classification and downstream vision tasks. It is built upon Neighborhood Attention (NA), a simple and flexible attention mechanism that localizes the receptive field for each query to its nearest neighboring pixels. NA is a localization of self-attention, and approaches it as the receptive field size increases. It is also equivalent in FLOPs and memory usage to Swin Transformer's shifted window attention given the same receptive field size, while being less constrained. Furthermore, NA includes local inductive biases, which eliminate the need for extra operations such as pixel shifts. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet with only 4.3 GFLOPs and 28M parameters, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20k. We will open-source our checkpoints, training script, configurations, and our CUDA kernel at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer .
Most of the existing methods for debaising in click-through rate (CTR) prediction depend on an oversimplified assumption, i.e., the click probability is the product of observation probability and relevance probability. However, since there is a complicated interplay between these two probabilities, these methods cannot be applied to other scenarios, e.g. query auto completion (QAC) and route recommendation. We propose a general debiasing framework without simplifying the relationships between variables, which can handle all scenarios in CTR prediction. Simulation experiments show that: under the simplest scenario, our method maintains a similar AUC with the state-of-the-art methods; in other scenarios, our method achieves considerable improvements compared with existing methods. Meanwhile, in online experiments, the framework also gains significant improvements consistently.
This paper proposes a probabilistic contrastive loss function for self-supervised learning. The well-known contrastive loss is deterministic and involves a temperature hyperparameter that scales the inner product between two normed feature embeddings. By reinterpreting the temperature hyperparameter as a quantity related to the radius of the hypersphere, we derive a new loss function that involves a confidence measure which quantifies uncertainty in a mathematically grounding manner. Some intriguing properties of the proposed loss function are empirically demonstrated, which agree with human-like predictions. We believe the present work brings up a new prospective to the area of contrastive learning.
Deep neural networks (DNN) have achieved great success in the recommender systems (RS) domain. However, to achieve remarkable performance, DNN-based recommender models often require numerous parameters, which inevitably bring redundant neurons and weights, a phenomenon referred to as over-parameterization. In this paper, we plan to exploit such redundancy phenomena to improve the performance of RS. Specifically, we propose PCRec, a top-N item \underline{rec}ommendation framework that leverages collaborative training of two DNN-based recommender models with the same network structure, termed \underline{p}eer \underline{c}ollaboration. PCRec can reactivate and strengthen the unimportant (redundant) weights during training, which achieves higher prediction accuracy but maintains its original inference efficiency. To realize this, we first introduce two criteria to identify the importance of weights of a given recommender model. Then, we rejuvenate the unimportant weights by transplanting outside information (i.e., weights) from its peer network. After such an operation and retraining, the original recommender model is endowed with more representation capacity by possessing more functional model parameters. To show its generality, we instantiate PCRec by using three well-known recommender models. We conduct extensive experiments on three real-world datasets, and show that PCRec yields significantly better recommendations than its counterpart with the same model (parameter) size.