We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $48.3$AP in $12$ epochs and $51.0$AP in $36$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+4.9}$\textbf{AP} and $\textbf{+2.4}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.
This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or few shot learning tasks. Finally, we discuss some potential future trends towards modality cooperation, unified representation, and knowledge incorporation. We believe that this review will be of help for researchers and practitioners of AI and ML, especially those interested in computer vision and natural language processing.
We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.
Triplet loss, one of the deep metric learning (DML) methods, is to learn the embeddings where examples from the same class are closer than examples from different classes. Motivated by DML, we propose an effective BP-Triplet Loss for unsupervised domain adaption (UDA) from the perspective of Bayesian learning and we name the model as BP-Triplet Net. In previous metric learning based methods for UDA, sample pairs across domains are treated equally, which is not appropriate due to the domain bias. In our work, considering the different importance of pair-wise samples for both feature learning and domain alignment, we deduce our BP-Triplet loss for effective UDA from the perspective of Bayesian learning. Our BP-Triplet loss adjusts the weights of pair-wise samples in intra domain and inter domain. Especially, it can self attend to the hard pairs (including hard positive pair and hard negative pair). Together with the commonly used adversarial loss for domain alignment, the quality of target pseudo labels is progressively improved. Our method achieved low joint error of the ideal source and target hypothesis. The expected target error can then be upper bounded following Ben-David s theorem. Comprehensive evaluations on five benchmark datasets, handwritten digits, Office31, ImageCLEF-DA, Office-Home and VisDA-2017 demonstrate the effectiveness of the proposed approach for UDA.
Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN fault tolerance and evaluate its fault tolerance comprehensively for the first time. Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49\% or energy consumption by 7.19\% without any accuracy loss compared to that without being aware of the fault tolerance
We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods. Code is available at \url{https://github.com/SlongLiu/DAB-DETR}.
Adversarial examples are inputs for machine learning models that have been designed by attackers to cause the model to make mistakes. In this paper, we demonstrate that adversarial examples can also be utilized for good to improve the performance of imbalanced learning. We provide a new perspective on how to deal with imbalanced data: adjust the biased decision boundary by training with Guiding Adversarial Examples (GAEs). Our method can effectively increase the accuracy of minority classes while sacrificing little accuracy on majority classes. We empirically show, on several benchmark datasets, our proposed method is comparable to the state-of-the-art method. To our best knowledge, we are the first to deal with imbalanced learning with adversarial examples.
Occlusion between different objects is a typical challenge in Multi-Object Tracking (MOT), which often leads to inferior tracking results due to the missing detected objects. The common practice in multi-object tracking is re-identifying the missed objects after their reappearance. Though tracking performance can be boosted by the re-identification, the annotation of identity is required to train the model. In addition, such practice of re-identification still can not track those highly occluded objects when they are missed by the detector. In this paper, we focus on online multi-object tracking and design two novel modules, the unsupervised re-identification learning module and the occlusion estimation module, to handle these problems. Specifically, the proposed unsupervised re-identification learning module does not require any (pseudo) identity information nor suffer from the scalability issue. The proposed occlusion estimation module tries to predict the locations where occlusions happen, which are used to estimate the positions of missed objects by the detector. Our study shows that, when applied to state-of-the-art MOT methods, the proposed unsupervised re-identification learning is comparable to supervised re-identification learning, and the tracking performance is further improved by the proposed occlusion estimation module.
A long standing problem in the modeling of non-Newtonian hydrodynamics is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure, and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newtonian hydrodynamic model, has been proposed and has shown some success in systematically passing the micro-scale structural mechanics information to the macro-scale hydrodynamics for suspensions with simple polymer conformation and bond potential. The model retains a multi-scaled nature by mapping the polymer configurations into a set of symmetry-preserving macro-scale features. The extended constitutive laws for these macro-scale features can be directly learned from the kinetics of their micro-scale counterparts. In this paper, we carry out further study of DeePN$^2$ using more complex micro-structural models. We show that DeePN$^2$ can faithfully capture the broadly overlooked viscoelastic differences arising from the specific molecular structural mechanics without human intervention.
Multi-touch attribution (MTA), aiming to estimate the contribution of each advertisement touchpoint in conversion journeys, is essential for budget allocation and automatically advertising. Existing methods first train a model to predict the conversion probability of the advertisement journeys with historical data and calculate the attribution of each touchpoint using counterfactual predictions. An assumption of these works is the conversion prediction model is unbiased, i.e., it can give accurate predictions on any randomly assigned journey, including both the factual and counterfactual ones. Nevertheless, this assumption does not always hold as the exposed advertisements are recommended according to user preferences. This confounding bias of users would lead to an out-of-distribution (OOD) problem in the counterfactual prediction and cause concept drift in attribution. In this paper, we define the causal MTA task and propose CausalMTA to eliminate the influence of user preferences. It systemically eliminates the confounding bias from both static and dynamic preferences to learn the conversion prediction model using historical data. We also provide a theoretical analysis to prove CausalMTA can learn an unbiased prediction model with sufficient data. Extensive experiments on both public datasets and the impression data in an e-commerce company show that CausalMTA not only achieves better prediction performance than the state-of-the-art method but also generates meaningful attribution credits across different advertising channels.