Face recognition has achieved significant progress in deep-learning era due to the ultra-large-scale and well-labeled datasets. However, training on ultra-large-scale datasets is time-consuming and takes up a lot of hardware resource. Therefore, designing an efficient training approach is crucial and indispensable. The heavy computational and memory costs mainly result from the high dimensionality of the Fully-Connected (FC) layer. Specifically, the dimensionality is determined by the number of face identities, which can be million-level or even more. To this end, we propose a novel training approach for ultra-large-scale face datasets, termed Faster Face Classification (F$^2$C). In F$^2$C, we first define a Gallery Net and a Probe Net that are used to generate identities' centers and extract faces' features for face recognition, respectively. Gallery Net has the same structure as Probe Net and inherits the parameters from Probe Net with a moving average paradigm. After that, to reduce the training time and hardware costs of the FC layer, we propose a Dynamic Class Pool (DCP) that stores the features from Gallery Net and calculates the inner product (logits) with positive samples (whose identities are in the DCP) in each mini-batch. DCP can be regarded as a substitute for the FC layer but it is far smaller, thus greatly reducing the computational and memory costs. For negative samples (whose identities are not in DCP), we minimize the cosine similarities between negative samples and those in DCP. Then, to improve the update efficiency of DCP's parameters, we design a dual data-loader including identity-based and instance-based loaders to generate a certain of identities and samples in mini-batches.
Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potentially degradation of performance. To address these problems, we propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filter out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training. Extensive experiments are conducted by using ten different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance.
Cluster discrimination is an effective pretext task for unsupervised representation learning, which often consists of two phases: clustering and discrimination. Clustering is to assign each instance a pseudo label that will be used to learn representations in discrimination. The main challenge resides in clustering since many prevalent clustering methods (e.g., k-means) have to run in a batch mode that goes multiple iterations over the whole data. Recently, a balanced online clustering method, i.e., SwAV, is proposed for representation learning. However, the assignment is optimized within only a small subset of data, which can be suboptimal. To address these challenges, we first investigate the objective of clustering-based representation learning from the perspective of distance metric learning. Based on this, we propose a novel clustering-based pretext task with online \textbf{Co}nstrained \textbf{K}-m\textbf{e}ans (\textbf{CoKe}) to learn representations and relations between instances simultaneously. Compared with the balanced clustering that each cluster has exactly the same size, we only constrain the minimum size of clusters to flexibly capture the inherent data structure. More importantly, our online assignment method has a theoretical guarantee to approach the global optimum. Finally, two variance reduction strategies are proposed to make the clustering robust for different augmentations. Without keeping representations of instances, the data is accessed in an online mode in CoKe while a single view of instances at each iteration is sufficient to demonstrate a better performance than contrastive learning methods relying on two views. Extensive experiments on ImageNet verify the efficacy of our proposal. Code will be released.
This paper introduces our solution for the Track2 in AI City Challenge 2021 (AICITY21). The Track2 is a vehicle re-identification (ReID) task with both the real-world data and synthetic data. We mainly focus on four points, i.e. training data, unsupervised domain-adaptive (UDA) training, post-processing, model ensembling in this challenge. (1) Both cropping training data and using synthetic data can help the model learn more discriminative features. (2) Since there is a new scenario in the test set that dose not appear in the training set, UDA methods perform well in the challenge. (3) Post-processing techniques including re-ranking, image-to-track retrieval, inter-camera fusion, etc, significantly improve final performance. (4) We ensemble CNN-based models and transformer-based models which provide different representation diversity. With aforementioned techniques, our method finally achieves 0.7445 mAP score, yielding the first place in the competition. Codes are available at https://github.com/michuanhaohao/AICITY2021_Track2_DMT.
Previous transfer methods for anomaly detection generally assume the availability of labeled data in source or target domains. However, such an assumption is not valid in most real applications where large-scale labeled data are too expensive. Therefore, this paper proposes an importance weighted adversarial autoencoder-based method to transfer anomaly detection knowledge in an unsupervised manner, particularly for a rarely studied scenario where a target domain has no labeled normal/abnormal data while only normal data from a related source domain exist. Specifically, the method learns to align the distributions of normal data in both source and target domains, but leave the distribution of abnormal data in the target domain unchanged. In this way, an obvious gap can be produced between the distributions of normal and abnormal data in the target domain, therefore enabling the anomaly detection in the domain. Extensive experiments on multiple synthetic datasets and the UCSD benchmark demonstrate the effectiveness of our approach.
Recently, maximizing mutual information has emerged as a powerful method for unsupervised graph representation learning. The existing methods are typically effective to capture information from the topology view but ignore the feature view. To circumvent this issue, we propose a novel approach by exploiting mutual information maximization across feature and topology views. Specifically, we first utilize a multi-view representation learning module to better capture both local and global information content across feature and topology views on graphs. To model the information shared by the feature and topology spaces, we then develop a common representation learning module using mutual information maximization and reconstruction loss minimization. To explicitly encourage diversity between graph representations from the same view, we also introduce a disagreement regularization to enlarge the distance between representations from the same view. Experiments on synthetic and real-world datasets demonstrate the effectiveness of integrating feature and topology views. In particular, compared with the previous supervised methods, our proposed method can achieve comparable or even better performance under the unsupervised representation and linear evaluation protocol.
Multi-Target Multi-Camera Tracking has a wide range of applications and is the basis for many advanced inferences and predictions. This paper describes our solution to the Track 3 multi-camera vehicle tracking task in 2021 AI City Challenge (AICITY21). This paper proposes a multi-target multi-camera vehicle tracking framework guided by the crossroad zones. The framework includes: (1) Use mature detection and vehicle re-identification models to extract targets and appearance features. (2) Use modified JDETracker (without detection module) to track single-camera vehicles and generate single-camera tracklets. (3) According to the characteristics of the crossroad, the Tracklet Filter Strategy and the Direction Based Temporal Mask are proposed. (4) Propose Sub-clustering in Adjacent Cameras for multi-camera tracklets matching. Through the above techniques, our method obtained an IDF1 score of 0.8095, ranking first on the leaderboard. The code have released: https://github.com/LCFractal/AIC21-MTMC.
Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been observed that SGD taking more than one pass over the training data (multi-pass SGD) has much better excess risk bound performance than the SGD only taking one pass over the training data (one-pass SGD). However, it is not very clear that how to explain this phenomenon in theory. In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstance. Specifically, we consider smooth risk minimization problems whose objective function is non-convex least squared loss. Under Polyak-Lojasiewicz (PL) condition, we establish faster convergence rate of excess risk bound for multi-pass SGD than that for one-pass SGD.