Few-shot image classification aims to learn to recognize new categories from limited labelled data. Recently, metric learning based approaches have been widely investigated which classify a query sample by finding the nearest prototype from the support set based on the feature similarities. For few-shot classification, the calculated similarity of a query-support pair depends on both the query and the support. The network has different confidences/uncertainty on the calculated similarities of the different pairs and there are observation noises on the similarity. Understanding and modeling the uncertainty on the similarity could promote better exploitation of the limited samples in optimization. However, this is still underexplored in few-shot learning. In this work, we propose Uncertainty-Aware Few-Shot (UAFS) image classification by modeling uncertainty of the similarities of query-support pairs and performing uncertainty-aware optimization. Particularly, we design a graph-based model to jointly estimate the uncertainty of similarities between a query and the prototypes in the support set. We optimize the network based on the modeled uncertainty by converting the observed similarity to a probabilistic similarity distribution to be robust to observation noises. Extensive experiments show our proposed method brings significant improvements on top of a strong baseline and achieves the state-of-the-art performance.
For domain generalization (DG) and unsupervised domain adaptation (UDA), cross domain feature alignment has been widely explored to pull the feature distributions of different domains in order to learn domain-invariant representations. However, the feature alignment is in general task-ignorant and could result in degradation of the discrimination power of the feature representation and thus hinders the high performance. In this paper, we propose a unified framework termed Feature Alignment and Restoration (FAR) to simultaneously ensure high generalization and discrimination power of the networks for effective DG and UDA. Specifically, we perform feature alignment (FA) across domains by aligning the moments of the distributions of attentively selected features to reduce their discrepancy. To ensure high discrimination, we propose a Feature Restoration (FR) operation to distill task-relevant features from the residual information and use them to compensate for the aligned features. For better disentanglement, we enforce a dual ranking entropy loss constraint in the FR step to encourage the separation of task-relevant and task-irrelevant features. Extensive experiments on multiple classification benchmarks demonstrate the high performance and strong generalization of our FAR framework for both domain generalization and unsupervised domain adaptation.
Person Re-identification (ReID) aims at matching a person of interest across images. In convolutional neural networks (CNNs) based approaches, loss design plays a role of metric learning which guides the feature learning process to pull closer features of the same identity and to push far apart features of different identities. In recent years, the combination of classification loss and triplet loss achieves superior performance and is predominant in ReID. In this paper, we rethink these loss functions within a generalized formulation and argue that triplet-based optimization can be viewed as a two-class subsampling classification, which performs classification over two sampled categories based on instance similarities. Furthermore, we present a case study which demonstrates that increasing the number of simultaneously considered instance classes significantly improves the ReID performance, since it is aligned better with the ReID test/inference process. With the multi-class subsampling classification incorporated, we provide a strong baseline which achieves the state-of-the-art performance on the benchmark person ReID datasets. Finally, we propose a new meta prototypical N-tuple loss for more efficient multi-class subsampling classification. We aim to inspire more new loss designs in the person ReID field.
Supervised person re-identification (ReID) often has poor scalability and usability in real-world deployments due to domain gaps and the lack of annotations for the target domain data. Unsupervised person ReID through domain adaptation is attractive yet challenging. Existing unsupervised ReID approaches often fail in correctly identifying the positive samples and negative samples through the distance-based matching/ranking. The two distributions of distances for positive sample pairs (Pos-distr) and negative sample pairs (Neg-distr) are often not well separated, having large overlap. To address this problem, we introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view. We model the two global distance distributions as Gaussian distributions and push apart the two distributions while encouraging their sharpness in the unsupervised training process. Particularly, to model the distributions from a global view and facilitate the timely updating of the distributions and the GDS related losses, we leverage a momentum update mechanism for building and maintaining the distribution parameters (mean and variance) and calculate the loss on the fly during the training. Distribution-based hard mining is proposed to further promote the separation of the two distributions. We validate the effectiveness of the GDS constraint in unsupervised ReID networks. Extensive experiments on multiple ReID benchmark datasets show our method leads to significant improvement over the baselines and achieves the state-of-the-art performance.
Existing fully-supervised person re-identification (ReID) methods usually suffer from poor generalization capability caused by domain gaps. The key to solving this problem lies in filtering out identity-irrelevant interference and learning domain-invariant person representations. In this paper, we aim to design a generalizable person ReID framework which trains a model on source domains yet is able to generalize/perform well on target domains. To achieve this goal, we propose a simple yet effective Style Normalization and Restitution (SNR) module. Specifically, we filter out style variations (e.g., illumination, color contrast) by Instance Normalization (IN). However, such a process inevitably removes discriminative information. We propose to distill identity-relevant feature from the removed information and restitute it to the network to ensure high discrimination. For better disentanglement, we enforce a dual causal loss constraint in SNR to encourage the separation of identity-relevant features and identity-irrelevant features. Extensive experiments demonstrate the strong generalization capability of our framework. Our models empowered by the SNR modules significantly outperform the state-of-the-art domain generalization approaches on multiple widely-used person ReID benchmarks, and also show superiority on unsupervised domain adaptation.
There has been remarkable progress on object detection and re-identification in recent years which are the core components for multi-object tracking. However, little attention has been focused on accomplishing the two tasks in a single network to improve the inference speed. The initial attempts along this path ended up with degraded results mainly because the re-identification branch is not appropriately learned. In this work, we study the essential reasons behind the failure, and accordingly present a simple baseline to addresses the problems. It remarkably outperforms the state-of-the-arts on the public datasets at $30$ fps. We hope this baseline could inspire and help evaluate new ideas in this field. The code and the pre-trained models are available at \url{https://github.com/ifzhang/FairMOT}.
We present an approach to estimate 3D poses of multiple people from multiple camera views. In contrast to the previous efforts which require to establish cross-view correspondence based on noisy and incomplete 2D pose estimations, we present an end-to-end solution which directly operates in the $3$D space, therefore avoids making incorrect decisions in the 2D space. To achieve this goal, the features in all camera views are warped and aggregated in a common 3D space, and fed into Cuboid Proposal Network (CPN) to coarsely localize all people. Then we propose Pose Regression Network (PRN) to estimate a detailed 3D pose for each proposal. The approach is robust to occlusion which occurs frequently in practice. Without bells and whistles, it outperforms the state-of-the-arts on the public datasets. Code will be released at https://github.com/microsoft/multiperson-pose-estimation-pytorch.
Despite the success in still image recognition, deep neural networks for spatiotemporal signal tasks (such as human action recognition in videos) still suffers from low efficacy and inefficiency over the past years. Recently, human experts have put more efforts into analyzing the importance of different components in 3D convolutional neural networks (3D CNNs) to design more powerful spatiotemporal learning backbones. Among many others, spatiotemporal fusion is one of the essentials. It controls how spatial and temporal signals are extracted at each layer during inference. Previous attempts usually start by ad-hoc designs that empirically combine certain convolutions and then draw conclusions based on the performance obtained by training the corresponding networks. These methods only support network-level analysis on limited number of fusion strategies. In this paper, we propose to convert the spatiotemporal fusion strategies into a probability space, which allows us to perform network-level evaluations of various fusion strategies without having to train them separately. Besides, we can also obtain fine-grained numerical information such as layer-level preference on spatiotemporal fusion within the probability space. Our approach greatly boosts the efficiency of analyzing spatiotemporal fusion. Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs. It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space. We present a geometric approach to reinforce the visual features of each pair of joints based on the IMUs. This notably improves 2D pose estimation accuracy especially when one joint is occluded. We call this approach Orientation Regularized Network (ORN). Then we lift the multi-view 2D poses to the 3D space by an Orientation Regularized Pictorial Structure Model (ORPSM) which jointly minimizes the projection error between the 3D and 2D poses, along with the discrepancy between the 3D pose and IMU orientations. The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset. Our code will be released at https://github.com/CHUNYUWANG/imu-human-pose-pytorch.