Abstract:Deep learning-based person Re-IDentification (ReID) often requires a large amount of training data to achieve good performance. Thus it appears that collecting more training data from diverse environments tends to improve the ReID performance. This paper re-examines this common belief and makes a somehow surprising observation: using more samples, i.e., training with samples from multiple datasets, does not necessarily lead to better performance by using the popular ReID models. In some cases, training with more samples may even hurt the performance of the evaluation is carried out in one of those datasets. We postulate that this phenomenon is due to the incapability of the standard network in adapting to diverse environments. To overcome this issue, we propose an approach called Domain-Camera-Sample Dynamic network (DCSD) whose parameters can be adaptive to various factors. Specifically, we consider the internal domain-related factor that can be identified from the input features, and external domain-related factors, such as domain information or camera information. Our discovery is that training with such an adaptive model can better benefit from more training samples. Experimental results show that our DCSD can greatly boost the performance (up to 12.3%) while joint training in multiple datasets.
Abstract:3D human pose and shape recovery from a monocular RGB image is a challenging task. Existing learning based methods highly depend on weak supervision signals, e.g. 2D and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However, considering the 2D-to-3D ambiguities existed in these weak supervision labels, the network is easy to get stuck in local optima when trained with such labels. In this paper, we reduce the ambituity by optimizing multiple initializations. Specifically, we propose a three-stage framework named Multi-Initialization Optimization Network (MION). In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample. Each coarse reconstruction can be regarded as an initialization leads to one optimization branch. In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism. Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction. Experiments demonstrate that our Multi-Initialization Optimization Network outperforms existing 3D mesh based methods on multiple public benchmarks.
Abstract:This is a very short technical report, which introduces the solution of the Team BUPT-CASIA for Short-video Face Parsing Track of The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021. Face parsing has recently attracted increasing interest due to its numerous application potentials. Generally speaking, it has a lot in common with human parsing, such as task setting, data characteristics, number of categories and so on. Therefore, this work applies state-of-the-art human parsing method to face parsing task to explore the similarities and differences between them. Our submission achieves 86.84% score and wins the 2nd place in the challenge.
Abstract:The training loss function that enforces certain training sample distribution patterns plays a critical role in building a re-identification (ReID) system. Besides the basic requirement of discrimination, i.e., the features corresponding to different identities should not be mixed, additional intra-class distribution constraints, such as features from the same identities should be close to their centers, have been adopted to construct losses. Despite the advances of various new loss functions, it is still challenging to strike the balance between the need of reducing the intra-class variation and allowing certain distribution freedom. In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples. The prediction error is then regarded as a loss called Center Prediction Loss (CPL). We show that, without introducing additional hyper-parameters, this new loss leads to a more flexible intra-class distribution constraint while ensuring the between-class samples are well-separated. Extensive experiments on various real-world ReID datasets show that the proposed loss can achieve superior performance and can also be complementary to existing losses.
Abstract:Re-identification (ReID) is to identify the same instance across different cameras. Existing ReID methods mostly utilize alignment-based or attention-based strategies to generate effective feature representations. However, most of these methods only extract general feature by employing single input image itself, overlooking the exploration of relevance between comparing images. To fill this gap, we propose a novel end-to-end trainable dynamic convolution framework named Instance and Pair-Aware Dynamic Networks in this paper. The proposed model is composed of three main branches where a self-guided dynamic branch is constructed to strengthen instance-specific features, focusing on every single image. Furthermore, we also design a mutual-guided dynamic branch to generate pair-aware features for each pair of images to be compared. Extensive experiments are conducted in order to verify the effectiveness of our proposed algorithm. We evaluate our algorithm in several mainstream person and vehicle ReID datasets including CUHK03, DukeMTMCreID, Market-1501, VeRi776 and VehicleID. In some datasets our algorithm outperforms state-of-the-art methods and in others, our algorithm achieves a comparable performance.
Abstract:How to estimate the quality of the network output is an important issue, and currently there is no effective solution in the field of human parsing. In order to solve this problem, this work proposes a statistical method based on the output probability map to calculate the pixel quality information, which is called pixel score. In addition, the Quality-Aware Module (QAM) is proposed to fuse the different quality information, the purpose of which is to estimate the quality of human parsing results. We combine QAM with a concise and effective network design to propose Quality-Aware Network (QANet) for human parsing. Benefiting from the superiority of QAM and QANet, we achieve the best performance on three multiple and one single human parsing benchmarks, including CIHP, MHP-v2, Pascal-Person-Part and LIP. Without increasing the training and inference time, QAM improves the AP$^\text{r}$ criterion by more than 10 points in the multiple human parsing task. QAM can be extended to other tasks with good quality estimation, e.g. instance segmentation. Specifically, QAM improves Mask R-CNN by ~1% mAP on COCO and LVISv1.0 datasets. Based on the proposed QAM and QANet, our overall system wins 1st place in CVPR2019 COCO DensePose Challenge, and 1st place in Track 1 & 2 of CVPR2020 LIP Challenge. Code and models are available at https://github.com/soeaver/QANet.
Abstract:Learning cross-view consistent feature representation is the key for accurate vehicle Re-identification (ReID), since the visual appearance of vehicles changes significantly under different viewpoints. To this end, most existing approaches resort to the supervised cross-view learning using extensive extra viewpoints annotations, which however, is difficult to deploy in real applications due to the expensive labelling cost and the continous viewpoint variation that makes it hard to define discrete viewpoint labels. In this study, we present a pluggable Weakly-supervised Cross-View Learning (WCVL) module for vehicle ReID. Through hallucinating the cross-view samples as the hardest positive counterparts in feature domain, we can learn the consistent feature representation via minimizing the cross-view feature distance based on vehicle IDs only without using any viewpoint annotation. More importantly, the proposed method can be seamlessly plugged into most existing vehicle ReID baselines for cross-view learning without re-training the baselines. To demonstrate its efficacy, we plug the proposed method into a bunch of off-the-shelf baselines and obtain significant performance improvement on four public benchmark datasets, i.e., VeRi-776, VehicleID, VRIC and VRAI.
Abstract:Multiple human parsing aims to segment various human parts and associate each part with the corresponding instance simultaneously. This is a very challenging task due to the diverse human appearance, semantic ambiguity of different body parts, and complex background. Through analysis of multiple human parsing task, we observe that human-centric global perception and accurate instance-level parsing scoring are crucial for obtaining high-quality results. But the most state-of-the-art methods have not paid enough attention to these issues. To reverse this phenomenon, we present Renovating Parsing R-CNN (RP R-CNN), which introduces a global semantic enhanced feature pyramid network and a parsing re-scoring network into the existing high-performance pipeline. The proposed RP R-CNN adopts global semantic representation to enhance multi-scale features for generating human parsing maps, and regresses a confidence score to represent its quality. Extensive experiments show that RP R-CNN performs favorably against state-of-the-art methods on CIHP and MHP-v2 datasets. Code and models are available at https://github.com/soeaver/RP-R-CNN.
Abstract:In object detection, offset-guided and point-guided regression dominate anchor-based and anchor-free method separately. Recently, point-guided approach is introduced to anchor-based method. However, we observe points predicted by this way are misaligned with matched region of proposals and score of localization, causing a notable gap in performance. In this paper, we propose CPM R-CNN which contains three efficient modules to optimize anchor-based point-guided method. According to sufficient evaluations on the COCO dataset, CPM R-CNN is demonstrated efficient to improve the localization accuracy by calibrating mentioned misalignment. Compared with Faster R-CNN and Grid R-CNN based on ResNet-101 with FPN, our approach can substantially improve detection mAP by 3.3% and 1.5% respectively without whistles and bells. Moreover, our best model achieves improvement by a large margin to 49.9% on COCO test-dev. Code and models will be publicly available.
Abstract:Scene text detection has received attention for years and achieved an impressive performance across various benchmarks. In this work, we propose an efficient and accurate approach to detect multioriented text in scene images. The proposed feature fusion mechanism allows us to use a shallower network to reduce the computational complexity. A self-attention mechanism is adopted to suppress false positive detections. Experiments on public benchmarks including ICDAR 2013, ICDAR 2015 and MSRA-TD500 show that our proposed approach can achieve better or comparable performances with fewer parameters and less computational cost.