In recent years, vision transformers have been introduced into face recognition and analysis and have achieved performance breakthroughs. However, most previous methods generally train a single model or an ensemble of models to perform the desired task, which ignores the synergy among different tasks and fails to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper presents a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, the SwinFace, consists of a single shared backbone together with a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 $\epsilon$-error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively. The code and models will be made publicly available at https://github.com/lxq1000/SwinFace.
Deep neural networks are vulnerable to universal adversarial perturbation (UAP), an instance-agnostic perturbation capable of fooling the target model for most samples. Compared to instance-specific adversarial examples, UAP is more challenging as it needs to generalize across various samples and models. In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective -- the gradient vanishing problem using small-batch stochastic gradient optimization and the local optima problem using large-batch optimization. To address these problems, we propose a simple and effective method called Stochastic Gradient Aggregation (SGA), which alleviates the gradient vanishing and escapes from poor local optima at the same time. Specifically, SGA employs the small-batch training to perform multiple iterations of inner pre-search. Then, all the inner gradients are aggregated as a one-step gradient estimation to enhance the gradient stability and reduce quantization errors. Extensive experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods. The code is available at https://github.com/liuxuannan/Stochastic-Gradient-Aggregation.
In many real-world applications, face recognition models often degenerate when training data (referred to as source domain) are different from testing data (referred to as target domain). To alleviate this mismatch caused by some factors like pose and skin tone, the utilization of pseudo-labels generated by clustering algorithms is an effective way in unsupervised domain adaptation. However, they always miss some hard positive samples. Supervision on pseudo-labeled samples attracts them towards their prototypes and would cause an intra-domain gap between pseudo-labeled samples and the remaining unlabeled samples within target domain, which results in the lack of discrimination in face recognition. In this paper, considering the particularity of face recognition, we propose a novel adversarial information network (AIN) to address it. First, a novel adversarial mutual information (MI) loss is proposed to alternately minimize MI with respect to the target classifier and maximize MI with respect to the feature extractor. By this min-max manner, the positions of target prototypes are adaptively modified which makes unlabeled images clustered more easily such that intra-domain gap can be mitigated. Second, to assist adversarial MI loss, we utilize a graph convolution network to predict linkage likelihoods between target data and generate pseudo-labels. It leverages valuable information in the context of nodes and can achieve more reliable results. The proposed method is evaluated under two scenarios, i.e., domain adaptation across poses and image conditions, and domain adaptation across faces with different skin tones. Extensive experiments show that AIN successfully improves cross-domain generalization and offers a new state-of-the-art on RFW dataset.
Structured reconstruction is a non-trivial dense prediction problem, which extracts structural information (\eg, building corners and edges) from a raster image, then reconstructs it to a 2D planar graph accordingly. Compared with common segmentation or detection problems, it significantly relays on the capability that leveraging holistic geometric information for structural reasoning. Current transformer-based approaches tackle this challenging problem in a two-stage manner, which detect corners in the first model and classify the proposed edges (corner-pairs) in the second model. However, they separate two-stage into different models and only share the backbone encoder. Unlike the existing modeling strategies, we present an enhanced corner representation method: 1) It fuses knowledge between the corner detection and edge prediction by sharing feature in different granularity; 2) Corner candidates are proposed in four heatmap channels w.r.t its direction. Both qualitative and quantitative evaluations demonstrate that our proposed method can better reconstruct fine-grained structures, such as adjacent corners and tiny edges. Consequently, it outperforms the state-of-the-art model by +1.9\%@F-1 on Corner and +3.0\%@F-1 on Edge.
Although face recognition has made impressive progress in recent years, we ignore the racial bias of the recognition system when we pursue a high level of accuracy. Previous work found that for different races, face recognition networks focus on different facial regions, and the sensitive regions of darker-skinned people are much smaller. Based on this discovery, we propose a new de-bias method based on gradient attention, called Gradient Attention Balance Network (GABN). Specifically, we use the gradient attention map (GAM) of the face recognition network to track the sensitive facial regions and make the GAMs of different races tend to be consistent through adversarial learning. This method mitigates the bias by making the network focus on similar facial regions. In addition, we also use masks to erase the Top-N sensitive facial regions, forcing the network to allocate its attention to a larger facial region. This method expands the sensitive region of darker-skinned people and further reduces the gap between GAM of darker-skinned people and GAM of Caucasians. Extensive experiments show that GABN successfully mitigates racial bias in face recognition and learns more balanced performance for people of different races.
In this paper, we delve into semi-supervised 2D human pose estimation. The previous method ignored two problems: (i) When conducting interactive training between large model and lightweight model, the pseudo label of lightweight model will be used to guide large models. (ii) The negative impact of noise pseudo labels on training. Moreover, the labels used for 2D human pose estimation are relatively complex: keypoint category and keypoint position. To solve the problems mentioned above, we propose a semi-supervised 2D human pose estimation framework driven by a position inconsistency pseudo label correction module (SSPCM). We introduce an additional auxiliary teacher and use the pseudo labels generated by the two teacher model in different periods to calculate the inconsistency score and remove outliers. Then, the two teacher models are updated through interactive training, and the student model is updated using the pseudo labels generated by two teachers. To further improve the performance of the student model, we use the semi-supervised Cut-Occlude based on pseudo keypoint perception to generate more hard and effective samples. In addition, we also proposed a new indoor overhead fisheye human keypoint dataset WEPDTOF-Pose. Extensive experiments demonstrate that our method outperforms the previous best semi-supervised 2D human pose estimation method. We will release the code and dataset at https://github.com/hlz0606/SSPCM.
Out-of-distribution (OOD) detection methods assume that they have test ground truths, i.e., whether individual test samples are in-distribution (IND) or OOD. However, in the real world, we do not always have such ground truths, and thus do not know which sample is correctly detected and cannot compute the metric like AUROC to evaluate the performance of different OOD detection methods. In this paper, we are the first to introduce the unsupervised evaluation problem in OOD detection, which aims to evaluate OOD detection methods in real-world changing environments without OOD labels. We propose three methods to compute Gscore as an unsupervised indicator of OOD detection performance. We further introduce a new benchmark Gbench, which has 200 real-world OOD datasets of various label spaces to train and evaluate our method. Through experiments, we find a strong quantitative correlation betwwen Gscore and the OOD detection performance. Extensive experiments demonstrate that our Gscore achieves state-of-the-art performance. Gscore also generalizes well with different IND/OOD datasets, OOD detection methods, backbones and dataset sizes. We further provide interesting analyses of the effects of backbones and IND/OOD datasets on OOD detection performance. The data and code will be available.
Although deep learning has significantly improved Face Recognition (FR), dramatic performance deterioration may occur when processing Low Resolution (LR) faces. To alleviate this, approaches based on unified feature space are proposed with the sacrifice under High Resolution (HR) circumstances. To deal with the huge domain gap between HR and LR domains and achieve the best on both domains, we first took a closer look at the impacts of several resolution augmentations and then analyzed the difficulty of LR samples from the perspective of the model gradient produced by different resolution samples. Besides, we also find that the introduction of some resolutions could help the learning of lower resolutions. Based on these, we divide the LR samples into three difficulties according to the resolution and propose a more effective Multi-Resolution Augmentation. Then, due to the rapidly increasing domain gap as the resolution decreases, we carefully design a novel and effective metric loss based on a LogExp distance function that provides decent gradients to prevent oscillation near the convergence point or tolerance to small distance errors; it could also dynamically adjust the penalty for errors in different dimensions, allowing for more optimization of dimensions with large errors. Combining these two insights, our model could learn more general knowledge in a wide resolution range of images and balanced results can be achieved by our extremely simple framework. Moreover, the augmentations and metrics are the cornerstones of LRFR, so our method could be considered a new baseline for the LRFR task. Experiments on the LRFR datasets: SCface, XQLFW, and large-scale LRFR dataset: TinyFace demonstrate the effectiveness of our methods, while the degradation on HRFR datasets is significantly reduced.
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.