Heterogeneous face recognition is a challenging task due to the large modality discrepancy and insufficient cross-modal samples. Most existing works focus on discriminative feature transformation, metric learning and cross-modal face synthesis. However, the fact that cross-modal faces are always coupled by domain (modality) and identity information has received little attention. Therefore, how to learn and utilize the domain-private feature and domain-agnostic feature for modality adaptive face recognition is the focus of this work. Specifically, this paper proposes a Feature Aggregation Network (FAN), which includes disentangled representation module (DRM), feature fusion module (FFM) and adaptive penalty metric (APM) learning session. First, in DRM, two subnetworks, i.e. domain-private network and domain-agnostic network are specially designed for learning modality features and identity features, respectively. Second, in FFM, the identity features are fused with domain features to achieve cross-modal bi-directional identity feature transformation, which, to a large extent, further disentangles the modality information and identity information. Third, considering that the distribution imbalance between easy and hard pairs exists in cross-modal datasets, which increases the risk of model bias, the identity preserving guided metric learning with adaptive hard pairs penalization is proposed in our FAN. The proposed APM also guarantees the cross-modality intra-class compactness and inter-class separation. Extensive experiments on benchmark cross-modal face datasets show that our FAN outperforms SOTA methods.
Trackers based on Siamese network have shown tremendous success, because of their balance between accuracy and speed. Nevertheless, with tracking scenarios becoming more and more sophisticated, most existing Siamese-based approaches ignore the addressing of the problem that distinguishes the tracking target from hard negative samples in the tracking phase. The features learned by these networks lack of discrimination, which significantly weakens the robustness of Siamese-based trackers and leads to suboptimal performance. To address this issue, we propose a simple yet efficient hard negative samples emphasis method, which constrains Siamese network to learn features that are aware of hard negative samples and enhance the discrimination of embedding features. Through a distance constraint, we force to shorten the distance between exemplar vector and positive vectors, meanwhile, enlarge the distance between exemplar vector and hard negative vectors. Furthermore, we explore a novel anchor-free tracking framework in a per-pixel prediction fashion, which can significantly reduce the number of hyper-parameters and simplify the tracking process by taking full advantage of the representation of convolutional neural network. Extensive experiments on six standard benchmark datasets demonstrate that the proposed method can perform favorable results against state-of-the-art approaches.
In the CNN based object detectors, feature pyramids are widely exploited to alleviate the problem of scale variation across object instances. These object detectors, which strengthen features via a top-down pathway and lateral connections, are mainly to enrich the semantic information of low-level features, but ignore the enhancement of high-level features. This can lead to an imbalance between different levels of features, in particular a serious lack of detailed information in the high-level features, which makes it difficult to get accurate bounding boxes. In this paper, we introduce a novel two-pronged transductive idea to explore the relationship among different layers in both backward and forward directions, which can enrich the semantic information of low-level features and detailed information of high-level features at the same time. Under the guidance of the two-pronged idea, we propose a Two-Pronged Network (TPNet) to achieve bidirectional transfer between high-level features and low-level features, which is useful for accurately detecting object at different scales. Furthermore, due to the distribution imbalance between the hard and easy samples in single-stage detectors, the gradient of localization loss is always dominated by the hard examples that have poor localization accuracy. This will enable the model to be biased toward the hard samples. So in our TPNet, an adaptive IoU based localization loss, named Rectified IoU (RIoU) loss, is proposed to rectify the gradients of each kind of samples. The Rectified IoU loss increases the gradients of examples with high IoU while suppressing the gradients of examples with low IoU, which can improve the overall localization accuracy of model. Extensive experiments demonstrate the superiority of our TPNet and RIoU loss.
Previous Person Re-Identification (Re-ID) models aim to focus on the most discriminative region of an image, while its performance may be compromised when that region is missing caused by camera viewpoint changes or occlusion. To solve this issue, we propose a novel model named Hierarchical Bi-directional Feature Perception Network (HBFP-Net) to correlate multi-level information and reinforce each other. First, the correlation maps of cross-level feature-pairs are modeled via low-rank bilinear pooling. Then, based on the correlation maps, Bi-directional Feature Perception (BFP) module is employed to enrich the attention regions of high-level feature, and to learn abstract and specific information in low-level feature. And then, we propose a novel end-to-end hierarchical network which integrates multi-level augmented features and inputs the augmented low- and middle-level features to following layers to retrain a new powerful network. What's more, we propose a novel trainable generalized pooling, which can dynamically select any value of all locations in feature maps to be activated. Extensive experiments implemented on the mainstream evaluation datasets including Market-1501, CUHK03 and DukeMTMC-ReID show that our method outperforms the recent SOTA Re-ID models.
Recent reference-based face restoration methods have received considerable attention due to their great capability in recovering high-frequency details on real low-quality images. However, most of these methods require a high-quality reference image of the same identity, making them only applicable in limited scenes. To address this issue, this paper suggests a deep face dictionary network (termed as DFDNet) to guide the restoration process of degraded observations. To begin with, we use K-means to generate deep dictionaries for perceptually significant face components (\ie, left/right eyes, nose and mouth) from high-quality images. Next, with the degraded input, we match and select the most similar component features from their corresponding dictionaries and transfer the high-quality details to the input via the proposed dictionary feature transfer (DFT) block. In particular, component AdaIN is leveraged to eliminate the style diversity between the input and dictionary features (\eg, illumination), and a confidence score is proposed to adaptively fuse the dictionary feature to the input. Finally, multi-scale dictionaries are adopted in a progressive manner to enable the coarse-to-fine restoration. Experiments show that our proposed method can achieve plausible performance in both quantitative and qualitative evaluation, and more importantly, can generate realistic and promising results on real degraded images without requiring an identity-belonging reference. The source code and models are available at \url{https://github.com/csxmli2016/DFDNet}.
Nowadays, as data becomes increasingly complex and distributed, data analyses often involve several related datasets that are stored on different servers and probably owned by different stakeholders. While there is an emerging need to provide these stakeholders with a full picture of their data under a global context, conventional visual analytical methods, such as dimensionality reduction, could expose data privacy when multi-party datasets are fused into a single site to build point-level relationships. In this paper, we reformulate the conventional t-SNE method from the single-site mode into a secure distributed infrastructure. We present a secure multi-party scheme for joint t-SNE computation, which can minimize the risk of data leakage. Aggregated visualization can be optionally employed to hide disclosure of point-level relationships. We build a prototype system based on our method, SMAP, to support the organization, computation, and exploration of secure joint embedding. We demonstrate the effectiveness of our approach with three case studies, one of which is based on the deployment of our system in real-world applications.
Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as their basic structures. These methods ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control between them, the other is without considering the disparity of the contributions of different encoder blocks. In this work, we propose a simple gated network (GateNet) to solve both issues at once. With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder. We design a novel gated dual branch structure to build the cooperation among different levels of features and improve the discriminability of the whole network. Through the dual branch design, more details of the saliency map can be further restored. In addition, we adopt the atrous spatial pyramid pooling based on the proposed "Fold" operation (Fold-ASPP) to accurately localize salient objects of various scales. Extensive experiments on five challenging datasets demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics.
In this paper, we propose an effective knowledge transfer framework to boost the weakly supervised object detection accuracy with the help of an external fully-annotated source dataset, whose categories may not overlap with the target domain. This setting is of great practical value due to the existence of many off-the-shelf detection datasets. To more effectively utilize the source dataset, we propose to iteratively transfer the knowledge from the source domain by a one-class universal detector and learn the target-domain detector. The box-level pseudo ground truths mined by the target-domain detector in each iteration effectively improve the one-class universal detector. Therefore, the knowledge in the source dataset is more thoroughly exploited and leveraged. Extensive experiments are conducted with Pascal VOC 2007 as the target weakly-annotated dataset and COCO/ImageNet as the source fully-annotated dataset. With the proposed solution, we achieved an mAP of $59.7\%$ detection performance on the VOC test set and an mAP of $60.2\%$ after retraining a fully supervised Faster RCNN with the mined pseudo ground truths. This is significantly better than any previously known results in related literature and sets a new state-of-the-art of weakly supervised object detection under the knowledge transfer setting. Code: \url{https://github.com/mikuhatsune/wsod_transfer}.