Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guodong Guo

Fully Transformer Networks for Semantic Image Segmentation

Jun 08, 2021

Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo

Figure 1 for Fully Transformer Networks for Semantic Image Segmentation

Figure 2 for Fully Transformer Networks for Semantic Image Segmentation

Figure 3 for Fully Transformer Networks for Semantic Image Segmentation

Figure 4 for Fully Transformer Networks for Semantic Image Segmentation

Abstract:Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated to combine such transformers with CNN-based semantic image segmentation models is very promising. However, it is not well studied yet on how well a pure transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve new state-of-the-art results on multiple challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K and COCO-Stuff. The source code will be released upon the publication of this work.

Via

Access Paper or Ask Questions

Image-to-Video Generation via 3D Facial Dynamics

May 31, 2021

Xiaoguang Tu, Yingtian Zou, Jian Zhao, Wenjie Ai, Jian Dong, Yuan Yao, Zhikang Wang, Guodong Guo, Zhifeng Li, Wei Liu(+1 more)

Figure 1 for Image-to-Video Generation via 3D Facial Dynamics

Figure 2 for Image-to-Video Generation via 3D Facial Dynamics

Figure 3 for Image-to-Video Generation via 3D Facial Dynamics

Figure 4 for Image-to-Video Generation via 3D Facial Dynamics

Abstract:We present a versatile model, FaceAnime, for various video generation tasks from still images. Video generation from a single face image is an interesting problem and usually tackled by utilizing Generative Adversarial Networks (GANs) to integrate information from the input face image and a sequence of sparse facial landmarks. However, the generated face images usually suffer from quality loss, image distortion, identity change, and expression mismatching due to the weak representation capacity of the facial landmarks. In this paper, we propose to "imagine" a face video from a single face image according to the reconstructed 3D face dynamics, aiming to generate a realistic and identity-preserving face video, with precisely predicted pose and facial expression. The 3D dynamics reveal changes of the facial expression and motion, and can serve as a strong prior knowledge for guiding highly realistic face video generation. In particular, we explore face video prediction and exploit a well-designed 3D dynamic prediction network to predict a 3D dynamic sequence for a single face image. The 3D dynamics are then further rendered by the sparse texture mapping algorithm to recover structural details and sparse textures for generating face frames. Our model is versatile for various AR/VR and entertainment applications, such as face video retargeting and face video prediction. Superior experimental results have well demonstrated its effectiveness in generating high-fidelity, identity-preserving, and visually pleasant face video clips from a single source face image.

Via

Access Paper or Ask Questions

Joint Face Image Restoration and Frontalization for Recognition

May 12, 2021

Xiaoguang Tu, Jian Zhao, Qiankun Liu, Wenjie Ai, Guodong Guo, Zhifeng Li, Wei Liu, Jiashi Feng

Figure 1 for Joint Face Image Restoration and Frontalization for Recognition

Figure 2 for Joint Face Image Restoration and Frontalization for Recognition

Figure 3 for Joint Face Image Restoration and Frontalization for Recognition

Figure 4 for Joint Face Image Restoration and Frontalization for Recognition

Abstract:In real-world scenarios, many factors may harm face recognition performance, e.g., large pose, bad illumination,low resolution, blur and noise. To address these challenges, previous efforts usually first restore the low-quality faces to high-quality ones and then perform face recognition. However, most of these methods are stage-wise, which is sub-optimal and deviates from the reality. In this paper, we address all these challenges jointly for unconstrained face recognition. We propose an Multi-Degradation Face Restoration (MDFR) model to restore frontalized high-quality faces from the given low-quality ones under arbitrary facial poses, with three distinct novelties. First, MDFR is a well-designed encoder-decoder architecture which extracts feature representation from an input face image with arbitrary low-quality factors and restores it to a high-quality counterpart. Second, MDFR introduces a pose residual learning strategy along with a 3D-based Pose Normalization Module (PNM), which can perceive the pose gap between the input initial pose and its real-frontal pose to guide the face frontalization. Finally, MDFR can generate frontalized high-quality face images by a single unified network, showing a strong capability of preserving face identity. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks demonstrate the superiority of MDFR over state-of-the-art methods on both face frontalization and face restoration.

* 14 pages, 9 figures

Via

Access Paper or Ask Questions

Contrastive Context-Aware Learning for 3D High-Fidelity Mask Face Presentation Attack Detection

Apr 13, 2021

Ajian Liu, Chenxu Zhao, Zitong Yu, Jun Wan, Anyang Su, Xing Liu, Zichang Tan, Sergio Escalera, Junliang Xing, Yanyan Liang(+4 more)

Figure 1 for Contrastive Context-Aware Learning for 3D High-Fidelity Mask Face Presentation Attack Detection

Figure 2 for Contrastive Context-Aware Learning for 3D High-Fidelity Mask Face Presentation Attack Detection

Figure 3 for Contrastive Context-Aware Learning for 3D High-Fidelity Mask Face Presentation Attack Detection

Figure 4 for Contrastive Context-Aware Learning for 3D High-Fidelity Mask Face Presentation Attack Detection

Abstract:Face presentation attack detection (PAD) is essential to secure face recognition systems primarily from high-fidelity mask attacks. Most existing 3D mask PAD benchmarks suffer from several drawbacks: 1) a limited number of mask identities, types of sensors, and a total number of videos; 2) low-fidelity quality of facial masks. Basic deep models and remote photoplethysmography (rPPG) methods achieved acceptable performance on these benchmarks but still far from the needs of practical scenarios. To bridge the gap to real-world applications, we introduce a largescale High-Fidelity Mask dataset, namely CASIA-SURF HiFiMask (briefly HiFiMask). Specifically, a total amount of 54,600 videos are recorded from 75 subjects with 225 realistic masks by 7 new kinds of sensors. Together with the dataset, we propose a novel Contrastive Context-aware Learning framework, namely CCL. CCL is a new training methodology for supervised PAD tasks, which is able to learn by leveraging rich contexts accurately (e.g., subjects, mask material and lighting) among pairs of live faces and high-fidelity mask attacks. Extensive experimental evaluations on HiFiMask and three additional 3D mask datasets demonstrate the effectiveness of our method.

Via

Access Paper or Ask Questions

Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

Mar 30, 2021

Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, Li Zhang

Figure 1 for Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

Figure 2 for Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

Figure 3 for Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

Figure 4 for Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

Abstract:The objective of this paper is to learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection. We make following contributions: (i) rather than appealing to the complicated pseudo-LiDAR based approach, we propose a depth-conditioned dynamic message propagation (DDMP) network to effectively integrate the multi-scale depth information with the image context;(ii) this is achieved by first adaptively sampling context-aware nodes in the image context and then dynamically predicting hybrid depth-dependent filter weights and affinity matrices for propagating information; (iii) by augmenting a center-aware depth encoding (CDE) task, our method successfully alleviates the inaccurate depth prior; (iv) we thoroughly demonstrate the effectiveness of our proposed approach and show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset. Particularly, we rank $1^{st}$ in the highly competitive KITTI monocular 3D object detection track on the submission day (November 16th, 2020). Code and models are released at \url{https://github.com/fudan-zvg/DDMP}

* CVPR 2021. Code at https://github.com/fudan-zvg/DDMP

Via

Access Paper or Ask Questions

Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking

Feb 08, 2021

Nan Jiang, Kuiran Wang, Xiaoke Peng, Xuehui Yu, Qiang Wang, Junliang Xing, Guorong Li, Jian Zhao, Guodong Guo, Zhenjun Han

Figure 1 for Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking

Figure 2 for Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking

Figure 3 for Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking

Figure 4 for Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking

Abstract:Unmanned Aerial Vehicle (UAV) offers lots of applications in both commerce and recreation. With this, monitoring the operation status of UAVs is crucially important. In this work, we consider the task of tracking UAVs, providing rich information such as location and trajectory. To facilitate research on this topic, we propose a dataset, Anti-UAV, with more than 300 video pairs containing over 580k manually annotated bounding boxes. The releasing of such a large-scale dataset could be a useful initial step in research of tracking UAVs. Furthermore, the advancement of addressing research challenges in Anti-UAV can help the design of anti-UAV systems, leading to better surveillance of UAVs. Besides, a novel approach named dual-flow semantic consistency (DFSC) is proposed for UAV tracking. Modulated by the semantic flow across video sequences, the tracker learns more robust class-level semantic information and obtains more discriminative instance-level features. Experimental results demonstrate that Anti-UAV is very challenging, and the proposed method can effectively improve the tracker's performance. The Anti-UAV benchmark and the code of the proposed approach will be publicly available at https://github.com/ucas-vg/Anti-UAV.

* 13 pages, 8 figures, submitted to IEEE T-MM

Via

Access Paper or Ask Questions

GINet: Graph Interaction Network for Scene Parsing

Sep 14, 2020

Tianyi Wu, Yu Lu, Yu Zhu, Chuang Zhang, Ming Wu, Zhanyu Ma, Guodong Guo

Figure 1 for GINet: Graph Interaction Network for Scene Parsing

Figure 2 for GINet: Graph Interaction Network for Scene Parsing

Figure 3 for GINet: Graph Interaction Network for Scene Parsing

Figure 4 for GINet: Graph Interaction Network for Scene Parsing

Abstract:Recently, context reasoning using image regions beyond local convolution has shown great potential for scene parsing. In this work, we explore how to incorporate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). The GI unit is capable of enhancing feature representations of convolution networks over high-level semantics and learning the semantic coherency adaptively to each sample. Specifically, the dataset-based linguistic knowledge is first incorporated in the GI unit to promote context reasoning over the visual graph, then the evolved representations of the visual graph are mapped to each local representation to enhance the discriminated capability for scene parsing. GI unit is further improved by the SC-loss to enhance the semantic representations over the exemplar-based semantic graph. We perform full ablation studies to demonstrate the effectiveness of each component in our approach. Particularly, the proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.

* Accepted by ECCV 2020

Via

Access Paper or Ask Questions

Binarized Neural Architecture Search for Efficient Object Recognition

Sep 08, 2020

Hanlin Chen, Li'an Zhuo, Baochang Zhang, Xiawu Zheng, Jianzhuang Liu, Rongrong Ji, David Doermann, Guodong Guo

Figure 1 for Binarized Neural Architecture Search for Efficient Object Recognition

Figure 2 for Binarized Neural Architecture Search for Efficient Object Recognition

Figure 3 for Binarized Neural Architecture Search for Efficient Object Recognition

Figure 4 for Binarized Neural Architecture Search for Efficient Object Recognition

Abstract:Traditional neural architecture search (NAS) has a significant impact in computer vision by automatically designing network architectures for various tasks. In this paper, binarized neural architecture search (BNAS), with a search space of binarized convolutions, is introduced to produce extremely compressed models to reduce huge computational cost on embedded devices for edge computing. The BNAS calculation is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space, and the performance loss when handling the wild data in various computing applications. To address these issues, we introduce operation space reduction and channel sampling into BNAS to significantly reduce the cost of searching. This is accomplished through a performance-based strategy that is robust to wild data, which is further used to abandon less potential operations. Furthermore, we introduce the Upper Confidence Bound (UCB) to solve 1-bit BNAS. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a comparable performance to NAS on both CIFAR and ImageNet databases. An accuracy of $96.53\%$ vs. $97.22\%$ is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a $40\%$ faster search than the state-of-the-art PC-DARTS. On the wild face recognition task, our binarized models achieve a performance similar to their corresponding full-precision models.

* arXiv admin note: substantial text overlap with arXiv:1911.10862

Via

Access Paper or Ask Questions

iffDetector: Inference-aware Feature Filtering for Object Detection

Jun 23, 2020

Mingyuan Mao, Yuxin Tian, Baochang Zhang, Qixiang Ye, Wanquan Liu, Guodong Guo, David Doermann

Figure 1 for iffDetector: Inference-aware Feature Filtering for Object Detection

Figure 2 for iffDetector: Inference-aware Feature Filtering for Object Detection

Figure 3 for iffDetector: Inference-aware Feature Filtering for Object Detection

Figure 4 for iffDetector: Inference-aware Feature Filtering for Object Detection

Abstract:Modern CNN-based object detectors focus on feature configuration during training but often ignore feature optimization during inference. In this paper, we propose a new feature optimization approach to enhance features and suppress background noise in both the training and inference stages. We introduce a generic Inference-aware Feature Filtering (IFF) module that can easily be combined with modern detectors, resulting in our iffDetector. Unlike conventional open-loop feature calculation approaches without feedback, the IFF module performs closed-loop optimization by leveraging high-level semantics to enhance the convolutional features. By applying Fourier transform analysis, we demonstrate that the IFF module acts as a negative feedback that theoretically guarantees the stability of feature learning. IFF can be fused with CNN-based object detectors in a plug-and-play manner with negligible computational cost overhead. Experiments on the PASCAL VOC and MS COCO datasets demonstrate that our iffDetector consistently outperforms state-of-the-art methods by significant margins\footnote{The test code and model are anonymously available in https://github.com/anonymous2020new/iffDetector }.

* 14 pages, 6 figures

Via

Access Paper or Ask Questions

Self-supervised Video Object Segmentation

Jun 22, 2020

Fangrui Zhu, Li Zhang, Yanwei Fu, Guodong Guo, Weidi Xie

Figure 1 for Self-supervised Video Object Segmentation

Figure 2 for Self-supervised Video Object Segmentation

Figure 3 for Self-supervised Video Object Segmentation

Figure 4 for Self-supervised Video Object Segmentation

Abstract:The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking). We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching, which resolves the challenge caused by the dis-appearance and reappearance of objects; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity, e.g. occlusions or dis-occlusions, fast motions; (iii) we explore the efficiency of self-supervised representation learning for dense tracking, surprisingly, we show that a powerful tracking model can be trained with as few as 100 raw video clips (equivalent to a duration of 11mins), indicating that low-level statistics have already been effective for tracking tasks; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube-VOS, as well as surpassing most of methods trained with millions of manual segmentation annotations, further bridging the gap between self-supervised and supervised learning. Codes are released to foster any further research (https://github.com/fangruizhu/self_sup_semiVOS).

Via

Access Paper or Ask Questions