Alert button
Picture for Tianfu Wu

Tianfu Wu

Alert button

High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks

Jan 23, 2018
Zeyuan Chen, Shaoliang Nie, Tianfu Wu, Christopher G. Healey

Figure 1 for High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks
Figure 2 for High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks
Figure 3 for High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks
Figure 4 for High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks

We present a deep learning approach for high resolution face completion with multiple controllable attributes (e.g., male and smiling) under arbitrary masks. Face completion entails understanding both structural meaningfulness and appearance consistency locally and globally to fill in "holes" whose content do not appear elsewhere in an input image. It is a challenging task with the difficulty level increasing significantly with respect to high resolution, the complexity of "holes" and the controllable attributes of filled-in fragments. Our system addresses the challenges by learning a fully end-to-end framework that trains generative adversarial networks (GANs) progressively from low resolution to high resolution with conditional vectors encoding controllable attributes. We design novel network architectures to exploit information across multiple scales effectively and efficiently. We introduce new loss functions encouraging sharp completion. We show that our system can complete faces with large structural and appearance variations using a single feed-forward pass of computation with mean inference time of 0.007 seconds for images at 1024 x 1024 resolution. We also perform a pilot human study that shows our approach outperforms state-of-the-art face completion methods in terms of rank analysis. The code will be released upon publication.

Viaarxiv icon

AOGNets: Deep AND-OR Grammar Networks for Visual Recognition

Nov 15, 2017
Xilai Li, Tianfu Wu, Xi Song, Hamid Krim

Figure 1 for AOGNets: Deep AND-OR Grammar Networks for Visual Recognition
Figure 2 for AOGNets: Deep AND-OR Grammar Networks for Visual Recognition
Figure 3 for AOGNets: Deep AND-OR Grammar Networks for Visual Recognition
Figure 4 for AOGNets: Deep AND-OR Grammar Networks for Visual Recognition

This paper presents a method of learning deep AND-OR Grammar (AOG) networks for visual recognition, which we term AOGNets. An AOGNet consists of a number of stages each of which is composed of a number of AOG building blocks. An AOG building block is designed based on a principled AND-OR grammar and represented by a hierarchical and compositional AND-OR graph. Each node applies some basic operation (e.g., Conv-BatchNorm-ReLU) to its input. There are three types of nodes: an AND-node explores composition, whose input is computed by concatenating features of its child nodes; an OR-node represents alternative ways of composition in the spirit of exploitation, whose input is the element-wise sum of features of its child nodes; and a Terminal-node takes as input a channel-wise slice of the input feature map of the AOG building block. AOGNets aim to harness the best of two worlds (grammar models and deep neural networks) in representation learning with end-to-end training. In experiments, AOGNets are tested on three highly competitive image classification benchmarks: CIFAR-10, CIFAR-100 and ImageNet-1K. AOGNets obtain better performance than the widely used Residual Net and its variants, and are tightly comparable to the Dense Net. AOGNets are also tested in object detection on the PASCAL VOC 2007 and 2012 using the vanilla Faster RCNN system and obtain better performance than the Residual Net.

* 10 pages 
Viaarxiv icon

An Attention-Driven Approach of No-Reference Image Quality Assessment

May 29, 2017
Diqi Chen, Yizhou Wang, Tianfu Wu, Wen Gao

Figure 1 for An Attention-Driven Approach of No-Reference Image Quality Assessment
Figure 2 for An Attention-Driven Approach of No-Reference Image Quality Assessment
Figure 3 for An Attention-Driven Approach of No-Reference Image Quality Assessment
Figure 4 for An Attention-Driven Approach of No-Reference Image Quality Assessment

In this paper, we present a novel method of no-reference image quality assessment (NR-IQA), which is to predict the perceptual quality score of a given image without using any reference image. The proposed method harnesses three functions (i) the visual attention mechanism, which affects many aspects of visual perception including image quality assessment, however, is overlooked in the NR-IQA literature. The method assumes that the fixation areas on an image contain key information to the process of IQA. (ii) the robust averaging strategy, which is a means \--- supported by psychology studies \--- to integrating multiple/step-wise evidence to make a final perceptual judgment. (iii) the multi-task learning, which is believed to be an effectual means to shape representation learning and could result in a more generalized model. To exploit the synergy of the three, we consider the NR-IQA as a dynamic perception process, in which the model samples a sequence of "informative" areas and aggregates the information to learn a representation for the tasks of jointly predicting the image quality score and the distortion type. The model learning is implemented by a reinforcement strategy, in which the rewards of both tasks guide the learning of the optimal sampling policy to acquire the "task-informative" image regions so that the predictions can be made accurately and efficiently (in terms of the sampling steps). The reinforcement learning is realized by a deep network with the policy gradient method and trained through back-propagation. In experiments, the model is tested on the TID2008 dataset and it outperforms several state-of-the-art methods. Furthermore, the model is very efficient in the sense that a small number of fixations are used in NR-IQA.

* 9 pages, 7 figures 
Viaarxiv icon

Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks

Mar 22, 2017
Bo Li, Tianfu Wu, Shuai Shao, Lun Zhang, Rufeng Chu

Figure 1 for Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks
Figure 2 for Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks
Figure 3 for Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks
Figure 4 for Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks

Jointly integrating aspect ratio and context has been extensively studied and shown performance improvement in traditional object detection systems such as the DPMs. It, however, has been largely ignored in deep neural network based detection systems. This paper presents a method of integrating a mixture of object models and region-based convolutional networks for accurate object detection. Each mixture component accounts for both object aspect ratio and multi-scale contextual information explicitly: (i) it exploits a mixture of tiling configurations in the RoI pooling to remedy the warping artifacts caused by a single type RoI pooling (e.g., with equally-sized 7 x 7 cells), and to respect the underlying object shapes more; (ii) it "looks from both the inside and the outside of a RoI" by incorporating contextual information at two scales: global context pooled from the whole image and local context pooled from the surrounding of a RoI. To facilitate accurate detection, this paper proposes a multi-stage detection scheme for integrating the mixture of object models, which utilizes the detection results of the model at the previous stage as the proposals for the current in both training and testing. The proposed method is called the aspect ratio and context aware region-based convolutional network (ARC-R-CNN). In experiments, ARC-R-CNN shows very competitive results with Faster R-CNN [41] and R-FCN [10] on two datasets: the PASCAL VOC and the Microsoft COCO. It obtains significantly better mAP performance using high IoU thresholds on both datasets.

Viaarxiv icon

Zero-Shot Learning posed as a Missing Data Problem

Feb 21, 2017
Bo Zhao, Botong Wu, Tianfu Wu, Yizhou Wang

Figure 1 for Zero-Shot Learning posed as a Missing Data Problem
Figure 2 for Zero-Shot Learning posed as a Missing Data Problem
Figure 3 for Zero-Shot Learning posed as a Missing Data Problem
Figure 4 for Zero-Shot Learning posed as a Missing Data Problem

This paper presents a method of zero-shot learning (ZSL) which poses ZSL as the missing data problem, rather than the missing label problem. Specifically, most existing ZSL methods focus on learning mapping functions from the image feature space to the label embedding space. Whereas, the proposed method explores a simple yet effective transductive framework in the reverse way \--- our method estimates data distribution of unseen classes in the image feature space by transferring knowledge from the label embedding space. In experiments, our method outperforms the state-of-the-art on two popular datasets.

Viaarxiv icon

Online Object Tracking, Learning and Parsing with And-Or Graphs

Sep 03, 2016
Tianfu Wu, Yang Lu, Song-Chun Zhu

Figure 1 for Online Object Tracking, Learning and Parsing with And-Or Graphs
Figure 2 for Online Object Tracking, Learning and Parsing with And-Or Graphs
Figure 3 for Online Object Tracking, Learning and Parsing with And-Or Graphs
Figure 4 for Online Object Tracking, Learning and Parsing with And-Or Graphs

This paper presents a method, called AOGTracker, for simultaneously tracking, learning and parsing (TLP) of unknown objects in video sequences with a hierarchical and compositional And-Or graph (AOG) representation. %The AOG captures both structural and appearance variations of a target object in a principled way. The TLP method is formulated in the Bayesian framework with a spatial and a temporal dynamic programming (DP) algorithms inferring object bounding boxes on-the-fly. During online learning, the AOG is discriminatively learned using latent SVM to account for appearance (e.g., lighting and partial occlusion) and structural (e.g., different poses and viewpoints) variations of a tracked object, as well as distractors (e.g., similar objects) in background. Three key issues in online inference and learning are addressed: (i) maintaining purity of positive and negative examples collected online, (ii) controling model complexity in latent structure learning, and (iii) identifying critical moments to re-learn the structure of AOG based on its intrackability. The intrackability measures uncertainty of an AOG based on its score maps in a frame. In experiments, our AOGTracker is tested on two popular tracking benchmarks with the same parameter setting: the TB-100/50/CVPR2013 benchmarks, and the VOT benchmarks --- VOT 2013, 2014, 2015 and TIR2015 (thermal imagery tracking). In the former, our AOGTracker outperforms state-of-the-art tracking algorithms including two trackers based on deep convolutional network. In the latter, our AOGTracker outperforms all other trackers in VOT2013 and is comparable to the state-of-the-art methods in VOT2014, 2015 and TIR2015.

* 17 pages, Reproducibility: The source code is released with this paper for reproducing all results, which is available at https://github.com/tfwu/RGM-AOGTracker 
Viaarxiv icon

Face Detection with End-to-End Integration of a ConvNet and a 3D Model

Aug 29, 2016
Yunzhu Li, Benyuan Sun, Tianfu Wu, Yizhou Wang

Figure 1 for Face Detection with End-to-End Integration of a ConvNet and a 3D Model
Figure 2 for Face Detection with End-to-End Integration of a ConvNet and a 3D Model
Figure 3 for Face Detection with End-to-End Integration of a ConvNet and a 3D Model
Figure 4 for Face Detection with End-to-End Integration of a ConvNet and a 3D Model

This paper presents a method for face detection in the wild, which integrates a ConvNet and a 3D mean face model in an end-to-end multi-task discriminative learning framework. The 3D mean face model is predefined and fixed (e.g., we used the one provided in the AFLW dataset). The ConvNet consists of two components: (i) The face pro- posal component computes face bounding box proposals via estimating facial key-points and the 3D transformation (rotation and translation) parameters for each predicted key-point w.r.t. the 3D mean face model. (ii) The face verification component computes detection results by prun- ing and refining proposals based on facial key-points based configuration pooling. The proposed method addresses two issues in adapting state- of-the-art generic object detection ConvNets (e.g., faster R-CNN) for face detection: (i) One is to eliminate the heuristic design of prede- fined anchor boxes in the region proposals network (RPN) by exploit- ing a 3D mean face model. (ii) The other is to replace the generic RoI (Region-of-Interest) pooling layer with a configuration pooling layer to respect underlying object structures. The multi-task loss consists of three terms: the classification Softmax loss and the location smooth l1 -losses [14] of both the facial key-points and the face bounding boxes. In ex- periments, our ConvNet is trained on the AFLW dataset only and tested on the FDDB benchmark with fine-tuning and on the AFW benchmark without fine-tuning. The proposed method obtains very competitive state-of-the-art performance in the two benchmarks.

* 16 pages, Y. Li and B. Sun contributed equally to this work 
Viaarxiv icon

Recognizing Car Fluents from Video

Mar 26, 2016
Bo Li, Tianfu Wu, Caiming Xiong, Song-Chun Zhu

Figure 1 for Recognizing Car Fluents from Video
Figure 2 for Recognizing Car Fluents from Video
Figure 3 for Recognizing Car Fluents from Video
Figure 4 for Recognizing Car Fluents from Video

Physical fluents, a term originally used by Newton [40], refers to time-varying object states in dynamic scenes. In this paper, we are interested in inferring the fluents of vehicles from video. For example, a door (hood, trunk) is open or closed through various actions, light is blinking to turn. Recognizing these fluents has broad applications, yet have received scant attention in the computer vision literature. Car fluent recognition entails a unified framework for car detection, car part localization and part status recognition, which is made difficult by large structural and appearance variations, low resolutions and occlusions. This paper learns a spatial-temporal And-Or hierarchical model to represent car fluents. The learning of this model is formulated under the latent structural SVM framework. Since there are no publicly related dataset, we collect and annotate a car fluent dataset consisting of car videos with diverse fluents. In experiments, the proposed method outperforms several highly related baseline methods in terms of car fluent recognition and car part localization.

* Accepted by CVPR 2016 
Viaarxiv icon

A Restricted Visual Turing Test for Deep Scene and Event Understanding

Dec 16, 2015
Hang Qi, Tianfu Wu, Mun-Wai Lee, Song-Chun Zhu

Figure 1 for A Restricted Visual Turing Test for Deep Scene and Event Understanding
Figure 2 for A Restricted Visual Turing Test for Deep Scene and Event Understanding
Figure 3 for A Restricted Visual Turing Test for Deep Scene and Event Understanding
Figure 4 for A Restricted Visual Turing Test for Deep Scene and Event Understanding

This paper presents a restricted visual Turing test (VTT) for story-line based deep understanding in long-term and multi-camera captured videos. Given a set of videos of a scene (such as a multi-room office, a garden, and a parking lot.) and a sequence of story-line based queries, the task is to provide answers either simply in binary form "true/false" (to a polar query) or in an accurate natural language description (to a non-polar query). Queries, polar or non-polar, consist of view-based queries which can be answered from a particular camera view and scene-centered queries which involves joint inference across different cameras. The story lines are collected to cover spatial, temporal and causal understanding of input videos. The data and queries distinguish our VTT from recently proposed visual question answering in images and video captioning. A vision system is proposed to perform joint video and query parsing which integrates different vision modules, a knowledge base and a query engine. The system provides unified interfaces for different modules so that individual modules can be reconfigured to test a new method. We provide a benchmark dataset and a toolkit for ontology guided story-line query generation which consists of about 93.5 hours videos captured in four different locations and 3,426 queries split into 127 story lines. We also provide a baseline implementation and result analyses.

Viaarxiv icon

Learning And-Or Models to Represent Context and Occlusion for Car Detection and Viewpoint Estimation

Sep 27, 2015
Tianfu Wu, Bo Li, Song-Chun Zhu

Figure 1 for Learning And-Or Models to Represent Context and Occlusion for Car Detection and Viewpoint Estimation
Figure 2 for Learning And-Or Models to Represent Context and Occlusion for Car Detection and Viewpoint Estimation
Figure 3 for Learning And-Or Models to Represent Context and Occlusion for Car Detection and Viewpoint Estimation
Figure 4 for Learning And-Or Models to Represent Context and Occlusion for Car Detection and Viewpoint Estimation

This paper presents a method for learning And-Or models to represent context and occlusion for car detection and viewpoint estimation. The learned And-Or model represents car-to-car context and occlusion configurations at three levels: (i) spatially-aligned cars, (ii) single car under different occlusion configurations, and (iii) a small number of parts. The And-Or model embeds a grammar for representing large structural and appearance variations in a reconfigurable hierarchy. The learning process consists of two stages in a weakly supervised way (i.e., only bounding boxes of single cars are annotated). Firstly, the structure of the And-Or model is learned with three components: (a) mining multi-car contextual patterns based on layouts of annotated single car bounding boxes, (b) mining occlusion configurations between single cars, and (c) learning different combinations of part visibility based on car 3D CAD simulation. The And-Or model is organized in a directed and acyclic graph which can be inferred by Dynamic Programming. Secondly, the model parameters (for appearance, deformation and bias) are jointly trained using Weak-Label Structural SVM. In experiments, we test our model on four car detection datasets --- the KITTI dataset \cite{Geiger12}, the PASCAL VOC2007 car dataset~\cite{pascal}, and two self-collected car datasets, namely the Street-Parking car dataset and the Parking-Lot car dataset, and three datasets for car viewpoint estimation --- the PASCAL VOC2006 car dataset~\cite{pascal}, the 3D car dataset~\cite{savarese}, and the PASCAL3D+ car dataset~\cite{xiang_wacv14}. Compared with state-of-the-art variants of deformable part-based models and other methods, our model achieves significant improvement consistently on the four detection datasets, and comparable performance on car viewpoint estimation.

* 14 pages 
Viaarxiv icon