Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vlad I. Morariu

C-WSL: Count-guided Weakly Supervised Localization

Jul 25, 2018

Mingfei Gao, Ang Li, Ruichi Yu, Vlad I. Morariu, Larry S. Davis

Figure 1 for C-WSL: Count-guided Weakly Supervised Localization

Figure 2 for C-WSL: Count-guided Weakly Supervised Localization

Figure 3 for C-WSL: Count-guided Weakly Supervised Localization

Figure 4 for C-WSL: Count-guided Weakly Supervised Localization

Abstract:We introduce count-guided weakly supervised localization (C-WSL), an approach that uses per-class object count as a new form of supervision to improve weakly supervised localization (WSL). C-WSL uses a simple count-based region selection algorithm to select high-quality regions, each of which covers a single object instance during training, and improves existing WSL methods by training with the selected regions. To demonstrate the effectiveness of C-WSL, we integrate it into two WSL architectures and conduct extensive experiments on VOC2007 and VOC2012. Experimental results show that C-WSL leads to large improvements in WSL and that the proposed approach significantly outperforms the state-of-the-art methods. The results of annotation experiments on VOC2007 suggest that a modest extra time is needed to obtain per-class object counts compared to labeling only object categories in an image. Furthermore, we reduce the annotation time by more than $2\times$ and $38\times$ compared to center-click and bounding-box annotations.

* ECCV2018

Via

Access Paper or Ask Questions

Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition

Jun 12, 2018

Yaming Wang, Vlad I. Morariu, Larry S. Davis

Figure 1 for Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition

Figure 2 for Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition

Figure 3 for Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition

Figure 4 for Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition

Abstract:Compared to earlier multistage frameworks using CNN features, recent end-to-end deep approaches for fine-grained recognition essentially enhance the mid-level learning capability of CNNs. Previous approaches achieve this by introducing an auxiliary network to infuse localization information into the main classification network, or a sophisticated feature encoding method to capture higher order feature statistics. We show that mid-level representation learning can be enhanced within the CNN framework, by learning a bank of convolutional filters that capture class-specific discriminative patches without extra part or bounding box annotations. Such a filter bank is well structured, properly initialized and discriminatively learned through a novel asymmetric multi-stream architecture with convolutional filter supervision and a non-random layer initialization. Experimental results show that our approach achieves state-of-the-art on three publicly available fine-grained recognition datasets (CUB-200-2011, Stanford Cars and FGVC-Aircraft). Ablation studies and visualizations are provided to understand our approach.

Via

Access Paper or Ask Questions

Learning Rich Features for Image Manipulation Detection

May 13, 2018

Peng Zhou, Xintong Han, Vlad I. Morariu, Larry S. Davis

Figure 1 for Learning Rich Features for Image Manipulation Detection

Figure 2 for Learning Rich Features for Image Manipulation Detection

Figure 3 for Learning Rich Features for Image Manipulation Detection

Figure 4 for Learning Rich Features for Image Manipulation Detection

Abstract:Image manipulation detection is different from traditional semantic object detection because it pays more attention to tampering artifacts than to image content, which suggests that richer features need to be learned. We propose a two-stream Faster R-CNN network and train it endto- end to detect the tampered regions given a manipulated image. One of the two streams is an RGB stream whose purpose is to extract features from the RGB image input to find tampering artifacts like strong contrast difference, unnatural tampered boundaries, and so on. The other is a noise stream that leverages the noise features extracted from a steganalysis rich model filter layer to discover the noise inconsistency between authentic and tampered regions. We then fuse features from the two streams through a bilinear pooling layer to further incorporate spatial co-occurrence of these two modalities. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression.

* CVPR 2018 Camera Ready

Via

Access Paper or Ask Questions

Fused Deep Neural Networks for Efficient Pedestrian Detection

May 02, 2018

Xianzhi Du, Mostafa El-Khamy, Vlad I. Morariu, Jungwon Lee, Larry Davis

Figure 1 for Fused Deep Neural Networks for Efficient Pedestrian Detection

Figure 2 for Fused Deep Neural Networks for Efficient Pedestrian Detection

Figure 3 for Fused Deep Neural Networks for Efficient Pedestrian Detection

Figure 4 for Fused Deep Neural Networks for Efficient Pedestrian Detection

Abstract:In this paper, we present an efficient pedestrian detection system, designed by fusion of multiple deep neural network (DNN) systems. Pedestrian candidates are first generated by a single shot convolutional multi-box detector at different locations with various scales and aspect ratios. The candidate generator is designed to provide the majority of ground truth pedestrian annotations at the cost of a large number of false positives. Then, a classification system using the idea of ensemble learning is deployed to improve the detection accuracy. The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results. To improve the training of the deep verification networks, a novel soft-label method is devised to assign floating point labels to the generated pedestrian candidates. A deep context aggregation semantic segmentation network also provides pixel-level classification of the scene and its results are softly fused with the detection results by the single shot detector. Our pedestrian detector compared favorably to state-of-art methods on all popular pedestrian detection datasets. For example, our fused DNN has better detection accuracy on the Caltech Pedestrian dataset than all previous state of art methods, while also being the fastest. We significantly improved the log-average miss rate on the Caltech pedestrian dataset to 7.67% and achieved the new state-of-the-art.

* 11 pages

Via

Access Paper or Ask Questions

Representing Videos based on Scene Layouts for Recognizing Agent-in-Place Actions

Apr 04, 2018

Ruichi Yu, Hongcheng Wang, Ang Li, Jingxiao Zheng, Vlad I. Morariu, Larry S. Davis

Figure 1 for Representing Videos based on Scene Layouts for Recognizing Agent-in-Place Actions

Figure 2 for Representing Videos based on Scene Layouts for Recognizing Agent-in-Place Actions

Figure 3 for Representing Videos based on Scene Layouts for Recognizing Agent-in-Place Actions

Figure 4 for Representing Videos based on Scene Layouts for Recognizing Agent-in-Place Actions

Abstract:We address the recognition of agent-in-place actions, which are associated with agents who perform them and places where they occur, in the context of outdoor home surveillance. We introduce a representation of the geometry and topology of scene layouts so that a network can generalize from the layouts observed in the training set to unseen layouts in the test set. This Layout-Induced Video Representation (LIVR) abstracts away low-level appearance variance and encodes geometric and topological relationships of places in a specific scene layout. LIVR partitions the semantic features of a video clip into different places to force the network to learn place-based feature descriptions; to predict the confidence of each action, LIVR aggregates features from the place associated with an action and its adjacent places on the scene layout. We introduce the Agent-in-Place Action dataset to show that our method allows neural network models to generalize significantly better to unseen scenes.

Via

Access Paper or Ask Questions

Two-Stream Neural Networks for Tampered Face Detection

Mar 29, 2018

Peng Zhou, Xintong Han, Vlad I. Morariu, Larry S. Davis

Figure 1 for Two-Stream Neural Networks for Tampered Face Detection

Figure 2 for Two-Stream Neural Networks for Tampered Face Detection

Figure 3 for Two-Stream Neural Networks for Tampered Face Detection

Figure 4 for Two-Stream Neural Networks for Tampered Face Detection

Abstract:We propose a two-stream network for face tampering detection. We train GoogLeNet to detect tampering artifacts in a face classification stream, and train a patch based triplet network to leverage features capturing local noise residuals and camera characteristics as a second stream. In addition, we use two different online face swapping applications to create a new dataset that consists of 2010 tampered images, each of which contains a tampered face. We evaluate the proposed two-stream network on our newly collected dataset. Experimental results demonstrate the effectiveness of our method.

* 2017 CVPR workshop

Via

Access Paper or Ask Questions

Dynamic Zoom-in Network for Fast Object Detection in Large Images

Mar 27, 2018

Mingfei Gao, Ruichi Yu, Ang Li, Vlad I. Morariu, Larry S. Davis

Figure 1 for Dynamic Zoom-in Network for Fast Object Detection in Large Images

Figure 2 for Dynamic Zoom-in Network for Fast Object Detection in Large Images

Figure 3 for Dynamic Zoom-in Network for Fast Object Detection in Large Images

Figure 4 for Dynamic Zoom-in Network for Fast Object Detection in Large Images

Abstract:We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Experiments on the Caltech Pedestrians dataset show that our approach reduces the number of processed pixels by over 50% without a drop in detection accuracy. The merits of our approach become more significant on a high resolution test set collected from YFCC100M dataset, where our approach maintains high detection performance while reducing the number of processed pixels by about 70% and the detection time by over 50%.

* CVPR2018

Via

Access Paper or Ask Questions

NISP: Pruning Networks using Neuron Importance Score Propagation

Mar 21, 2018

Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, Larry S. Davis

Figure 1 for NISP: Pruning Networks using Neuron Importance Score Propagation

Figure 2 for NISP: Pruning Networks using Neuron Importance Score Propagation

Figure 3 for NISP: Pruning Networks using Neuron Importance Score Propagation

Figure 4 for NISP: Pruning Networks using Neuron Importance Score Propagation

Abstract:To reduce the significant redundancy in deep Convolutional Neural Networks (CNNs), most existing methods prune neurons by only considering statistics of an individual layer or two consecutive layers (e.g., prune one layer to minimize the reconstruction error of the next layer), ignoring the effect of error propagation in deep networks. In contrast, we argue that it is essential to prune neurons in the entire neuron network jointly based on a unified goal: minimizing the reconstruction error of important responses in the "final response layer" (FRL), which is the second-to-last layer before classification, for a pruned network to retrain its predictive power. Specifically, we apply feature ranking techniques to measure the importance of each neuron in the FRL, and formulate network pruning as a binary integer optimization problem and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network. The CNN is pruned by removing neurons with least importance, and then fine-tuned to retain its predictive power. NISP is evaluated on several datasets with multiple CNN models and demonstrated to achieve significant acceleration and compression with negligible accuracy loss.

Via

Access Paper or Ask Questions

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Aug 03, 2017

Ruichi Yu, Ang Li, Vlad I. Morariu, Larry S. Davis

Figure 1 for Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Figure 2 for Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Figure 3 for Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Figure 4 for Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Abstract:Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. Then, we distill the knowledge into a deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

* ICCV 2017

Via

Access Paper or Ask Questions

Generating Holistic 3D Scene Abstractions for Text-based Image Retrieval

Apr 11, 2017

Ang Li, Jin Sun, Joe Yue-Hei Ng, Ruichi Yu, Vlad I. Morariu, Larry S. Davis

Figure 1 for Generating Holistic 3D Scene Abstractions for Text-based Image Retrieval

Figure 2 for Generating Holistic 3D Scene Abstractions for Text-based Image Retrieval

Figure 3 for Generating Holistic 3D Scene Abstractions for Text-based Image Retrieval

Figure 4 for Generating Holistic 3D Scene Abstractions for Text-based Image Retrieval

Abstract:Spatial relationships between objects provide important information for text-based image retrieval. As users are more likely to describe a scene from a real world perspective, using 3D spatial relationships rather than 2D relationships that assume a particular viewing direction, one of the main challenges is to infer the 3D structure that bridges images with users' text descriptions. However, direct inference of 3D structure from images requires learning from large scale annotated data. Since interactions between objects can be reduced to a limited set of atomic spatial relations in 3D, we study the possibility of inferring 3D structure from a text description rather than an image, applying physical relation models to synthesize holistic 3D abstract object layouts satisfying the spatial constraints present in a textual description. We present a generic framework for retrieving images from a textual description of a scene by matching images with these generated abstract object layouts. Images are ranked by matching object detection outputs (bounding boxes) to 2D layout candidates (also represented by bounding boxes) which are obtained by projecting the 3D scenes with sampled camera directions. We validate our approach using public indoor scene datasets and show that our method outperforms baselines built upon object occurrence histograms and learned 2D pairwise relations.

* CVPR 2017

Via

Access Paper or Ask Questions