Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ross Girshick

Exploring the Limits of Weakly Supervised Pretraining

May 02, 2018

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten

Figure 1 for Exploring the Limits of Weakly Supervised Pretraining

Figure 2 for Exploring the Limits of Weakly Supervised Pretraining

Figure 3 for Exploring the Limits of Weakly Supervised Pretraining

Figure 4 for Exploring the Limits of Weakly Supervised Pretraining

Abstract:State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards "small". Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

* Technical report

Via

Access Paper or Ask Questions

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Apr 30, 2018

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He

Figure 1 for Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Figure 2 for Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Figure 3 for Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Figure 4 for Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Abstract:Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

* Tech report (v2: correct typos)

Via

Access Paper or Ask Questions

Panoptic Segmentation

Apr 14, 2018

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollár

Abstract:We propose and study a novel panoptic segmentation (PS) task. Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we first propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. Second, we are working to introduce panoptic segmentation tracks at upcoming recognition challenges. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.

* update of related work and metric details

Via

Access Paper or Ask Questions

Non-local Neural Networks

Apr 13, 2018

Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He

Abstract:Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at https://github.com/facebookresearch/video-nonlocal-net .

* CVPR 2018, code is available at: https://github.com/facebookresearch/video-nonlocal-net

Via

Access Paper or Ask Questions

Low-Shot Learning from Imaginary Data

Apr 03, 2018

Yu-Xiong Wang, Ross Girshick, Martial Hebert, Bharath Hariharan

Figure 1 for Low-Shot Learning from Imaginary Data

Figure 2 for Low-Shot Learning from Imaginary Data

Figure 3 for Low-Shot Learning from Imaginary Data

Figure 4 for Low-Shot Learning from Imaginary Data

Abstract:Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts might help machine vision systems perform better low-shot learning, i.e., learning concepts from few examples. We present a novel approach to low-shot learning that uses this idea. Our approach builds on recent progress in meta-learning ("learning to learn") by combining a meta-learner with a "hallucinator" that produces additional training examples, and optimizing both models jointly. Our hallucinator can be incorporated into a variety of meta-learners and provides significant gains: up to a 6 point boost in classification accuracy when only a single training example is available, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

* CVPR 2018 camera-ready version

Via

Access Paper or Ask Questions

Learning to Segment Every Thing

Mar 27, 2018

Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, Ross Girshick

Figure 1 for Learning to Segment Every Thing

Figure 2 for Learning to Segment Every Thing

Figure 3 for Learning to Segment Every Thing

Figure 4 for Learning to Segment Every Thing

Abstract:Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction of which have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We evaluate our approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

Via

Access Paper or Ask Questions

Detecting and Recognizing Human-Object Interactions

Mar 27, 2018

Georgia Gkioxari, Ross Girshick, Piotr Dollár, Kaiming He

Figure 1 for Detecting and Recognizing Human-Object Interactions

Figure 2 for Detecting and Recognizing Human-Object Interactions

Figure 3 for Detecting and Recognizing Human-Object Interactions

Figure 4 for Detecting and Recognizing Human-Object Interactions

Abstract:To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting <human, verb, object> triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

Via

Access Paper or Ask Questions

Focal Loss for Dense Object Detection

Feb 07, 2018

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár

Figure 1 for Focal Loss for Dense Object Detection

Figure 2 for Focal Loss for Dense Object Detection

Figure 3 for Focal Loss for Dense Object Detection

Figure 4 for Focal Loss for Dense Object Detection

Abstract:The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.

Via

Access Paper or Ask Questions

Mask R-CNN

Jan 24, 2018

Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

Abstract:We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron

* open source; appendix on more results

Via

Access Paper or Ask Questions

Data Distillation: Towards Omni-Supervised Learning

Dec 12, 2017

Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He

Figure 1 for Data Distillation: Towards Omni-Supervised Learning

Figure 2 for Data Distillation: Towards Omni-Supervised Learning

Figure 3 for Data Distillation: Towards Omni-Supervised Learning

Figure 4 for Data Distillation: Towards Omni-Supervised Learning

Abstract:We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

* tech report

Via

Access Paper or Ask Questions