Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rahul Sukthankar

Object category learning and retrieval with weak supervision

Jul 23, 2018
Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar

Figure 1 for Object category learning and retrieval with weak supervision

Figure 2 for Object category learning and retrieval with weak supervision

Figure 3 for Object category learning and retrieval with weak supervision

We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision. To that end, we propose a fully differentiable unsupervised deep clustering approach to learn semantic classes in an end-to-end fashion without individual class labeling using only unlabeled object proposals. The key contributions of our work are 1) a kmeans clustering objective where the clusters are learned as parameters of the network and are represented as memory units, and 2) simultaneously building a feature representation, or embedding, while learning to cluster it. This approach shows promising results on two popular computer vision datasets: on CIFAR10 for clustering objects, and on the more complex and challenging Cityscapes dataset for semantically discovering classes which visually correspond to cars, people, and bicycles. Currently, the only supervision provided is segmentation objectness masks, but this method can be extended to use an unsupervised objectness-based object generation mechanism which will make the approach completely unsupervised.

* Camera-ready version for NIPS 2017 workshop Learning with Limited Labeled Data

Via

Access Paper or Ask Questions

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Apr 30, 2018
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Figure 1 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Figure 2 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Figure 3 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Figure 4 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.

* To appear in CVPR 2018. Check dataset page https://research.google.com/ava/ for details

Via

Access Paper or Ask Questions

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Apr 20, 2018
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar

Figure 1 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Figure 2 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Figure 3 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Figure 4 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization

We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

* Accepted in CVPR 2018

Via

Access Paper or Ask Questions

Beyond Skip Connections: Top-Down Modulation for Object Detection

Sep 19, 2017
Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, Abhinav Gupta

Figure 1 for Beyond Skip Connections: Top-Down Modulation for Object Detection

Figure 2 for Beyond Skip Connections: Top-Down Modulation for Object Detection

Figure 3 for Beyond Skip Connections: Top-Down Modulation for Object Detection

Figure 4 for Beyond Skip Connections: Top-Down Modulation for Object Detection

In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.).

Via

Access Paper or Ask Questions

WebVision Challenge: Visual Learning and Understanding With Web Data

May 16, 2017
Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, Jesse Berent, Abhinav Gupta, Rahul Sukthankar, Luc Van Gool

Figure 1 for WebVision Challenge: Visual Learning and Understanding With Web Data

Figure 2 for WebVision Challenge: Visual Learning and Understanding With Web Data

Figure 3 for WebVision Challenge: Visual Learning and Understanding With Web Data

We present the 2017 WebVision Challenge, a public image recognition challenge designed for deep learning based on web images without instance-level human annotation. Following the spirit of previous vision challenges, such as ILSVRC, Places2 and PASCAL VOC, which have played critical roles in the development of computer vision by contributing to the community with large scale annotated data for model designing and standardized benchmarking, we contribute with this challenge a large scale web images dataset, and a public competition with a workshop co-located with CVPR 2017. The WebVision dataset contains more than $2.4$ million web images crawled from the Internet by using queries generated from the $1,000$ semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information is also included. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. The 2017 WebVision challenge consists of two tracks, the image classification task on WebVision test set, and the transfer learning task on PASCAL VOC 2012 dataset. In this paper, we describe the details of data collection and annotation, highlight the characteristics of the dataset, and introduce the evaluation metrics.

* project page: http://www.vision.ee.ethz.ch/webvision/

Via

Access Paper or Ask Questions

Motion Prediction Under Multimodality with Conditional Stochastic Networks

May 05, 2017
Katerina Fragkiadaki, Jonathan Huang, Alex Alemi, Sudheendra Vijayanarasimhan, Susanna Ricco, Rahul Sukthankar

Figure 1 for Motion Prediction Under Multimodality with Conditional Stochastic Networks

Figure 2 for Motion Prediction Under Multimodality with Conditional Stochastic Networks

Figure 3 for Motion Prediction Under Multimodality with Conditional Stochastic Networks

Figure 4 for Motion Prediction Under Multimodality with Conditional Stochastic Networks

Given a visual history, multiple future outcomes for a video scene are equally probable, in other words, the distribution of future outcomes has multiple modes. Multimodality is notoriously hard to handle by standard regressors or classifiers: the former regress to the mean and the latter discretize a continuous high dimensional output space. In this work, we present stochastic neural network architectures that handle such multimodality through stochasticity: future trajectories of objects, body joints or frames are represented as deep, non-linear transformations of random (as opposed to deterministic) variables. Such random variables are sampled from simple Gaussian distributions whose means and variances are parametrized by the output of convolutional encoders over the visual history. We introduce novel convolutional architectures for predicting future body joint trajectories that outperform fully connected alternatives \cite{DBLP:journals/corr/WalkerDGH16}. We introduce stochastic spatial transformers through optical flow warping for predicting future frames, which outperform their deterministic equivalents \cite{DBLP:journals/corr/PatrauceanHC15}. Training stochastic networks involves an intractable marginalization over stochastic variables. We compare various training schemes that handle such marginalization through a) straightforward sampling from the prior, b) conditional variational autoencoders \cite{NIPS2015_5775,DBLP:journals/corr/WalkerDGH16}, and, c) a proposed K-best-sample loss that penalizes the best prediction under a fixed "prediction budget". We show experimental results on object trajectory prediction, human body joint trajectory prediction and video prediction under varying future uncertainty, validating quantitatively and qualitatively our architectural choices and training schemes.

Via

Access Paper or Ask Questions

SfM-Net: Learning of Structure and Motion from Video

Apr 25, 2017
Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

Figure 1 for SfM-Net: Learning of Structure and Motion from Video

Figure 2 for SfM-Net: Learning of Structure and Motion from Video

Figure 3 for SfM-Net: Learning of Structure and Motion from Video

Figure 4 for SfM-Net: Learning of Structure and Motion from Video

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates. The model can be trained with various degrees of supervision: 1) self-supervised by the re-projection photometric error (completely unsupervised), 2) supervised by ego-motion (camera motion), or 3) supervised by depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth estimates and successfully estimates frame-to-frame camera rotations and translations. It often successfully segments the moving objects in the scene, even though such supervision is never provided.

Via

Access Paper or Ask Questions

Cognitive Mapping and Planning for Visual Navigation

Apr 23, 2017
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik

Figure 1 for Cognitive Mapping and Planning for Visual Navigation

Figure 2 for Cognitive Mapping and Planning for Visual Navigation

Figure 3 for Cognitive Mapping and Planning for Visual Navigation

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as "go to a chair".

* To Appear at CVPR 2017. Project website with code, models, simulation environment and videos: https://sites.google.com/view/cognitive-mapping-and-planning/

Via

Access Paper or Ask Questions

Robust Adversarial Reinforcement Learning

Mar 08, 2017
Lerrel Pinto, James Davidson, Rahul Sukthankar, Abhinav Gupta

Figure 1 for Robust Adversarial Reinforcement Learning

Figure 2 for Robust Adversarial Reinforcement Learning

Figure 3 for Robust Adversarial Reinforcement Learning

Figure 4 for Robust Adversarial Reinforcement Learning

Deep neural networks coupled with fast simulation and improved computation have led to recent successes in the field of reinforcement learning (RL). However, most current RL-based approaches fail to generalize since: (a) the gap between simulation and real world is so large that policy-learning approaches fail to transfer; (b) even if policy learning is done in real world, the data scarcity leads to failed generalization from training to test scenarios (e.g., due to different friction or object masses). Inspired from H-infinity control methods, we note that both modeling errors and differences in training and test scenarios can be viewed as extra forces/disturbances in the system. This paper proposes the idea of robust adversarial reinforcement learning (RARL), where we train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. The jointly trained adversary is reinforced -- that is, it learns an optimal destabilization policy. We formulate the policy learning as a zero-sum, minimax objective function. Extensive experiments in multiple environments (InvertedPendulum, HalfCheetah, Swimmer, Hopper and Walker2d) conclusively demonstrate that our method (a) improves training stability; (b) is robust to differences in training/test conditions; and c) outperform the baseline even in the absence of the adversary.

* 10 pages

Via

Access Paper or Ask Questions

Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data

Feb 03, 2017
Shumeet Baluja, Michele Covell, Rahul Sukthankar

Figure 1 for Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data

Figure 2 for Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data

Figure 3 for Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data

Figure 4 for Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data

Real-time optimization of traffic flow addresses important practical problems: reducing a driver's wasted time, improving city-wide efficiency, reducing gas emissions and improving air quality. Much of the current research in traffic-light optimization relies on extending the capabilities of traffic lights to either communicate with each other or communicate with vehicles. However, before such capabilities become ubiquitous, opportunities exist to improve traffic lights by being more responsive to current traffic situations within the current, already deployed, infrastructure. In this paper, we introduce a traffic light controller that employs bidding within micro-auctions to efficiently incorporate traffic sensor information; no other outside sources of information are assumed. We train and test traffic light controllers on large-scale data collected from opted-in Android cell-phone users over a period of several months in Mountain View, California and the River North neighborhood of Chicago, Illinois. The learned auction-based controllers surpass (in both the relevant metrics of road-capacity and mean travel time) the currently deployed lights, optimized static-program lights, and longer-term planning approaches, in both cities, measured using real user driving data.

Via

Access Paper or Ask Questions