Alert button
Picture for Tushar Nagarajan

Tushar Nagarajan

Alert button

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Jan 05, 2023
Shuhan Tan, Tushar Nagarajan, Kristen Grauman

Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.

* Tech report. Project page: https://vision.cs.utexas.edu/projects/egodistill 
Viaarxiv icon

Egocentric scene context for human-centric environment understanding from video

Jul 22, 2022
Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

Figure 1 for Egocentric scene context for human-centric environment understanding from video
Figure 2 for Egocentric scene context for human-centric environment understanding from video
Figure 3 for Egocentric scene context for human-centric environment understanding from video
Figure 4 for Egocentric scene context for human-centric environment understanding from video

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and only capture what is directly seen. We present an approach that links egocentric video and camera pose over time by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings to facilitate human-centric environment understanding. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on real-world videos of house tours from unseen environments. We show that by grounding videos in their physical environment, our models surpass traditional scene classification models at predicting which room a camera-wearer is in (where frame-level information is insufficient), and can leverage this grounding to localize video moments corresponding to environment-centric queries, outperforming prior methods. Project page: http://vision.cs.utexas.edu/projects/ego-scene-context/

Viaarxiv icon

Shaping embodied agent behavior with activity-context priors from egocentric video

Oct 14, 2021
Tushar Nagarajan, Kristen Grauman

Figure 1 for Shaping embodied agent behavior with activity-context priors from egocentric video
Figure 2 for Shaping embodied agent behavior with activity-context priors from egocentric video
Figure 3 for Shaping embodied agent behavior with activity-context priors from egocentric video
Figure 4 for Shaping embodied agent behavior with activity-context priors from egocentric video

Complex physical tasks entail a sequence of object interactions, each with its own preconditions -- which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activity-context priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of other compatible objects that are required for activities to succeed (e.g., a knife and cutting board brought together with a tomato are conducive to cutting). We encode our video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction. In this way, our model translates everyday human experience into embodied agent skills. We demonstrate our idea using egocentric EPIC-Kitchens video of people performing unscripted kitchen activities to benefit virtual household robotic agents performing various complex tasks in AI2-iTHOR, significantly accelerating agent learning. Project page: http://vision.cs.utexas.edu/projects/ego-rewards/

Viaarxiv icon

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Oct 13, 2021
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

Figure 1 for Ego4D: Around the World in 3,000 Hours of Egocentric Video
Figure 2 for Ego4D: Around the World in 3,000 Hours of Egocentric Video
Figure 3 for Ego4D: Around the World in 3,000 Hours of Egocentric Video
Figure 4 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

Viaarxiv icon

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

Apr 16, 2021
Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman

Figure 1 for Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos
Figure 2 for Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos
Figure 3 for Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos
Figure 4 for Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in models that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

* Accepted by CVPR-2021 
Viaarxiv icon

Environment Predictive Coding for Embodied Agents

Feb 03, 2021
Santhosh K. Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

Figure 1 for Environment Predictive Coding for Embodied Agents
Figure 2 for Environment Predictive Coding for Embodied Agents
Figure 3 for Environment Predictive Coding for Embodied Agents
Figure 4 for Environment Predictive Coding for Embodied Agents

We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out portions of an agent's trajectory and predict them from the unmasked portions, conditioned on the agent's camera poses. By learning such representations on a collection of videos, we demonstrate successful transfer to multiple downstream navigation-oriented tasks. Our experiments on the photorealistic 3D environments of Gibson and Matterport3D show that our method outperforms the state-of-the-art on challenging tasks with only a limited budget of experience.

* 9 pages, 6 figures, appendix 
Viaarxiv icon

Differentiable Causal Discovery Under Unmeasured Confounding

Oct 14, 2020
Rohit Bhattacharya, Tushar Nagarajan, Daniel Malinsky, Ilya Shpitser

Figure 1 for Differentiable Causal Discovery Under Unmeasured Confounding
Figure 2 for Differentiable Causal Discovery Under Unmeasured Confounding
Figure 3 for Differentiable Causal Discovery Under Unmeasured Confounding
Figure 4 for Differentiable Causal Discovery Under Unmeasured Confounding

The data drawn from biological, economic, and social systems are often confounded due to the presence of unmeasured variables. Prior work in causal discovery has focused on discrete search procedures for selecting acyclic directed mixed graphs (ADMGs), specifically ancestral ADMGs, that encode ordinary conditional independence constraints among the observed variables of the system. However, confounded systems also exhibit more general equality restrictions that cannot be represented via these graphs, placing a limit on the kinds of structures that can be learned using ancestral ADMGs. In this work, we derive differentiable algebraic constraints that fully characterize the space of ancestral ADMGs, as well as more general classes of ADMGs, arid ADMGs and bow-free ADMGs, that capture all equality restrictions on the observed variables. We use these constraints to cast causal discovery as a continuous optimization problem and design differentiable procedures to find the best fitting ADMG when the data comes from a confounded linear system of equations with correlated errors. We demonstrate the efficacy of our method through simulations and application to a protein expression dataset.

* Main draft: 9 pages. Appendix: 9 pages 
Viaarxiv icon

Learning Affordance Landscapes forInteraction Exploration in 3D Environments

Aug 21, 2020
Tushar Nagarajan, Kristen Grauman

Figure 1 for Learning Affordance Landscapes forInteraction Exploration in 3D Environments
Figure 2 for Learning Affordance Landscapes forInteraction Exploration in 3D Environments
Figure 3 for Learning Affordance Landscapes forInteraction Exploration in 3D Environments
Figure 4 for Learning Affordance Landscapes forInteraction Exploration in 3D Environments

Embodied agents operating in human spaces must be able to master how their environment works: what objects can the agent use, and how can it use them? We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen). Given an egocentric RGB-D camera and a high-level action space, the agent is rewarded for maximizing successful interactions while simultaneously training an image-based affordance segmentation model. The former yields a policy for acting efficiently in new environments to prepare for downstream interaction tasks, while the latter yields a convolutional neural network that maps image regions to the likelihood they permit each action, densifying the rewards for exploration. We demonstrate our idea with AI2-iTHOR. The results show agents can learn how to use new home environments intelligently and that it prepares them to rapidly address various downstream tasks like "find a knife and put it in the drawer." Project page: http://vision.cs.utexas.edu/projects/interaction-exploration/

Viaarxiv icon

EGO-TOPO: Environment Affordances from Egocentric Video

Jan 14, 2020
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

Figure 1 for EGO-TOPO: Environment Affordances from Egocentric Video
Figure 2 for EGO-TOPO: Environment Affordances from Egocentric Video
Figure 3 for EGO-TOPO: Environment Affordances from Egocentric Video
Figure 4 for EGO-TOPO: Environment Affordances from Egocentric Video

First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.

Viaarxiv icon

Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

Jun 03, 2019
Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

Figure 1 for Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)
Figure 2 for Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)
Figure 3 for Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)
Figure 4 for Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching videos of real human behavior and anticipating afforded actions. Given a novel image or video, our model infers a spatial hotspot map indicating how an object would be manipulated in a potential interaction, even if the object is currently at rest. Through results with both first and third person video, we show the value of grounding affordances in real human-object interactions. Not only are our weakly supervised hotspots competitive with strongly supervised affordance methods, but they can also anticipate object interaction for novel object categories. Project page: http://vision.cs.utexas.edu/projects/interaction-hotspots/

* arXiv admin note: substantial text overlap with arXiv:1812.04558 
Viaarxiv icon