Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jitendra Malik

Humans in 4D: Reconstructing and Tracking Humans with Transformers

May 31, 2023

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, Jitendra Malik

Figure 1 for Humans in 4D: Reconstructing and Tracking Humans with Transformers

Figure 2 for Humans in 4D: Reconstructing and Tracking Humans with Transformers

Figure 3 for Humans in 4D: Reconstructing and Tracking Humans with Transformers

Figure 4 for Humans in 4D: Reconstructing and Tracking Humans with Transformers

Abstract:We present an approach to reconstruct humans and track them over time. At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery. This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images. To analyze video, we use 3D reconstructions from HMR 2.0 as input to a tracking system that operates in 3D. This enables us to deal with multiple people and maintain identities through occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art results for tracking people from monocular video. Furthermore, we demonstrate the effectiveness of HMR 2.0 on the downstream task of action recognition, achieving significant improvements over previous pose-based action recognition approaches. Our code and models are available on the project website: https://shubham-goel.github.io/4dhumans/.

* Project Webpage: https://shubham-goel.github.io/4dhumans/

Via

Access Paper or Ask Questions

More Than an Arm: Using a Manipulator as a Tail for Enhanced Stability in Legged Locomotion

May 02, 2023

Huang Huang, Antonio Loquercio, Ashish Kumar, Neerja Thakkar, Ken Goldberg, Jitendra Malik

Figure 1 for More Than an Arm: Using a Manipulator as a Tail for Enhanced Stability in Legged Locomotion

Figure 2 for More Than an Arm: Using a Manipulator as a Tail for Enhanced Stability in Legged Locomotion

Figure 3 for More Than an Arm: Using a Manipulator as a Tail for Enhanced Stability in Legged Locomotion

Figure 4 for More Than an Arm: Using a Manipulator as a Tail for Enhanced Stability in Legged Locomotion

Abstract:Is a manipulator on a legged robot a liability or an asset for locomotion? Prior works mainly designed specific controllers to account for the added payload and inertia from a manipulator. In contrast, biological systems typically benefit from additional limbs, which can simplify postural control. For instance, cats use their tails to enhance the stability of their bodies and prevent falls under disturbances. In this work, we show that a manipulator can be an important asset for maintaining balance during locomotion. To do so, we train a sensorimotor policy using deep reinforcement learning to create a synergy between the robot's limbs. This policy enables the robot to maintain stability despite large disturbances. However, learning such a controller can be quite challenging. To account for these challenges, we propose a stage-wise training procedure to learn complex behaviors. Our proposed method decomposes this complex task into three stages and then incrementally learns these tasks to arrive at a single policy capable of solving the final control task, achieving a success rate up to 2.35 times higher than baselines in simulation. We deploy our learned policy in the real world and show stability during locomotion under strong disturbances.

Via

Access Paper or Ask Questions

Navigating to Objects Specified by Images

Apr 03, 2023

Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra, Jitendra Malik, Stefan Lee, Devendra Singh Chaplot

Figure 1 for Navigating to Objects Specified by Images

Figure 2 for Navigating to Objects Specified by Images

Figure 3 for Navigating to Objects Specified by Images

Figure 4 for Navigating to Objects Specified by Images

Abstract:Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We deploy this system to a mobile robot platform and demonstrate effective real-world performance, achieving an 88% success rate across a home and an office environment.

Via

Access Paper or Ask Questions

On the Benefits of 3D Pose and Tracking for Human Action Recognition

Apr 03, 2023

Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, Jitendra Malik

Abstract:In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Code and results are available at: https://brjathu.github.io/LART

* CVPR2023 (project page: https://brjathu.github.io/LART)

Via

Access Paper or Ask Questions

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Mar 31, 2023

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik(+5 more)

Figure 1 for Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Figure 2 for Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Figure 3 for Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Figure 4 for Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Abstract:We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.

* Project website: https://eai-vc.github.io

Via

Access Paper or Ask Questions

Decoupling Human and Camera Motion from Videos in the Wild

Mar 20, 2023

Vickie Ye, Georgios Pavlakos, Jitendra Malik, Angjoo Kanazawa

Figure 1 for Decoupling Human and Camera Motion from Videos in the Wild

Figure 2 for Decoupling Human and Camera Motion from Videos in the Wild

Figure 3 for Decoupling Human and Camera Motion from Videos in the Wild

Figure 4 for Decoupling Human and Camera Motion from Videos in the Wild

Abstract:We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often not possible for in-the-wild videos. However, even when existing SLAM systems cannot recover accurate scene reconstructions, the background pixel motion still provides enough signal to constrain the camera motion. We show that relative camera estimates along with data-driven human motion priors can resolve the scene scale ambiguity and recover global human trajectories. Our method robustly recovers the global 3D trajectories of people in challenging in-the-wild videos, such as PoseTrack. We quantify our improvement over existing methods on 3D human dataset Egobody. We further demonstrate that our recovered camera scale allows us to reason about motion of multiple people in a shared coordinate frame, which improves performance of downstream tracking in PoseTrack. Code and video results can be found at https://vye16.github.io/slahmr.

* Project site: https://vye16.github.io/slahmr. CVPR 2023

Via

Access Paper or Ask Questions

Learning Humanoid Locomotion with Transformers

Mar 06, 2023

Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, Koushil Sreenath

Abstract:We present a sim-to-real learning-based approach for real-world humanoid locomotion. Our controller is a causal Transformer trained by autoregressive prediction of future actions from the history of observations and actions. We hypothesize that the observation-action history contains useful information about the world that a powerful Transformer model can use to adapt its behavior in-context, without updating its weights. We do not use state estimation, dynamics models, trajectory optimization, reference trajectories, or pre-computed gait libraries. Our controller is trained with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deployed to the real world in a zero-shot fashion. We evaluate our approach in high-fidelity simulation and successfully deploy it to the real robot as well. To the best of our knowledge, this is the first demonstration of a fully learning-based method for real-world full-sized humanoid locomotion.

* Project page: https://humanoid-transformer.github.io

Via

Access Paper or Ask Questions

Big Little Transformer Decoder

Feb 15, 2023

Sehoon Kim, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

Abstract:The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment, and which makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to review and correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, summarization on CNN/DailyMail, and language modeling on WikiText-2. On an NVIDIA Titan Xp GPU, our framework achieves a speedup of up to 2.13x without any performance drop, and it achieves up to 2.38x speedup with only ~1 point degradation. Furthermore, our framework is fully plug-and-play as it does not require any training or modifications to model architectures. Our code will be open-sourced.

Via

Access Paper or Ask Questions

Reversible Vision Transformers

Feb 09, 2023

Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik

Figure 1 for Reversible Vision Transformers

Figure 2 for Reversible Vision Transformers

Figure 3 for Reversible Vision Transformers

Figure 4 for Reversible Vision Transformers

Abstract:We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at https://github.com/facebookresearch/slowfast. A simpler, easy to understand and modify version is also available at https://github.com/karttikeya/minREV

* Oral at CVPR 2022, updated version

Via

Access Paper or Ask Questions

Multiview Compressive Coding for 3D Reconstruction

Jan 19, 2023

Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

Figure 1 for Multiview Compressive Coding for 3D Reconstruction

Figure 2 for Multiview Compressive Coding for 3D Reconstruction

Figure 3 for Multiview Compressive Coding for 3D Reconstruction

Figure 4 for Multiview Compressive Coding for 3D Reconstruction

Abstract:A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models and category-specific priors which hinder scaling to novel settings. In this work, we explore single-view 3D reconstruction by learning generalizable representations inspired by advances in self-supervised learning. We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder. MCC's generality and efficiency allow it to learn from large-scale and diverse data sources with strong generalization to novel objects imagined by DALL$\cdot$E 2 or captured in-the-wild with an iPhone.

* Project page: https://mcc3d.github.io/

Via

Access Paper or Ask Questions