Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jitendra Malik

A Zero-Shot Adaptive Quadcopter Controller

Sep 19, 2022
Dingqi Zhang, Antonio Loquercio, Xiangyu Wu, Ashish Kumar, Jitendra Malik, Mark W. Mueller

Figure 1 for A Zero-Shot Adaptive Quadcopter Controller

Figure 2 for A Zero-Shot Adaptive Quadcopter Controller

Figure 3 for A Zero-Shot Adaptive Quadcopter Controller

Figure 4 for A Zero-Shot Adaptive Quadcopter Controller

This paper proposes a universal adaptive controller for quadcopters, which can be deployed zero-shot to quadcopters of very different mass, arm lengths and motor constants, and also shows rapid adaptation to unknown disturbances during runtime. The core algorithmic idea is to learn a single policy that can adapt online at test time not only to the disturbances applied to the drone, but also to the robot dynamics and hardware in the same framework. We achieve this by training a neural network to estimate a latent representation of the robot and environment parameters, which is used to condition the behaviour of the controller, also represented as a neural network. We train both networks exclusively in simulation with the goal of flying the quadcopters to goal positions and avoiding crashes to the ground. We directly deploy the same controller trained in the simulation without any modifications on two quadcopters with differences in mass, inertia, and maximum motor speed of up to 4 times. In addition, we show rapid adaptation to sudden and large disturbances (up to 35.7%) in the mass and inertia of the quadcopters. We perform an extensive evaluation in both simulation and the physical world, where we outperform a state-of-the-art learning-based adaptive controller and a traditional PID controller specifically tuned to each platform individually. Video results can be found at https://dz298.github.io/universal-drone-controller/.

* Video results can be found on the project webpage https://dz298.github.io/universal-drone-controller/

Via

Access Paper or Ask Questions

Multi-skill Mobile Manipulation for Object Rearrangement

Sep 06, 2022
Jiayuan Gu, Devendra Singh Chaplot, Hao Su, Jitendra Malik

Figure 1 for Multi-skill Mobile Manipulation for Object Rearrangement

Figure 2 for Multi-skill Mobile Manipulation for Object Rearrangement

Figure 3 for Multi-skill Mobile Manipulation for Object Rearrangement

Figure 4 for Multi-skill Mobile Manipulation for Object Rearrangement

We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines.

* Project website: https://sites.google.com/view/hab-m3

Via

Access Paper or Ask Questions

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Jun 02, 2022
Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

Figure 1 for Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Figure 2 for Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Figure 3 for Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Figure 4 for Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After reexamining the design choices for both the macro and micro-architecture of Conformer, we propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure, which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of feed-forward module, followed up by multi-head attention or convolution modules, instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depth-wise downsampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate on Librispeech test-other without external language models. This is 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

Via

Access Paper or Ask Questions

Adapting Rapid Motor Adaptation for Bipedal Robots

May 30, 2022
Ashish Kumar, Zhongyu Li, Jun Zeng, Deepak Pathak, Koushil Sreenath, Jitendra Malik

Figure 1 for Adapting Rapid Motor Adaptation for Bipedal Robots

Figure 2 for Adapting Rapid Motor Adaptation for Bipedal Robots

Figure 3 for Adapting Rapid Motor Adaptation for Bipedal Robots

Figure 4 for Adapting Rapid Motor Adaptation for Bipedal Robots

Recent advances in legged locomotion have enabled quadrupeds to walk on challenging terrains. However, bipedal robots are inherently more unstable and hence it's harder to design walking controllers for them. In this work, we leverage recent advances in rapid adaptation for locomotion control, and extend them to work on bipedal robots. Similar to existing works, we start with a base policy which produces actions while taking as input an estimated extrinsics vector from an adaptation module. This extrinsics vector contains information about the environment and enables the walking controller to rapidly adapt online. However, the extrinsics estimator could be imperfect, which might lead to poor performance of the base policy which expects a perfect estimator. In this paper, we propose A-RMA (Adapting RMA), which additionally adapts the base policy for the imperfect extrinsics estimator by finetuning it using model-free RL. We demonstrate that A-RMA outperforms a number of RL-based baseline controllers and model-based controllers in simulation, and show zero-shot deployment of a single A-RMA policy to enable a bipedal robot, Cassie, to walk in a variety of different scenarios in the real world beyond what it has seen during training. Videos and results at https://ashish-kmr.github.io/a-rma/

* First two authors contributed equally. Website at https://ashish-kmr.github.io/a-rma/

Via

Access Paper or Ask Questions

Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

Apr 12, 2022
Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, Du Tran

Figure 1 for Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

Figure 2 for Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

Figure 3 for Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

Figure 4 for Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

Open-world instance segmentation is the task of grouping pixels into object instances without any pre-determined taxonomy. This is challenging, as state-of-the-art methods rely on explicit class semantics obtained from large labeled datasets, and out-of-domain evaluation performance drops significantly. Here we propose a novel approach for mask proposals, Generic Grouping Networks (GGNs), constructed without semantic supervision. Our approach combines a local measure of pixel affinity with instance-level mask supervision, producing a training regimen designed to make the model as generic as the data diversity allows. We introduce a method for predicting Pairwise Affinities (PA), a learned local relationship between pairs of pixels. PA generalizes very well to unseen categories. From PA we construct a large set of pseudo-ground-truth instance masks; combined with human-annotated instance masks we train GGNs and significantly outperform the SOTA on open-world instance segmentation on various benchmarks including COCO, LVIS, ADE20K, and UVO. Code is available on project website: https://sites.google.com/view/generic-grouping/.

* CVPR 2022

Via

Access Paper or Ask Questions

Masked Visual Pre-training for Motor Control

Mar 11, 2022
Tete Xiao, Ilija Radosavovic, Trevor Darrell, Jitendra Malik

Figure 1 for Masked Visual Pre-training for Motor Control

Figure 2 for Masked Visual Pre-training for Motor Control

Figure 3 for Masked Visual Pre-training for Motor Control

Figure 4 for Masked Visual Pre-training for Motor Control

This paper shows that self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We first train the visual representations by masked modeling of natural images. We then freeze the visual encoder and train neural network controllers on top with reinforcement learning. We do not perform any task-specific fine-tuning of the encoder; the same visual representations are used for all motor control tasks. To the best of our knowledge, this is the first self-supervised model to exploit real-world images at scale for motor control. To accelerate progress in learning from pixels, we contribute a benchmark suite of hand-designed tasks varying in movements, scenes, and robots. Without relying on labels, state-estimation, or expert demonstrations, we consistently outperform supervised encoders by up to 80% absolute success rate, sometimes even matching the oracle state performance. We also find that in-the-wild images, e.g., from YouTube or Egocentric videos, lead to better visual representations for various manipulation tasks than ImageNet images.

* Code and videos at: https://tetexiao.com/projects/mvp

Via

Access Paper or Ask Questions

Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

Feb 10, 2022
Anastasios N Angelopoulos, Amit P Kohli, Stephen Bates, Michael I Jordan, Jitendra Malik, Thayer Alshaabi, Srigokul Upadhyayula, Yaniv Romano

Figure 1 for Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

Figure 2 for Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

Figure 3 for Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

Figure 4 for Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

Image-to-image regression is an important learning task, used frequently in biological imaging. Current algorithms, however, do not generally offer statistical guarantees that protect against a model's mistakes and hallucinations. To address this, we develop uncertainty quantification techniques with rigorous statistical guarantees for image-to-image regression problems. In particular, we show how to derive uncertainty intervals around each pixel that are guaranteed to contain the true value with a user-specified confidence probability. Our methods work in conjunction with any base machine learning model, such as a neural network, and endow it with formal mathematical guarantees -- regardless of the true unknown data distribution or choice of model. Furthermore, they are simple to implement and computationally inexpensive. We evaluate our procedure on three image-to-image regression tasks: quantitative phase microscopy, accelerated magnetic resonance imaging, and super-resolution transmission electron microscopy of a Drosophila melanogaster brain.

* Code available at https://github.com/aangelopoulos/im2im-uq

Via

Access Paper or Ask Questions

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Jan 25, 2022
Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman

Figure 1 for PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Figure 2 for PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Figure 3 for PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Figure 4 for PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that `where to look?' can be treated purely as a perception problem, and learned without environment interactions. To address this, we propose a network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. We train the potential function network using supervised learning on a passive dataset of top-down semantic maps, and integrate it into a modular framework to perform ObjectGoal navigation. Experiments on Gibson and Matterport3D demonstrate that our method achieves the state-of-the-art for ObjectGoal navigation while incurring up to 1,600x less computational cost for training.

* 8 pages + appendix

Via

Access Paper or Ask Questions

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Jan 20, 2022
Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

Figure 1 for MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Figure 2 for MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Figure 3 for MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Figure 4 for MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models will be made publicly available.

* Technical report

Via

Access Paper or Ask Questions

Tracking People by Predicting 3D Appearance, Location & Pose

Dec 08, 2021
Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Jitendra Malik

Figure 1 for Tracking People by Predicting 3D Appearance, Location & Pose

Figure 2 for Tracking People by Predicting 3D Appearance, Location & Pose

Figure 3 for Tracking People by Predicting 3D Appearance, Location & Pose

Figure 4 for Tracking People by Predicting 3D Appearance, Location & Pose

In this paper, we present an approach for tracking people in monocular videos, by predicting their future 3D representations. To achieve this, we first lift people to 3D from a single frame in a robust way. This lifting includes information about the 3D pose of the person, his or her location in the 3D space, and the 3D appearance. As we track a person, we collect 3D observations over time in a tracklet representation. Given the 3D nature of our observations, we build temporal models for each one of the previous attributes. We use these models to predict the future state of the tracklet, including 3D location, 3D appearance, and 3D pose. For a future frame, we compute the similarity between the predicted state of a tracklet and the single frame observations in a probabilistic manner. Association is solved with simple Hungarian matching, and the matches are used to update the respective tracklets. We evaluate our approach on various benchmarks and report state-of-the-art results.

* Project Page : https://brjathu.github.io/PHALP/

Via

Access Paper or Ask Questions