Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chuang Gan

Retrospectives on the Embodied AI Workshop

Oct 17, 2022

Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi(+29 more)

Figure 1 for Retrospectives on the Embodied AI Workshop

Figure 2 for Retrospectives on the Embodied AI Workshop

Figure 3 for Retrospectives on the Embodied AI Workshop

Figure 4 for Retrospectives on the Embodied AI Workshop

Abstract:We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.

Via

Access Paper or Ask Questions

Revisiting the Roles of "Text" in Text Games

Oct 15, 2022

Yi Gu, Shunyu Yao, Chuang Gan, Joshua B. Tenenbaum, Mo Yu

Figure 1 for Revisiting the Roles of "Text" in Text Games

Figure 2 for Revisiting the Roles of "Text" in Text Games

Figure 3 for Revisiting the Roles of "Text" in Text Games

Figure 4 for Revisiting the Roles of "Text" in Text Games

Abstract:Text games present opportunities for natural language understanding (NLU) methods to tackle reinforcement learning (RL) challenges. However, recent work has questioned the necessity of NLU by showing random text hashes could perform decently. In this paper, we pursue a fine-grained investigation into the roles of text in the face of different RL challenges, and reconcile that semantic and non-semantic language representations could be complementary rather than contrasting. Concretely, we propose a simple scheme to extract relevant contextual information into an approximate state hash as extra input for an RNN-based text agent. Such a lightweight plug-in achieves competitive performance with state-of-the-art text agents using advanced NLU techniques such as knowledge graph and passage retrieval, suggesting non-NLU methods might suffice to tackle the challenge of partial observability. However, if we remove RNN encoders and use approximate or even ground-truth state hash alone, the model performs miserably, which confirms the importance of semantic function approximation to tackle the challenge of combinatorially large observation and action spaces. Our findings and analysis provide new insights for designing better text game task setups and agents.

Via

Access Paper or Ask Questions

Learning Active Camera for Multi-Object Navigation

Oct 14, 2022

Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

Figure 1 for Learning Active Camera for Multi-Object Navigation

Figure 2 for Learning Active Camera for Multi-Object Navigation

Figure 3 for Learning Active Camera for Multi-Object Navigation

Figure 4 for Learning Active Camera for Multi-Object Navigation

Abstract:Getting robots to navigate to multiple objects autonomously is essential yet difficult in robot applications. One of the key challenges is how to explore environments efficiently with camera sensors only. Existing navigation methods mainly focus on fixed cameras and few attempts have been made to navigate with active cameras. As a result, the agent may take a very long time to perceive the environment due to limited camera scope. In contrast, humans typically gain a larger field of view by looking around for a better perception of the environment. How to make robots perceive the environment as efficiently as humans is a fundamental problem in robotics. In this paper, we consider navigating to multiple objects more efficiently with active cameras. Specifically, we cast moving camera to a Markov Decision Process and reformulate the active camera problem as a reinforcement learning problem. However, we have to address two new challenges: 1) how to learn a good camera policy in complex environments and 2) how to coordinate it with the navigation policy. To address these, we carefully design a reward function to encourage the agent to explore more areas by moving camera actively. Moreover, we exploit human experience to infer a rule-based camera action to guide the learning process. Last, to better coordinate two kinds of policies, the camera policy takes navigation actions into account when making camera moving decisions. Experimental results show our camera policy consistently improves the performance of multi-object navigation over four baselines on two datasets.

* Accepted by NeurIPS 2022

Via

Access Paper or Ask Questions

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Oct 14, 2022

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

Figure 1 for Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Figure 2 for Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Figure 3 for Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Figure 4 for Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Abstract:We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. However, enabling a robot to build a map that well represents the environment is extremely challenging as the environment often involves diverse objects with various attributes. In this paper, we propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively. Moreover, we propose a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map. Through this task, the agent not only learns to localize the instruction-relevant objects for navigation but also is encouraged to learn a better map representation that reveals object information. We then feed the learned map and instruction to a waypoint predictor to determine the next navigation goal. Experimental results show our method outperforms the state-of-the-art by 4.0% and 4.6% w.r.t. success rate both in seen and unseen environments, respectively on VLN-CE dataset. Code is available at https://github.com/PeihaoChen/WS-MGMap.

* Accepted by NeurIPS 2022

Via

Access Paper or Ask Questions

Learning Physical Dynamics with Subequivariant Graph Neural Networks

Oct 13, 2022

Jiaqi Han, Wenbing Huang, Hengbo Ma, Jiachen Li, Joshua B. Tenenbaum, Chuang Gan

Figure 1 for Learning Physical Dynamics with Subequivariant Graph Neural Networks

Figure 2 for Learning Physical Dynamics with Subequivariant Graph Neural Networks

Figure 3 for Learning Physical Dynamics with Subequivariant Graph Neural Networks

Figure 4 for Learning Physical Dynamics with Subequivariant Graph Neural Networks

Abstract:Graph Neural Networks (GNNs) have become a prevailing tool for learning physical dynamics. However, they still encounter several challenges: 1) Physical laws abide by symmetry, which is a vital inductive bias accounting for model generalization and should be incorporated into the model design. Existing simulators either consider insufficient symmetry, or enforce excessive equivariance in practice when symmetry is partially broken by gravity. 2) Objects in the physical world possess diverse shapes, sizes, and properties, which should be appropriately processed by the model. To tackle these difficulties, we propose a novel backbone, Subequivariant Graph Neural Network, which 1) relaxes equivariance to subequivariance by considering external fields like gravity, where the universal approximation ability holds theoretically; 2) introduces a new subequivariant object-aware message passing for learning physical interactions between multiple objects of various shapes in the particle-based representation; 3) operates in a hierarchical fashion, allowing for modeling long-range and complex interactions. Our model achieves on average over 3% enhancement in contact prediction accuracy across 8 scenarios on Physion and 2X lower rollout MSE on RigidFall compared with state-of-the-art GNN simulators, while exhibiting strong generalization and data efficiency.

* Accepted by NeurIPS 2022

Via

Access Paper or Ask Questions

M$^3$Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

Oct 12, 2022

Xinyu Sun, Peihao Chen, Liangwei Chen, Thomas H. Li, Mingkui Tan, Chuang Gan

Figure 1 for M$^3$Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

Figure 2 for M$^3$Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

Figure 3 for M$^3$Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

Figure 4 for M$^3$Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

Abstract:We study self-supervised video representation learning that seeks to learn video features from unlabeled videos, which is widely used for video analysis as labeling videos is labor-intensive. Current methods often mask some video regions and then train a model to reconstruct spatial information in these regions (e.g., original pixels). However, the model is easy to reconstruct this information by considering content in a single frame. As a result, it may neglect to learn the interactions between frames, which are critical for video analysis. In this paper, we present a new self-supervised learning task, called Masked Motion Modeling (M$^3$Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions. To generate motion targets for this task, we track the objects using optical flow. The motion targets consist of position transitions and shape changes of the tracked objects, thus the model has to consider multiple frames comprehensively. Besides, to help the model capture fine-grained motion details, we enforce the model to predict trajectory motion targets in high temporal resolution based on a video in low temporal resolution. After pre-training using our M$^3$Video task, the model is able to anticipate fine-grained motion details even taking a sparsely sampled video as input. We conduct extensive experiments on four benchmark datasets. Remarkably, when doing pre-training with 400 epochs, we improve the accuracy from 67.6\% to 69.2\% and from 78.8\% to 79.7\% on Something-Something V2 and Kinetics-400 datasets, respectively.

Via

Access Paper or Ask Questions

MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Sep 17, 2022

Kefan Su, Siyuan Zhou, Chuang Gan, Xiangjun Wang, Zongqing Lu

Figure 1 for MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Figure 2 for MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Figure 3 for MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Figure 4 for MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Abstract:Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose \textit{multi-agent alternate Q-learning} (MA2QL), where agents take turns to update their Q-functions by Q-learning. MA2QL is a \textit{minimalist} approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees a $\varepsilon$-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.

Via

Access Paper or Ask Questions

Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

Jul 29, 2022

Hongbin Lin, Yifan Zhang, Zhen Qiu, Shuaicheng Niu, Chuang Gan, Yanxia Liu, Mingkui Tan

Figure 1 for Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

Figure 2 for Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

Figure 3 for Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

Figure 4 for Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

Abstract:This paper studies a new, practical but challenging problem, called Class-Incremental Unsupervised Domain Adaptation (CI-UDA), where the labeled source domain contains all classes, but the classes in the unlabeled target domain increase sequentially. This problem is challenging due to two difficulties. First, source and target label sets are inconsistent at each time step, which makes it difficult to conduct accurate domain alignment. Second, previous target classes are unavailable in the current step, resulting in the forgetting of previous knowledge. To address this problem, we propose a novel Prototype-guided Continual Adaptation (ProCA) method, consisting of two solution strategies. 1) Label prototype identification: we identify target label prototypes by detecting shared classes with cumulative prediction probabilities of target samples. 2) Prototype-based alignment and replay: based on the identified label prototypes, we align both domains and enforce the model to retain previous knowledge. With these two strategies, ProCA is able to adapt the source model to a class-incremental unlabeled target domain effectively. Extensive experiments demonstrate the effectiveness and superiority of ProCA in resolving CI-UDA. The source code is available at https://github.com/Hongbin98/ProCA.git

* Accepted by ECCV 2022

Via

Access Paper or Ask Questions

On-Device Training Under 256KB Memory

Jul 14, 2022

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

Figure 1 for On-Device Training Under 256KB Memory

Figure 2 for On-Device Training Under 256KB Memory

Figure 3 for On-Device Training Under 256KB Memory

Figure 4 for On-Device Training Under 256KB Memory

Abstract:On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to mixed bit-precision and the lack of normalization; (2) the limited hardware resource (memory and computation) does not allow full backward computation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offloads the runtime auto-differentiation to compile time. Our framework is the first practical solution for on-device transfer learning of visual recognition on tiny IoT devices (e.g., a microcontroller with only 256KB SRAM), using less than 1/100 of the memory of existing frameworks while matching the accuracy of cloud training+edge deployment for the tinyML application VWW. Our study enables IoT devices to not only perform inference but also continuously adapt to new data for on-device lifelong learning.

Via

Access Paper or Ask Questions

3D Concept Grounding on Neural Fields

Jul 13, 2022

Yining Hong, Yilun Du, Chunru Lin, Joshua B. Tenenbaum, Chuang Gan

Figure 1 for 3D Concept Grounding on Neural Fields

Figure 2 for 3D Concept Grounding on Neural Fields

Figure 3 for 3D Concept Grounding on Neural Fields

Figure 4 for 3D Concept Grounding on Neural Fields

Abstract:In this paper, we address the challenging problem of 3D concept grounding (i.e. segmenting and learning visual concepts) by looking at RGBD images and reasoning about paired questions and answers. Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded. In contrast, humans are capable of grounding concepts on the underlying 3D representation of images. However, traditionally inferred 3D representations (e.g., point clouds, voxelgrids, and meshes) cannot capture continuous 3D features flexibly, thus making it challenging to ground concepts to 3D regions based on the language description of the object being referred to. To address both issues, we propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts. Specifically, each 3D coordinate in a scene is represented as a high-dimensional descriptor. Concept grounding can then be performed by computing the similarity between the descriptor vector of a 3D coordinate and the vector embedding of a language concept, which enables segmentations and concept learning to be jointly learned on neural fields in a differentiable fashion. As a result, both 3D semantic and instance segmentations can emerge directly from question answering supervision using a set of defined neural operators on top of neural fields (e.g., filtering and counting). Experimental results show that our proposed framework outperforms unsupervised/language-mediated segmentation models on semantic and instance segmentation tasks, as well as outperforms existing models on the challenging 3D aware visual reasoning tasks. Furthermore, our framework can generalize well to unseen shape categories and real scans.

* Project page: http://3d-cg.csail.mit.edu

Via

Access Paper or Ask Questions