Alert button
Picture for Wentao Yuan

Wentao Yuan

Alert button

University of Washington

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Nov 02, 2023
Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website https://m2-t2.github.io.

* 12 pages, 8 figures, accepted by CoRL 2023 
Viaarxiv icon

Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning

Oct 22, 2023
Chahyon Ku, Carl Winge, Ryan Diaz, Wentao Yuan, Karthik Desingh

This paper primarily focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks. Specifically, it investigates the alignment and insertion of objects with geometrical extrusions and intrusions, commonly referred to as a peg-in-hole task. The accuracy required to detect and orient the peg and the hole geometry in SE(3) space for successful assembly poses significant challenges. Addressing this, we employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders. Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations. Our quantitative analysis shows that existing pretrained models fail to capture the essential visual features necessary for this task. However, a visual encoder trained from scratch consistently outperforms the frozen pretrained models. Moreover, we discuss rotation representations and associated loss functions that substantially improve policy learning. We present a novel task scenario designed to evaluate the progress in visuomotor policy learning, with a specific focus on improving the robustness of intricate assembly tasks that require both geometrical and spatial reasoning. Videos, additional experiments, dataset, and code are available at https://bit.ly/geometric-peg-in-hole .

Viaarxiv icon

TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation

Mar 28, 2023
Xiangyun Meng, Nathan Hatch, Alexander Lambert, Anqi Li, Nolan Wagener, Matthew Schmittle, JoonHo Lee, Wentao Yuan, Zoey Chen, Samuel Deng, Greg Okopal, Dieter Fox, Byron Boots, Amirreza Shaban

Figure 1 for TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation
Figure 2 for TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation
Figure 3 for TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation
Figure 4 for TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation

Effective use of camera-based vision systems is essential for robust performance in autonomous off-road driving, particularly in the high-speed regime. Despite success in structured, on-road settings, current end-to-end approaches for scene prediction have yet to be successfully adapted for complex outdoor terrain. To this end, we present TerrainNet, a vision-based terrain perception system for semantic and geometric terrain prediction for aggressive, off-road navigation. The approach relies on several key insights and practical considerations for achieving reliable terrain modeling. The network includes a multi-headed output representation to capture fine- and coarse-grained terrain features necessary for estimating traversability. Accurate depth estimation is achieved using self-supervised depth completion with multi-view RGB and stereo inputs. Requirements for real-time performance and fast inference speeds are met using efficient, learned image feature projections. Furthermore, the model is trained on a large-scale, real-world off-road dataset collected across a variety of diverse outdoor environments. We show how TerrainNet can also be used for costmap prediction and provide a detailed framework for integration into a planning module. We demonstrate the performance of TerrainNet through extensive comparison to current state-of-the-art baselines for camera-only scene prediction. Finally, we showcase the effectiveness of integrating TerrainNet within a complete autonomous-driving stack by conducting a real-world vehicle test in a challenging off-road scenario.

Viaarxiv icon

KD-MVS: Knowledge Distillation Based Self-supervised Learning for MVS

Jul 21, 2022
Yikang Ding, Qingtian Zhu, Xiangyue Liu, Wentao Yuan, Haotian Zhang, CHi Zhang

Figure 1 for KD-MVS: Knowledge Distillation Based Self-supervised Learning for MVS
Figure 2 for KD-MVS: Knowledge Distillation Based Self-supervised Learning for MVS
Figure 3 for KD-MVS: Knowledge Distillation Based Self-supervised Learning for MVS
Figure 4 for KD-MVS: Knowledge Distillation Based Self-supervised Learning for MVS

Supervised multi-view stereo (MVS) methods have achieved remarkable progress in terms of reconstruction quality, but suffer from the challenge of collecting large-scale ground-truth depth. In this paper, we propose a novel self-supervised training pipeline for MVS based on knowledge distillation, termed \textit{KD-MVS}, which mainly consists of self-supervised teacher training and distillation-based student training. Specifically, the teacher model is trained in a self-supervised fashion using both photometric and featuremetric consistency. Then we distill the knowledge of the teacher model to the student model through probabilistic knowledge transferring. With the supervision of validated knowledge, the student model is able to outperform its teacher by a large margin. Extensive experiments performed on multiple datasets show our method can even outperform supervised methods.

Viaarxiv icon

Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives

Jul 21, 2022
Wentao Yuan, Qingtian Zhu, Xiangyue Liu, Yikang Ding, Haotian Zhang, Chi Zhang

Figure 1 for Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives
Figure 2 for Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives
Figure 3 for Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives
Figure 4 for Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives

Recently, Implicit Neural Representations (INRs) parameterized by neural networks have emerged as a powerful and promising tool to represent different kinds of signals due to its continuous, differentiable properties, showing superiorities to classical discretized representations. However, the training of neural networks for INRs only utilizes input-output pairs, and the derivatives of the target output with respect to the input, which can be accessed in some cases, are usually ignored. In this paper, we propose a training paradigm for INRs whose target output is image pixels, to encode image derivatives in addition to image values in the neural network. Specifically, we use finite differences to approximate image derivatives. We show how the training paradigm can be leveraged to solve typical INRs problems, i.e., image regression and inverse rendering, and demonstrate this training paradigm can improve the data-efficiency and generalization capabilities of INRs. The code of our method is available at \url{https://github.com/megvii-research/Sobolev_INRs}.

Viaarxiv icon

TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers

Nov 29, 2021
Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, Xiao Liu

Figure 1 for TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers
Figure 2 for TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers
Figure 3 for TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers
Figure 4 for TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers

In this paper, we present TransMVSNet, based on our exploration of feature matching in multi-view stereo (MVS). We analogize MVS back to its nature of a feature matching task and therefore propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information within and across images. To facilitate a better adaptation of the FMT, we leverage an Adaptive Receptive Field (ARF) module to ensure a smooth transit in scopes of features and bridge different stages with a feature pathway to pass transformed features and gradients across different scales. In addition, we apply pair-wise feature correlation to measure similarity between features, and adopt ambiguity-reducing focal loss to strengthen the supervision. To the best of our knowledge, TransMVSNet is the first attempt to leverage Transformer into the task of MVS. As a result, our method achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark, and BlendedMVS dataset. The code of our method will be made available at https://github.com/MegviiRobot/TransMVSNet .

Viaarxiv icon

SORNet: Spatial Object-Centric Representations for Sequential Manipulation

Sep 08, 2021
Wentao Yuan, Chris Paxton, Karthik Desingh, Dieter Fox

Figure 1 for SORNet: Spatial Object-Centric Representations for Sequential Manipulation
Figure 2 for SORNet: Spatial Object-Centric Representations for Sequential Manipulation
Figure 3 for SORNet: Spatial Object-Centric Representations for Sequential Manipulation
Figure 4 for SORNet: Spatial Object-Centric Representations for Sequential Manipulation

Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state, where the ability to reason about spatial relationships among object entities from raw sensor inputs is crucial. Prior works relying on explicit state estimation or end-to-end learning struggle with novel objects. In this work, we propose SORNet (Spatial Object-Centric Representation Network), which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest. We show that the object embeddings learned by SORNet generalize zero-shot to unseen object entities on three spatial reasoning tasks: spatial relationship classification, skill precondition classification and relative direction regression, significantly outperforming baselines. Further, we present real-world robotic experiments demonstrating the usage of the learned object embeddings in task planning for sequential manipulation.

Viaarxiv icon

STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering

Dec 22, 2020
Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, Steven Lovegrove

Figure 1 for STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering
Figure 2 for STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering
Figure 3 for STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering
Figure 4 for STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering

We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation. Recent work has shown that neural networks are surprisingly effective at the task of compressing many views of a scene into a learned function which maps from a viewing ray to an observed radiance value via volume rendering. Unfortunately, these methods lose all their predictive power once any object in the scene has moved. In this work, we explicitly model rigid motion of objects in the context of neural representations of radiance fields. We show that without any additional human specified supervision, we can reconstruct a dynamic scene with a single rigid object in motion by simultaneously decomposing it into its two constituent parts and encoding each with its own neural representation. We achieve this by jointly optimizing the parameters of two neural radiance fields and a set of rigid poses which align the two fields at each frame. On both synthetic and real world datasets, we demonstrate that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes. Our factored representation furthermore enables animation of unseen object motion.

Viaarxiv icon

DeepGMR: Learning Latent Gaussian Mixture Models for Registration

Aug 20, 2020
Wentao Yuan, Ben Eckart, Kihwan Kim, Varun Jampani, Dieter Fox, Jan Kautz

Figure 1 for DeepGMR: Learning Latent Gaussian Mixture Models for Registration
Figure 2 for DeepGMR: Learning Latent Gaussian Mixture Models for Registration
Figure 3 for DeepGMR: Learning Latent Gaussian Mixture Models for Registration
Figure 4 for DeepGMR: Learning Latent Gaussian Mixture Models for Registration

Point cloud registration is a fundamental problem in 3D computer vision, graphics and robotics. For the last few decades, existing registration algorithms have struggled in situations with large transformations, noise, and time constraints. In this paper, we introduce Deep Gaussian Mixture Registration (DeepGMR), the first learning-based registration method that explicitly leverages a probabilistic registration paradigm by formulating registration as the minimization of KL-divergence between two probability distributions modeled as mixtures of Gaussians. We design a neural network that extracts pose-invariant correspondences between raw point clouds and Gaussian Mixture Model (GMM) parameters and two differentiable compute blocks that recover the optimal transformation from matched GMM parameters. This construction allows the network learn an SE(3)-invariant feature space, producing a global registration method that is real-time, generalizable, and robust to noise. Across synthetic and real-world data, our proposed method shows favorable performance when compared with state-of-the-art geometry-based and learning-based registration methods.

* ECCV 2020 spotlight 
Viaarxiv icon