Alert button
Picture for Yufei Ye

Yufei Ye

Alert button

Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

Sep 11, 2023
Yufei Ye, Poorvi Hebbar, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
Figure 2 for Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
Figure 3 for Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
Figure 4 for Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

We tackle the task of reconstructing hand-object interactions from short video clips. Given an input video, our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape, as well as the time-varying motion and hand articulation. While the input video naturally provides some multi-view cues to guide 3D inference, these are insufficient on their own due to occlusions and limited viewpoint variations. To obtain accurate 3D, we augment the multi-view signals with generic data-driven priors to guide reconstruction. Specifically, we learn a diffusion network to model the conditional distribution of (geometric) renderings of objects conditioned on hand configuration and category label, and leverage it as a prior to guide the novel-view renderings of the reconstructed scene. We empirically evaluate our approach on egocentric videos across 6 object categories, and observe significant improvements over prior single-view and multi-view methods. Finally, we demonstrate our system's ability to reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person interactions.

* Accepted to ICCV23 (Oral). Project Page: https://judyye.github.io/diffhoi-www/ 
Viaarxiv icon

Affordance Diffusion: Synthesizing Hand-Object Interactions

Mar 25, 2023
Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, Sifei Liu

Figure 1 for Affordance Diffusion: Synthesizing Hand-Object Interactions
Figure 2 for Affordance Diffusion: Synthesizing Hand-Object Interactions
Figure 3 for Affordance Diffusion: Synthesizing Hand-Object Interactions
Figure 4 for Affordance Diffusion: Synthesizing Hand-Object Interactions

Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (ie, an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. Project page: https://judyye.github.io/affordiffusion-www

Viaarxiv icon

What's in your hands? 3D Reconstruction of Generic Objects in Hands

Apr 14, 2022
Yufei Ye, Abhinav Gupta, Shubham Tulsiani

Figure 1 for What's in your hands? 3D Reconstruction of Generic Objects in Hands
Figure 2 for What's in your hands? 3D Reconstruction of Generic Objects in Hands
Figure 3 for What's in your hands? 3D Reconstruction of Generic Objects in Hands
Figure 4 for What's in your hands? 3D Reconstruction of Generic Objects in Hands

Our work aims to reconstruct hand-held objects given a single RGB image. In contrast to prior works that typically assume known 3D templates and reduce the problem to 3D pose estimation, our work reconstructs generic hand-held object without knowing their 3D templates. Our key insight is that hand articulation is highly predictive of the object shape, and we propose an approach that conditionally reconstructs the object based on the articulation and the visual input. Given an image depicting a hand-held object, we first use off-the-shelf systems to estimate the underlying hand pose and then infer the object shape in a normalized hand-centric coordinate frame. We parameterized the object by signed distance which are inferred by an implicit network which leverages the information from both visual feature and articulation-aware coordinates to process a query point. We perform experiments across three datasets and show that our method consistently outperforms baselines and is able to reconstruct a diverse set of objects. We analyze the benefits and robustness of explicit articulation conditioning and also show that this allows the hand pose estimation to further improve in test-time optimization.

* accepted to CVPR 22 
Viaarxiv icon

Shelf-Supervised Mesh Prediction in the Wild

Feb 11, 2021
Yufei Ye, Shubham Tulsiani, Abhinav Gupta

Figure 1 for Shelf-Supervised Mesh Prediction in the Wild
Figure 2 for Shelf-Supervised Mesh Prediction in the Wild
Figure 3 for Shelf-Supervised Mesh Prediction in the Wild
Figure 4 for Shelf-Supervised Mesh Prediction in the Wild

We aim to infer 3D shape and pose of object from a single image and propose a learning-based approach that can train from unstructured image collections, supervised by only segmentation outputs from off-the-shelf recognition systems (i.e. 'shelf-supervised'). We first infer a volumetric representation in a canonical frame, along with the camera pose. We enforce the representation geometrically consistent with both appearance and masks, and also that the synthesized novel views are indistinguishable from image collections. The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame. These two steps allow both shape-pose factorization from image collections and per-instance reconstruction in finer details. We examine the method on both synthetic and real-world datasets and demonstrate its scalability on 50 categories in the wild, an order of magnitude more classes than existing works.

Viaarxiv icon

Object-centric Forward Modeling for Model Predictive Control

Oct 08, 2019
Yufei Ye, Dhiraj Gandhi, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Object-centric Forward Modeling for Model Predictive Control
Figure 2 for Object-centric Forward Modeling for Model Predictive Control
Figure 3 for Object-centric Forward Modeling for Model Predictive Control
Figure 4 for Object-centric Forward Modeling for Model Predictive Control

We present an approach to learn an object-centric forward model, and show that this allows us to plan for sequences of actions to achieve distant desired goals. We propose to model a scene as a collection of objects, each with an explicit spatial location and implicit visual feature, and learn to model the effects of actions using random interaction data. Our model allows capturing the robot-object and object-object interactions, and leads to more sample-efficient and accurate predictions. We show that this learned model can be leveraged to search for action sequences that lead to desired goal configurations, and that in conjunction with a learned correction module, this allows for robust closed loop execution. We present experiments both in simulation and the real world, and show that our approach improves over alternate implicit or pixel-space forward models. Please see our project page (https://judyye.github.io/ocmpc/) for result videos.

Viaarxiv icon

Compositional Video Prediction

Aug 22, 2019
Yufei Ye, Maneesh Singh, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Compositional Video Prediction
Figure 2 for Compositional Video Prediction
Figure 3 for Compositional Video Prediction
Figure 4 for Compositional Video Prediction

We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight. We implicitly predict future states of independent entities while reasoning about their interactions, and compose future video frames using these predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variable, and show that this allows us to sample diverse and plausible futures. We empirically validate our approach against alternate representations and ways of incorporating multi-modality. We examine two datasets, one comprising of stacked objects that may fall, and the other containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings. See https://judyye.github.io/CVP/ for video predictions.

* accepted to ICCV19 
Viaarxiv icon

A New Approach for Resource Scheduling with Deep Reinforcement Learning

Jun 21, 2018
Yufei Ye, Xiaoqin Ren, Jin Wang, Lingxiao Xu, Wenxia Guo, Wenqiang Huang, Wenhong Tian

Figure 1 for A New Approach for Resource Scheduling with Deep Reinforcement Learning
Figure 2 for A New Approach for Resource Scheduling with Deep Reinforcement Learning
Figure 3 for A New Approach for Resource Scheduling with Deep Reinforcement Learning
Figure 4 for A New Approach for Resource Scheduling with Deep Reinforcement Learning

With the rapid development of deep learning, deep reinforcement learning (DRL) began to appear in the field of resource scheduling in recent years. Based on the previous research on DRL in the literature, we introduce online resource scheduling algorithm DeepRM2 and the offline resource scheduling algorithm DeepRM_Off. Compared with the state-of-the-art DRL algorithm DeepRM and heuristic algorithms, our proposed algorithms have faster convergence speed and better scheduling efficiency with regarding to average slowdown time, job completion time and rewards.

Viaarxiv icon

Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs

Apr 08, 2018
Xiaolong Wang, Yufei Ye, Abhinav Gupta

Figure 1 for Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs
Figure 2 for Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs
Figure 3 for Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs
Figure 4 for Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs

We consider the problem of zero-shot recognition: learning a visual classifier for a category with zero training examples, just using the word embedding of the category and its relationship to other categories, which visual data are provided. The key to dealing with the unfamiliar or novel category is to transfer knowledge obtained from familiar classes to describe the unfamiliar class. In this paper, we build upon the recently introduced Graph Convolutional Network (GCN) and propose an approach that uses both semantic embeddings and the categorical relationships to predict the classifiers. Given a learned knowledge graph (KG), our approach takes as input semantic embeddings for each node (representing visual category). After a series of graph convolutions, we predict the visual classifier for each category. During training, the visual classifiers for a few categories are given to learn the GCN parameters. At test time, these filters are used to predict the visual classifiers of unseen categories. We show that our approach is robust to noise in the KG. More importantly, our approach provides significant improvement in performance compared to the current state-of-the-art results (from 2 ~ 3% on some metrics to whopping 20% on a few).

* CVPR 2018 
Viaarxiv icon