Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dingbang Huang

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

Jun 03, 2026

Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos

Abstract:In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: https://dingbang777.github.io/RePHO/

* CVPR 2026. Project Page: https://dingbang777.github.io/RePHO/

Via

Access Paper or Ask Questions

PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

May 16, 2025

Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, Bo Dai

Figure 1 for PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Figure 2 for PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Figure 3 for PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Figure 4 for PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Abstract:Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.

* Project Page: https://github.com/dingbang777/PSDiffusion/

Via

Access Paper or Ask Questions

Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Mar 20, 2025

Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, Yong-Lu Li

Figure 1 for Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Figure 2 for Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Figure 3 for Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Figure 4 for Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Abstract:Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at https://wenboran2002.github.io/3dhoi.

* Accepted to CVPR 2025

Via

Access Paper or Ask Questions