Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Heng Yu

, Tsinghua University

AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion

Apr 20, 2026

Hongjie Li, Heng Yu, Jiaman Li, Hong-Xing Yu, Ehsan Adeli, C. Karen Liu, Jiajun Wu

Abstract:Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.

* CVPR 2026. Project website: https://awfuact.github.io/anylift/ The first two authors contribute equally

Via

Access Paper or Ask Questions

Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations

Mar 01, 2026

Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao, Heng Yu, Xudong Jiang

Abstract:While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.

* Accepted by IEEE Transactions on Multimedia

Via

Access Paper or Ask Questions

Compress, Cross and Scale: Multi-Level Compression Cross Networks for Efficient Scaling in Recommender Systems

Feb 12, 2026

Heng Yu, Xiangjun Zhou, Jie Xia, Heng Zhao, Anxin Wu, Yu Zhao, Dongying Kong

Abstract:Modeling high-order feature interactions efficiently is a central challenge in click-through rate and conversion rate prediction. Modern industrial recommender systems are predominantly built upon deep learning recommendation models, where the interaction backbone plays a critical role in determining both predictive performance and system efficiency. However, existing interaction modules often struggle to simultaneously achieve strong interaction capacity, high computational efficiency, and good scalability, resulting in limited ROI when models are scaled under strict production constraints. In this work, we propose MLCC, a structured feature interaction architecture that organizes feature crosses through hierarchical compression and dynamic composition, which can efficiently capture high-order feature dependencies while maintaining favorable computational complexity. We further introduce MC-MLCC, a Multi-Channel extension that decomposes feature interactions into parallel subspaces, enabling efficient horizontal scaling with improved representation capacity and significantly reduced parameter growth. Extensive experiments on three public benchmarks and a large-scale industrial dataset show that our proposed models consistently outperform strong DLRM-style baselines by up to 0.52 AUC, while reducing model parameters and FLOPs by up to 26$\times$ under comparable performance. Comprehensive scaling analyses demonstrate stable and predictable scaling behavior across embedding dimension, head number, and channel count, with channel-based scaling achieving substantially better efficiency than conventional embedding inflation. Finally, online A/B testing on a real-world advertising platform validates the practical effectiveness of our approach, which has been widely adopted in Bilibili advertising system under strict latency and resource constraints.

* 11 pages, 3 figures

Via

Access Paper or Ask Questions

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Dec 16, 2025

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli

Figure 1 for ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Figure 2 for ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Figure 3 for ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Figure 4 for ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Abstract:Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/

* Project page: https://ai.stanford.edu/~juze/ViBES/

Via

Access Paper or Ask Questions

SocialGen: Modeling Multi-Human Social Interaction with Language Models

Mar 28, 2025

Heng Yu, Juze Zhang, Changan Chen, Tiange Xiang, Yusu Fang, Juan Carlos Niebles, Ehsan Adeli

Abstract:Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.

Via

Access Paper or Ask Questions

MagicGeo: Training-Free Text-Guided Geometric Diagram Generation

Feb 19, 2025

Junxiao Wang, Ting Zhang, Heng Yu, Jingdong Wang, Hua Huang

Abstract:Geometric diagrams are critical in conveying mathematical and scientific concepts, yet traditional diagram generation methods are often manual and resource-intensive. While text-to-image generation has made strides in photorealistic imagery, creating accurate geometric diagrams remains a challenge due to the need for precise spatial relationships and the scarcity of geometry-specific datasets. This paper presents MagicGeo, a training-free framework for generating geometric diagrams from textual descriptions. MagicGeo formulates the diagram generation process as a coordinate optimization problem, ensuring geometric correctness through a formal language solver, and then employs coordinate-aware generation. The framework leverages the strong language translation capability of large language models, while formal mathematical solving ensures geometric correctness. We further introduce MagicGeoBench, a benchmark dataset of 220 geometric diagram descriptions, and demonstrate that MagicGeo outperforms current methods in both qualitative and quantitative evaluations. This work provides a scalable, accurate solution for automated diagram generation, with significant implications for educational and academic applications.

Via

Access Paper or Ask Questions

DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

Jan 21, 2025

Yupei Li, Zifan Wei, Heng Yu, Huichi Zhou, Björn W. Schuller

Figure 1 for DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

Figure 2 for DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

Figure 3 for DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

Figure 4 for DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

Abstract:Code-switching, the alternation between two or more languages within communication, poses great challenges for Automatic Speech Recognition (ASR) systems. Existing models and datasets are limited in their ability to effectively handle these challenges. To address this gap and foster progress in code-switching ASR research, we introduce the DOTA-ME-CS: Daily oriented text audio Mandarin-English code-switching dataset, which consists of 18.54 hours of audio data, including 9,300 recordings from 34 participants. To enhance the dataset's diversity, we apply artificial intelligence (AI) techniques such as AI timbre synthesis, speed variation, and noise addition, thereby increasing the complexity and scalability of the task. The dataset is carefully curated to ensure both diversity and quality, providing a robust resource for researchers addressing the intricacies of bilingual speech recognition with detailed data analysis. We further demonstrate the dataset's potential in future research. The DOTA-ME-CS dataset, along with accompanying code, will be made publicly available.

Via

Access Paper or Ask Questions

End-to-end Generative Spatial-Temporal Ultrasonic Odometry and Mapping Framework

Dec 23, 2024

Fuhua Jia, Xiaoying Yang, Mengshen Yang, Yang Li, Hang Xu, Adam Rushworth, Salman Ijaz, Heng Yu, Tianxiang Cui

Figure 1 for End-to-end Generative Spatial-Temporal Ultrasonic Odometry and Mapping Framework

Figure 2 for End-to-end Generative Spatial-Temporal Ultrasonic Odometry and Mapping Framework

Figure 3 for End-to-end Generative Spatial-Temporal Ultrasonic Odometry and Mapping Framework

Figure 4 for End-to-end Generative Spatial-Temporal Ultrasonic Odometry and Mapping Framework

Abstract:Performing simultaneous localization and mapping (SLAM) in low-visibility conditions, such as environments filled with smoke, dust and transparent objets, has long been a challenging task. Sensors like cameras and Light Detection and Ranging (LiDAR) are significantly limited under these conditions, whereas ultrasonic sensors offer a more robust alternative. However, the low angular resolution, slow update frequency, and limited detection accuracy of ultrasonic sensors present barriers for SLAM. In this work, we propose a novel end-to-end generative ultrasonic SLAM framework. This framework employs a sensor array with overlapping fields of view, leveraging the inherently low angular resolution of ultrasonic sensors to implicitly encode spatial features in conjunction with the robot's motion. Consecutive time frame data is processed through a sliding window mechanism to capture temporal features. The spatiotemporally encoded sensor data is passed through multiple modules to generate dense scan point clouds and robot pose transformations for map construction and odometry. The main contributions of this work include a novel ultrasonic sensor array that spatiotemporally encodes the surrounding environment, and an end-to-end generative SLAM framework that overcomes the inherent defects of ultrasonic sensors. Several real-world experiments demonstrate the feasibility and robustness of the proposed framework.

* 5 pages, 4 figures and 1 table

Via

Access Paper or Ask Questions

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Jun 11, 2024

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, Hsin-Ying Lee

Figure 1 for 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Figure 2 for 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Figure 3 for 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Figure 4 for 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Abstract:Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

Via

Access Paper or Ask Questions

WeightedPose: Generalizable Cross-Pose Estimation via Weighted SVD

May 03, 2024

Xuxin Cheng, Heng Yu, Harry Zhang, Wenxing Deng

Figure 1 for WeightedPose: Generalizable Cross-Pose Estimation via Weighted SVD

Figure 2 for WeightedPose: Generalizable Cross-Pose Estimation via Weighted SVD

Figure 3 for WeightedPose: Generalizable Cross-Pose Estimation via Weighted SVD

Abstract:We present a novel method for robotic manipulation tasks in human environments that require reasoning about the 3D geometric relationship between a pair of objects. Traditional end-to-end trained policies, which map from pixel observations to low-level robot actions, struggle to reason about complex pose relationships and have difficulty generalizing to unseen object configurations. To address these challenges, we propose a method that learns to reason about the 3D geometric relationship between objects, focusing on the relationship between key parts on one object with respect to key parts on another object. Our standalone model utilizes Weighted SVD to reason about both pose relationships between articulated parts and between free-floating objects. This approach allows the robot to understand the relationship between the oven door and the oven body, as well as the relationship between the lasagna plate and the oven, for example. By considering the 3D geometric relationship between objects, our method enables robots to perform complex manipulation tasks that reason about object-centric representations. We open source the code and demonstrate the results here

* arXiv admin note: text overlap with arXiv:2211.09325

Via

Access Paper or Ask Questions