Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

James M. Rehg

Learning Predictive Visuomotor Coordination

Mar 30, 2025

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

Abstract:Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Via

Access Paper or Ask Questions

Towards Online Multi-Modal Social Interaction Understanding

Mar 25, 2025

Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M. Rehg, Yapeng Tian

Figure 1 for Towards Online Multi-Modal Social Interaction Understanding

Figure 2 for Towards Online Multi-Modal Social Interaction Understanding

Figure 3 for Towards Online Multi-Modal Social Interaction Understanding

Figure 4 for Towards Online Multi-Modal Social Interaction Understanding

Abstract:Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.

Via

Access Paper or Ask Questions

Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings

Feb 03, 2025

Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar

Abstract:Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to generalize across diverse health applications. In this paper, we introduce Pulse-PPG, the first open-source PPG foundation model trained exclusively on raw PPG data collected over a 100-day field study with 120 participants. Existing PPG foundation models are either open-source but trained on clinical data or closed-source, limiting their applicability in real-world settings. We evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its performance against a state-of-the-art foundation model trained on clinical data. Our results demonstrate that Pulse-PPG, trained on uncurated field data, exhibits superior generalization across clinical and mobile health applications in both lab and field settings. This suggests that exposure to real-world variability enables the model to learn fine-grained representations, making it more adaptable across tasks. Furthermore, pre-training on field data surprisingly outperforms its pre-training on clinical data in many tasks, reinforcing the importance of training on real-world, diverse datasets. To encourage further advancements in robust foundation models leveraging field data, we plan to release Pulse-PPG, providing researchers with a powerful resource for developing more generalizable PPG-based models.

* The first two listed authors contributed equally to this research

Via

Access Paper or Ask Questions

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Jan 08, 2025

Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani

Figure 1 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Figure 2 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Figure 3 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Figure 4 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Abstract:We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: https://spar3d.github.io

Via

Access Paper or Ask Questions

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Dec 12, 2024

Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, James M. Rehg

Figure 1 for Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Figure 2 for Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Figure 3 for Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Figure 4 for Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Abstract:We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

Via

Access Paper or Ask Questions

PyPulse: A Python Library for Biosignal Imputation

Dec 09, 2024

Kevin Gao, Maxwell A. Xu, James M. Rehg, Alexander Moreno

Figure 1 for PyPulse: A Python Library for Biosignal Imputation

Figure 2 for PyPulse: A Python Library for Biosignal Imputation

Figure 3 for PyPulse: A Python Library for Biosignal Imputation

Abstract:We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. Missingness is commonplace in these settings and can arise from multiple causes, such as insecure sensor attachment or data transmission loss. PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers. Specifically, its new capabilities include using pre-trained imputation methods out-of-the-box on custom datasets, running the full workflow of training or testing a baseline method with a single line of code, and comparing baseline methods in an interactive visualization tool. We released PyPulse under the MIT License on Github and PyPI. The source code can be found at: https://github.com/rehg-lab/pulseimpute.

* 7 pages, 3 figures. Implementation and documentation are available at https://github.com/rehg-lab/pulseimpute

Via

Access Paper or Ask Questions

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Dec 03, 2024

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang(+1 more)

Figure 1 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Figure 2 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Figure 3 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Figure 4 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Abstract:Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

* 18 pages, 16 figures, 5 tables

Via

Access Paper or Ask Questions

Optimization-Free Image Immunization Against Diffusion-Based Editing

Nov 27, 2024

Tarik Can Ozden, Ozgur Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava, Sandeep P. Chinchali, James M. Rehg

Figure 1 for Optimization-Free Image Immunization Against Diffusion-Based Editing

Figure 2 for Optimization-Free Image Immunization Against Diffusion-Based Editing

Figure 3 for Optimization-Free Image Immunization Against Diffusion-Based Editing

Figure 4 for Optimization-Free Image Immunization Against Diffusion-Based Editing

Abstract:Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds-achieving a 250,000x speedup. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code is provided in our project webpage.

* Project webpage: https://diffvax.github.io/

Via

Access Paper or Ask Questions

RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Nov 27, 2024

Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Fineman, Karthik J. Raghuram, James M. Rehg, Shirley Ren

Figure 1 for RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Figure 2 for RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Figure 3 for RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Figure 4 for RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Abstract:We present RelCon, a novel self-supervised \textit{Rel}ative \textit{Con}trastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.

Via

Access Paper or Ask Questions

Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Nov 26, 2024

Xiang Li, Zixuan Huang, Anh Thai, James M. Rehg

Figure 1 for Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Figure 2 for Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Figure 3 for Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Figure 4 for Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Abstract:Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.

* Project page: https://ryanxli.github.io/reflect3d/

Via

Access Paper or Ask Questions