Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhishek Saroha

EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

Apr 01, 2026

Abhishek Saroha, Huajian Zeng, Xingxing Zuo, Daniel Cremers, Xi Wang

Abstract:Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.

* CVPR 2026: https://abhi-rf.github.io/egoflow/

Via

Access Paper or Ask Questions

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Mar 18, 2026

Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang

Abstract:Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.

* Accpeted by 3DV 2026. Project Page: https://huajian-zeng.github.io/projects/gmt/

Via

Access Paper or Ask Questions

GAS-NeRF: Geometry-Aware Stylization of Dynamic Radiance Fields

Mar 11, 2025

Nhat Phuong Anh Vu, Abhishek Saroha, Or Litany, Daniel Cremers

Abstract:Current 3D stylization techniques primarily focus on static scenes, while our world is inherently dynamic, filled with moving objects and changing environments. Existing style transfer methods primarily target appearance -- such as color and texture transformation -- but often neglect the geometric characteristics of the style image, which are crucial for achieving a complete and coherent stylization effect. To overcome these shortcomings, we propose GAS-NeRF, a novel approach for joint appearance and geometry stylization in dynamic Radiance Fields. Our method leverages depth maps to extract and transfer geometric details into the radiance field, followed by appearance transfer. Experimental results on synthetic and real-world datasets demonstrate that our approach significantly enhances the stylization quality while maintaining temporal coherence in dynamic scenes.

Via

Access Paper or Ask Questions

Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction

Jan 10, 2025

Cecilia Curreli, Dominik Muhle, Abhishek Saroha, Zhenzhang Ye, Riccardo Marin, Daniel Cremers

Abstract:Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics. Visit our project page: https://ceveloper.github.io/publications/skeletondiffusion/

Via

Access Paper or Ask Questions

ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Jan 07, 2025

Abhishek Saroha, Florian Hofherr, Mariia Gladkova, Cecilia Curreli, Or Litany, Daniel Cremers

Figure 1 for ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Figure 2 for ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Figure 3 for ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Figure 4 for ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Abstract:Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.

Via

Access Paper or Ask Questions

DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting

Jul 24, 2024

Linus Härenstam-Nielsen, Lu Sang, Abhishek Saroha, Nikita Araslanov, Daniel Cremers

Abstract:Neural implicit surfaces can be used to recover accurate 3D geometry from imperfect point clouds. In this work, we show that state-of-the-art techniques work by minimizing an approximation of a one-sided Chamfer distance. This shape metric is not symmetric, as it only ensures that the point cloud is near the surface but not vice versa. As a consequence, existing methods can produce inaccurate reconstructions with spurious surfaces. Although one approach against spurious surfaces has been widely used in the literature, we theoretically and experimentally show that it is equivalent to regularizing the surface area, resulting in over-smoothing. As a more appealing alternative, we propose DiffCD, a novel loss function corresponding to the symmetric Chamfer distance. In contrast to previous work, DiffCD also assures that the surface is near the point cloud, which eliminates spurious surfaces without the need for additional regularization. We experimentally show that DiffCD reliably recovers a high degree of shape detail, substantially outperforming existing work across varying surface complexity and noise levels. Project code is available at https://github.com/linusnie/diffcd.

Via

Access Paper or Ask Questions

Gaussian Splatting in Style

Mar 13, 2024

Abhishek Saroha, Mariia Gladkova, Cecilia Curreli, Tarun Yenamandra, Daniel Cremers

Figure 1 for Gaussian Splatting in Style

Figure 2 for Gaussian Splatting in Style

Figure 3 for Gaussian Splatting in Style

Figure 4 for Gaussian Splatting in Style

Abstract:Scene stylization extends the work of neural style transfer to three spatial dimensions. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across a multi-view setting. A vast majority of the previous works achieve this by optimizing the scene with a specific style image. In contrast, we propose a novel architecture trained on a collection of style images, that at test time produces high quality stylized novel views. Our work builds up on the framework of 3D Gaussian splatting. For a given scene, we take the pretrained Gaussians and process them using a multi resolution hash grid and a tiny MLP to obtain the conditional stylised views. The explicit nature of 3D Gaussians give us inherent advantages over NeRF-based methods including geometric consistency, along with having a fast training and rendering regime. This enables our method to be useful for vast practical use cases such as in augmented or virtual reality applications. Through our experiments, we show our methods achieve state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.

Via

Access Paper or Ask Questions

ResolvNet: A Graph Convolutional Network with multi-scale Consistency

Sep 30, 2023

Christian Koke, Abhishek Saroha, Yuesong Shen, Marvin Eisenberger, Daniel Cremers

Figure 1 for ResolvNet: A Graph Convolutional Network with multi-scale Consistency

Figure 2 for ResolvNet: A Graph Convolutional Network with multi-scale Consistency

Figure 3 for ResolvNet: A Graph Convolutional Network with multi-scale Consistency

Figure 4 for ResolvNet: A Graph Convolutional Network with multi-scale Consistency

Abstract:It is by now a well known fact in the graph learning community that the presence of bottlenecks severely limits the ability of graph neural networks to propagate information over long distances. What so far has not been appreciated is that, counter-intuitively, also the presence of strongly connected sub-graphs may severely restrict information flow in common architectures. Motivated by this observation, we introduce the concept of multi-scale consistency. At the node level this concept refers to the retention of a connected propagation graph even if connectivity varies over a given graph. At the graph-level, multi-scale consistency refers to the fact that distinct graphs describing the same object at different resolutions should be assigned similar feature vectors. As we show, both properties are not satisfied by poular graph neural network architectures. To remedy these shortcomings, we introduce ResolvNet, a flexible graph neural network based on the mathematical concept of resolvents. We rigorously establish its multi-scale consistency theoretically and verify it in extensive experiments on real world data: Here networks based on this ResolvNet architecture prove expressive; out-performing baselines significantly on many tasks; in- and outside the multi-scale setting.

Via

Access Paper or Ask Questions

Weight-Aware Implicit Geometry Reconstruction with Curvature-Guided Sampling

Jun 03, 2023

Lu Sang, Abhishek Saroha, Maolin Gao, Daniel Cremers

Abstract:Neural surface implicit representations offer numerous advantages, including the ability to easily modify topology and surface resolution. However, reconstructing implicit geometry representation with only limited known data is challenging. In this paper, we present an approach that effectively interpolates and extrapolates within training points, generating additional training data to reconstruct a surface with superior qualitative and quantitative results. We also introduce a technique that efficiently calculates differentiable geometric properties, i.e., mean and Gaussian curvatures, to enhance the sampling process during training. Additionally, we propose a weight-aware implicit neural representation that not only streamlines surface extraction but also extend to non-closed surfaces by depicting non-closed areas as locally degenerated patches, thereby mitigating the drawbacks of the previous assumption in implicit neural representations.

* 9 pages

Via

Access Paper or Ask Questions

Implicit Shape Completion via Adversarial Shape Priors

Apr 21, 2022

Abhishek Saroha, Marvin Eisenberger, Tarun Yenamandra, Daniel Cremers

Figure 1 for Implicit Shape Completion via Adversarial Shape Priors

Figure 2 for Implicit Shape Completion via Adversarial Shape Priors

Figure 3 for Implicit Shape Completion via Adversarial Shape Priors

Figure 4 for Implicit Shape Completion via Adversarial Shape Priors

Abstract:We present a novel neural implicit shape method for partial point cloud completion. To that end, we combine a conditional Deep-SDF architecture with learned, adversarial shape priors. More specifically, our network converts partial inputs into a global latent code and then recovers the full geometry via an implicit, signed distance generator. Additionally, we train a PointNet++ discriminator that impels the generator to produce plausible, globally consistent reconstructions. In that way, we effectively decouple the challenges of predicting shapes that are both realistic, i.e. imitate the training set's pose distribution, and accurate in the sense that they replicate the partial input observations. In our experiments, we demonstrate state-of-the-art performance for completing partial shapes, considering both man-made objects (e.g. airplanes, chairs, ...) and deformable shape categories (human bodies). Finally, we show that our adversarial training approach leads to visually plausible reconstructions that are highly consistent in recovering missing parts of a given object.

Via

Access Paper or Ask Questions