Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Srinath Sridhar

ProcTex: Consistent and Interactive Text-to-texture Synthesis for Procedural Models

Jan 28, 2025

Ruiqi Xu, Zihan Zhu, Xianghao Xu, Srinath Sridhar, Daniel Ritchie

Figure 1 for ProcTex: Consistent and Interactive Text-to-texture Synthesis for Procedural Models

Figure 2 for ProcTex: Consistent and Interactive Text-to-texture Synthesis for Procedural Models

Figure 3 for ProcTex: Consistent and Interactive Text-to-texture Synthesis for Procedural Models

Figure 4 for ProcTex: Consistent and Interactive Text-to-texture Synthesis for Procedural Models

Abstract:Recent advancement in 2D image diffusion models has driven significant progress in text-guided texture synthesis, enabling realistic, high-quality texture generation from arbitrary text prompts. However, current methods usually focus on synthesizing texture for single static 3D objects, and struggle to handle entire families of shapes, such as those produced by procedural programs. Applying existing methods naively to each procedural shape is too slow to support exploring different parameter settings at interactive rates, and also results in inconsistent textures across the procedural shapes. To this end, we introduce ProcTex, the first text-to-texture system designed for procedural 3D models. ProcTex enables consistent and real-time text-guided texture synthesis for families of shapes, which integrates seamlessly with the interactive design flow of procedural models. To ensure consistency, our core approach is to generate texture for the shape produced by one setting of the procedural parameters, followed by a texture transfer stage to apply the texture to other parameter settings. We also develop several techniques, including a novel UV displacement network for real-time texture transfer, the retexturing pipeline to support structural changes from discrete procedural parameters, and part-level UV texture map generation for local appearance editing. Extensive experiments on a diverse set of professional procedural models validate ProcTex's ability to produce high-quality, visually consistent textures while supporting real-time, interactive applications.

Via

Access Paper or Ask Questions

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Dec 18, 2024

Tao Lu, Ankit Dhiman, R Srinath, Emre Arslan, Angela Xing, Yuanbo Xiangli, R Venkatesh Babu, Srinath Sridhar

Figure 1 for Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Figure 2 for Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Figure 3 for Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Figure 4 for Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Abstract:Novel-view synthesis is an important problem in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent methods like 3D Gaussian Splatting (3DGS) have become the preferred method for this task, providing high-quality novel views in real time. However, the training time of a 3DGS model is slow, often taking 30 minutes for a scene with 200 views. In contrast, our goal is to reduce the optimization time by training for fewer steps while maintaining high rendering quality. Specifically, we combine the guidance from both the position error and the appearance error to achieve a more effective densification. To balance the rate between adding new Gaussians and fitting old Gaussians, we develop a convergence-aware budget control mechanism. Moreover, to make the densification process more reliable, we selectively add new Gaussians from mostly visited regions. With these designs, we reduce the Gaussian optimization steps to one-third of the previous approach while achieving a comparable or even better novel view rendering quality. To further facilitate the rapid fitting of 4K resolution images, we introduce a dilation-based rendering technique. Our method, Turbo-GS, speeds up optimization for typical scenes and scales well to high-resolution (4K) scenarios on standard datasets. Through extensive experiments, we show that our method is significantly faster in optimization than other methods while retaining quality. Project page: https://ivl.cs.brown.edu/research/turbo-gs.

* Project page: https://ivl.cs.brown.edu/research/turbo-gs

Via

Access Paper or Ask Questions

GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Dec 05, 2024

Rao Fu, Dingxi Zhang, Alex Jiang, Wanjia Fu, Austin Funk, Daniel Ritchie, Srinath Sridhar

Figure 1 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Figure 2 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Figure 3 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Figure 4 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Abstract:Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction.

Via

Access Paper or Ask Questions

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Dec 03, 2024

Kefan Chen, Chaerin Min, Linguang Zhang, Shreyas Hampali, Cem Keskin, Srinath Sridhar

Figure 1 for FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Figure 2 for FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Figure 3 for FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Figure 4 for FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Abstract:Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.

Via

Access Paper or Ask Questions

HFGaussian: Learning Generalizable Gaussian Human with Integrated Human Features

Nov 05, 2024

Arnab Dey, Cheng-You Lu, Andrew I. Comport, Srinath Sridhar, Chin-Teng Lin, Jean Martinet

Abstract:Recent advancements in radiance field rendering show promising results in 3D scene representation, where Gaussian splatting-based techniques emerge as state-of-the-art due to their quality and efficiency. Gaussian splatting is widely used for various applications, including 3D human representation. However, previous 3D Gaussian splatting methods either use parametric body models as additional information or fail to provide any underlying structure, like human biomechanical features, which are essential for different applications. In this paper, we present a novel approach called HFGaussian that can estimate novel views and human features, such as the 3D skeleton, 3D key points, and dense pose, from sparse input images in real time at 25 FPS. The proposed method leverages generalizable Gaussian splatting technique to represent the human subject and its associated features, enabling efficient and generalizable reconstruction. By incorporating a pose regression network and the feature splatting technique with Gaussian splatting, HFGaussian demonstrates improved capabilities over existing 3D human methods, showcasing the potential of 3D human representations with integrated biomechanics. We thoroughly evaluate our HFGaussian method against the latest state-of-the-art techniques in human Gaussian splatting and pose estimation, demonstrating its real-time, state-of-the-art performance.

Via

Access Paper or Ask Questions

DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Oct 25, 2024

Md Faizal Karim, Shreya Bollimuntha, Mohammed Saad Hashmi, Autrio Das, Gaurav Singh, Srinath Sridhar, Arun Kumar Singh, Nagamanikandan Govindan, K Madhava Krishna

Figure 1 for DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Figure 2 for DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Figure 3 for DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Figure 4 for DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Abstract:Dual-arm manipulation is an area of growing interest in the robotics community. Enabling robots to perform tasks that require the coordinated use of two arms, is essential for complex manipulation tasks such as handling large objects, assembling components, and performing human-like interactions. However, achieving effective dual-arm manipulation is challenging due to the need for precise coordination, dynamic adaptability, and the ability to manage interaction forces between the arms and the objects being manipulated. We propose a novel pipeline that combines the advantages of policy learning based on environment feedback and gradient-based optimization to learn controller gains required for the control outputs. This allows the robotic system to dynamically modulate its impedance in response to task demands, ensuring stability and dexterity in dual-arm operations. We evaluate our pipeline on a trajectory-tracking task involving a variety of large, complex objects with different masses and geometries. The performance is then compared to three other established methods for controlling dual-arm robots, demonstrating superior results.

Via

Access Paper or Ask Questions

MotionGlot: A Multi-Embodied Motion Generation Model

Oct 22, 2024

Sudarshan Harithas, Srinath Sridhar

Figure 1 for MotionGlot: A Multi-Embodied Motion Generation Model

Figure 2 for MotionGlot: A Multi-Embodied Motion Generation Model

Figure 3 for MotionGlot: A Multi-Embodied Motion Generation Model

Figure 4 for MotionGlot: A Multi-Embodied Motion Generation Model

Abstract:This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motion-related tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of 35.3% across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert-controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.

Via

Access Paper or Ask Questions

CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Oct 19, 2024

Xianghao Xu, Srinath Sridhar, Daniel Ritchie

Figure 1 for CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Figure 2 for CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Figure 3 for CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Figure 4 for CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Abstract:We propose a zero-shot text-driven 3D shape deformation system that deforms an input 3D mesh of a manufactured object to fit an input text description. To do this, our system optimizes the parameters of a deformation model to maximize an objective function based on the widely used pre-trained vision language model CLIP. We find that CLIP-based objective functions exhibit many spurious local optima; to circumvent them, we parameterize deformations using a novel deformation model called BoxDefGraph which our system automatically computes from an input mesh, the BoxDefGraph is designed to capture the object aligned rectangular/circular geometry features of most manufactured objects. We then use the CMA-ES global optimization algorithm to maximize our objective, which we find to work better than popular gradient-based optimizers. We demonstrate that our approach produces appealing results and outperforms several baselines.

Via

Access Paper or Ask Questions

EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Jul 30, 2024

Aashish Rai, Srinath Sridhar

Figure 1 for EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Figure 2 for EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Figure 3 for EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Figure 4 for EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Abstract:We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot easily capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strength of latent diffusion models for conditioned audio synthesis. We first encode and process audio and video data into a form that is suitable for generation. The encoded data is used to train our model to generate audio tracks that capture the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables temporal synchronization to the synthesized audio. Extensive evaluations show that our model outperforms existing work in audio quality, and in our newly proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.

* preprint

Via

Access Paper or Ask Questions

GenHeld: Generating and Editing Handheld Objects

Jun 07, 2024

Chaerin Min, Srinath Sridhar

Figure 1 for GenHeld: Generating and Editing Handheld Objects

Figure 2 for GenHeld: Generating and Editing Handheld Objects

Figure 3 for GenHeld: Generating and Editing Handheld Objects

Figure 4 for GenHeld: Generating and Editing Handheld Objects

Abstract:Grasping is an important human activity that has long been studied in robotics, computer vision, and cognitive science. Most existing works study grasping from the perspective of synthesizing hand poses conditioned on 3D or 2D object representations. We propose GenHeld to address the inverse problem of synthesizing held objects conditioned on 3D hand model or 2D image. Given a 3D model of hand, GenHeld 3D can select a plausible held object from a large dataset using compact object representations called object codes.The selected object is then positioned and oriented to form a plausible grasp without changing hand pose. If only a 2D hand image is available, GenHeld 2D can edit this image to add or replace a held object. GenHeld 2D operates by combining the abilities of GenHeld 3D with diffusion-based image editing. Results and experiments show that we outperform baselines and can generate plausible held objects in both 2D and 3D. Our experiments demonstrate that our method achieves high quality and plausibility of held object synthesis in both 3D and 2D.

Via

Access Paper or Ask Questions