Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haoxin Chen

Brian

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Apr 22, 2026

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo(+206 more)

Abstract:Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

* Project website: https://open-h.github.io/open-h-embodiment/

Via

Access Paper or Ask Questions

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Jul 14, 2024

Qinyu Yang, Haoxin Chen, Yong Zhang, Menghan Xia, Xiaodong Cun, Zhixun Su, Ying Shan

Figure 1 for Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Figure 2 for Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Figure 3 for Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Figure 4 for Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Abstract:In order to improve the quality of synthesized videos, currently, one predominant method involves retraining an expert diffusion model and then implementing a noising-denoising process for refinement. Despite the significant training costs, maintaining consistency of content between the original and enhanced videos remains a major challenge. To tackle this challenge, we propose a novel formulation that considers both visual quality and consistency of content. Consistency of content is ensured by a proposed loss function that maintains the structure of the input, while visual quality is improved by utilizing the denoising process of pretrained diffusion models. To address the formulated optimization problem, we have developed a plug-and-play noise optimization strategy, referred to as Noise Calibration. By refining the initial random noise through a few iterations, the content of original video can be largely preserved, and the enhancement effect demonstrates a notable improvement. Extensive experiments have demonstrated the effectiveness of the proposed method.

* ECCV 2024, Project Page: https://yangqy1110.github.io/NC-SDEdit/, Code Repo: https://github.com/yangqy1110/NC-SDEdit/

Via

Access Paper or Ask Questions

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Feb 16, 2024

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen(+2 more)

Figure 1 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Figure 2 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Figure 3 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Figure 4 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Abstract:Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

* Project Page: https://guolanqing.github.io/Self-Cascade/

Via

Access Paper or Ask Questions

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Jan 17, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan

Figure 1 for VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Figure 2 for VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Figure 3 for VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Figure 4 for VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Abstract:Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

* Homepage: https://ailab-cvc.github.io/videocrafter; Github: https://github.com/AILab-CVC/VideoCrafter

Via

Access Paper or Ask Questions

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Dec 01, 2023

Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan

Figure 1 for StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Figure 2 for StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Figure 3 for StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Figure 4 for StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Abstract:Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.

* Project page: https://gongyeliu.github.io/StyleCrafter.github.io/ ; GitHub repository: https://github.com/GongyeLiu/StyleCrafter

Via

Access Paper or Ask Questions

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Oct 30, 2023

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang(+2 more)

Figure 1 for VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Figure 2 for VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Figure 3 for VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Figure 4 for VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Abstract:Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

* Tech Report; Github: https://github.com/AILab-CVC/VideoCrafter Homepage: https://ailab-cvc.github.io/videocrafter/

Via

Access Paper or Ask Questions

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Oct 18, 2023

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan

Figure 1 for EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Figure 2 for EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Figure 3 for EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Figure 4 for EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Abstract:The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

* Technical Report, Project page: https://evalcrafter.github.io/

Via

Access Paper or Ask Questions

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Oct 18, 2023

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, Ying Shan

Figure 1 for DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Figure 2 for DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Figure 3 for DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Figure 4 for DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Abstract:Enhancing a still image with motion offers more engaged visual experience. Traditional image animation techniques mainly focus on animating natural scenes with random dynamics, such as clouds and fluid, and thus limits their applicability to generic visual contents. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich image embedding space using a learnable image encoding network, which facilitates the video model to digest the image content compatibly. However, some visual details still struggle to be preserved in the resulting videos. To supplement more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results reveal that our proposed method produces visually convincing animated videos, exhibiting both natural motions and high fidelity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors. The source code will be released upon publication.

* Preliminary demo code: https://github.com/AILab-CVC/VideoCrafter

Via

Access Paper or Ask Questions

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Oct 11, 2023

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan

Figure 1 for ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Figure 2 for ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Figure 3 for ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Figure 4 for ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Abstract:In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

* Project page: https://yingqinghe.github.io/scalecrafter/ Github: https://github.com/YingqingHe/ScaleCrafter

Via

Access Paper or Ask Questions

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Jul 13, 2023

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan(+1 more)

Figure 1 for Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Figure 2 for Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Figure 3 for Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Figure 4 for Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Abstract:Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.

* Github: https://github.com/VideoCrafter/Animate-A-Story Project page: https://videocrafter.github.io/Animate-A-Story

Via

Access Paper or Ask Questions