Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anyi Rao

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Feb 12, 2025

Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao(+3 more)

Figure 1 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Figure 2 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Figure 3 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Figure 4 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Abstract:Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.

* Project Page: https://bujiazi.github.io/light-a-video.github.io/

Via

Access Paper or Ask Questions

ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

Oct 04, 2024

Anyi Rao, Jean-Peïc Chou, Maneesh Agrawala

Figure 1 for ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

Figure 2 for ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

Figure 3 for ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

Figure 4 for ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database

Abstract:Scriptwriters usually rely on their mental visualization to create a vivid story by using their imagination to see, feel, and experience the scenes they are writing. Besides mental visualization, they often refer to existing images or scenes in movies and analyze the visual elements to create a certain mood or atmosphere. In this paper, we develop ScriptViz to provide external visualization based on a large movie database for the screenwriting process. It retrieves reference visuals on the fly based on scripts' text and dialogue from a large movie database. The tool provides two types of control on visual elements that enable writers to 1) see exactly what they want with fixed visual elements and 2) see variances in uncertain elements. User evaluation among 15 scriptwriters shows that ScriptViz is able to present scriptwriters with consistent yet diverse visual possibilities, aligning closely with their scripts and helping their creation.

* Accepted in the 37th Annual ACM Symposium on User Interface Software and Technology (UIST'24). Webpage: https://virtualfilmstudio.github.io/projects/scriptviz

Via

Access Paper or Ask Questions

CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion

Aug 30, 2024

Yiran Chen, Anyi Rao, Xuekun Jiang, Shishi Xiao, Ruiqing Ma, Zeyu Wang, Hui Xiong, Bo Dai

Abstract:With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visual previsualization system enhanced with engine-powered diffusion. It features a novel camera and storyboard interface that offers dynamic control, from global to local camera adjustments. This is combined with a user-friendly AI rendering workflow, which aims to achieve consistent results through multi-masked IP-Adapter and engine simulation guidelines. In our comprehensive evaluation study, we demonstrate that our system reduces development viscosity (i.e., the complexity and challenges in the development process), meets users' needs for extensive control and iteration in the design process, and outperforms other AI video production workflows in cinematic camera movement, as shown by our experiments and a within-subjects user study. With its intuitive camera controls and realistic rendering of camera motion, CinePreGen shows great potential for improving video production for both individual creators and industry professionals.

Via

Access Paper or Ask Questions

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

Nov 29, 2023

Xuekun Jiang, Anyi Rao, Jingbo Wang, Dahua Lin, Bo Dai

Abstract:In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections, neglecting 3D statuses. To address these issues, we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities, which also achieves a higher rating in the user study.

* Project Page: https://virtualfilmstudio.github.io/projects/cinetransfer

Via

Access Paper or Ask Questions

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Nov 28, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai

Figure 1 for SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Figure 2 for SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Figure 3 for SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Figure 4 for SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Abstract:The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at https://guoyww.github.io/projects/SparseCtrl .

* Project page: https://guoyww.github.io/projects/SparseCtrl

Via

Access Paper or Ask Questions

Automated Conversion of Music Videos into Lyric Videos

Aug 28, 2023

Jiaju Ma, Anyi Rao, Li-Yi Wei, Rubaiat Habib Kazi, Hijung Valentina Shin, Maneesh Agrawala

Figure 1 for Automated Conversion of Music Videos into Lyric Videos

Figure 2 for Automated Conversion of Music Videos into Lyric Videos

Figure 3 for Automated Conversion of Music Videos into Lyric Videos

Figure 4 for Automated Conversion of Music Videos into Lyric Videos

Abstract:Musicians and fans often produce lyric videos, a form of music videos that showcase the song's lyrics, for their favorite songs. However, making such videos can be challenging and time-consuming as the lyrics need to be added in synchrony and visual harmony with the video. Informed by prior work and close examination of existing lyric videos, we propose a set of design guidelines to help creators make such videos. Our guidelines ensure the readability of the lyric text while maintaining a unified focus of attention. We instantiate these guidelines in a fully automated pipeline that converts an input music video into a lyric video. We demonstrate the robustness of our pipeline by generating lyric videos from a diverse range of input sources. A user study shows that lyric videos generated by our pipeline are effective in maintaining text readability and unifying the focus of attention.

Via

Access Paper or Ask Questions

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Aug 07, 2023

Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, Jiaqi Wang

Figure 1 for Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Figure 2 for Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Figure 3 for Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Figure 4 for Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Abstract:Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE.

* Accepted by ACM MM 2023

Via

Access Paper or Ask Questions

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Jul 10, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai

Abstract:With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at https://animatediff.github.io/ .

* Project page: https://animatediff.github.io/

Via

Access Paper or Ask Questions

HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE

Jun 05, 2023

Zikai Wei, Anyi Rao, Bo Dai, Dahua Lin

Figure 1 for HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE

Figure 2 for HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE

Figure 3 for HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE

Figure 4 for HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE

Abstract:Factor model is a fundamental investment tool in quantitative investment, which can be empowered by deep learning to become more flexible and efficient in practical complicated investing situations. However, it is still an open question to build a factor model that can conduct stock prediction in an online and adaptive setting, where the model can adapt itself to match the current market regime identified based on only point-in-time market information. To tackle this problem, we propose the first deep learning based online and adaptive factor model, HireVAE, at the core of which is a hierarchical latent space that embeds the underlying relationship between the market situation and stock-wise latent factors, so that HireVAE can effectively estimate useful latent factors given only historical market information and subsequently predict accurate stock returns. Across four commonly used real stock market benchmarks, the proposed HireVAE demonstrate superior performance in terms of active returns over previous methods, verifying the potential of such online and adaptive factor model.

* Accepted to IJCAI 2023

Via

Access Paper or Ask Questions

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

May 27, 2023

Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang

Abstract:Vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs and latency are also dramatically growing with rapid development, making model acceleration exceedingly critical for researchers with limited resources and consumers with low-end devices. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is still relatively under-explored. Accordingly, this paper proposes \textbf{Cross}-\textbf{G}uided \textbf{E}nsemble of \textbf{T}okens (\textbf{\emph{CrossGET}}) as a universal vison-language Transformer acceleration framework, which adaptively reduces token numbers during inference via cross-modal guidance on-the-fly, leading to significant model acceleration while keeping high performance. Specifically, the proposed \textit{CrossGET} has two key designs:1) \textit{Cross-Guided Matching and Ensemble}. \textit{CrossGET} incorporates cross-modal guided token matching and ensemble to merge tokens effectively, only introducing cross-modal tokens with negligible extra parameters. 2) \textit{Complete-Graph Soft Matching}. In contrast to the previous bipartite soft matching approach, \textit{CrossGET} introduces an efficient and effective complete-graph soft matching policy to achieve more reliable token-matching results. Extensive experiments on various vision-language tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed \textit{CrossGET} framework. The code will be at https://github.com/sdc17/CrossGET.

* Preprint

Via

Access Paper or Ask Questions