Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tianyu He

VidTok: A Versatile and Open-Source Video Tokenizer

Dec 17, 2024

Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian

Figure 1 for VidTok: A Versatile and Open-Source Video Tokenizer

Figure 2 for VidTok: A Versatile and Open-Source Video Tokenizer

Figure 3 for VidTok: A Versatile and Open-Source Video Tokenizer

Figure 4 for VidTok: A Versatile and Open-Source Video Tokenizer

Abstract:Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.

* Code & Models: https://github.com/microsoft/VidTok

Via

Access Paper or Ask Questions

Compositional 3D-aware Video Generation with LLM Director

Aug 31, 2024

Hanxin Zhu, Tianyu He, Anni Tang, Junliang Guo, Zhibo Chen, Jiang Bian

Figure 1 for Compositional 3D-aware Video Generation with LLM Director

Figure 2 for Compositional 3D-aware Video Generation with LLM Director

Figure 3 for Compositional 3D-aware Video Generation with LLM Director

Figure 4 for Compositional 3D-aware Video Generation with LLM Director

Abstract:Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual concepts within the generated video, such as the motion and appearance of specific characters and the movement of viewpoints. In this work, we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models. Specifically, given an input textual prompt, our scheme consists of three stages: 1) We leverage LLM as the director to first decompose the complex query into several sub-prompts that indicate individual concepts within the video~(\textit{e.g.}, scene, objects, motions), then we let LLM to invoke pre-trained expert models to obtain corresponding 3D representations of concepts. 2) To compose these representations, we prompt multi-modal LLM to produce coarse guidance on the scales and coordinates of trajectories for the objects. 3) To make the generated frames adhere to natural image distribution, we further leverage 2D diffusion priors and use Score Distillation Sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept. Project page: \url{https://aka.ms/c3v}.

Via

Access Paper or Ask Questions

A Generic Review of Integrating Artificial Intelligence in Cognitive Behavioral Therapy

Jul 28, 2024

Meng Jiang, Qing Zhao, Jianqiang Li, Fan Wang, Tianyu He, Xinyan Cheng, Bing Xiang Yang, Grace W. K. Ho, Guanghui Fu

Abstract:Cognitive Behavioral Therapy (CBT) is a well-established intervention for mitigating psychological issues by modifying maladaptive cognitive and behavioral patterns. However, delivery of CBT is often constrained by resource limitations and barriers to access. Advancements in artificial intelligence (AI) have provided technical support for the digital transformation of CBT. Particularly, the emergence of pre-training models (PTMs) and large language models (LLMs) holds immense potential to support, augment, optimize and automate CBT delivery. This paper reviews the literature on integrating AI into CBT interventions. We begin with an overview of CBT. Then, we introduce the integration of AI into CBT across various stages: pre-treatment, therapeutic process, and post-treatment. Next, we summarized the datasets relevant to some CBT-related tasks. Finally, we discuss the benefits and current limitations of applying AI to CBT. We suggest key areas for future research, highlighting the need for further exploration and validation of the long-term efficacy and clinical utility of AI-enhanced CBT. The transformative potential of AI in reshaping the practice of CBT heralds a new era of more accessible, efficient, and personalized mental health interventions.

Via

Access Paper or Ask Questions

Cheems: Wonderful Matrices More Efficient and More Effective Architecture

Jul 25, 2024

Jingze Shi, Lu He, Yuhan Wang, Tianyu He, Bingheng Wu, Mingkun Hou

Figure 1 for Cheems: Wonderful Matrices More Efficient and More Effective Architecture

Figure 2 for Cheems: Wonderful Matrices More Efficient and More Effective Architecture

Figure 3 for Cheems: Wonderful Matrices More Efficient and More Effective Architecture

Figure 4 for Cheems: Wonderful Matrices More Efficient and More Effective Architecture

Abstract:Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. I studied the effectiveness of using different position encodings in structured state space dual algorithms, and the more effective SSD-Attn internal and external function mixing method, and designed a more efficient cross domain mixture of experts. I found that the same matrix is very wonderful in different algorithms, which allows us to establish a new hybrid sparse architecture: Cheems. Compared with other hybrid architectures, it is more efficient and more effective in language modeling tasks.

Via

Access Paper or Ask Questions

Video In-context Learning

Jul 10, 2024

Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian

Abstract:In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this paper, we propose and study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences, each semantically guided by the prompted video demonstrations. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We thoroughly analyze the effect of different datasets and represent frames as discrete tokens, and then model them by next token predictions. We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results. Our model follows the scaling law and generates high-quality video clips that accurately align with the semantic guidance provided by in-context examples.

Via

Access Paper or Ask Questions

GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

Jun 14, 2024

Xiqian Yu, Hanxin Zhu, Tianyu He, Zhibo Chen

Figure 1 for GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

Figure 2 for GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

Figure 3 for GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

Figure 4 for GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

Abstract:Achieving high-resolution novel view synthesis (HRNVS) from low-resolution input views is a challenging task due to the lack of high-resolution data. Previous methods optimize high-resolution Neural Radiance Field (NeRF) from low-resolution input views but suffer from slow rendering speed. In this work, we base our method on 3D Gaussian Splatting (3DGS) due to its capability of producing high-quality images at a faster rendering speed. To alleviate the shortage of data for higher-resolution synthesis, we propose to leverage off-the-shelf 2D diffusion priors by distilling the 2D knowledge into 3D with Score Distillation Sampling (SDS). Nevertheless, applying SDS directly to Gaussian-based 3D super-resolution leads to undesirable and redundant 3D Gaussian primitives, due to the randomness brought by generative priors. To mitigate this issue, we introduce two simple yet effective techniques to reduce stochastic disturbances introduced by SDS. Specifically, we 1) shrink the range of diffusion timestep in SDS with an annealing strategy; 2) randomly discard redundant Gaussian primitives during densification. Extensive experiments have demonstrated that our proposed GaussainSR can attain high-quality results for HRNVS with only low-resolution inputs on both synthetic and real-world datasets. Project page: https://chchnii.github.io/GaussianSR/

Via

Access Paper or Ask Questions

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Jun 12, 2024

Runyi Yu, Tianyu He, Ailing Zeng, Yuchi Wang, Junliang Guo, Xu Tan, Chang Liu, Jie Chen, Jiang Bian

Figure 1 for Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Figure 2 for Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Figure 3 for Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Figure 4 for Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Abstract:We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).

* 14 pages of main text, 23 pages in total, 9 figures

Via

Access Paper or Ask Questions

Grokking Modular Polynomials

Jun 05, 2024

Darshil Doshi, Tianyu He, Aritra Das, Andrey Gromov

Figure 1 for Grokking Modular Polynomials

Figure 2 for Grokking Modular Polynomials

Figure 3 for Grokking Modular Polynomials

Abstract:Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend the class of analytical solutions to include modular multiplication as well as modular addition with many terms. Additionally, we show that real networks trained on these datasets learn similar solutions upon generalization (grokking). (ii) We combine these "expert" solutions to construct networks that generalize on arbitrary modular polynomials. (iii) We hypothesize a classification of modular polynomials into learnable and non-learnable via neural networks training; and provide experimental evidence supporting our claims.

* 7+4 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Jun 04, 2024

Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov

Figure 1 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Figure 2 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Figure 3 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Figure 4 for Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Abstract:Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing the highly structured representations in both phases; and discuss the learnt algorithm.

* 21 pages, 19 figures

Via

Access Paper or Ask Questions

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

May 24, 2024

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

Abstract:Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.

* Project page: https://wangyuchi369.github.io/InstructAvatar/

Via

Access Paper or Ask Questions