Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hui Li

Jiangnan University, Wuxi, China

Kronecker Mask and Interpretive Prompts are Language-Action Video Learners

Feb 05, 2025

Jingyi Yang, Zitong Yu, Xiuming Ni, Jia He, Hui Li

Abstract:Contrastive language-image pretraining (CLIP) has significantly advanced image-based vision learning. A pressing topic subsequently arises: how can we effectively adapt CLIP to the video domain? Recent studies have focused on adjusting either the textual or visual branch of CLIP for action recognition. However, we argue that adaptations of both branches are crucial. In this paper, we propose \textbf{CLAVER}: a \textbf{C}ontrastive \textbf{L}anguage-\textbf{A}ction \textbf{V}ideo Learn\textbf{er}, designed to shift CLIP's focus from the alignment of static visual objects and concrete nouns to the alignment of dynamic action behaviors and abstract verbs. Specifically, we introduce a novel Kronecker mask attention for temporal modeling. Our tailored Kronecker mask offers three benefits 1) it expands the temporal receptive field for each token, 2) it serves as an effective spatiotemporal heterogeneity inductive bias, mitigating the issue of spatiotemporal homogenization, and 3) it can be seamlessly plugged into transformer-based models. Regarding the textual branch, we leverage large language models to generate diverse, sentence-level and semantically rich interpretive prompts of actions, which shift the model's focus towards the verb comprehension. Extensive experiments on various benchmarks and learning scenarios demonstrate the superiority and generality of our approach. The code will be available soon.

Via

Access Paper or Ask Questions

Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Feb 03, 2025

Aidan Curtis, Eric Li, Michael Noseworthy, Nishad Gothoskar, Sachin Chitta, Hui Li, Leslie Pack Kaelbling, Nicole Carey

Figure 1 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Figure 2 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Figure 3 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Figure 4 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Abstract:Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically discovering a sampling distribution via entropy-regularized reward maximization of a normalizing-flow-based neural sampling distribution. We show that this architecture is more flexible and provides greater robustness than existing approaches that learn simpler, parameterized sampling distributions, as demonstrated in six simulated and one real-world robotics domain. Lastly, we explore how these learned sampling distributions, combined with a privileged value function, can be used for out-of-distribution detection in an uncertainty-aware multi-step manipulation planner.

Via

Access Paper or Ask Questions

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Jan 14, 2025

Yabo Zhang, Xinpeng Zhou, Yihan Zeng, Hang Xu, Hui Li, Wangmeng Zuo

Figure 1 for FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Figure 2 for FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Figure 3 for FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Figure 4 for FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

Abstract:Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.

* Code: https://github.com/YBYBZhang/FramePainter

Via

Access Paper or Ask Questions

Method of data forward generation with partial differential equations for machine learning modeling in fluid mechanics

Jan 06, 2025

Ruilin Chen, Xiaowei Jin, Nikolaus A. Adams, Hui Li

Figure 1 for Method of data forward generation with partial differential equations for machine learning modeling in fluid mechanics

Figure 2 for Method of data forward generation with partial differential equations for machine learning modeling in fluid mechanics

Figure 3 for Method of data forward generation with partial differential equations for machine learning modeling in fluid mechanics

Figure 4 for Method of data forward generation with partial differential equations for machine learning modeling in fluid mechanics

Abstract:Artificial intelligence (AI) for fluid mechanics has become attractive topic. High-fidelity data is one of most critical issues for the successful applications of AI in fluid mechanics, however, it is expensively obtained or even inaccessible. This study proposes a high-efficient data forward generation method from the partial differential equations (PDEs). Specifically, the solutions of the PDEs are first generated either following a random field (e.g. Gaussian random field, GRF, computational complexity O(NlogN), N is the number of spatial points) or physical laws (e.g. a kind of spectra, computational complexity O(NM), M is the number of modes), then the source terms, boundary conditions and initial conditions are computed to satisfy PDEs. Thus, the data pairs of source terms, boundary conditions and initial conditions with corresponding solutions of PDEs can be constructed. A Poisson neural network (Poisson-NN) embedded in projection method and a wavelet transform convolutional neuro network (WTCNN) embedded in multigrid numerical simulation for solving incompressible Navier-Stokes equations is respectively proposed. The feasibility of generated data for training Poisson-NN and WTCNN is validated. The results indicate that even without any DNS data, the generated data can train these two models with excellent generalization and accuracy. The data following physical laws can significantly improve the convergence rate, generalization and accuracy than that generated following GRF.

Via

Access Paper or Ask Questions

Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

Jan 05, 2025

Hui Li, Xiaoyu Ren, Hongjiu Yu, Huiyu Duan, Kai Li, Ying Chen, Libo Wang, Xiongkuo Min, Guangtao Zhai, Xu Liu

Figure 1 for Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

Figure 2 for Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

Figure 3 for Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

Figure 4 for Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

Abstract:Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in this paper we present LiveBeauty, the first large-scale live-specific FAP dataset, in a more challenging application scenario, i.e., live streaming. 10,000 face images are collected from a live streaming platform directly, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset in the challenging live scenario. Furthermore, a multi-modal FAP method is proposed to measure the facial attractiveness in live streaming. Specifically, we first extract holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM) and a Multi-modal Attractiveness Encoder Module (MAEM), respectively, then integrate the extracted features through a Cross-Modal Fusion Module (CMFM). Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. Dataset will be available soon.

Via

Access Paper or Ask Questions

DeepSeek-V3 Technical Report

Dec 27, 2024

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang(+188 more)

Figure 1 for DeepSeek-V3 Technical Report

Figure 2 for DeepSeek-V3 Technical Report

Figure 3 for DeepSeek-V3 Technical Report

Figure 4 for DeepSeek-V3 Technical Report

Abstract:We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Via

Access Paper or Ask Questions

PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Dec 09, 2024

Qian Zhang, Panfeng Chen, Jiali Li, Linkun Feng, Shuyu Liu, Mei Chen, Hui Li, Yanhao Wang

Figure 1 for PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Figure 2 for PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Figure 3 for PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Figure 4 for PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Abstract:The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.

* 21 pages, 12 figures

Via

Access Paper or Ask Questions

Video2Reward: Generating Reward Function from Videos for Legged Robot Behavior Learning

Dec 07, 2024

Runhao Zeng, Dingjie Zhou, Qiwei Liang, Junlin Liu, Hui Li, Changxin Huang, Jianqiang Li, Xiping Hu, Fuchun Sun

Abstract:Learning behavior in legged robots presents a significant challenge due to its inherent instability and complex constraints. Recent research has proposed the use of a large language model (LLM) to generate reward functions in reinforcement learning, thereby replacing the need for manually designed rewards by experts. However, this approach, which relies on textual descriptions to define learning objectives, fails to achieve controllable and precise behavior learning with clear directionality. In this paper, we introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned. Specifically, we first process videos containing the target behaviors, converting the motion information of individuals in the videos into keypoint trajectories represented as coordinates through a video2text transforming module. These trajectories are then fed into an LLM to generate the reward function, which in turn is used to train the policy. To enhance the quality of the reward function, we develop a video-assisted iterative reward refinement scheme that visually assesses the learned behaviors and provides textual feedback to the LLM. This feedback guides the LLM to continually refine the reward function, ultimately facilitating more efficient behavior learning. Experimental results on tasks involving bipedal and quadrupedal robot motion control demonstrate that our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score. More importantly, by switching video inputs, we find our method can rapidly learn diverse motion behaviors such as walking and running.

* Proceedings of the 27th European Conference on Artificial Intelligence (ECAI 2024), Santiago de Compostela, Spain, October 19-24, 2024. Frontiers in Artificial Intelligence and Applications, vol. 392, IOS Press, pp. 4369-4376
* 8 pages, 6 figures, ECAI2024

Via

Access Paper or Ask Questions

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Dec 03, 2024

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang(+1 more)

Figure 1 for OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Figure 2 for OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Figure 3 for OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Figure 4 for OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Abstract:Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid

* 11 pages, 8 figures, 5 tables

Via

Access Paper or Ask Questions

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Dec 01, 2024

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu

Abstract:Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://github.com/fudan-generative-vision/hallo3.

Via

Access Paper or Ask Questions