Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kai Han

and Other Contributors

DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning

Mar 09, 2025

Chengxuan Qian, Kai Han, Jingchao Wang, Zhenlong Yuan, Rui Qian, Chongwen Lyu, Jun Chen, Zhe Liu

Abstract:Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.

Via

Access Paper or Ask Questions

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Feb 21, 2025

Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman

Abstract:The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP/SigLIP and the state-of-the-art BLIP-2 architectures. To train the architecture with limited computing resources, we develop a 'student friendly' best practice involving global hard sample mining, and selection and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. Benefiting from the novel architecture and data curation, experiments show our enhanced network significantly boosts CLIP/SigLIP performance and outperforms the state-of-the-art BLIP-2 model on text-to-image retrieval.

Via

Access Paper or Ask Questions

Parallel Sequence Modeling via Generalized Spatial Propagation Network

Jan 21, 2025

Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu

Figure 1 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Figure 2 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Figure 3 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Figure 4 for Parallel Sequence Modeling via Generalized Spatial Propagation Network

Abstract:We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.

* Project page: http://whj363636.github.io/GSPN/

Via

Access Paper or Ask Questions

VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Jan 21, 2025

Chaohao Xie, Kai Han, Kwan-Yee K. Wong

Figure 1 for VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Figure 2 for VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Figure 3 for VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Figure 4 for VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Abstract:Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.

* 10 pages, 5 Figures (Accepted at WACV 2025)

Via

Access Paper or Ask Questions

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Jan 08, 2025

Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang

Figure 1 for Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Figure 2 for Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Figure 3 for Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Figure 4 for Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Abstract:Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.

Via

Access Paper or Ask Questions

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Dec 20, 2024

Xiaohu Huang, Hao Zhou, Kai Han

Figure 1 for PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Figure 2 for PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Figure 3 for PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Abstract:In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs' reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: https://github.com/Visual-AI/PruneVid.

* Efficient Video Large Language Models

Via

Access Paper or Ask Questions

OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Dec 19, 2024

Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, Kai Han

Figure 1 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Figure 2 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Figure 3 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Figure 4 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Abstract:In recent years, the field of text-to-video (T2V) generation has made significant strides. Despite this progress, there is still a gap between theoretical advancements and practical application, amplified by issues like degraded image quality and flickering artifacts. Recent advancements in enhancing the video diffusion model (VDM) through feedback learning have shown promising results. However, these methods still exhibit notable limitations, such as misaligned feedback and inferior scalability. To tackle these issues, we introduce OnlineVPO, a more efficient preference learning approach tailored specifically for video diffusion models. Our method features two novel designs, firstly, instead of directly using image-based reward feedback, we leverage the video quality assessment (VQA) model trained on synthetic data as the reward model to provide distribution and modality-aligned feedback on the video diffusion model. Additionally, we introduce an online DPO algorithm to address the off-policy optimization and scalability issue in existing video preference learning frameworks. By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and more importantly scalable preference learning algorithm for video diffusion models, offering valuable insights for future advancements in this domain.

Via

Access Paper or Ask Questions

Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Dec 18, 2024

Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han

Figure 1 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Figure 2 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Figure 3 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Figure 4 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Abstract:Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decomposes complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems, enabling rigorous validation of the LLMs' intermediate reasoning steps. This approach facilitates evaluation of both the quality of final actions and the accuracy of the underlying reasoning process. GAMEBoT also naturally alleviates the risk of data contamination through dynamic games and head-to-head LLM competitions. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts. Project page: \url{https://visual-ai.github.io/gamebot}

* 8 pages

Via

Access Paper or Ask Questions

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Dec 13, 2024

Chang-Bin Zhang, Yujie Zhong, Kai Han

Figure 1 for Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Figure 2 for Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Figure 3 for Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Figure 4 for Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Abstract:Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment. In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions. We investigate the roles of each component in the transformer decoder across these two training targets, including self-attention, cross-attention, and feed-forward network. Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared. This finding leads us to propose a multi-route training mechanism, featuring a primary route for one-to-one prediction and two auxiliary training routes for one-to-many prediction. We enhance the training mechanism with a novel instructive self-attention that dynamically and flexibly guides object queries for one-to-many prediction. The auxiliary routes are removed during inference, ensuring no impact on model architecture or inference cost. We conduct extensive experiments on various baselines, achieving consistent improvements as shown in Figure 1.

* Tech. report

Via

Access Paper or Ask Questions

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Dec 12, 2024

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, Yunhe Wang

Figure 1 for Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Figure 2 for Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Figure 3 for Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Figure 4 for Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Abstract:Large Language Models (LLMs) have shown remarkable abilities across various language tasks, but solving complex reasoning problems remains a challenge. While existing methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT utilizes sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction and learning from past mistakes, as well as consensus-guided decision making strategies to optimize correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.

* Preprint

Via

Access Paper or Ask Questions