Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tieyuan Chen

Awakening Diffusion Transformers: Eliciting Stronger Generation and Understanding via Massive Activation Modulation

Jul 03, 2026

Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Supavadee Aramvith, Mehrtash Harandi, Weiyao Lin

Abstract:Massive Activations (MAs) have been widely observed in Transformer-based models, yet their structure and functional roles in Diffusion Transformers (DiTs) remain insufficiently understood. In this work, we systematically analyze MAs in representative DiTs and find that they are spatially distributed across image tokens while concentrated in a small set of fixed feature dimensions. We further show that these dimensions are closely aligned with AdaLN residual scaling factors and are primarily modulated by the denoising timestep rather than text conditions. This structure leads to two task-dependent effects: for generation, MAs are critical for fine-grained detail synthesis while having limited influence on global semantics; for understanding, their shared high-magnitude directions make raw DiT features overly similar across spatial tokens and weaken dense feature discrimination. Based on these findings, we introduce Eliciting Massive Activation (EMA), a training-free framework that leverages Massive Activations (MAs) as a unified modulation signal to improve both generative and representational capabilities of DiTs. For generation, EMA proposes MA-driven Detail G}uidance (DG), which suppresses MA dimensions to construct a detail-deficient counterfactual prediction and guides sampling toward finer visual details. DG further supports efficient partial-forward inference, integration with classifier-free guidance, and token-level Local DG for refining selected image regions. For understanding, EMA introduces MA-modulated REPresentation extraction (MREP), which uses pretrained AdaLN channel-wise modulation to reduce MA directional dominance and concatenates spatially normalized MA maps to preserve useful spatial structure. Extensive experiments demonstrate that EMA consistently improves both the generation quality and representation capability of DiTs.

Via

Access Paper or Ask Questions

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Jul 01, 2026

Shijie Li, Yilin Gao, Siyuan Yang, Tieyuan Chen, Chaofan Gan, Zhihao He, Zicheng Zhao, Yuyu Guo, Weiyao Lin, Hang Yu

Abstract:Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative is continuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severe train-inference mismatch: a training-time posterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standard variational training then forces the inference-time prior to mimic a posterior that has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via a bidirectional calibration objective. A forward KL divergence trains the target-agnostic prior to match the posterior, while a novel reverse KL divergence simultaneously regularizes the posterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage''. We provide theoretical analysis formalizing this leakage as prior contamination and prove that our dual-KL objective reduces it. We instantiate AMVL in a latent-integrated MLLM and show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complex BLINK benchmark by +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.

Via

Access Paper or Ask Questions

GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

Jun 15, 2026

Ruoxuan Yang, Tieyuan Chen, Xiaofeng Huang, Haibing Yin, Jun Wang, Xiping Chen, Jun Yin, Xuesong Gao, Weiyao Lin

Abstract:Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs' clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify "what happens", while grounded visual entity crops anchor "who or what participates in each event" to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Apr 22, 2026

Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li(+8 more)

Abstract:We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

* LLaDA2.0-Uni Technical Report

Via

Access Paper or Ask Questions

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Jan 25, 2026

Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin

Abstract:Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

Via

Access Paper or Ask Questions

Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

Sep 16, 2025

Zhihao He, Tianyao He, Tieyuan Chen, Yun Xu, Huabin Liu, Chaofan Gan, Gui Zou, Weiyao Lin

Abstract:Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.

Via

Access Paper or Ask Questions

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Aug 11, 2025

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu(+3 more)

Abstract:The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

Via

Access Paper or Ask Questions

Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Jun 09, 2025

Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Gui Zou, Weiyao Lin

Figure 1 for Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Figure 2 for Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Figure 3 for Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Figure 4 for Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Abstract:Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76\%$, $1.37\%$, and $4.87\%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.

* Preprint

Via

Access Paper or Ask Questions

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

May 24, 2025

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

Figure 1 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Figure 2 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Figure 3 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Figure 4 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Abstract:Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

* Under Review

Via

Access Paper or Ask Questions

Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Feb 09, 2025

Cuipeng Wang, Tieyuan Chen, Haipeng Wang

Figure 1 for Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Figure 2 for Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Figure 3 for Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Figure 4 for Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Abstract:Knowledge distillation is a technique aimed at enhancing the performance of a smaller student network without increasing its parameter size by transferring knowledge from a larger, pre-trained teacher network. Previous approaches have predominantly focused on distilling global feature information while overlooking the importance of disentangling the diverse types of information embedded within different regions of the feature. In this work, we introduce multi-scale decoupling in the feature transfer process for the first time, where the decoupled local features are individually processed and integrated with contrastive learning. Moreover, compared to previous contrastive learning-based distillation methods, our approach not only reduces computational costs but also enhances efficiency, enabling performance improvements for the student network using only single-batch samples. Extensive evaluations on CIFAR-100 and ImageNet demonstrate our method's superiority, with some student networks distilled using our method even surpassing the performance of their pre-trained teacher networks. These results underscore the effectiveness of our approach in enabling student networks to thoroughly absorb knowledge from teacher networks.

Via

Access Paper or Ask Questions