Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Zhu

FutureFab.AI

Oscillation-Reduced MXFP4 Training for Vision Transformers

Feb 28, 2025

Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen

Figure 1 for Oscillation-Reduced MXFP4 Training for Vision Transformers

Figure 2 for Oscillation-Reduced MXFP4 Training for Vision Transformers

Figure 3 for Oscillation-Reduced MXFP4 Training for Vision Transformers

Figure 4 for Oscillation-Reduced MXFP4 Training for Vision Transformers

Abstract:Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50\%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at https://github.com/thu-ml/TetraJet-MXFP4Training

Via

Access Paper or Ask Questions

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Feb 25, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen

Abstract:An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.

Via

Access Paper or Ask Questions

RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Feb 21, 2025

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu

Figure 1 for RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Figure 2 for RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Figure 3 for RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Figure 4 for RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Abstract:Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge, and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality $2\times$ extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables $3\times$ extrapolation by minimal fine-tuning without long videos. Project page and codes: \href{https://riflex-video.github.io/}{https://riflex-video.github.io/.}

Via

Access Paper or Ask Questions

Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

Feb 11, 2025

Chengyang Ying, Huayu Chen, Xinning Zhou, Zhongkai Hao, Hang Su, Jun Zhu

Figure 1 for Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

Figure 2 for Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

Figure 3 for Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

Figure 4 for Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

Abstract:Unsupervised reinforcement learning (RL) aims to pre-train agents by exploring states or skills in reward-free environments, facilitating the adaptation to downstream tasks. However, existing methods often overlook the fitting ability of pre-trained policies and struggle to handle the heterogeneous pre-training data, which are crucial for achieving efficient exploration and fast fine-tuning. To address this gap, we propose Exploratory Diffusion Policy (EDP), which leverages the strong expressive ability of diffusion models to fit the explored data, both boosting exploration and obtaining an efficient initialization for downstream tasks. Specifically, we estimate the distribution of collected data in the replay buffer with the diffusion policy and propose a score intrinsic reward, encouraging the agent to explore unseen states. For fine-tuning the pre-trained diffusion policy on downstream tasks, we provide both theoretical analyses and practical algorithms, including an alternating method of Q function optimization and diffusion policy distillation. Extensive experiments demonstrate the effectiveness of EDP in efficient exploration during pre-training and fast adaptation during fine-tuning.

Via

Access Paper or Ask Questions

CANeRV: Content Adaptive Neural Representation for Video Compression

Feb 10, 2025

Lv Tang, Jun Zhu, Xinfeng Zhang, Li Zhang, Siwei Ma, Qingming Huang

Figure 1 for CANeRV: Content Adaptive Neural Representation for Video Compression

Figure 2 for CANeRV: Content Adaptive Neural Representation for Video Compression

Figure 3 for CANeRV: Content Adaptive Neural Representation for Video Compression

Figure 4 for CANeRV: Content Adaptive Neural Representation for Video Compression

Abstract:Recent advances in video compression introduce implicit neural representation (INR) based methods, which effectively capture global dependencies and characteristics of entire video sequences. Unlike traditional and deep learning based approaches, INR-based methods optimize network parameters from a global perspective, resulting in superior compression potential. However, most current INR methods utilize a fixed and uniform network architecture across all frames, limiting their adaptability to dynamic variations within and between video sequences. This often leads to suboptimal compression outcomes as these methods struggle to capture the distinct nuances and transitions in video content. To overcome these challenges, we propose Content Adaptive Neural Representation for Video Compression (CANeRV), an innovative INR-based video compression network that adaptively conducts structure optimisation based on the specific content of each video sequence. To better capture dynamic information across video sequences, we propose a dynamic sequence-level adjustment (DSA). Furthermore, to enhance the capture of dynamics between frames within a sequence, we implement a dynamic frame-level adjustment (DFA). {Finally, to effectively capture spatial structural information within video frames, thereby enhancing the detail restoration capabilities of CANeRV, we devise a structure level hierarchical structural adaptation (HSA).} Experimental results demonstrate that CANeRV can outperform both H.266/VVC and state-of-the-art INR-based video compression techniques across diverse video datasets.

Via

Access Paper or Ask Questions

Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Feb 09, 2025

Yijun Yang, Lichao Wang, Xiao Yang, Lanqing Hong, Jun Zhu

Figure 1 for Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Figure 2 for Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Figure 3 for Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Figure 4 for Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Abstract:Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.

Via

Access Paper or Ask Questions

A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

Feb 07, 2025

Yuhao Zhou, Jintao Xu, Chenglong Bao, Chao Ding, Jun Zhu

Figure 1 for A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

Figure 2 for A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

Figure 3 for A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

Abstract:We consider the problem of finding an $\epsilon$-stationary point of a nonconvex function with a Lipschitz continuous Hessian and propose a quadratic regularized Newton method incorporating a new class of regularizers constructed from the current and previous gradients. The method leverages a recently developed linear conjugate gradient approach with a negative curvature monitor to solve the regularized Newton equation. Notably, our algorithm is adaptive, requiring no prior knowledge of the Lipschitz constant of the Hessian, and achieves a global complexity of $O(\epsilon^{-\frac{3}{2}}) + \tilde O(1)$ in terms of the second-order oracle calls, and $\tilde O(\epsilon^{-\frac{7}{4}})$ for Hessian-vector products, respectively. Moreover, when the iterates converge to a point where the Hessian is positive definite, the method exhibits quadratic local convergence. Preliminary numerical results illustrate the competitiveness of our algorithm.

Via

Access Paper or Ask Questions

Elucidating the Preconditioning in Consistency Distillation

Feb 05, 2025

Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu

Abstract:Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed \textit{Analytic-Precond} to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher's, and achieve $2\times$ to $3\times$ training acceleration of consistency trajectory models in multi-step generation across various datasets.

* Accepted at ICLR 2025

Via

Access Paper or Ask Questions

STAIR: Improving Safety Alignment with Introspective Reasoning

Feb 04, 2025

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu

Figure 1 for STAIR: Improving Safety Alignment with Introspective Reasoning

Figure 2 for STAIR: Improving Safety Alignment with Introspective Reasoning

Figure 3 for STAIR: Improving Safety Alignment with Introspective Reasoning

Figure 4 for STAIR: Improving Safety Alignment with Introspective Reasoning

Abstract:Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.

* 22 pages, 8 figures

Via

Access Paper or Ask Questions

Towards the Worst-case Robustness of Large Language Models

Jan 31, 2025

Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu

Figure 1 for Towards the Worst-case Robustness of Large Language Models

Figure 2 for Towards the Worst-case Robustness of Large Language Models

Figure 3 for Towards the Worst-case Robustness of Large Language Models

Figure 4 for Towards the Worst-case Robustness of Large Language Models

Abstract:Recent studies have revealed the vulnerability of Large Language Models (LLMs) to adversarial attacks, where the adversary crafts specific input sequences to induce harmful, violent, private, or incorrect outputs. Although various defenses have been proposed, they have not been evaluated by strong adaptive attacks, leaving the worst-case robustness of LLMs still intractable. By developing a stronger white-box attack, our evaluation results indicate that most typical defenses achieve nearly 0\% robustness.To solve this, we propose \textit{DiffTextPure}, a general defense that diffuses the (adversarial) input prompt using any pre-defined smoothing distribution, and purifies the diffused input using a pre-trained language model. Theoretically, we derive tight robustness lower bounds for all smoothing distributions using Fractal Knapsack or 0-1 Knapsack solvers. Under this framework, we certify the robustness of a specific case -- smoothing LLMs using a uniform kernel -- against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.

Via

Access Paper or Ask Questions