Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zihan Liu

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Jun 16, 2025

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Abstract:In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

* The AceReason-Nemotron collection: https://huggingface.co/collections/nvidia/acereason-682f4e1261dc22f697fd1485

Via

Access Paper or Ask Questions

Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Jun 06, 2025

Haosong Liu, Yuge Cheng, Zihan Liu, Aiyue Chen, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo

Figure 1 for Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Figure 2 for Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Figure 3 for Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Figure 4 for Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Abstract:Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).

Via

Access Paper or Ask Questions

Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

May 23, 2025

Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, Biwei Huang

Abstract:Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.

Via

Access Paper or Ask Questions

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

May 22, 2025

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Abstract:Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

* We release the model at: https://huggingface.co/nvidia/AceReason-Nemotron-14B

Via

Access Paper or Ask Questions

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

Apr 08, 2025

Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro

Abstract:Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.

Via

Access Paper or Ask Questions

Learning-Based Passive Fault-Tolerant Control of a Quadrotor with Rotor Failure

Mar 04, 2025

Jiehao Chen, Kaidong Zhao, Zihan Liu, YanJie Li, Yunjiang Lou

Figure 1 for Learning-Based Passive Fault-Tolerant Control of a Quadrotor with Rotor Failure

Figure 2 for Learning-Based Passive Fault-Tolerant Control of a Quadrotor with Rotor Failure

Figure 3 for Learning-Based Passive Fault-Tolerant Control of a Quadrotor with Rotor Failure

Figure 4 for Learning-Based Passive Fault-Tolerant Control of a Quadrotor with Rotor Failure

Abstract:This paper proposes a learning-based passive fault-tolerant control (PFTC) method for quadrotor capable of handling arbitrary single-rotor failures, including conditions ranging from fault-free to complete rotor failure, without requiring any rotor fault information or controller switching. Unlike existing methods that treat rotor faults as disturbances and rely on a single controller for multiple fault scenarios, our approach introduces a novel Selector-Controller network structure. This architecture integrates fault detection module and the controller into a unified policy network, effectively combining the adaptability to multiple fault scenarios of PFTC with the superior control performance of active fault-tolerant control (AFTC). To optimize performance, the policy network is trained using a hybrid framework that synergizes reinforcement learning (RL), behavior cloning (BC), and supervised learning with fault information. Extensive simulations and real-world experiments validate the proposed method, demonstrating significant improvements in fault response speed and position tracking performance compared to state-of-the-art PFTC and AFTC approaches.

Via

Access Paper or Ask Questions

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Feb 18, 2025

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Abstract:Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .

Via

Access Paper or Ask Questions

Hybrid Adaptive Modeling using Neural Networks Trained with Nonlinear Dynamics Based Features

Jan 21, 2025

Zihan Liu, Prashant N. Kambali, C. Nataraj

Figure 1 for Hybrid Adaptive Modeling using Neural Networks Trained with Nonlinear Dynamics Based Features

Figure 2 for Hybrid Adaptive Modeling using Neural Networks Trained with Nonlinear Dynamics Based Features

Figure 3 for Hybrid Adaptive Modeling using Neural Networks Trained with Nonlinear Dynamics Based Features

Figure 4 for Hybrid Adaptive Modeling using Neural Networks Trained with Nonlinear Dynamics Based Features

Abstract:Accurate models are essential for design, performance prediction, control, and diagnostics in complex engineering systems. Physics-based models excel during the design phase but often become outdated during system deployment due to changing operational conditions, unknown interactions, excitations, and parametric drift. While data-based models can capture the current state of complex systems, they face significant challenges, including excessive data dependence, limited generalizability to changing conditions, and inability to predict parametric dependence. This has led to combining physics and data in modeling, termed physics-infused machine learning, often using numerical simulations from physics-based models. This paper introduces a novel approach that departs from standard techniques by uncovering information from nonlinear dynamical modeling and embedding it in data-based models. The goal is to create a hybrid adaptive modeling framework that integrates data-based modeling with newly measured data and analytical nonlinear dynamical models for enhanced accuracy, parametric dependence, and improved generalizability. By explicitly incorporating nonlinear dynamic phenomena through perturbation methods, the predictive capabilities are more realistic and insightful compared to knowledge obtained from brute-force numerical simulations. In particular, perturbation methods are utilized to derive asymptotic solutions which are parameterized to generate frequency responses. Frequency responses provide comprehensive insights into dynamics and nonlinearity which are quantified and extracted as high-quality features. A machine-learning model, trained by these features, tracks parameter variations and updates the mismatched model. The results demonstrate that this adaptive modeling method outperforms numerical gray box models in prediction accuracy and computational efficiency.

Via

Access Paper or Ask Questions

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Dec 19, 2024

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Figure 1 for AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Figure 2 for AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Figure 3 for AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Figure 4 for AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Abstract:In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath

Via

Access Paper or Ask Questions

NVLM: Open Frontier-Class Multimodal LLMs

Sep 17, 2024

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Figure 1 for NVLM: Open Frontier-Class Multimodal LLMs

Figure 2 for NVLM: Open Frontier-Class Multimodal LLMs

Figure 3 for NVLM: Open Frontier-Class Multimodal LLMs

Figure 4 for NVLM: Open Frontier-Class Multimodal LLMs

Abstract:We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.

Via

Access Paper or Ask Questions