Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wu

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Jun 12, 2026

NVIDIA, :, Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif(+564 more)

Abstract:We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

Via

Access Paper or Ask Questions

Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution

Mar 22, 2026

Yu-Shan Tai, An-Yeu, Wu

Abstract:Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.

Via

Access Paper or Ask Questions

Post-Training Quantization for Vision Mamba with k-Scaled Quantization and Reparameterization

Jan 28, 2025

Bo-Yun Shi, Yi-Cheng Lo, An-Yeu, Wu

Figure 1 for Post-Training Quantization for Vision Mamba with k-Scaled Quantization and Reparameterization

Figure 2 for Post-Training Quantization for Vision Mamba with k-Scaled Quantization and Reparameterization

Figure 3 for Post-Training Quantization for Vision Mamba with k-Scaled Quantization and Reparameterization

Figure 4 for Post-Training Quantization for Vision Mamba with k-Scaled Quantization and Reparameterization

Abstract:The Mamba model, utilizing a structured state-space model (SSM), offers linear time complexity and demonstrates significant potential. Vision Mamba (ViM) extends this framework to vision tasks by incorporating a bidirectional SSM and patch embedding, surpassing Transformer-based models in performance. While model quantization is essential for efficient computing, existing works have focused solely on the original Mamba model and have not been applied to ViM. Additionally, they neglect quantizing the SSM layer, which is central to Mamba and can lead to substantial error propagation by naive quantization due to its inherent structure. In this paper, we focus on the post-training quantization (PTQ) of ViM. We address the issues with three core techniques: 1) a k-scaled token-wise quantization method for linear and convolutional layers, 2) a reparameterization technique to simplify hidden state quantization, and 3) a factor-determining method that reduces computational overhead by integrating operations. Through these methods, the error caused by PTQ can be mitigated. Experimental results on ImageNet-1k demonstrate only a 0.8-1.2\% accuracy degradation due to PTQ, highlighting the effectiveness of our approach.

Via

Access Paper or Ask Questions

SoK: Prompt Hacking of Large Language Models

Oct 16, 2024

Baha Rababah, Shang, Wu, Matthew Kwiatkowski, Carson Leung, Cuneyt Gurcan Akcora

Figure 1 for SoK: Prompt Hacking of Large Language Models

Figure 2 for SoK: Prompt Hacking of Large Language Models

Figure 3 for SoK: Prompt Hacking of Large Language Models

Figure 4 for SoK: Prompt Hacking of Large Language Models

Abstract:The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI's behavior, improving diagnostic precision and enabling more targeted enhancements to the system's safety and robustness.

Via

Access Paper or Ask Questions

Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Sep 12, 2024

Hao-Wei Chiang, Chi-Tse Huang, Hsiang-Yun Cheng, Po-Hao Tseng, Ming-Hsiu Lee, An-Yeu, Wu

Figure 1 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Figure 2 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Figure 3 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Figure 4 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Abstract:While memory-augmented neural networks (MANNs) offer an effective solution for few-shot learning (FSL) by integrating deep neural networks with external memory, the capacity requirements and energy overhead of data movement become enormous due to the large number of support vectors in many-class FSL scenarios. Various in-memory search solutions have emerged to improve the energy efficiency of MANNs. NAND-based multi-bit content addressable memory (MCAM) is a promising option due to its high density and large capacity. Despite its potential, MCAM faces limitations such as a restricted number of word lines, limited quantization levels, and non-ideal effects like varying string currents and bottleneck effects, which lead to significant accuracy drops. To address these issues, we propose several innovative methods. First, the Multi-bit Thermometer Code (MTMC) leverages the extensive capacity of MCAM to enhance vector precision using cumulative encoding rules, thereby mitigating the bottleneck effect. Second, the Asymmetric vector similarity search (AVSS) reduces the precision of the query vector while maintaining that of the support vectors, thereby minimizing the search iterations and improving efficiency in many-class scenarios. Finally, the Hardware-Aware Training (HAT) method optimizes controller training by modeling the hardware characteristics of MCAM, thus enhancing the reliability of the system. Our integrated framework reduces search iterations by up to 32 times, and increases overall accuracy by 1.58% to 6.94%.

Via

Access Paper or Ask Questions

Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Jul 02, 2024

Chi-Wei Chen, Wen-Chiao Tsai, Lung-Sheng Tsai, An-Yeu, Wu

Figure 1 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Figure 2 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Figure 3 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Figure 4 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Abstract:In Extended Reality (XR) applications, high data rates and low latency are crucial for immersive experiences. Uplink transmission in XR is challenging due to the limited antennas and power of lightweight XR devices. To improve data transmission rates, we investigate a relay-assisted carrier aggregation (RACA) system. The XR device simultaneously transmits data to an access point (AP) and a relay in proximity over low-frequency and high-frequency bands, respectively. Then, the relay down-converts and amplifies the signals to the AP, effectively acting as an additional transmit antenna for the XR device. In this paper, we propose two algorithms to maximize the data rate of the XR device in their respective protocols. In the centralized protocol, the rate maximization problem is equivalently transformed as a weighted mean square error minimization (WMMSE) problem which can be solved iteratively by alternative optimization. In the distributed protocol, the rate maximization problem is decomposed into two independent sub-problems where the rate of the direct link and the rate of the relay link are maximized by singular value decomposition (SVD)-based methods with water-filling (WF). Simulation results show that the rate of the RACA system is improved by $32\%$ compared to that of the conventional carrier aggregation scheme.

Via

Access Paper or Ask Questions

LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Apr 11, 2024

Jiing-Ping Wang, Ming-Guang Lin, An-Yeu, Wu

Figure 1 for LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Figure 2 for LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Figure 3 for LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Figure 4 for LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Abstract:With the rise of Transformer models in NLP and CV domain, Multi-Head Attention has been proven to be a game-changer. However, its expensive computation poses challenges to the model throughput and efficiency, especially for the long sequence tasks. Exploiting the sparsity in attention has been proven to be an effective way to reduce computation. Nevertheless, prior works do not consider the various distributions among different heads and lack a systematic method to determine the threshold. To address these challenges, we propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE). LATTE employs a headwise threshold-based filter with the low-precision dot product and computation reuse mechanism to reduce the computation of MHA. Moreover, the trainable threshold is introduced to provide a systematic method for adjusting the thresholds and enable end-to-end optimization. Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings with only a minor compromise in performance. Also, the trainable threshold is shown to be essential for the leverage between the performance and the computation. As a result, LATTE filters up to 85.16% keys with only a 0.87% accuracy drop in the CV task and 89.91% keys with a 0.86 perplexity increase in the NLP task.

Via

Access Paper or Ask Questions

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Feb 01, 2024

Yu-Shan Tai, An-Yeu, Wu

Figure 1 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Figure 2 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Figure 3 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Figure 4 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Abstract:While vision transformers (ViTs) have shown great potential in computer vision tasks, their intense computation and memory requirements pose challenges for practical applications. Existing post-training quantization methods leverage value redistribution or specialized quantizers to address the non-normal distribution in ViTs. However, without considering the asymmetry in activations and relying on hand-crafted settings, these methods often struggle to maintain performance under low-bit quantization. To overcome these challenges, we introduce SmoothQuant with bias term (SQ-b) to alleviate the asymmetry issue and reduce the clamping loss. We also introduce optimal scaling factor ratio search (OPT-m) to determine quantization parameters by a data-dependent mechanism automatically. To further enhance the compressibility, we incorporate the above-mentioned techniques and propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT). We develop greedy mixed-precision quantization (Greedy MP) to allocate layer-wise bit-width considering both model performance and compressibility. Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset. Specifically, our proposed methods achieve accuracy improvements ranging from 0.90% to 23.35% on 4-bit ViTs with single-precision and from 3.82% to 78.14% on 5-bit fully quantized ViTs with mixed-precision.

Via

Access Paper or Ask Questions

TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

May 22, 2023

Yu-Shan Tai, Ming-Guang Lin, An-Yeu, Wu

Figure 1 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Figure 2 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Figure 3 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Figure 4 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Abstract:Vision transformers (ViTs) have achieved remarkable performance in various computer vision tasks. However, intensive memory and computation requirements impede ViTs from running on resource-constrained edge devices. Due to the non-normally distributed values after Softmax and GeLU, post-training quantization on ViTs results in severe accuracy degradation. Moreover, conventional methods fail to address the high channel-wise variance in LayerNorm. To reduce the quantization loss and improve classification accuracy, we propose a two-scaled post-training quantization scheme for vision transformer (TSPTQ-ViT). We design the value-aware two-scaled scaling factors (V-2SF) specialized for post-Softmax and post-GeLU values, which leverage the bit sparsity in non-normal distribution to save bit-widths. In addition, the outlier-aware two-scaled scaling factors (O-2SF) are introduced to LayerNorm, alleviating the dominant impacts from outlier values. Our experimental results show that the proposed methods reach near-lossless accuracy drops (<0.5%) on the ImageNet classification task under 8-bit fully quantized ViTs.

Via

Access Paper or Ask Questions

C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Jul 25, 2022

Cheng-Yen Hsieh, Yu-Chuan Chuang, An-Yeu, Wu

Figure 1 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 2 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 3 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 4 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Abstract:Most existing studies improve the efficiency of Split learning (SL) by compressing the transmitted features. However, most works focus on dimension-wise compression that transforms high-dimensional features into a low-dimensional space. In this paper, we propose circular convolution-based batch-wise compression for SL (C3-SL) to compress multiple features into one single feature. To avoid information loss while merging multiple features, we exploit the quasi-orthogonality of features in high-dimensional space with circular convolution and superposition. To the best of our knowledge, we are the first to explore the potential of batch-wise compression under the SL scenario. Based on the simulation results on CIFAR-10 and CIFAR-100, our method achieves a 16x compression ratio with negligible accuracy drops compared with the vanilla SL. Moreover, C3-SL significantly reduces 1152x memory and 2.25x computation overhead compared to the state-of-the-art dimension-wise compression method.

* 6 pages, IEEE MLSP 2022, Github: https://github.com/WesleyHsieh0806/Split-Learning-Compression

Via

Access Paper or Ask Questions