Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jae-Joon Kim

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Oct 16, 2025

Beomseok Kang, Jiwon Song, Jae-Joon Kim

Abstract:Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

Via

Access Paper or Ask Questions

Retrospective Sparse Attention for Efficient Long-Context Generation

Aug 12, 2025

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

Figure 1 for Retrospective Sparse Attention for Efficient Long-Context Generation

Figure 2 for Retrospective Sparse Attention for Efficient Long-Context Generation

Figure 3 for Retrospective Sparse Attention for Efficient Long-Context Generation

Figure 4 for Retrospective Sparse Attention for Efficient Long-Context Generation

Abstract:Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.

Via

Access Paper or Ask Questions

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

May 20, 2025

Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim

Figure 1 for Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Figure 2 for Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Figure 3 for Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Figure 4 for Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Abstract:Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

Via

Access Paper or Ask Questions

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Feb 03, 2025

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

Abstract:While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00$\times$ and 1.40$\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.

Via

Access Paper or Ask Questions

COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Jan 12, 2025

Jihoon Park, Jeongin Choe, Dohyun Kim, Jae-Joon Kim

Figure 1 for COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Figure 2 for COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Figure 3 for COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Figure 4 for COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Abstract:Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods.

* Accepted IEEE DATE 2025

Via

Access Paper or Ask Questions

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Jun 18, 2024

Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim

Figure 1 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Figure 2 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Figure 3 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Figure 4 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Abstract:Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

Via

Access Paper or Ask Questions

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Feb 14, 2024

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim

Figure 1 for SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Figure 2 for SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Figure 3 for SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Figure 4 for SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Abstract:Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB successfully accelerates LLM inference without compromising the linguistic capabilities of these models, making it a promising technique for optimizing the efficiency of LLMs. The code is available at: https://github.com/leapingjagg-dev/SLEB

Via

Access Paper or Ask Questions

Squeezing Large-Scale Diffusion Models for Mobile

Jul 03, 2023

Jiwoong Choi, Minkyu Kim, Daehyun Ahn, Taesu Kim, Yulhwa Kim, Dongwon Jo, Hyesung Jeon, Jae-Joon Kim, Hyungjun Kim

Abstract:The emergence of diffusion models has greatly broadened the scope of high-fidelity image synthesis, resulting in notable advancements in both practical implementation and academic research. With the active adoption of the model in various real-world applications, the need for on-device deployment has grown considerably. However, deploying large diffusion models such as Stable Diffusion with more than one billion parameters to mobile devices poses distinctive challenges due to the limited computational and memory resources, which may vary according to the device. In this paper, we present the challenges and solutions for deploying Stable Diffusion on mobile devices with TensorFlow Lite framework, which supports both iOS and Android devices. The resulting Mobile Stable Diffusion achieves the inference latency of smaller than 7 seconds for a 512x512 image generation on Android devices with mobile GPUs.

* 7 pages, 8 figures, ICML 2023 Workshop on Challenges in Deployable Generative AI

Via

Access Paper or Ask Questions

INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Apr 18, 2022

Changhun Lee, Hyungjun Kim, Eunhyeok Park, Jae-Joon Kim

Figure 1 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Figure 2 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Figure 3 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Figure 4 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Abstract:Binary Neural Networks (BNNs) have emerged as a promising solution for reducing the memory footprint and compute costs of deep neural networks. BNNs, on the other hand, suffer from information loss because binary activations are limited to only two values, resulting in reduced accuracy. To improve the accuracy, previous studies have attempted to control the distribution of binary activation by manually shifting the threshold of the activation function or making the shift amount trainable. During the process, they usually depended on statistical information computed from a batch. We argue that using statistical data from a batch fails to capture the crucial information for each input instance in BNN computations, and the differences between statistical information computed from each instance need to be considered when determining the binary activation threshold of each instance. Based on the concept, we propose the Binary Neural Network with INSTAnce-aware threshold (INSTA-BNN), which decides the activation threshold value considering the difference between statistical data computed from a batch and each instance. The proposed INSTA-BNN outperforms the baseline by 2.5% and 2.3% on the ImageNet classification task with comparable computing cost, achieving 68.0% and 71.7% top-1 accuracy on ResNet-18 and MobileNetV1 based models, respectively.

* 19 pages, 7 figures; excluded axessibility package

Via

Access Paper or Ask Questions

Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Dec 02, 2020

Hyungjun Kim, Jihoon Park, Changhun Lee, Jae-Joon Kim

Figure 1 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Figure 2 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Figure 3 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Figure 4 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Abstract:Binarization of neural network models is considered as one of the promising methods to deploy deep neural network models on resource-constrained environments such as mobile devices. However, Binary Neural Networks (BNNs) tend to suffer from severe accuracy degradation compared to the full-precision counterpart model. Several techniques were proposed to improve the accuracy of BNNs. One of the approaches is to balance the distribution of binary activations so that the amount of information in the binary activations becomes maximum. Based on extensive analysis, in stark contrast to previous work, we argue that unbalanced activation distribution can actually improve the accuracy of BNNs. We also show that adjusting the threshold values of binary activation functions results in the unbalanced distribution of the binary activation, which increases the accuracy of BNN models. Experimental results show that the accuracy of previous BNN models (e.g. XNOR-Net and Bi-Real-Net) can be improved by simply shifting the threshold values of binary activation functions without requiring any other modification.

* 11 pages, 10 figures

Via

Access Paper or Ask Questions