Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gengyang Li

SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

Jan 07, 2026

Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu

Abstract:Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

May 21, 2025

Gengyang Li, Yifeng Gao, Yuming Li, Yunfang Wu

Figure 1 for ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Figure 2 for ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Figure 3 for ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Figure 4 for ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Abstract:While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.

Via

Access Paper or Ask Questions