Abstract:We investigate learned KV cache compression through Speculative Importance Prediction (SIP), a 1.7M parameter non-query-aware scorer that predicts token importance from KV representations alone. Despite architectural sophistication (multi-horizon lookahead, cross-attention), SIP does not outperform simple baselines, including random selection, across 5 seeds, 4 retention levels, and 3 tasks. Key findings: (1) position-based heuristics (keep first 4 + last N tokens) match or exceed learned approaches; (2) prefill attention provides equivalent signal to complex learned scorers; (3) marginal information in KV representations beyond position and prefill attention appears limited for importance prediction. We hypothesize that circular dependence between future queries and generation trajectories contributes to this difficulty.
Abstract:We present a controlled study of multi-hop contextual reasoning in large language models, providing a clean demonstration of the task-method dissociation: rule-based pattern matching achieves 100% success on structured information retrieval but only 6.7% on tasks requiring cross-document reasoning, while LLM-based multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail. Using a synthetic evaluation framework with 120 trials across four models (LLaMA-3 8B, LLaMA-2 13B, Mixtral 8x7B, DeepSeek-V2 16B), we report three key findings: (1) Multi-agent amplification depends on base capability: statistically significant gains occur only for models with sufficient reasoning ability (p < 0.001 for LLaMA-3 8B, p = 0.014 for Mixtral), with improvements of up to 46.7 percentage points, while weaker models show no benefit, suggesting amplification rather than compensation; (2) Active parameters predict reasoning performance: Mixtral's performance aligns with its ~12B active parameters rather than 47B total, consistent with the hypothesis that inference-time compute drives reasoning capability in MoE architectures; (3) Architecture quality matters: LLaMA-3 8B outperforms LLaMA-2 13B despite fewer parameters, consistent with known training improvements. Our results provide controlled quantitative evidence for intuitions about multi-agent coordination and MoE scaling, while highlighting the dependence of multi-agent benefits on base model capability. We release our evaluation framework to support reproducible research on reasoning in mid-scale models.