Abstract:Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.




Abstract:Real-time, low-cost, and wireless mechanical vibration monitoring is necessary for industrial applications to track the operation status of equipment, environmental applications to proactively predict natural disasters, as well as day-to-day applications such as vital sign monitoring. Despite this urgent need, existing solutions, such as laser vibrometers, commercial Wi-Fi devices, and cameras, lack wide practical deployment due to their limited sensitivity and functionality. In this work, we propose and verify that a fully passive, resonance-based vibration processing device attached to the vibrating surface can improve the sensitivity of wireless vibration measurement methods by more than 10 times at designated frequencies. Additionally, the device realizes an analog real-time vibration filtering/labeling effect, and the device also provides a platform for surface editing, which adds more functionalities to the current non-contact sensing systems. Finally, the working frequency of the device is widely adjustable over orders of magnitudes, broadening its applicability to different applications.