Abstract:On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and hot-swapping.We introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.
Abstract:Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.
Abstract:Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.
Abstract:Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.
Abstract:For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.
Abstract:Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.




Abstract:Layer-wise post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining. However, recent progress in this line of research is saturating, underscoring the need to revisit its core limitation and explore further improvements. This study identifies a critical bottleneck in existing layer-wise PTQ methods: the accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this, we propose Quantization Error Propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error which enable compensating for accumulated quantization errors. Additionally, we introduce a tunable propagation mechanism that allows for control over both propagation strength and computational overhead, making the framework adaptable to various architectures and resource constraints. Empirical evaluation on LLaMA2 models (7B, 13B, 70B) demonstrate that incorporating QEP into standard layer-wise PTQ pipelines outperforms standard PTQ methods. Notably, QEP yields substantial performance improvements under extreme low-bit quantization settings.




Abstract:We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the "notorious" issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase.

Abstract:Min-max optimization problems, also known as saddle point problems, have attracted significant attention due to their applications in various fields, such as fair beamforming, generative adversarial networks (GANs), and adversarial learning. However, understanding the properties of these min-max problems has remained a substantial challenge. This study introduces a statistical mechanical formalism for analyzing the equilibrium values of min-max problems in the high-dimensional limit, while appropriately addressing the order of operations for min and max. As a first step, we apply this formalism to bilinear min-max games and simple GANs, deriving the relationship between the amount of training data and generalization error and indicating the optimal ratio of fake to real data for effective learning. This formalism provides a groundwork for a deeper theoretical analysis of the equilibrium properties in various machine learning methods based on min-max problems and encourages the development of new algorithms and architectures.
Abstract:Learning-based methods have gained attention as general-purpose solvers because they can automatically learn problem-specific heuristics, reducing the need for manually crafted heuristics. However, these methods often face challenges with scalability. To address these issues, the improved Sampling algorithm for Combinatorial Optimization (iSCO) using discrete Langevin dynamics has been proposed, demonstrating better performance than several learning-based solvers. This study proposes a different approach that integrates gradient-based update through continuous relaxation, combined with Quasi-Quantum Annealing (QQA). QQA smoothly transitions the objective function from a simple convex form, where half-integral solutions dominate, to the original objective function, where the variables are restricted to 0 or 1. Furthermore, we incorporate parallel run communication leveraging GPUs, enhancing exploration capabilities and accelerating convergence. Numerical experiments demonstrate that our approach is a competitive general-purpose solver, achieving comparable performance to iSCO across various benchmark problems. Notably, our method exhibits superior trade-offs between speed and solution quality for large-scale instances compared to iSCO, commercial solvers, and specialized algorithms.