Abstract:Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations change reasoning structure but preserve final answers. We argue this assumption is not well validated. Such transformations can change how PRM scores relate to correctness signals, leading to different failure modes across models.To address this gap, we introduce \textbf{EST-PRM}, a stress-testing framework for dense process rewards. It applies three transformations: (1) step inflation, (2) dependency-aware step reordering, and (3) confidence markers. A vulnerability decomposition is defined that separates reward inflation from loss of correctness sensitivity. Five PRM-style models are evaluated on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench.The results indicate clear differences in vulnerability patterns across models. Math-Shepherd shows the strongest sensitivity to position perturbations, with a Pearson correlation drop of $0.152 \pm 0.038$ and a $32.8 \pm 4.9\%$ score inflation rate. Qwen2.5-Math-PRM is most affected by step inflation, reaching a $47.6 \pm 4.3\%$ inflation rate. Confidence-based perturbations also distort reward calibration, revealing inconsistencies in correctness estimation. Three mitigation strategies are evaluated, highlighting trade-offs between robustness coverage and false-positive rates.
Abstract:Federated continual learning (FCL) lets distributed clients adapt language-model heads to evolving NLP tasks without sharing raw text. Under user-level differential privacy (DP), replay-based continual learning faces a structural obstacle: clients can release only small noisy lists of candidate replay summaries, and those lists are unordered across clients. We introduce Canonicalized Stable-List Replay (CSLR), where clients privately produce candidate replay distributions over a shared sentence-embedding space and the server aligns them using signatures induced by public anchor sentences. The anchors provide identifiability for aggregation rather than additional replay data. We prove that, under an observable anchor-signature margin, $O(\log(N/η)/p)$ anchors distinguish $N$ candidate list elements with probability at least $1-η$, and we give a scoped anchorless non-identifiability result for unordered-label oracle models. Across five seeds on continual classification, NER, and dialogue benchmarks, CSLR improves the final average task metric by 3.9--5.6 points over the strongest non-CSLR DP baseline at $\eps=4$ under the reported replay-release budget, while also outperforming Hungarian and optimal-transport matchers. The formal privacy guarantee covers replay release; end-to-end private training additionally requires composition with a private optimizer for task-head updates.
Abstract:State abstraction in reinforcement learning is usually formulated as a partition of states based on reward and transition similarity. This excludes a common structural pattern in navigation, graph, and hierarchical decision problems: interface states such as doors, hubs, and bottlenecks naturally participate in more than one region. We introduce \emph{tangle-core abstraction}, an overlapping state-abstraction framework based on graph tangles of empirical transition graphs. The method constructs abstract states from consistently oriented low-order separations and represents shared interfaces through a membership kernel rather than a hard partition. We give value-preservation guarantees for the induced overlapping abstract MDP under an explicit action-consistency condition, identify an interior-homogeneity/boundary-leakage error decomposition, and prove a quantitative interface-overlap result showing when hard partitions incur an avoidable boundary error. Empirically, tangle-core abstractions achieve favorable compression--return tradeoffs against reward-aware, learned, topological-map, and graph-partitioning baselines across bottlenecked tabular domains, procedurally generated mazes, and MiniGrid representations. We also identify a clear failure regime in which transition topology is uninformative, where tangles predictably offer little benefit. These results position graph tangles as an effective topology-aware abstraction prior for decision problems with shared interface structure.
Abstract:When many reinforcement-learning policies achieve near-optimal return, a post-hoc auditor may have to distinguish among many behaviorally distinct but return-equivalent policies. We formalize this phenomenon through an occupancy-measure analogue of Rashomon capacity: the metric entropy of the near-optimal occupancy region, computed relative to an audited deployment class. Because occupancy measures identify behavior only up to occupancy equivalence, we formulate auditing at the occupancy-class level and distinguish exact local-query oracles from noisy sample-query oracles. Our main exact-query result is conditional: if the audited class contains a $2/H$-separated near-optimal packing whose local signatures are $b$-sparse, then exact local-query auditing requires $Ω(M/b)$ queries; when the packing realizes deployment-class capacity and $b=O(1)$, this becomes $Ω(2^{\Hopt^\cF(\eps)})$. We give a finite discounted hidden-branch MDP attaining this bound and show the exact Bayes success law. For noisy hidden-trigger testing, we prove a mixture lower bound of order $M/β$, where $β$ is the per-sample KL signal, yielding $Ω(2^{\Hopt^\cF(\eps)}/(ρ^2Δ^2))$ for capacity-order packings with $β=O(ρ^2Δ^2)$. We also provide a static target-recognition information lower bound, a transcript-compatible oracle-cover verification upper bound, and a canonical occupancy regularizer whose regularized audited capacity collapses when a trusted reference occupancy is available. Controlled benchmarks distinguish positive sparse-signature instances from high-capacity negative controls where exact auditing is easy, and map the noisy-trigger law to post-processed continuous-control and visual-RL auditing regimes.
Abstract:Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed replay ratios are inherently limited because the optimal mixture varies with the current domain, the training stage, and the evolving vulnerability of prior behaviors. We propose PROX-YMIX, a framework that learns a dynamic replay controller on a small proxy model and transfers the frozen controller to a larger target. The controller never observes future tasks and constructs its state from normalized validation losses and their temporal dynamics, producing a masked mixture over the current task and accessible replay buffers. Our core empirical hypothesis is forgetting mirroring: task vulnerability rankings remain largely consistent across model scales even when absolute loss magnitudes differ. We validate this assumption empirically before transferring controllers across scales. On LLaMA-3-8B across five continual instruction tuning sequences, PROXYMIX improves average accuracy by 3.4 points, reduces final forgetting by 3.5 points, and raises safety score by 5.8 points over the strongest non-oracle baseline, at roughly 50x lower policy learning cost than Oracle Target RL. The framework is leakage free and architecture independent at the interface level, and we also identify settings where the proxy assumption breaks down, highlighting limitations for robust deployment.
Abstract:As retrieval-augmented generation (RAG) systems scale, it becomes increasingly challenging to ensure faithful grounding in external evidence. Large language models may still prioritize parametric knowledge over retrieved information when conflicts arise. We propose a novel training-free decoding framework, \emph{Grounded Decoding}, designed to improve factual consistency in RAG without modifying model parameters. Unlike standard approaches that rely on a single conditional distribution, our method constructs two matched-prompt distributions at every generation step: (1) a full RAG distribution conditioned on the query, retrieved documents, and generated prefix, and (2) a retrieval-only distribution conditioned solely on retrieved evidence and the same prefix. The final next-token distribution is derived as the unique solution to a KL-barycenter objective over the probability simplex, yielding a normalized geometric fusion of the two distributions.This formulation naturally recovers standard RAG when the grounding weight is zero and smoothly shifts probability mass toward retrieved evidence as grounding strength increases. We further introduce a conflict-aware adaptive weighting scheme that dynamically adjusts grounding based on distributional disagreement and retriever confidence. Experiments on ALCE, Natural Questions, and FActScore demonstrate consistent improvements in factual accuracy and citation quality over standard RAG and competitive decoding-time baselines, while maintaining fluency. Our results indicate that probability-level fusion provides a strong and efficient alternative to logit-level intervention methods for faithful RAG decoding.
Abstract:Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.
Abstract:Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30\%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.
Abstract:Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $Ω(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $Ω(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nyström approximation, and bounded non-stationarity are developed as supporting material in the appendix.
Abstract:Crash classification models in transportation safety are typically evaluated using accuracy, F1, or AUC, metrics that cannot reveal whether a model is silently overfitting. We introduce a spectral diagnostic framework grounded in Random Matrix Theory (RMT) and Heavy-Tailed Self-Regularization (HTSR) that spans the ML taxonomy: weight matrices for BERT/ALBERT/Qwen2.5, out-of-fold increment matrices for XGBoost/Random Forest, empirical Hessians for Logistic Regression, induced affinity matrices for Decision Trees, and Graph Laplacians for KNN. Evaluating nine model families on two Iowa DOT crash classification tasks (173,512 and 371,062 records respectively), we find that the power-law exponent $α$ provides a structural quality signal: well-regularized models consistently yield $α$ within $[2, 4]$ (mean $2.87 \pm 0.34$), while overfit variants show $α< 2$ or spectral collapse. We observe a strong rank correlation between $α$ and expert agreement (Spearman $ρ= 0.89$, $p < 0.001$), suggesting spectral quality captures model behaviors aligned with expert reasoning. We propose an $α$-based early stopping criterion and a spectral model selection protocol, and validate both against cross-validated F1 baselines. Sparse Lanczos approximations make the framework scalable to large datasets.