In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
We study the parameterized complexity of testing approximate first-order stationarity at a prescribed point for continuous piecewise-affine (PA) functions, a basic task in nonsmooth optimization. PA functions form a canonical model for nonsmooth stationarity testing and capture the local polyhedral geometry that appears in ReLU-type training losses. Recent work by Tian and So (SODA 2025) shows that testing approximate stationarity notions for PA functions is computationally intractable in the worst case, and identifies fixed-dimensional tractability as an open direction. We address this direction from the viewpoint of parameterized complexity, with the ambient dimension $d$ as the parameter. In this paper, we give XP algorithms in fixed dimension for the tractable sides, and prove W[1]-hardness for the complementary sides. Moreover, lower bounds under the Exponential Time Hypothesis rule out algorithms running in time $ρ(d)\size^{o(d)}$ for any computable function $ρ$, where $\size$ denotes the total binary encoding length of the stationarity-testing instance. As a further consequence, our results yield the corresponding parameterized complexity picture for testing local minimality of continuous PA functions. We further extend our hardness results to a family of shallow ReLU CNN training losses, with stationarity tested in the trainable weight space. Thus, the same parameterized-complexity picture also appears for simple CNN training losses.
GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent's continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to $+30.0$, suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.
With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at https://github.com/vision-agh/NAS-GNN-KWS.
A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout $S = |W_{out} W_{in}|$, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at $p < 10^{-5}$. We package the protocol used to test that claim -- standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics ($do(X=c)$, soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms -- as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard $do(X=c)$ interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger -- the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.
Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work. Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions. We formalize continuous auditing as a $T$-round Stackelberg game between an auditor that commits to a temporal policy and an adaptive auditee, and identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously. We make this formal as Observation 1 and show that two minimal extension policies, each derived from the observation, close the regime along orthogonal axes: a sample-size-aware static rule (Periodic-with-floor) closes the granularity-failure case, while a history-conditioned suspicion-escalation policy closes the coverage-failure case for the naive Drift strategy -- and neither closes both, exactly as the observation predicts; an audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both. To support empirical study we contribute a non-additive harm decomposition (welfare loss $W$, coverage loss $C$) that exposes how attrition shifts harm from the regulator-accountable surface to a regulator-invisible one; an initial library of five auditee strategies (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift) and five auditor policies, calibrated to summary statistics from published audits of the DSA Transparency Database; and a reproducible simulator with a small, extensible Python interface.
The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) provide per-parameter adaptivity, they do not explicitly account for module-level gradient heterogeneity, resulting in slower convergence, suboptimal performance, or training instability. Existing approaches typically rely on manually tuned module-specific learning rates or specific optimization strategies, which are computationally costly and difficult to generalize across tasks or models. To establish a more principled approach, we first analyze the noise-damping behavior of Adam in high-noise modules and introduce \textbf{Module-wise Learning Rate Scaling via SNR (MoLS)}. MoLS estimates module-level SNRs to scale Adam updates, allowing automated module-wise learning rate allocation without manual tuning. Empirical results through multiple LLM training benchmarks demonstrate that MoLS improves convergence speed and generalization, achieving performance comparable to carefully tuned module-specific learning rates, while remaining compatible with memory-efficient training algorithms.
Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
We study channel parameter estimation for multiuser orthogonal time frequency space (OTFS) systems in the delay-Doppler (DD) domain. To enable structured parametric estimation, we adopt a multi-user pilot cyclic prefix (MU-PCP) design, which multiplexes users along the Doppler dimension while preserving a separable exponential structure. This structure facilitates high-resolution estimation of fractional delay and Doppler parameters in the multiuser setting. Building on this framework, we extend weighted MUSIC (W-MUSIC) to multiuser OTFS, providing a computationally efficient approach with mild grid dependency, and develop a matrix pencil (MP)-based method that achieves fully grid-independent delay-Doppler parameter estimation. Numerical results demonstrate the effectiveness of the proposed methods and reveal a robustness-complexity tradeoff: W-MUSIC performs better at low SNR, while MP achieves higher estimation accuracy at moderate-to-high SNR with significantly lower computational complexity.
Large Language Model ensembles improve reasoning accuracy up to a performance boundary; beyond it, additional deliberation degrades accuracy. Static-budget methods cannot detect this boundary. Extended-thinking architectures compound the problem: a wrong answer after 120k tokens is indistinguishable from a correct one. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. Two configurations are evaluated: a persistence heuristic and DASE-Spatial (arena half-width W). Three contributions. (1) DASE produces a commit-type routing partition complementary to verbalized single-call confidence. On a contamination-controlled corpus (AIME 2010-2023, N=254, 3 seeds), a 120B ensemble achieves a 24.8 pp routing gap (right-wall 97.1% vs. left-wall 73.6%), statistically equivalent to Opus 4.6 Standard verbalized confidence at coverage-matched threshold (25.7 pp gap; bootstrap CI on difference: [-12.0, +10.3] pp, p=0.873). The two mechanisms disagree on 27% of routing assignments, establishing them as complements rather than substitutes; every DASE decision is accompanied by a machine-readable deliberation record. (2) Adaptive stopping, not injection bandwidth, drives accuracy gains. On AIME-300, bandwidth accounts for only 0.3 pp (ns); on GPQA-Extended, 4.4 pp bandwidth versus 5.0 pp stopping effect. DASE-Spatial ties Debate-Dense at its optimal budget using one-tenth the injection bandwidth and identifies that budget automatically; W=8 (65.0%) significantly outperforms W=4 (59.3%) on AIME-300 (adj p=0.0042). (3) Injection-based methods exhibit a retrospective accuracy-vs-inference inverted-U on both benchmarks; this pattern is hypothesis-generating for future work.