Abstract:Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.
Abstract:Graph Signal Processing (GSP) and Graph Neural Networks (GNNs) rely fundamentally on the matrix representation of the underlying graph topology. This representation defines key operators such as the graph Fourier transform, spectral filtering, and convolution. Existing parameterized operator families interpolate only partial subsets of classical graph matrices, while broader formulations become non-compact when representing transition-type operators, limiting both theoretical analysis and stable learning. To address this issue, we propose the Generalized Linear Graph Representation (GLGR), denoted by $\mathbf{Q}_{α,l}$, as a compact two-parameter operator family defined on a bounded linear domain. GLGR unifies major classical operators together with transition-type operators without requiring asymptotic parameters. Theoretically, we show that $\mathbf{Q}_{α,l}$ admits a variational decomposition balancing local smoothness and global degree-weighted energy, derive spectral perturbation bounds, and establish graph-aware sufficient conditions for positive semi-definiteness. Building on this formulation, we develop Adaptive GLGR Convolution (AG-Conv), which makes the propagation operator itself learnable within end-to-end GNNs. Experiments on graph classification and node classification benchmarks show that GLGR improves both fixed-operator representation search and adaptive graph learning across multiple backbones.
Abstract:Graph signal denoising is a fundamental task in graph signal processing. While the node-oriented filtering approach enhances spatial adaptability, it suffers from spectral rigidity due to its reliance on the graph Fourier transform. Conversely, emerging fractional-domain transforms provide crucial spectral flexibility but are fundamentally limited by their globally shared filtering paradigm, failing to accommodate localized topological variations. To bridge this gap, this paper proposes a generalized node-oriented fractional filtering (NOFF) framework that seamlessly integrates localized spatial adaptability with proactive spectral modulation across various fractional transforms. However, straightforwardly assigning independent full-rank filters to all vertices incurs a prohibitive parameter space, leading to severe overfitting on random noise. To mitigate this, we introduce the low-rank NOFF (LRNOFF) architecture. By imposing a strict low-rank constraint, LRNOFF inherently acts as a powerful implicit regularizer, preventing noise memorization and ensuring the extraction of robust spectral bases. Furthermore, we develop an efficient computational implementation termed LRNOFF-Fast, which drastically reduces computational and memory overhead while preserving theoretical optimality. Experiments on real-world datasets demonstrate that the proposed framework achieves state-of-the-art performance.
Abstract:In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
Abstract:Graph signal processing extends spectral analysis to data supported on irregular domains. Existing fractional transforms for two-dimensional graph signals, including the two-dimensional graph fractional Fourier transform (GFRFT), typically impose a shared fractional order across dimensions, which limits adaptivity to heterogeneous spatiotemporal spectra. To address this limitation, we propose the two-dimensional graph bi-fractional Fourier transform, which assigns independent fractional orders to the factor graphs of a Cartesian product, enabling decoupled spectral control while preserving separability, unitarity, and invertibility. To further resolve the basis ambiguity in temporal fractional analysis, we develop a geodesic-coupled GFRFT by constructing a coupling path along the principal geodesic on the unitary manifold, thereby unifying graph-induced and discrete temporal bases with guaranteed unitarity and a closed-form inverse. Building on these transforms, we derive a differentiable Wiener-type filtering framework with a hybrid optimization strategy: the fractional orders are learned end-to-end from data, while the coupling parameter is fixed as a structural regularizer. Experiments on real-world time-varying graph datasets and dynamic image restoration tasks demonstrate consistent gains over state-of-the-art fractional transforms and competitive learning-based baselines.
Abstract:Accurate traffic prediction is a key task for intelligent transportation systems. The core difficulty lies in accurately modeling the complex spatial-temporal dependencies in traffic data. In recent years, improvements in network architecture have failed to bring significant performance enhancements, while embedding technology has shown great potential. However, existing embedding methods often ignore graph structure information or rely solely on static graph structures, making it difficult to effectively capture the dynamic associations between nodes that evolve over time. To address this issue, this letter proposes a novel dynamic weighted graph structure (DWGS) embedding method, which relies on a graph structure that can truly reflect the changes in the strength of dynamic associations between nodes over time. By first combining the DWGS embedding with the spatial-temporal adaptive embedding, as well as the temporal embedding and feature embedding, and then integrating attention and frequency-domain multi-layer perceptrons (MLPs), we design a novel traffic prediction model, termed the DWGS embedding integrated with attention and frequency-domain MLPs (DWAFM). Experiments on five real-world traffic datasets show that the DWAFM achieves better prediction performance than some state-of-the-arts.
Abstract:The graph fractional Fourier transform (GFRFT) generalizes the graph Fourier transform (GFT) but suffers from a significant computational bottleneck: determining the optimal transform order requires expensive eigendecomposition and matrix multiplication, leading to $O(N^3)$ complexity. To address this issue, we propose a fast GFRFT (FGFRFT) algorithm for unitary GFT matrices based on Fourier series approximation and an efficient caching strategy. FGFRFT reduces the complexity of generating transform matrices to $O(2LN^2)$ while preserving differentiability, thereby enabling adaptive order learning. We validate the algorithm through theoretical analysis, approximation accuracy tests, and order learning experiments. Furthermore, we demonstrate its practical efficacy for image and point cloud denoising and present the fractional specformer, which integrates the FGFRFT into the specformer architecture. This integration enables the model to overcome the limitations of a fixed GFT basis and learn optimal fractional orders for complex data. Experimental results confirm that the proposed algorithm significantly accelerates computation and achieves superior performance compared with the GFRFT.
Abstract:Large language models (LLMs) have achieved remarkable progress in code generation, yet their potential for software protection remains largely untapped. Reverse engineering continues to threaten software security, while traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis. In this work, we present the first protection-aware framework that learns robust representations of VMP-protected code. Our approach builds large-scale paired datasets of source code and normalized VM implementations, and introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels. We jointly optimize language modeling with functionality-aware and protection-aware contrastive objectives to capture both semantic equivalence and protection strength. To further assess resilience, we propose a protection effectiveness optimization task that quantifies and ranks different VM variants derived from the same source. Coupled with a two-stage continual pre-training and fine-tuning pipeline, our method enables models to generate, compare, and reason over protected code. Extensive experiments show that our framework significantly improves robustness across diverse protection levels, opening a new research direction for learning-based software defense. In this work, we present ShieldedCode, the first protection-aware framework that learns robust representations of VMP-protected code. Our method achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o., and improves binary similarity detection Recall@1 by 10% over state of art methods like jTrans.
Abstract:Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
Abstract:Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.