Abstract:Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.
Abstract:In high-conflict mixed-traffic scenarios involving human-driven and autonomous vehicles, most existing autonomous driving systems default to overly conservative behaviors, lack proactive interaction, and consequently suffer from limited public acceptance. To mitigate intent misunderstandings and decision failures, we present a Large Language Model based interactive decision-making framework that augments scene understanding and intent-aware interaction to jointly improve safety and efficiency. The approach uses Object-Process Methodology to semantically model complex multi-vehicle scenes, abstracting low-level perceptual data into objects, processes, and relations, thereby streamlining reasoning over latent causal structure. Building on this representation, the Large Language Model parses both explicit and implicit intents of surrounding agents and, under jointly enforced safety and efficiency constraints, selects candidate maneuvers. We further generate perturbed trajectory candidates via Monte Carlo sampling and evaluate them to obtain an optimized executable trajectory. To foster transparency and coordination with nearby road users, the final decision is translated by the Large Language Model into concise natural-language messages and broadcast through an external Human-Machine Interface, completing a closed loop from scene understanding to action to language. Experiments in a cluster driving simulator demonstrate that the proposed method outperforms traditional baselines across safety, comfort, and efficiency metrics, while a Turing-test-style evaluation indicates a high degree of human-likeness in decision making. Besides, these results suggest that coupling semantic scene abstraction with Large Language Model mediated intent reasoning and language-based eHMI communication offers a practical pathway toward interactive, trustworthy autonomous driving in dense mixed traffic.
Abstract:The rapid iteration of autonomous driving algorithms has created a growing demand for high-fidelity, replayable, and diagnosable testing data. However, many public datasets lack real vehicle dynamics feedback and closed-loop interaction with surrounding traffic and road infrastructure, limiting their ability to reflect deployment readiness. To address this gap, we present OVPD (OnSite Virtual-Physical Dataset), a virtual-physical fusion testing dataset released from the 2025 OnSite Autonomous Driving Challenge. Centered on real-vehicle-in-the-loop testing, OVPD integrates virtual background traffic with vehicle-infrastructure perception to build controllable and interactive closed-loop test environments on a proving ground. The dataset contains 20 testing clips from 20 teams over a scenario chain of 15 atomic scenarios, totaling nearly 3 hours of multi-modal data, including vehicle trajectories and states, control commands, and digital-twin-rendered surround-view observations. OVPD supports long-tail planning and decision-making validation, open-loop or platform-enabled closed-loop evaluation, and comprehensive assessment across safety, efficiency, comfort, rule compliance, and traffic impact, providing actionable evidence for failure diagnosis and iterative improvement. The dataset is available via: https://huggingface.co/datasets/Yuhang253820/Onsite_OPVD
Abstract:High-precision direction-of-arrival (DOA) estimation, as a key sensing capability for 6G-enabled applications such as autonomous driving and extended reality, is increasingly dependent on the effective exploitation of spatial degrees of freedom (DOFs). This paper integrates two frontier DOFs-oriented paradigms and proposes a fluid antenna-enabled hybrid analog-digital (FA-HAD) architecture, which features an extremely lightweight front-end configuration mechanism and efficient spatial DOFs exploitation. Within this architecture, a collaborative spatial-phase sampling strategy is first developed to enable real-time 2-D DOA estimation under compressive observations, and a single-source CRLB analysis is provided to quantify the achievable performance limit, offering quantitative guidance for accuracy-overhead trade-offs. Furthermore, an efficient virtual-array spatial covariance matrix reconstruction method is proposed to recover a physically meaningful covariance representation, thereby providing a covariance-domain interface that is directly reusable by a broad class of existing covariance-based array processing and array design techniques, which strengthens the scalability and transferability of the proposed architecture. Building upon the reconstructed SCM, a Jacobi-Anger expansion based dimension-reduced MUSIC estimator is further derived for arbitrary planar arrays with a favorable computational cost. Simulation results demonstrate that the proposed FA-HAD framework attains DOA accuracy close to fully digital systems while substantially reducing RF hardware complexity and training overhead.
Abstract:LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.
Abstract:With the rapid expansion of low-altitude economy (LAE) services and the growing demand for integrated sensing and communication (ISAC) in air-ground networks, reliable direction-of-arrival (DOA) estimation has become essential for both directional communication and sensing functions. DOA underpins beam alignment, spatial-reuse scheduling, and ISAC-critical tasks such as airspace situational awareness and multi-target monitoring. Hybrid analog-digital (HAD) architectures have emerged as a practical solution for large-aperture directional operation under stringent radio frequency (RF), analog-to-digital converter (ADC), and size, weight, and power (SWaP) constraints. However, HAD compresses antenna-domain observations through analog combining, fundamentally reshaping the measurement model and introducing new algorithmic and system-level challenges for DOA estimation. This article first reviews the principles and representative architectures of HAD, highlighting their advantages for scalable beam-centric and ISAC-oriented operation in LAE scenarios. We then provide a structured overview of HAD-enabled DOA estimation methodologies, including spatial covariance matrix (SCM) reconstruction, multi-combiner scan-based acquisition, and pilot-aided estimation, along with key design tradeoffs. Finally, we discuss open challenges and outline reliability-driven research directions toward robust, deployable HAD-enabled DOA solutions for practical ISAC-enabled low-altitude environments.
Abstract:Unlike fixed-position arrays with static observation entropy, the scalable fluid antenna system (S-FAS) can dynamically adjust its aperture to form different observation spaces with configuration-dependent entropy budgets. This reconfigurability requires an information-theoretic framework beyond traditional algebraic identifiability analysis. This paper establishes an observation entropy framework for S-FAS, which unifies the derivation of identifiability limits, the diagnosis of processing bottlenecks, and system design optimization. For an S-FAS with mutual coupling suppression, we derive a complete capacity hierarchy among compressed, extended, and jointly stacked configurations. The entropy framework reveals that sequential two-stage processing suffers from an information bottleneck that restricts achievable capacity, while the noise entropy ratio can be used to distinguish fundamental performance limits from algorithmic deficiencies. A joint MUSIC algorithm is proposed to approach the theoretical joint capacity bound. Extensive Monte Carlo simulations, validated by both algebraic and information-theoretic criteria, verify the derived capacity hierarchy and identifiability boundaries.
Abstract:Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task-specific training. We present KLDrive, the first knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy-based scene fact construction module that consolidates multi-source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact-grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few-shot in-context exemplars, the framework adapts to diverse reasoning tasks without heavy task-specific fine-tuning. Experiments on two large-scale autonomous-driving QA benchmarks show that KLDrive outperforms prior state-of-the-art methods, achieving the best overall accuracy of 65.04% on NuScenes-QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.
Abstract:Large-scale sparse multi-objective optimization problems (LSMOPs) are prevalent in real-world applications, where optimal solutions typically contain only a few nonzero variables, such as in adversarial attacks, critical node detection, and sparse signal reconstruction. Since the function evaluation of LSMOPs often relies on large-scale datasets involving a large number of decision variables, the search space becomes extremely high-dimensional. The coexistence of sparsity and high dimensionality greatly intensifies the conflict between exploration and exploitation, making it difficult for existing multi-objective evolutionary algorithms (MOEAs) to identify the critical nonzero decision variables within limited function evaluations. To address this challenge, this paper proposes an evolutionary algorithm with probabilistic annealing for large-scale sparse multi-objective optimization. The algorithm is driven by two probability vectors with distinct entropy characteristics: a convergence-oriented probability vector with relatively low entropy ensures stable exploitation, whereas an annealed probability vector with gradually decreasing entropy enables an adaptive transition from global exploration to local refinement. By integrating these complementary search dynamics, the proposed algorithm achieves a dynamic equilibrium between exploration and exploitation. Experimental results on benchmark problems and real-world applications demonstrate that the proposed algorithm outperforms state-of-the-art evolutionary algorithms in terms of both convergence and diversity.
Abstract:Chain-of-thought (CoT) improves reasoning reliability but increases token cost, motivating post-training compression of explicit reasoning traces. However, the shortest sufficient reasoning is not universal: it depends on difficulty, model capacity, and training state, making fixed length targets brittle. In practice, naive RL-based compression can also undesirably shorten the user-facing answer, because a single completion-level learning signal leaks across the think/answer boundary. We propose Difficulty-Scaled Segment-Wise GRPO (DSS-GRPO), which decomposes returns into think and answer components, computes group-relative advantages per segment, and routes them with hard token masks so compression updates act only on think while answer alignment acts only on answer. DSS-GRPO uses prompt-wise within-group shaping and difficulty-aware scaling to encourage concise reasoning without collapsing answer behavior.