Tony
Abstract:In a fluid antenna system (FAS), a single reconfigurable antenna is able to activate one of $N$ correlated ports to exploit spatial diversity. However, outage analysis is challenging because exact evaluation requires an $N$-dimensional multivariate integral, while existing closed-form approximations based on block-correlation models tend to underestimate the true outage probability. This paper shows that the spatial correlation matrix of a FAS with a normalized linear aperture length $W$ has at most $K^{*}=2\lceil W\rceil+1$ significant eigenmodes, regardless of the number of deployed ports. This is a spatial counterpart of the Slepian-Landau-Pollak spectral concentration theorem and reveals that the spatial degrees of freedom are determined by aperture size rather than port count. Motivated by this result, we derive an \emph{equivalent degree of freedom} (EDoF) approximation, under which the outage probability can be expressed in closed form as that of selection combining over $K^{*}$ independent branches. We propose a refined \emph{weighted independent modes} (WIM) approximation, to incorporate eigenvalue-dependent branch weights $\{β_k\}$ and yield a product-form closed-form expression with improved accuracy at moderate signal-to-noise ratio (SNR). Both approximations achieve the exact diversity order, become asymptotically exact at high SNR, and provably never underestimate the true outage probability by Anderson's inequality. The proposed framework is further extended to obtain closed-form expressions for ergodic capacity, characterize multi-user fluid antenna multiple access (FAMA) with explicit interference-limited outage floors. Besides, we analyze two-dimensional planar FAS, for which the diversity order scales multiplicatively with the aperture dimensions.
Abstract:Fluid antenna system (FAS), which continuously repositions a single physical element across a deployment region $[0, D]$, breaks this limit by freeing antenna positions from the discrete grid entirely. This paper establishes the theoretical foundations of sparse FAS design for direction-of-arrival (DOA) estimation and shows that continuous position freedom unlocks three compounding advantages over the classical designs. \emph{First}, we derive a universal dual DOF bound and prove that FAS-optimized positions can approach it, growing the DOF linearly with $D/λ$ , where $λ$ is the signal wavelength, rather than saturating at $O(N^2)$. \emph{Second}, the CRB scales as $O(1/D^{2L})$ for $L$ sources, a $(D/(N^2 d_0))^{2L}$ improvement over the best grid design, with $d_0 = λ/2$ and D-optimal positions admitting closed-form solution for single sources and efficient Frank-Wolfe algorithm for multiple sources. \emph{Third}, we propose a two-stage FAS-MUSIC approach that combines coarray MUSIC disambiguation with full-aperture local maximum likelihood (ML) refinement to track the CRB, overcoming the grating-lobe ambiguity inherent in large-aperture non-uniform arrays. Robustness to minimum spacing constraints, mutual coupling, and finite position accuracy is also analyzed. Extensive simulations show that FAS-MUSIC achieves $17.5\times$ lower root mean squared error (RMSE) than uniform linear array (ULA) MUSIC and that FAS with $4$ antennas outperforms MRA with $8$ antennas, gains that are unattainable by any grid-constrained design.
Abstract:Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.
Abstract:Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.
Abstract:Embodied AI for Science (EAI4S) brings intelligence into the laboratory by uniting perception, reasoning, and robotic action to autonomously run experiments in the physical world. For the Global South, this shift is not about adopting advanced automation for its own sake, but about overcoming a fundamental capacity constraint: too few hands to run too many experiments. By enabling continuous, reliable experimentation under limits of manpower, power, and connectivity, EAI4S turns automation from a luxury into essential scientific infrastructure. The main obstacle, however, is not algorithmic capability. It is infrastructure. Open-source AI and foundation models have narrowed the knowledge gap, but EAI4S depends on dependable edge compute, energy-efficient hardware, modular robotic systems, localized data pipelines, and open standards. Without these foundations, even the most capable models remain trapped in well-resourced laboratories. This article argues for an infrastructure-first approach to EAI4S and outlines the practical requirements for deploying embodied intelligence at scale, offering a concrete pathway for Global South institutions to translate AI advances into sustained scientific capacity and competitive research output.
Abstract:Unlike fixed-position arrays with static observation entropy, the scalable fluid antenna system (S-FAS) can dynamically adjust its aperture to form different observation spaces with configuration-dependent entropy budgets. This reconfigurability requires an information-theoretic framework beyond traditional algebraic identifiability analysis. This paper establishes an observation entropy framework for S-FAS, which unifies the derivation of identifiability limits, the diagnosis of processing bottlenecks, and system design optimization. For an S-FAS with mutual coupling suppression, we derive a complete capacity hierarchy among compressed, extended, and jointly stacked configurations. The entropy framework reveals that sequential two-stage processing suffers from an information bottleneck that restricts achievable capacity, while the noise entropy ratio can be used to distinguish fundamental performance limits from algorithmic deficiencies. A joint MUSIC algorithm is proposed to approach the theoretical joint capacity bound. Extensive Monte Carlo simulations, validated by both algebraic and information-theoretic criteria, verify the derived capacity hierarchy and identifiability boundaries.
Abstract:Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Abstract:Non-fixed flexible antenna architectures, such as fluid antenna system (FAS), movable antenna (MA), and pinching antenna, have garnered significant interest in recent years. Among them, rotatable antenna (RA) has emerged as a promising technology for enhancing wireless communication and sensing performance through flexible antenna orientation/boresight rotation. By enabling mechanical or electronic boresight adjustment without altering physical antenna positions, RA introduces additional spatial degrees of freedom (DoFs) beyond conventional beamforming. In this paper, we provide a comprehensive tutorial on the fundamentals, architectures, and applications of RA-empowered wireless networks. Specifically, we begin by reviewing the historical evolution of RA-related technologies and clarifying the distinctive role of RA among flexible antenna architectures. Then, we establish a unified mathematical framework for RA-enabled systems, including general antenna/array rotation models, as well as channel models that cover near- and far-field propagation characteristics, wideband frequency selectivity, and polarization effects. Building upon this foundation, we investigate antenna/array rotation optimization in representative communication and sensing scenarios. Furthermore, we examine RA channel estimation/acquisition strategies encompassing orientation scheduling mechanisms and signal processing methods that exploit multi-view channel observations. Beyond theoretical modeling and algorithmic design, we discuss practical RA configurations and deployment strategies. We also present recent RA prototypes and experimental results that validate the practical performance gains enabled by antenna rotation. Finally, we highlight promising extensions of RA to emerging wireless paradigms and outline open challenges to inspire future research.
Abstract:Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
Abstract:Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: https://github.com/GaoScience/point-to-mask.