Abstract:Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.
Abstract:The joint design of analog beamforming and power allocation is investigated for a single radio-frequency chain multiuser time-division multiple access system under a max-min signal-to-noise ratio (SNR) criterion. A hardware-efficient phased-array architecture is considered, where the beamforming vector is shared by all users and is subject to constant-modulus constraints. For any fixed analog beamformer, the optimal power allocation is first derived in closed form, by which the original problem is reduced to phase-shift optimization only. Then, globally optimal branch-and-bound (BB) algorithms are developed for discrete and continuous phase shifts. Numerical results show that the proposed BB algorithms achieve the global optimum and provide reliable benchmarks for evaluating the performance gap of low-complexity alternating-optimization methods.
Abstract:3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.
Abstract:This paper studies an unmanned aerial vehicle (UAV) position and attitude sensing problem, where a base station equipped with an antenna array transmits signals to a predetermined potential flight region of a flying UAV, and exploits the reflected echoes for wireless imaging. The UAV is represented by an electromagnetic point cloud in this region that contains its spatial information and electromagnetic properties (EPs), enabling the unified extraction of UAV position, attitude, and shape from the reconstructed point cloud. To accomplish this task, we develop a generative UAV sensing approach. The position and signal-to-noise ratio embedding are adopted to assist the UAV features extraction from the estimated sensing channel under the measurement noise and channel variations. Guided by the obtained features, a conditional diffusion model is utilized to generate the point cloud. The simulation results demonstrate that the reconstructed point clouds via the proposed approach present higher fidelity compared to the competing schemes, thereby enabling a more accurate capture of the UAV attitude and shape information, as well as a more precise position estimation.
Abstract:Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on "mental navigation": the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.
Abstract:This paper investigates secure Directional Modulation (DM) design enhanced by a rotatable active Reconfigurable Intelligent Surface (RIS). In conventional RIS-assisted DM networks, the security performance gain is limited due to the multiplicative path loss introduced by the RIS reflection path. To address this challenge, a Secrecy Rate (SR) maximization problem is formulated, subject to constraints including the eavesdropper's Direction Of Arrival (DOA) estimation performance, transmit power, rotatable range, and maximum reflection amplitude of the RIS elements. To solve this non-convex optimization problem, three algorithms are proposed: a multi-stream null-space projection and leakage-based method, an enhanced leakage-based method, and an optimization scheme based on the Distributed Soft Actor-Critic with Three refinements (DSAC-T). Simulation results validate the effectiveness of the proposed algorithms. A performance trade-off is observed between eavesdropper's DOA estimation accuracy and the achievable SR. The security enhancement provided by the RIS is more significant in systems equipped with a small number of antennas. By optimizing the orientation of the RIS, a 52.6\% improvement in SR performance can be achieved.
Abstract:Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
Abstract:Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.
Abstract:A segmented waveguide-enabled pinching-antenna system (SWAN)-based tri-hybrid beamforming architecture is proposed for uplink multi-user MIMO communications, which jointly optimizes digital, analog, and pinching beamforming. Both fully-connected (FC) and partially-connected (PC) structures between RF chains and segment feed points are considered. For the FC architecture, tri-hybrid beamforming is optimized using the weighted minimum mean-square error (WMMSE) and zero-forcing (ZF) approaches. Specifically, the digital, analog, and pinching beamforming components are optimized via a closed-form solution, Riemannian manifold optimization, and a Gauss-Seidel search, respectively. For the PC architecture, an interleaved topology tailored to the SWAN receiver is proposed, in which segments assigned to each RF chain (sub-array) are interleaved with those from other sub-arrays. Based on this structure, a WMMSE-based tri-hybrid design is developed, in which the Riemannian-manifold update used for the FC structure is replaced by element-wise phase calibration to exploit sparsity in analog beamforming. To gain insight into the performance of the proposed system, the rate-scaling laws with respect to the number of segments are derived for both the FC and PC structures. Our results demonstrate that: i)~SWAN with the proposed tri-hybrid beamforming consistently outperforms conventional hybrid beamforming and conventional pinching-antenna systems with pinching beamforming for both the FC and PC structures; and ii)~the PC structure can strike a good balance between sum rate and energy consumption when the number of segments is large; and iii) the achievable rate does not necessarily increase with the number of segments.
Abstract:Deep learning-based Personal Sound Zones (PSZs) rely on simulated acoustic transfer functions (ATFs) for training, yet idealized point-source models exhibit large sim-to-real gaps. While physically informed components improve generalization, individual contributions remain unclear. This paper presents a controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network (BSANN). We progressively enrich simulated ATFs with three components: (i) anechoically measured frequency responses of the particular loudspeakers(FR), (ii) analytic circular-piston directivity (DIR), and (iii) rigid-sphere head-related transfer functions (RS-HRTF). Four configurations are evaluated via in-situ measurements with two dummy heads. Performance metrics include inter-zone isolation (IZI), inter-program interference (IPI), and crosstalk cancellation (XTC) over 100-20000 Hz. Results show FR provides spectral calibration, yielding modest XTC improvements and reduced inter-listener IPI imbalance. DIR delivers the most consistent sound-zone separation gains (10.05 dB average IZI/IPI). RS-HRTF dominates binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts. These findings guide prioritization of measurements and models when constructing training ATFs under limited budgets.