Abstract:Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
Abstract:Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.
Abstract:A segmented waveguide-enabled pinching-antenna system (SWAN)-based tri-hybrid beamforming architecture is proposed for uplink multi-user MIMO communications, which jointly optimizes digital, analog, and pinching beamforming. Both fully-connected (FC) and partially-connected (PC) structures between RF chains and segment feed points are considered. For the FC architecture, tri-hybrid beamforming is optimized using the weighted minimum mean-square error (WMMSE) and zero-forcing (ZF) approaches. Specifically, the digital, analog, and pinching beamforming components are optimized via a closed-form solution, Riemannian manifold optimization, and a Gauss-Seidel search, respectively. For the PC architecture, an interleaved topology tailored to the SWAN receiver is proposed, in which segments assigned to each RF chain (sub-array) are interleaved with those from other sub-arrays. Based on this structure, a WMMSE-based tri-hybrid design is developed, in which the Riemannian-manifold update used for the FC structure is replaced by element-wise phase calibration to exploit sparsity in analog beamforming. To gain insight into the performance of the proposed system, the rate-scaling laws with respect to the number of segments are derived for both the FC and PC structures. Our results demonstrate that: i)~SWAN with the proposed tri-hybrid beamforming consistently outperforms conventional hybrid beamforming and conventional pinching-antenna systems with pinching beamforming for both the FC and PC structures; and ii)~the PC structure can strike a good balance between sum rate and energy consumption when the number of segments is large; and iii) the achievable rate does not necessarily increase with the number of segments.
Abstract:Deep learning-based Personal Sound Zones (PSZs) rely on simulated acoustic transfer functions (ATFs) for training, yet idealized point-source models exhibit large sim-to-real gaps. While physically informed components improve generalization, individual contributions remain unclear. This paper presents a controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network (BSANN). We progressively enrich simulated ATFs with three components: (i) anechoically measured frequency responses of the particular loudspeakers(FR), (ii) analytic circular-piston directivity (DIR), and (iii) rigid-sphere head-related transfer functions (RS-HRTF). Four configurations are evaluated via in-situ measurements with two dummy heads. Performance metrics include inter-zone isolation (IZI), inter-program interference (IPI), and crosstalk cancellation (XTC) over 100-20000 Hz. Results show FR provides spectral calibration, yielding modest XTC improvements and reduced inter-listener IPI imbalance. DIR delivers the most consistent sound-zone separation gains (10.05 dB average IZI/IPI). RS-HRTF dominates binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts. These findings guide prioritization of measurements and models when constructing training ATFs under limited budgets.
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
Abstract:Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard's state-of-the-art performance and validate its excellent robustness and generalization capacity.
Abstract:Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.
Abstract:A signal processing-based framework is proposed for detecting random segment failures in segmented waveguide-enabled pinching-antenna systems. To decouple the passively combined uplink signal and to provide per-segment observability, tagged pilots are employed. A simple tag is attached to each segment and is used to apply a known low-rate modulation at the segment feed, which assigns a unique signature to each segment. Based on the tagged-pilot model, a low-complexity per-segment maximum-likelihood (ML) detector is developed for the case in which the pilot length is no smaller than the number of segments. For the case in which the pilot length is smaller than the number of segments, sparsity in the failure-indicator vector is exploited and a compressive sensing-based detector is adopted. Numerical results show that the per-segment detector approaches joint ML performance, while the compressive sensing-based detector achieves reliable detection with a short pilot and can outperform baselines that require much longer pilots.
Abstract:The pinching-antenna system (PASS) enables wireless channel reconfiguration through optimized placement of pinching antennas along dielectric waveguides. In this article, a unified analytical framework is proposed to characterize the maintainability of PASS. Within this framework, random waveguide failures and repairs are modeled by treating the waveguide lifetime and repair time as exponentially distributed random variables, which are characterized by the failure rate and the repair rate, respectively. The operational state of the waveguide is described by a two-state continuous-time Markov chain, for which the transition probabilities and steady-state probabilities of the waveguide being working or failed are analyzed. By incorporating the randomness of the waveguide operational state into the transmission rate, system maintainability is characterized using the probability of non-zero rate (PNR) and outage probability (OP). The proposed framework is applied to both a conventional PASS employing a single long waveguide and a segmented waveguide-enabled pinching-antenna system (SWAN) composed of multiple short waveguide segments under two operational protocols: segment switching (SS) and segment aggregation (SA). Closed-form expressions for the PNR and OP are derived for both architectures, and the corresponding scaling laws are analyzed with respect to the service-region size and the number of segments. It is proven that both SS-based and SA-based SWAN achieve higher PNR and lower OP than conventional PASS, which confirms the maintainability advantage of segmentation. Numerical results demonstrate that: i) the maintainability gain of SWAN over conventional PASS increases with the number of segments, and ii) SA provides stronger maintainability than SS.
Abstract:In this work, we propose a deep unified (DU) encoder that embeds source information in a codeword that contains sufficient redundancy to handle both channel and source uncertainties, without enforcing an explicit pilot-data separation. At the receiver, we design a parallel flow-matching (PFM) decoder that leverages flow-based generative priors to jointly estimate the channel and the source, yielding much more efficient inference than the existing diffusion-based approaches. To benchmark performance limits, we derive the Bayesian Cramér-Rao bound (BCRB) for the joint channel and source estimation problem. Extensive simulations over block-fading MIMO-OFDM channels demonstrate that the proposed DU-PFM approach drastically outperforms the state-of-the-art methods in both channel estimation accuracy and source reconstruction quality.