Abstract:Physical layer key generation (PLKG) has emerged as a promising solution for achieving highly secured and low-latency key distribution, offering information-theoretic security that is inherently resilient to quantum attacks. However, simultaneously ensuring a high data transmission rate and a high secret key generation rate under eavesdropping attacks remains a major challenge. In time-division duplex (TDD) systems with multiple antennas, we derive closed-form expressions for both rates by modeling the legitimate channel as a time-correlated autoregressive (AR) process. This formulation leads to a highly nonconvex and time-coupled optimization problem, rendering traditional optimization methods ineffective. To address this issue, we propose a multi-agent soft actor-critic (SAC) framework equipped with a long short-term memory (LSTM) adversary prediction module to cope with the partial observability of the eavesdropper's mode. Simulation results demonstrate that the proposed approach achieves superior performance compared with other benchmark algorithms, while effectively balancing the trade-off between secret key generation rate and data transmission rate. The results also confirm the robustness of the proposed framework against intelligent eavesdropping and partial observation uncertainty.
Abstract:Efficient 3D LiDAR point cloud compression (LPCC) and streaming are critical for edge server-assisted robotic systems, enabling real-time communication with compact data representations. A widely adopted approach represents LiDAR point clouds as range images, enabling the direct use of mature image and video compression codecs. However, because these codecs are designed with human visual perception in mind, they often compromise geometric details, which downgrades the performance of downstream robotic tasks such as mapping and object detection. Furthermore, rate-distortion optimization (RDO)-based rate control remains largely underexplored for range image compression (RIC) under dynamic bandwidth conditions. To address these limitations, we propose D-Compress, a new detail-preserving and fast RIC framework tailored for real-time streaming. D-Compress integrates both intra- and inter-frame prediction with an adaptive discrete wavelet transform approach for precise residual compression. Additionally, we introduce a new RDO-based rate control algorithm for RIC through new rate-distortion modeling. Extensive evaluations on various datasets demonstrate the superiority of D-Compress, which outperforms state-of-the-art (SOTA) compression methods in both geometric accuracy and downstream task performance, particularly at compression ratios exceeding 100x, while maintaining real-time execution on resource-constrained hardware. Moreover, evaluations under dynamic bandwidth conditions validate the robustness of its rate control mechanism.
Abstract:This paper introduces CSI-RFF, a new framework that leverages micro-signals embedded within Channel State Information (CSI) curves to realize Radio-Frequency Fingerprinting of commodity off-the-shelf (COTS) WiFi devices for open-set authentication. The micro-signals that serve as RF fingerprints are termed ``micro-CSI''. Through experimentation, we have found that the presence of micro-CSI can primarily be attributed to imperfections in the RF circuitry. Furthermore, this characteristic signal is detectable in WiFi 4/5/6 network interface cards (NICs). We have conducted further experiments to determine the most effective CSI collection configurations to stabilize micro-CSI. Yet, extracting micro-CSI for authentication purposes poses a significant challenge. This complexity arises from the fact that CSI measurements inherently include both micro-CSI and the distortions introduced by wireless channels. These two elements are intricately intertwined, making their separation non-trivial. To tackle this challenge, we have developed a signal space-based extraction technique for line-of-sight (LoS) scenarios, which can effectively separate the distortions caused by wireless channels and micro-CSI. Over the course of our comprehensive CSI data collection period extending beyond one year, we found that the extracted micro-CSI displays unique characteristics specific to each WiFi device and remains invariant over time. This establishes micro-CSI as a suitable candidate for device fingerprinting. Finally, we conduct a case study focusing on area access control for mobile robots. Our experimental results demonstrate that the micro-CSI-based authentication algorithm can achieve an average attack detection rate close to 99% with a false alarm rate of 0% in both static and mobile conditions when using 20 CSI measurements to construct one fingerprint.
Abstract:Transparent liquid manipulation in robotic pouring remains challenging for perception systems: specular/refraction effects and lighting variability degrade visual cues, undermining reliable level estimation. To address this challenge, we introduce RadarEye, a real-time mmWave radar signal processing pipeline for robust liquid level estimation and tracking during the whole pouring process. RadarEye integrates (i) a high-resolution range-angle beamforming module for liquid level sensing and (ii) a physics-informed mid-pour tracker that suppresses multipath to maintain lock on the liquid surface despite stream-induced clutter and source container reflections. The pipeline delivers sub-millisecond latency. In real-robot water-pouring experiments, RadarEye achieves a 0.35 cm median absolute height error at 0.62 ms per update, substantially outperforming vision and ultrasound baselines.
Abstract:Orthogonal time frequency space (OTFS) modulation has demonstrated significant advantages in high-mobility scenarios in future 6G networks. However, existing channel estimation methods often overlook the structured sparsity and clustering characteristics inherent in realistic clustered delay line (CDL) channels, leading to degraded performance in practical systems. To address this issue, we propose a novel nonparametric Bayesian learning (NPBL) framework for OTFS channel estimation. Specifically, a stick-breaking process is introduced to automatically infer the number of multipath components and assign each path to its corresponding cluster. The channel coefficients within each cluster are modeled by a Gaussian mixture distribution to capture complex fading statistics. Furthermore, an effective pruning criterion is designed to eliminate spurious multipath components, thereby enhancing estimation accuracy and reducing computational complexity. Simulation results demonstrate that the proposed method achieves superior performance in terms of normalized mean squared error compared to existing methods.
Abstract:Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
Abstract:Large Language Models (LLMs) have become indispensable across various domains, but this comes at the cost of substantial computational and memory resources. Model pruning addresses this by removing redundant components from models. In particular, block pruning can achieve significant compression and inference acceleration. However, existing block pruning methods are often unstable and struggle to attain globally optimal solutions. In this paper, we propose a mutual information based pruning method MI-PRUN for LLMs. Specifically, we leverages mutual information to identify redundant blocks by evaluating transitions in hidden states. Additionally, we incorporate the Data Processing Inequality (DPI) to reveal the relationship between the importance of entire contiguous blocks and that of individual blocks. Moreover, we develop the Fast-Block-Select algorithm, which iteratively updates block combinations to achieve a globally optimal solution while significantly improving the efficiency. Extensive experiments across various models and datasets demonstrate the stability and effectiveness of our method.




Abstract:Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance.
Abstract:Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.




Abstract:Transparent objects are prevalent in everyday environments, but their distinct physical properties pose significant challenges for camera-guided robotic arms. Current research is mainly dependent on camera-only approaches, which often falter in suboptimal conditions, such as low-light environments. In response to this challenge, we present FuseGrasp, the first radar-camera fusion system tailored to enhance the transparent objects manipulation. FuseGrasp exploits the weak penetrating property of millimeter-wave (mmWave) signals, which causes transparent materials to appear opaque, and combines it with the precise motion control of a robotic arm to acquire high-quality mmWave radar images of transparent objects. The system employs a carefully designed deep neural network to fuse radar and camera imagery, thereby improving depth completion and elevating the success rate of object grasping. Nevertheless, training FuseGrasp effectively is non-trivial, due to limited radar image datasets for transparent objects. We address this issue utilizing large RGB-D dataset, and propose an effective two-stage training approach: we first pre-train FuseGrasp on a large public RGB-D dataset of transparent objects, then fine-tune it on a self-built small RGB-D-Radar dataset. Furthermore, as a byproduct, FuseGrasp can determine the composition of transparent objects, such as glass or plastic, leveraging the material identification capability of mmWave radar. This identification result facilitates the robotic arm in modulating its grip force appropriately. Extensive testing reveals that FuseGrasp significantly improves the accuracy of depth reconstruction and material identification for transparent objects. Moreover, real-world robotic trials have confirmed that FuseGrasp markedly enhances the handling of transparent items. A video demonstration of FuseGrasp is available at https://youtu.be/MWDqv0sRSok.