Abstract:Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
Abstract:This paper builds up a virtual domain extension (VDE) framework for imposing boundary conditions (BCs) in flow simulation using pre-trained local neural operator (LNO). It creates extended virtual domains to the input function to compensate for the corrosion nature of computational domains during LNO inference, thus turns the implementation of BC into the determination of field values on the extended domain. Several strategies to calculate the field values are proposed and validated in solving numerical examples, including padding operation, direct imposition, pressure symmetry, and optimization by backpropagation, and compared with boundary imposition in traditional solvers. It is found that the large time interval of LNO induces a relatively wide near-boundary domain to be processed, thus imposing BC on only a few nodes near the boundary following the immersed boundary conception in traditional solvers can hardly achieve high accuracy. With appropriate values assigned on the extended virtual domains, VDE can accurately impose BCs and lead to reasonable flow field predictions. This work provides a guidance for imposing BCs reliably in LNO prediction, which could facilitate the reuse of pre-trained LNO in more applications.
Abstract:Humans naturally integrate vision and haptics for robust object perception during manipulation. The loss of either modality significantly degrades performance. Inspired by this multisensory integration, prior object pose estimation research has attempted to combine visual and haptic/tactile feedback. Although these works demonstrate improvements in controlled environments or synthetic datasets, they often underperform vision-only approaches in real-world settings due to poor generalization across diverse grippers, sensor layouts, or sim-to-real environments. Furthermore, they typically estimate the object pose for each frame independently, resulting in less coherent tracking over sequences in real-world deployments. To address these limitations, we introduce a novel unified haptic representation that effectively handles multiple gripper embodiments. Building on this representation, we introduce a new visuo-haptic transformer-based object pose tracker that seamlessly integrates visual and haptic input. We validate our framework in our dataset and the Feelsight dataset, demonstrating significant performance improvement on challenging sequences. Notably, our method achieves superior generalization and robustness across novel embodiments, objects, and sensor types (both taxel-based and vision-based tactile sensors). In real-world experiments, we demonstrate that our approach outperforms state-of-the-art visual trackers by a large margin. We further show that we can achieve precise manipulation tasks by incorporating our real-time object tracking result into motion plans, underscoring the advantages of visuo-haptic perception. Our model and dataset will be made open source upon acceptance of the paper. Project website: https://lhy.xyz/projects/v-hop/
Abstract:Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .
Abstract:Beyond diagonal reconfigurable intelligent surface (BD-RIS) is a new architecture for RIS where elements are interconnected to provide more wave manipulation flexibility than traditional single connected RIS, enhancing data rate and coverage. However, channel estimation for BD-RIS is challenging due to the more complex multiple-connection structure involving their scattering elements. To address this issue, this paper proposes a decoupled channel estimation method for BD-RIS that yields separate estimates of the involved channels to enhance the accuracy of the overall combined channel by capitalizing on its Kronecker structure. Starting from a least squares estimate of the combined channel and by properly reshaping the resulting filtered signal, the proposed algorithm resorts to a Khatri-Rao Factorization (KRF) method that teases out the individual channels based on simple rank-one matrix approximation steps. Numerical results show that the proposed decoupled channel estimation yields more accurate channel estimates than the classical least squares scheme.
Abstract:Reconfigurable intelligent surface (RIS) has been envisioned as a key technology in future wireless communication networks to enable smart radio environment. To further enhance the passive beamforming capability of RIS, beyond diagonal (BD)-RIS has been proposed considering reconfigurable interconnections among different RIS elements. BD-RIS has a unique feature that cannot be enabled by conventional diagonal RIS; it can be realized by non-reciprocal circuits and thus enables an asymmetric scattering matrix. This feature provides the capability to break the wireless channel reciprocity, and has the potential to benefit full-duplex (FD) systems. In this paper, we model the BD RIS-assisted FD systems, where the impact of BD-RIS non-reciprocity and that of structural scattering, which refers to the specular reflection generated by RIS when the RIS is turned OFF, are explicitly captured. To assess the benefits of non-reciprocal BD-RIS, we optimise the scattering matrix, precoder and combiner to maximize the DL and UL sum-rates in the FD system. To tackle this optimization problem, we propose an iterative algorithm based on block coordination descent (BCD) and penalty dual decomposition (PDD). Numerical results demonstrate surprising benefits of non-reciprocal BD-RIS that it can achieve much higher DL and UL sum-rates in the FD scenario than reciprocal BD-RIS and conventional diagonal RIS.
Abstract:Reconfigurable intelligent surface (RIS) has been envisioned as a key technology in future wireless communication networks to enable smart radio environment. To further enhance the passive beamforming capability of RIS, beyond diagonal (BD)-RIS has been proposed considering interconnections among different RIS elements. BD-RIS has a unique feature that cannot be enabled by conventional diagonal RIS; it can be realized by non-reciprocal circuits and thus has asymmetric scattering matrix. This feature provides probability to break the wireless channel reciprocity, and thus has potential to benefit the full-duplex (FD) system. In this paper, we model the BD RIS-assisted FD systems, where the impact of BD-RIS non-reciprocity and that of structural scattering, which refers to the virtual direct channel constructed by RIS when the RIS is turned OFF, are explicitly captured. To visualize the analysis, we propose to design the scattering matrix, precoder and combiner to maximize the DL and UL sum-rates in the FD system. To tackle this optimization problem, we propose an iterative algorithm based on block coordination descent (BCD) and penalty dual decomposition (PDD). Numerical results demonstrate surprising benefits of non-reciprocal BD-RIS that it can achieve higher DL and UL sum-rates in the FD scenario than reciprocal BD-RIS and conventional diagonal RIS.
Abstract:Reconfigurable Intelligent Surface (RIS) is a breakthrough technology enabling the dynamic control of the propagation environment in wireless communications through programmable surfaces. To improve the flexibility of conventional diagonal RIS (D-RIS), beyond diagonal RIS (BD-RIS) has emerged as a family of more general RIS architectures. However, D-RIS and BD-RIS have been commonly explored neglecting mutual coupling effects, while the global optimization of RIS with mutual coupling, its performance limits, and scaling laws remain unexplored. This study addresses these gaps by deriving global optimal closed-form solutions for BD-RIS with mutual coupling to maximize the channel gain, specifically fully- and tree-connected RISs. Besides, we provide the expression of the maximum channel gain achievable in the presence of mutual coupling and its scaling law in closed form. By using the derived scaling laws, we analytically prove that mutual coupling increases the channel gain on average under Rayleigh fading channels. Our theoretical analysis, confirmed by numerical simulations, shows that both fully- and tree-connected RISs with mutual coupling achieve the same channel gain upper bound when optimized with the proposed global optimal solutions. Furthermore, we observe that a mutual coupling-unaware optimization of RIS can cause a channel gain degradation of up to 5 dB.
Abstract:Beyond diagonal reconfigurable intelligent surfaces (BD-RIS) is a new advance in RIS techniques that introduces reconfigurable inter-element connections to generate scattering matrices not limited to being diagonal. BD-RIS has been recently proposed and proven to have benefits in enhancing channel gain and enlarging coverage in wireless communications. Uniquely, BD-RIS enables reciprocal and non-reciprocal architectures characterized by symmetric and non-symmetric scattering matrices. However, the performance benefits and new use cases enabled by non-reciprocal BD-RIS for wireless systems remain unexplored. This work takes a first step toward closing this knowledge gap and studies the non-reciprocal BD-RIS in full-duplex systems and its performance benefits over reciprocal counterparts. We start by deriving a general RIS aided full-duplex system model using a multiport circuit theory, followed by a simplified channel model based on physically consistent assumptions. With the considered channel model, we investigate the effect of BD-RIS non-reciprocity and identify the theoretical conditions for reciprocal and non-reciprocal BD-RISs to simultaneously achieve the maximum received power of the signal of interest in the uplink and the downlink. Simulation results validate the theories and highlight the significant benefits offered by non-reciprocal BD-RIS in full-duplex systems. The significant gains are achieved because of the non-reciprocity principle which implies that if a wave hits the non-reciprocal BD-RIS from one direction, the surface behaves differently than if it hits from the opposite direction. This enables an uplink user and a downlink user at different locations to optimally communicate with the same full-duplex base station via a non-reciprocal BD-RIS, which would not be possible with reciprocal surfaces.
Abstract:Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.