Abstract:Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a semi-automated pipeline. Using ST-Human, we train ST-VLM, a spatio-temporal vision-language model that generates spatially grounded and temporally coherent 3D representations to guide policy execution. The smooth spatial masks focus on task-relevant geometry and stabilize latent representations, enabling online replanning and long-horizon reasoning. Experiments on RLBench and real-world manipulation tasks show that \method significantly outperforms state-of-the-art baselines, improving zero-shot success rates by 44.6% and 30.3%. These results demonstrate that offloading spatio-temporal reasoning to VLMs with unified 3D-4D representations substantially improves robustness and generalization for open-world robotic manipulation. Project website: https://oucx117.github.io/ST-VLA/.
Abstract:Intelligent forest tree breeding has advanced plant phenotyping, yet existing research largely focuses on large-leaf agricultural crops, with limited attention to fine-grained leaf analysis of sapling trees in open-field environments. Natural scenes introduce challenges including scale variation, illumination changes, and irregular leaf morphology. To address these issues, we collected UAV RGB imagery of field-grown saplings and constructed the Poplar-leaf dataset, containing 1,202 branches and 19,876 pixel-level annotated leaf instances. To our knowledge, this is the first instance segmentation dataset specifically designed for forestry leaves in open-field conditions. We propose LeafInst, a novel segmentation framework tailored for irregular and multi-scale leaf structures. The model integrates an Asymptotic Feature Pyramid Network (AFPN) for multi-scale perception, a Dynamic Asymmetric Spatial Perception (DASP) module for irregular shape modeling, and a dual-residual Dynamic Anomalous Regression Head (DARH) with Top-down Concatenation decoder Feature Fusion (TCFU) to improve detection and segmentation performance. On Poplar-leaf, LeafInst achieves 68.4 mAP, outperforming YOLOv11 by 7.1 percent and MaskDINO by 6.5 percent. On the public PhenoBench benchmark, it reaches 52.7 box mAP, exceeding MaskDINO by 3.4 percent. Additional experiments demonstrate strong generalization and practical utility for large-scale leaf phenotyping.
Abstract:3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS. Our code will be made publicly available at (https://github.com/Bistu3DV/MND-GS/).




Abstract:Automatic Modulation Recognition (AMR) is an essential part of Intelligent Transportation System (ITS) dynamic spectrum allocation. However, current deep learning-based AMR (DL-AMR) methods are challenged to extract discriminative and robust features at low signal-to-noise ratios (SNRs), where the representation of modulation symbols is highly interfered by noise. Furthermore, current research on GNN methods for AMR tasks generally suffers from issues related to graph structure construction and computational complexity. In this paper, we propose a Spatial-Temporal-Frequency Graph Convolution Network (STF-GCN) framework, with the temporal domain as the anchor point, to fuse spatial and frequency domain features embedded in the graph structure nodes. On this basis, an adaptive correlation-based adjacency matrix construction method is proposed, which significantly enhances the graph structure's capacity to aggregate local information into individual nodes. In addition, a PoolGAT layer is proposed to coarsen and compress the global key features of the graph, significantly reducing the computational complexity. The results of the experiments confirm that STF-GCN is able to achieve recognition performance far beyond the state-of-the-art DL-AMR algorithms, with overall accuracies of 64.35%, 66.04% and 70.95% on the RML2016.10a, RML2016.10b and RML22 datasets, respectively. Furthermore, the average recognition accuracies under low SNR conditions from -14dB to 0dB outperform the state-of-the-art (SOTA) models by 1.20%, 1.95% and 1.83%, respectively.




Abstract:Reconfigurable Intelligent Surfaces (RISs) have emerged as a transformative technology for next-generation wireless communication systems, offering unprecedented control over electromagnetic wave propagation. In particular, Simultaneously Transmitting and Reflecting RISs (STAR-RISs) have garnered significant attention due to their full-space coverage. This paper presents an active STAR-RIS, which enables independent control of both transmission and reflection phases and features out-of-band harmonic suppression. Unlike the traditional passive RIS, the proposed design integrates active amplification to overcome the inherent passive losses, significantly enhancing signal strength and system performance. Additionally, the system supports dynamic power allocation between transmission and reflection modes, providing greater flexibility to meet diverse communication demands in complex propagation environments. The versatility of the design is further validated by extending the Radar Cross Section (RCS)-based path loss model to the STAR-RIS. This design improves efficiency, flexibility, and adaptability, offering a promising solution for future wireless communication systems, particularly in scenarios requiring simultaneous control of transmission and reflection signals.
Abstract:Novel view synthesis has made significant progress in the field of 3D computer vision. However, the rendering of view-consistent novel views from imperfect camera poses remains challenging. In this paper, we introduce a hybrid bundle-adjusting 3D Gaussians model that enables view-consistent rendering with pose optimization. This model jointly extract image-based and neural 3D representations to simultaneously generate view-consistent images and camera poses within forward-facing scenes. The effective of our model is demonstrated through extensive experiments conducted on both real and synthetic datasets. These experiments clearly illustrate that our model can effectively optimize neural scene representations while simultaneously resolving significant camera pose misalignments. The source code is available at https://github.com/Bistu3DV/hybridBA.




Abstract:This paper introduces a size-adaptable robotic endoscope design, which aims to improve the efficiency and comfort of colonoscopy. The robotic endoscope proposed in this paper combines the expansion mechanism and the external drive system, which can adjust the shape according to the different pipe diameters, thus improving the stability and propulsion force during propulsion. As an actuator in the expansion mechanism, flexible bellows can provide a normal force of 3.89 N and an axial deformation of nearly 10mm at the maximum pressure, with a 53% expansion rate in the size of expandable tip. In the test of the locomotion performance of the prototype, we obtained the relationship with the propelling of the prototype by changing the friction coefficient of the pipe and the motor angular velocity. In the experiment with artificial bowel tissues, the prototype can generate a propelling force of 2.83 N, and the maximum linear speed is 29.29 m/s in average, and could produce effective propulsion when it passes through different pipe sizes. The results show that the prototype can realize the ability of shape adaptation in order to obtain more propulsion. The relationship between propelling force and traction force, structural optimization and miniaturization still need further exploration.




Abstract:Reconfigurable intelligent surface (RIS) is a promising technology that has the potential to change the way we interact with the wireless propagating environment. In this paper, we design and fabricate an RIS system that can be used in the fifth generation (5G) mobile communication networks. We also propose a practical two-step spatial-oversampling codebook algorithm for the beamforming of RIS, which is based on the spatial structure of the wireless channel. This algorithm has much lower complexity compared to the two-dimensional full-space searching-based codebook, yet with only negligible performance loss. Then, a series of experiments are conducted with the fabricated RIS systems, covering the office, corridor, and outdoor environments, in order to verified the effectiveness of RIS in both laboratory and current 5G commercial networks. In the office and corridor scenarios, the 5.8 GHz RIS provided a 10-20 dB power gain at the receiver. In the outdoor test, over 35 dB power gain was observed with RIS compared to the non-deployment case. However, in commercial 5G networks, the 2.6 GHz RIS improved indoor signal strength by only 4-7 dB. The experimental results indicate that RIS achieves higher power gain when transceivers are equipped with directional antennas instead of omni-directional antennas.




Abstract:Most research works on reconfigurable intelligent surfaces (RIS) rely on idealized model of the reflection coefficients, i.e., uniform reflection amplitude for any phases and sufficient phase shifting capability. In practice however, such models are oversimplified. This paper introduces a realistic reflection coefficient model for RIS based on measurements. The reflection coefficients are modeled as discrete complex values that have non-uniform amplitudes and suffer from insufficient phase shift capability. We then propose a group-based query algorithm that takes the imperfect coefficients into consideration while calculating the reflection coefficients. We analyze the performance of the proposed algorithm, and derive the closed-form expressions to characterize the received power of an RIS-aided wireless communication system. The performance gains of the proposed algorithm are confirmed in simulations. Furthermore, we validate the proposed theoretical results by experiments with our fabricated RIS prototype systems. The simulation and measurement results match well with the theoretical analysis.



Abstract:Reconfigurable Intelligent Surface (RIS) has recently been regarded as a paradigm-shifting technology beyond 5G, for its flexibility on smartly adjusting the response to the impinging electromagnetic (EM) waves. Usually, RIS can be implemented by properly reconfiguring the adjustable parameters of each RIS unit to align the signal phase on the receiver side. And it is believed that the phase alignment can be also mechanically achieved by a metal plate with the same physical size. However, we found in the prototype experiments that, a well-rotated metal plate can only approximately perform as well as RIS under limited conditions, although its scattering efficiency is relatively higher. When it comes to the case of spherical wave impinging, RIS outperforms the metal plate even beyond the receiving near-field regions. We analyze this phenomenon with wave optics theory and propose explicit scattering models for both the metal plate and RIS in general scenarios. Finally, the models are validated by simulations and field measurements.