Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shibiao Xu

CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Oct 14, 2025

Jinzhou Lin, Jie Zhou, Wenhao Xu, Rongtao Xu, Changwei Wang, Shunpeng Chen, Kexue Fu, Yihua Shao, Li Guo, Shibiao Xu

Abstract:Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.

Via

Access Paper or Ask Questions

3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Jul 16, 2025

Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, Shibiao Xu

Figure 1 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Figure 2 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Figure 3 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Figure 4 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Abstract:With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15\%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84\%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.

* Accepted by IROS 2025

Via

Access Paper or Ask Questions

SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

May 29, 2025

Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, Shibiao Xu

Abstract:Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.

* Information Fusion 2025

Via

Access Paper or Ask Questions

Movable-Element STARS-Assisted Near-Field Wideband Communications

May 25, 2025

Guangyu Zhu, Xidong Mu, Li Guo, Ao Huang, Shibiao Xu

Abstract:A novel movable-element simultaneously transmitting and reflecting surface (ME-STARS)-assisted near-field wideband communication framework is proposed. In particular, the position of each STARS element can be adjusted to combat the significant wideband beam squint issue in the near field instead of using costly true-time delay components. Four practical ME-STARS element movement modes are proposed, namely region-based (RB), horizontal-based (HB), vertical-based (VB), and diagonal-based (DB) modes. Based on this, a near-field wideband multi-user downlink communication scenario is considered, where a sum rate maximization problem is formulated by jointly optimizing the base station (BS) precoding, ME-STARS beamforming, and element positions. To solve this intractable problem, a two-layer algorithm is developed. For the inner layer, the block coordinate descent optimization framework is utilized to solve the BS precoding and ME-STARS beamforming in an iterative manner. For the outer layer, the particle swarm optimization-based heuristic search method is employed to determine the desired element positions. Numerical results show that:1) the ME-STARSs can effectively address the beam squint for near-field wideband communications compared to conventional STARSs with fixed element positions; 2) the RB mode achieves the most efficient beam squint effect mitigation, while the DB mode achieves the best trade-off between performance gain and hardware overhead; and 3) an increase in the number of ME-STARS elements or BS subcarriers substantially improves the system performance.

Via

Access Paper or Ask Questions

FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

May 23, 2025

Zherui Zhang, Jiaxin Wu, Changwei Wang, Rongtao Xu, Longzhao Huang, Wenhao Xu, Wenbo Xu, Li Guo, Shibiao Xu

Figure 1 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Figure 2 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Figure 3 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Figure 4 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Abstract:Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {{\large {\textbf{F}}}}aster {{\large {\textbf{D}}}}istillation-{{\large {\textbf{B}}}}ased {{\large {\textbf{P}}}}rompt {{\large {\textbf{L}}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.

Via

Access Paper or Ask Questions

Image Recognition with Online Lightweight Vision Transformer: A Survey

May 06, 2025

Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu(+1 more)

Abstract:The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT

Via

Access Paper or Ask Questions

CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Apr 30, 2025

Zherui Zhang, Changwei Wang, Rongtao Xu, Wenhao Xu, Shibiao Xu, Yu Zhang, Li Guo

Figure 1 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Figure 2 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Figure 3 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Figure 4 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Abstract:Data-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data. Existing DFKD methods focus primarily on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations. In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including: \textit{\textbf{i.)}} Significant efficiency advantages resulting from altering the generator training paradigm; \textit{\textbf{ii.)}} Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks; \textit{\textbf{iii.)}} Remarkable transferability of data-free learned representations demonstrated in downstream tasks.

Via

Access Paper or Ask Questions

Pinching-Antenna Systems (PASS)-enabled Secure Wireless Communications

Apr 18, 2025

Guangyu Zhu, Xidong Mu, Li Guo, Shibiao Xu, Yuanwei Liu, Naofal Al-Dhahir

Abstract:A novel pinching-antenna systems (PASS)-enabled secure wireless communication framework is proposed. By dynamically adjusting the positions of dielectric particles, namely pinching antennas (PAs), along the waveguides, PASS introduces a novel concept of pinching beamforming to enhance the performance of physical layer security. A fundamental PASS-enabled secure communication system is considered with one legitimate user and one eavesdropper. Both single-waveguide and multiple-waveguide scenarios are studied. 1) For the single-waveguide scenario, the secrecy rate (SR) maximization is formulated to optimize the pinching beamforming. A PA-wise successive tuning (PAST) algorithm is proposed, which ensures constructive signal superposition at the legitimate user while inducing a destructive legitimate signal at the eavesdropper. 2) For the multiple-waveguide scenario, artificial noise (AN) is employed to further improve secrecy performance. A pair of practical transmission architectures are developed: waveguide division (WD) and waveguide multiplexing (WM). The key difference lies in whether each waveguide carries a single type of signal or a mixture of signals with baseband beamforming. For the SR maximization problem under the WD case, a two-stage algorithm is developed, where the pinching beamforming is designed with the PAST algorithm and the baseband power allocation among AN and legitimate signals is solved using successive convex approximation (SCA). For the WM case, an alternating optimization algorithm is developed, where the baseband beamforming is optimized with SCA and the pinching beamforming is designed employing particle swarm optimization.

Via

Access Paper or Ask Questions

Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Apr 14, 2025

Changwei Wang, Shunpeng Chen, Yukun Song, Rongtao Xu, Zherui Zhang, Jiguang Zhang, Haoran Yang, Yu Zhang, Kexue Fu, Shide Du(+4 more)

Abstract:Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at https://github.com/chenshunpeng/FoL

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Apr 03, 2025

Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang(+2 more)

Abstract:Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

* 27 pages, 11 figures, survey paper submitted to Information Fusion

Via

Access Paper or Ask Questions