Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wenxian Yu

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Nov 17, 2025

Dexin Zuo, Ang Li, Wei Wang, Wenxian Yu, Danping Zou

Figure 1 for CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Figure 2 for CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Figure 3 for CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Figure 4 for CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Abstract:Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.

* 7 pages, accepted by AAAI 2026 (oral)

Via

Access Paper or Ask Questions

StableTracker: Learning to Stably Track Target via Differentiable Simulation

Sep 17, 2025

Fanxing Li, Shengyang Wang, Fangyu Sun, Shuyu Wu, Dexin Zuo, Wenxian Yu, Danping Zou

Abstract:FPV object tracking methods heavily rely on handcraft modular designs, resulting in hardware overload and cumulative error, which seriously degrades the tracking performance, especially for rapidly accelerating or decelerating targets. To address these challenges, we present \textbf{StableTracker}, a learning-based control policy that enables quadrotors to robustly follow the moving target from arbitrary perspectives. The policy is trained using backpropagation-through-time via differentiable simulation, allowing the quadrotor to maintain the target at the center of the visual field in both horizontal and vertical directions, while keeping a fixed relative distance, thereby functioning as an autonomous aerial camera. We compare StableTracker against both state-of-the-art traditional algorithms and learning baselines. Simulation experiments demonstrate that our policy achieves superior accuracy, stability and generalization across varying safe distances, trajectories, and target velocities. Furthermore, a real-world experiment on a quadrotor with an onboard computer validated practicality of the proposed approach.

Via

Access Paper or Ask Questions

Collaborative Learning for Unsupervised Multimodal Remote Sensing Image Registration: Integrating Self-Supervision and MIM-Guided Diffusion-Based Image Translation

May 28, 2025

Xiaochen Wei, Weiwei Guo, Wenxian Yu

Abstract:The substantial modality-induced variations in radiometric, texture, and structural characteristics pose significant challenges for the accurate registration of multimodal images. While supervised deep learning methods have demonstrated strong performance, they often rely on large-scale annotated datasets, limiting their practical application. Traditional unsupervised methods usually optimize registration by minimizing differences in feature representations, yet often fail to robustly capture geometric discrepancies, particularly under substantial spatial and radiometric variations, thus hindering convergence stability. To address these challenges, we propose a Collaborative Learning framework for Unsupervised Multimodal Image Registration, named CoLReg, which reformulates unsupervised registration learning into a collaborative training paradigm comprising three components: (1) a cross-modal image translation network, MIMGCD, which employs a learnable Maximum Index Map (MIM) guided conditional diffusion model to synthesize modality-consistent image pairs; (2) a self-supervised intermediate registration network which learns to estimate geometric transformations using accurate displacement labels derived from MIMGCD outputs; (3) a distilled cross-modal registration network trained with pseudo-label predicted by the intermediate network. The three networks are jointly optimized through an alternating training strategy wherein each network enhances the performance of the others. This mutual collaboration progressively reduces modality discrepancies, enhances the quality of pseudo-labels, and improves registration accuracy. Extensive experimental results on multiple datasets demonstrate that our ColReg achieves competitive or superior performance compared to state-of-the-art unsupervised approaches and even surpasses several supervised baselines.

Via

Access Paper or Ask Questions

OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

Apr 08, 2025

Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li

Abstract:Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate the gap of multimodal images. Firstly, we propose a novel one-step unaligned target-guided conditional denoising diffusion probabilistic models(UTGOS-CDDPM)to translate multimodal images into a unified domain. In the inference stage, traditional conditional DDPM generate translated source image by a large number of iterations, which severely slows down the image registration task. To address this issues, we use the unaligned traget image as a condition to promote the generation of low-frequency features of the translated source image. Furthermore, during the training stage, we add the inverse process of directly predicting the translated image to ensure that the translated source image can be generated in one step during the testing stage. Additionally, to supervised the detail features of translated source image, we propose a new perceptual loss that focuses on the high-frequency feature differences between the translated and ground-truth images. Finally, a multimodal multiscale image registration network (MM-Reg) fuse the multimodal feature of the unimodal images and multimodal images by proposed multimodal feature fusion strategy. Experiments demonstrate superior accuracy and efficiency across various multimodal registration tasks, particularly for SAR-optical image pairs.

Via

Access Paper or Ask Questions

mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

Mar 04, 2025

Jiarui Yang, Songpengcheng Xia, Zengyuan Lai, Lan Sun, Qi Wu, Wenxian Yu, Ling Pei

Figure 1 for mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

Figure 2 for mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

Figure 3 for mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

Figure 4 for mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

Abstract:Millimeter-wave (mmWave) radar offers robust sensing capabilities in diverse environments, making it a highly promising solution for human body reconstruction due to its privacy-friendly and non-intrusive nature. However, the significant sparsity of mmWave point clouds limits the estimation accuracy. To overcome this challenge, we propose a two-stage deep learning framework that enhances mmWave point clouds and improves human body reconstruction accuracy. Our method includes a mmWave point cloud enhancement module that densifies the raw data by leveraging temporal features and a multi-stage completion network, followed by a 2D-3D fusion module that extracts both 2D and 3D motion features to refine SMPL parameters. The mmWave point cloud enhancement module learns the detailed shape and posture information from 2D human masks in single-view images. However, image-based supervision is involved only during the training phase, and the inference relies solely on sparse point clouds to maintain privacy. Experiments on multiple datasets demonstrate that our approach outperforms state-of-the-art methods, with the enhanced point clouds further improving performance when integrated into existing models.

Via

Access Paper or Ask Questions

Multi-Domain Features Guided Supervised Contrastive Learning for Radar Target Detection

Dec 17, 2024

Junjie Wang, Yuze Gao, Dongying Li, Wenxian Yu

Figure 1 for Multi-Domain Features Guided Supervised Contrastive Learning for Radar Target Detection

Figure 2 for Multi-Domain Features Guided Supervised Contrastive Learning for Radar Target Detection

Figure 3 for Multi-Domain Features Guided Supervised Contrastive Learning for Radar Target Detection

Figure 4 for Multi-Domain Features Guided Supervised Contrastive Learning for Radar Target Detection

Abstract:Detecting small targets in sea clutter is challenging due to dynamic maritime conditions. Existing solutions either model sea clutter for detection or extract target features based on clutter-target echo differences, including statistical and deep features. While more common, the latter often excels in controlled scenarios but struggles with robust detection and generalization in diverse environments, limiting practical use. In this letter, we propose a multi-domain features guided supervised contrastive learning (MDFG_SCL) method, which integrates statistical features derived from multi-domain differences with deep features obtained through supervised contrastive learning, thereby capturing both low-level domain-specific variations and high-level semantic information. This comprehensive feature integration enables the model to effectively distinguish between small targets and sea clutter, even under challenging conditions. Experiments conducted on real-world datasets demonstrate that the proposed shallow-to-deep detector not only achieves effective identification of small maritime targets but also maintains superior detection performance across varying sea conditions, outperforming the mainstream unsupervised contrastive learning and supervised contrastive learning methods.

Via

Access Paper or Ask Questions

Seeing Through Pixel Motion: Learning Obstacle Avoidance from Optical Flow with One Camera

Nov 07, 2024

Yu Hu, Yuang Zhang, Yunlong Song, Yang Deng, Feng Yu, Linzuo Zhang, Weiyao Lin, Danping Zou, Wenxian Yu

Figure 1 for Seeing Through Pixel Motion: Learning Obstacle Avoidance from Optical Flow with One Camera

Figure 2 for Seeing Through Pixel Motion: Learning Obstacle Avoidance from Optical Flow with One Camera

Figure 3 for Seeing Through Pixel Motion: Learning Obstacle Avoidance from Optical Flow with One Camera

Figure 4 for Seeing Through Pixel Motion: Learning Obstacle Avoidance from Optical Flow with One Camera

Abstract:Optical flow captures the motion of pixels in an image sequence over time, providing information about movement, depth, and environmental structure. Flying insects utilize this information to navigate and avoid obstacles, allowing them to execute highly agile maneuvers even in complex environments. Despite its potential, autonomous flying robots have yet to fully leverage this motion information to achieve comparable levels of agility and robustness. Challenges of control from optical flow include extracting accurate optical flow at high speeds, handling noisy estimation, and ensuring robust performance in complex environments. To address these challenges, we propose a novel end-to-end system for quadrotor obstacle avoidance using monocular optical flow. We develop an efficient differentiable simulator coupled with a simplified quadrotor model, allowing our policy to be trained directly through first-order gradient optimization. Additionally, we introduce a central flow attention mechanism and an action-guided active sensing strategy that enhances the policy's focus on task-relevant optical flow observations to enable more responsive decision-making during flight. Our system is validated both in simulation and the real world using an FPV racing drone. Despite being trained in a simple environment in simulation, our system is validated both in simulation and the real world using an FPV racing drone. Despite being trained in a simple environment in simulation, our system demonstrates agile and robust flight in various unknown, cluttered environments in the real world at speeds of up to 6m/s.

Via

Access Paper or Ask Questions

Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation

Nov 04, 2024

Xinyu Xu, Huazhen Liu, Feiming Wei, Huilin Xiong, Wenxian Yu, Tao Zhang

Figure 1 for Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation

Figure 2 for Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation

Figure 3 for Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation

Figure 4 for Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation

Abstract:Point cloud is often regarded as a discrete sampling of Riemannian manifold and plays a pivotal role in the 3D image interpretation. Particularly, rotation perturbation, an unexpected small change in rotation caused by various factors (like equipment offset, system instability, measurement errors and so on), can easily lead to the inferior results in point cloud learning tasks. However, classical point cloud learning methods are sensitive to rotation perturbation, and the existing networks with rotation robustness also have much room for improvements in terms of performance and noise tolerance. Given these, this paper remodels the point cloud from the perspective of manifold as well as designs a manifold distillation method to achieve the robustness of rotation perturbation without any coordinate transformation. In brief, during the training phase, we introduce a teacher network to learn the rotation robustness information and transfer this information to the student network through online distillation. In the inference phase, the student network directly utilizes the original 3D coordinate information to achieve the robustness of rotation perturbation. Experiments carried out on four different datasets verify the effectiveness of our method. Averagely, on the Modelnet40 and ScanobjectNN classification datasets with random rotation perturbations, our classification accuracy has respectively improved by 4.92% and 4.41%, compared to popular rotation-robust networks; on the ShapeNet and S3DIS segmentation datasets, compared to the rotation-robust networks, the improvements of mIoU are 7.36% and 4.82%, respectively. Besides, from the experimental results, the proposed algorithm also shows excellent performance in resisting noise and outliers.

* 13 pages, 8 figures, submitted to TCSVT

Via

Access Paper or Ask Questions

Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

Nov 04, 2024

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Shaofeng Zhang, Yi Yu, Wenxian Yu, Junchi Yan

Figure 1 for Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

Figure 2 for Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

Figure 3 for Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

Figure 4 for Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

Abstract:In recent years, aerial object detection has been increasingly pivotal in various earth observation applications. However, current algorithms are limited to detecting a set of pre-defined object categories, demanding sufficient annotated training samples, and fail to detect novel object categories. In this paper, we put forth a novel formulation of the aerial object detection problem, namely open-vocabulary aerial object detection (OVAD), which can detect objects beyond training categories without costly collecting new labeled data. We propose CastDet, a CLIP-activated student-teacher detection framework that serves as the first OVAD detector specifically designed for the challenging aerial scenario, where objects often exhibit weak appearance features and arbitrary orientations. Our framework integrates a robust localization teacher along with several box selection strategies to generate high-quality proposals for novel objects. Additionally, the RemoteCLIP model is adopted as an omniscient teacher, which provides rich knowledge to enhance classification capabilities for novel categories. A dynamic label queue is devised to maintain high-quality pseudo-labels during training. By doing so, the proposed CastDet boosts not only novel object proposals but also classification. Furthermore, we extend our approach from horizontal OVAD to oriented OVAD with tailored algorithm designs to effectively manage bounding box representation and pseudo-label generation. Extensive experiments for both tasks on multiple existing aerial object detection datasets demonstrate the effectiveness of our approach. The code is available at https://github.com/lizzy8587/CastDet.

Via

Access Paper or Ask Questions

PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation

Nov 03, 2024

Xinyu Xu, Huazhen Liu, Huilin Xiong, Wenxian Yu, Tao Zhang

Figure 1 for PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation

Figure 2 for PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation

Figure 3 for PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation

Figure 4 for PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation

Abstract:Semantic segmentation is an important branch of image processing and computer vision. With the popularity of deep learning, various deep semantic segmentation networks have been proposed for pixel-level classification and segmentation tasks. However, the imaging angles are often arbitrary in real world, such as water body images in remote sensing, and capillary and polyp images in medical field, and we usually cannot obtain prior orientation information to guide these networks to extract more effective features. Additionally, learning the features of objects with multiple orientation information is also challenging, as most CNN-based semantic segmentation networks do not have rotation equivariance to resist the disturbance from orientation information. To address the same, in this paper, we first establish a universal convolution-group framework to more fully utilize the orientation information and make the networks rotation equivariant. Then, we mathematically construct the padding-based rotation equivariant convolution mode (PreCM), which can be used not only for multi-scale images and convolution kernels, but also as a replacement component to replace multiple convolutions, like dilated convolution, transposed convolution, variable stride convolution, etc. In order to verify the realization of rotation equivariance, a new evaluation metric named rotation difference (RD) is finally proposed. The experiments carried out on the datesets Satellite Images of Water Bodies, DRIVE and Floodnet show that the PreCM-based networks can achieve better segmentation performance than the original and data augmentation-based networks. In terms of the average RD value, the former is 0% and the latter two are respectively 7.0503% and 3.2606%. Last but not least, PreCM also effectively enhances the robustness of networks to rotation perturbations.

* 14 pages, 14 figures, submitted to TIP

Via

Access Paper or Ask Questions