Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changyong Shu

FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

Nov 14, 2025

Jiangyong Yu, Changyong Shu, Sifan Zhou, Zichen Yu, Xing Hu, Yan Chen, Dawei Yang

Abstract:Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.

* I made an operational error. I intended to update the paper with Identifier arXiv:2502.15488, not submit a new paper with a different identifier. Therefore, I would like to withdraw the current submission and resubmit an updated version for Identifier arXiv:2502.15488

Via

Access Paper or Ask Questions

A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

May 23, 2025

Xiaobao Wei, Jiawei Liu, Dongbo Yang, Junda Cheng, Changyong Shu, Wei Wang

Figure 1 for A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

Figure 2 for A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

Figure 3 for A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

Figure 4 for A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

Abstract:We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/SIA-IDE/Wavelet-Stereo).

Via

Access Paper or Ask Questions

Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

Feb 21, 2025

Jiangyong Yu, Changyong Shu, Dawei Yang, Zichen Yu, Xing Hu, Yan Chen

Figure 1 for Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

Figure 2 for Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

Figure 3 for Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

Figure 4 for Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

Abstract:PETR-based methods have dominated benchmarks in 3D perception and are increasingly becoming a key component in modern autonomous driving systems. However, their quantization performance significantly degrades when INT8 inference is required, with a degradation of 58.2% in mAP and 36.9% in NDS on the NuScenes dataset. To address this issue, we propose a quantization-aware position embedding transformation for multi-view 3D object detection, termed Q-PETR. Q-PETR offers a quantizationfriendly and deployment-friendly architecture while preserving the original performance of PETR. It substantially narrows the accuracy gap between INT8 and FP32 inference for PETR-series methods. Without bells and whistles, our approach reduces the mAP and NDS drop to within 1% under standard 8-bit per-tensor post-training quantization. Furthermore, our method exceeds the performance of the original PETR in terms of floating-point precision. Extensive experiments across a variety of PETR-series models demonstrate its broad generalization.

Via

Access Paper or Ask Questions

GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

Dec 19, 2024

Qianpu Sun, Changyong Shu, Sifan Zhou, Zichen Yu, Yan Chen, Dawei Yang, Yuan Chun

Figure 1 for GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

Figure 2 for GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

Figure 3 for GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

Figure 4 for GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

Abstract:3D occupancy perception is gaining increasing attention due to its capability to offer detailed and precise environment representations. Previous weakly-supervised NeRF methods balance efficiency and accuracy, with mIoU varying by 5-10 points due to sampling count along camera rays. Recently, real-time Gaussian splatting has gained widespread popularity in 3D reconstruction, and the occupancy prediction task can also be viewed as a reconstruction task. Consequently, we propose GSRender, which naturally employs 3D Gaussian Splatting for occupancy prediction, simplifying the sampling process. In addition, the limitations of 2D supervision result in duplicate predictions along the same camera ray. We implemented the Ray Compensation (RC) module, which mitigates this issue by compensating for features from adjacent frames. Finally, we redesigned the loss to eliminate the impact of dynamic objects from adjacent frames. Extensive experiments demonstrate that our approach achieves SOTA (state-of-the-art) results in RayIoU (+6.0), while narrowing the gap with 3D supervision methods. Our code will be released soon.

Via

Access Paper or Ask Questions

UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

Sep 17, 2024

Zichen Yu, Changyong Shu

Figure 1 for UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

Figure 2 for UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

Figure 3 for UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

Figure 4 for UltimateDO: An Efficient Framework to Marry Occupancy Prediction with 3D Object Detection via Channel2height

Abstract:Occupancy and 3D object detection are characterized as two standard tasks in modern autonomous driving system. In order to deploy them on a series of edge chips with better precision and time-consuming trade-off, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from deployment difficulties (i.e., 3D convolution, transformer and so on) or deficiencies in task coordination. Instead, we argue that a favorable framework should be devised in pursuit of ease deployment on diverse chips and high precision with little time-consuming. Oriented at this, we revisit the paradigm for interaction between 3D object detection and occupancy prediction, reformulate the model with 2D convolution and prioritize the tasks such that each contributes to other. Thus, we propose a method to achieve fast 3D object detection and occupancy prediction (UltimateDO), wherein the light occupancy prediction head in FlashOcc is married to 3D object detection network, with negligible additional timeconsuming of only 1.1ms while facilitating each other. We instantiate UltimateDO on the challenging nuScenes-series benchmarks.

Via

Access Paper or Ask Questions

Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center

Jun 15, 2024

Zichen Yu, Changyong Shu, Qianpu Sun, Junjie Linghu, Xiaobao Wei, Jiangyong Yu, Zongdai Liu, Dawei Yang, Hui Li, Yan Chen

Abstract:Panoptic occupancy poses a novel challenge by aiming to integrate instance occupancy and semantic occupancy within a unified framework. However, there is still a lack of efficient solutions for panoptic occupancy. In this paper, we propose Panoptic-FlashOcc, a straightforward yet robust 2D feature framework that enables realtime panoptic occupancy. Building upon the lightweight design of FlashOcc, our approach simultaneously learns semantic occupancy and class-aware instance clustering in a single network, these outputs are jointly incorporated through panoptic occupancy procession for panoptic occupancy. This approach effectively addresses the drawbacks of high memory and computation requirements associated with three-dimensional voxel-level representations. With its straightforward and efficient design that facilitates easy deployment, Panoptic-FlashOcc demonstrates remarkable achievements in panoptic occupancy prediction. On the Occ3D-nuScenes benchmark, it achieves exceptional performance, with 38.5 RayIoU and 29.1 mIoU for semantic occupancy, operating at a rapid speed of 43.9 FPS. Furthermore, it attains a notable score of 16.0 RayPQ for panoptic occupancy, accompanied by a fast inference speed of 30.2 FPS. These results surpass the performance of existing methodologies in terms of both speed and accuracy. The source code and trained models can be found at the following github repository: https://github.com/Yzichen/FlashOCC.

Via

Access Paper or Ask Questions

FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

Nov 18, 2023

Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, Yan Chen

Figure 1 for FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

Figure 2 for FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

Figure 3 for FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

Figure 4 for FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

Abstract:Given the capability of mitigating the long-tail deficiencies and intricate-shaped absence prevalent in 3D object detection, occupancy prediction has become a pivotal component in autonomous driving systems. However, the procession of three-dimensional voxel-level representations inevitably introduces large overhead in both memory and computation, obstructing the deployment of to-date occupancy prediction approaches. In contrast to the trend of making the model larger and more complicated, we argue that a desirable framework should be deployment-friendly to diverse chips while maintaining high precision. To this end, we propose a plug-and-play paradigm, namely FlashOCC, to consolidate rapid and memory-efficient occupancy prediction while maintaining high precision. Particularly, our FlashOCC makes two improvements based on the contemporary voxel-level occupancy prediction approaches. Firstly, the features are kept in the BEV, enabling the employment of efficient 2D convolutional layers for feature extraction. Secondly, a channel-to-height transformation is introduced to lift the output logits from the BEV into the 3D space. We apply the FlashOCC to diverse occupancy prediction baselines on the challenging Occ3D-nuScenes benchmarks and conduct extensive experiments to validate the effectiveness. The results substantiate the superiority of our plug-and-play paradigm over previous state-of-the-art methods in terms of precision, runtime efficiency, and memory costs, demonstrating its potential for deployment. The code will be made available.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Nov 27, 2022

Changyong Shu, Fisher Yu, Yifan Liu

Figure 1 for 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Figure 2 for 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Figure 3 for 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Figure 4 for 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Abstract:Multi-camera 3D object detection, a critical component for vision-only driving systems, has achieved impressive progress. Notably, transformer-based methods with 2D features augmented by 3D positional encodings (PE) have enjoyed great success. However, the mechanism and options of 3D PE have not been thoroughly explored. In this paper, we first explore, analyze and compare various 3D positional encodings. In particular, we devise 3D point PE and show its superior performance since more precise positioning may lead to superior 3D detection. In practice, we utilize monocular depth estimation to obtain the 3D point positions for multi-camera 3D object detection. The PE with estimated 3D point locations can bring significant improvements compared to the commonly used camera-ray PE. Among DETR-based strategies, our method achieves state-of-the-art 45.6 mAP and 55.1 NDS on the competitive nuScenes valuation set. It's the first time that the performance gap between the vision-only (DETR-based) and LiDAR-based methods is reduced within 5\% mAP and 6\% NDS.

* 10 pages, 7 figures

Via

Access Paper or Ask Questions

Few-Shot Head Swapping in the Wild

Apr 27, 2022

Changyong Shu, Hemao Wu, Hang Zhou, Jiaming Liu, Zhibin Hong, Changxing Ding, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Figure 1 for Few-Shot Head Swapping in the Wild

Figure 2 for Few-Shot Head Swapping in the Wild

Figure 3 for Few-Shot Head Swapping in the Wild

Figure 4 for Few-Shot Head Swapping in the Wild

Abstract:The head swapping task aims at flawlessly placing a source head onto a target body, which is of great importance to various entertainment scenarios. While face swapping has drawn much attention, the task of head swapping has rarely been explored, particularly under the few-shot setting. It is inherently challenging due to its unique needs in head modeling and background blending. In this paper, we present the Head Swapper (HeSer), which achieves few-shot head swapping in the wild through two delicately designed modules. Firstly, a Head2Head Aligner is devised to holistically migrate pose and expression information from the target to the source head by examining multi-scale information. Secondly, to tackle the challenges of skin color variations and head-background mismatches in the swapping procedure, a Head2Scene Blender is introduced to simultaneously modify facial skin color and fill mismatched gaps in the background around the head. Particularly, seamless blending is achieved with the help of a Semantic-Guided Color Reference Creation procedure and a Blending UNet. Extensive experiments demonstrate that the proposed method produces superior head swapping results in a variety of scenes.

* Accepted to CVPR 2022 as Oral. Demo videos and code are available at https://jmliu88.github.io/HeSer

Via

Access Paper or Ask Questions

Channel-wise Distillation for Semantic Segmentation

Nov 26, 2020

Changyong Shu, Yifan Liu, Jianfei Gao, Lin Xu, Chunhua Shen

Figure 1 for Channel-wise Distillation for Semantic Segmentation

Figure 2 for Channel-wise Distillation for Semantic Segmentation

Figure 3 for Channel-wise Distillation for Semantic Segmentation

Figure 4 for Channel-wise Distillation for Semantic Segmentation

Abstract:Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for semantic segmentation align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a distribution using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures. Code is available at: https://git.io/ChannelDis

* Code is available at: https://git.io/ChannelDis

Via

Access Paper or Ask Questions