Alert button
Picture for Qiang Nie

Qiang Nie

Alert button

Can the Query-based Object Detector Be Designed with Fewer Stages?

Sep 28, 2023
Jialin Li, Weifu Fu, Yuhuan Lin, Qiang Nie, Yong Liu

Query-based object detectors have made significant advancements since the publication of DETR. However, most existing methods still rely on multi-stage encoders and decoders, or a combination of both. Despite achieving high accuracy, the multi-stage paradigm (typically consisting of 6 stages) suffers from issues such as heavy computational burden, prompting us to reconsider its necessity. In this paper, we explore multiple techniques to enhance query-based detectors and, based on these findings, propose a novel model called GOLO (Global Once and Local Once), which follows a two-stage decoding paradigm. Compared to other mainstream query-based models with multi-stage decoders, our model employs fewer decoder stages while still achieving considerable performance. Experimental results on the COCO dataset demonstrate the effectiveness of our approach.

Viaarxiv icon

Semi-supervised Domain Adaptation with Inter and Intra-domain Mixing for Semantic Segmentation

Aug 30, 2023
Weifu Fu, Qiang Nie, Jialin Li, Yuhuan Lin, Kai Wu, Yong Liu, Chengjie Wang

Despite recent advances in semantic segmentation, an inevitable challenge is the performance degradation caused by the domain shift in real application. Current dominant approach to solve this problem is unsupervised domain adaptation (UDA). However, the absence of labeled target data in UDA is overly restrictive and limits performance. To overcome this limitation, a more practical scenario called semi-supervised domain adaptation (SSDA) has been proposed. Existing SSDA methods are derived from the UDA paradigm and primarily focus on leveraging the unlabeled target data and source data. In this paper, we highlight the significance of exploiting the intra-domain information between the limited labeled target data and unlabeled target data, as it greatly benefits domain adaptation. Instead of solely using the scarce labeled data for supervision, we propose a novel SSDA framework that incorporates both inter-domain mixing and intra-domain mixing, where inter-domain mixing mitigates the source-target domain gap and intra-domain mixing enriches the available target domain information. By simultaneously learning from inter-domain mixing and intra-domain mixing, the network can capture more domain-invariant features and promote its performance on the target domain. We also explore different domain mixing operations to better exploit the target domain information. Comprehensive experiments conducted on the GTA5toCityscapes and SYNTHIA2Cityscapes benchmarks demonstrate the effectiveness of our method, surpassing previous methods by a large margin.

* 7 pages, 4 figures 
Viaarxiv icon

Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes

Aug 23, 2023
Donghao Zhou, Jialin Li, Jinpeng Li, Jiancheng Huang, Qiang Nie, Yong Liu, Bin-Bin Gao, Qiong Wang, Pheng-Ann Heng, Guangyong Chen

Figure 1 for Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes
Figure 2 for Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes
Figure 3 for Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes
Figure 4 for Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes

Large-scale well-annotated datasets are of great importance for training an effective object detector. However, obtaining accurate bounding box annotations is laborious and demanding. Unfortunately, the resultant noisy bounding boxes could cause corrupt supervision signals and thus diminish detection performance. Motivated by the observation that the real ground-truth is usually situated in the aggregation region of the proposals assigned to a noisy ground-truth, we propose DIStribution-aware CalibratiOn (DISCO) to model the spatial distribution of proposals for calibrating supervision signals. In DISCO, spatial distribution modeling is performed to statistically extract the potential locations of objects. Based on the modeled distribution, three distribution-aware techniques, i.e., distribution-aware proposal augmentation (DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware confidence estimation (DA-Est), are developed to improve classification, localization, and interpretability, respectively. Extensive experiments on large-scale noisy image datasets (i.e., Pascal VOC and MS-COCO) demonstrate that DISCO can achieve state-of-the-art detection performance, especially at high noise levels.

* 12 pages, 9 figures 
Viaarxiv icon

NeRF-Loc: Visual Localization with Conditional Neural Radiance Field

Apr 17, 2023
Jianlin Liu, Qiang Nie, Yong Liu, Chengjie Wang

Figure 1 for NeRF-Loc: Visual Localization with Conditional Neural Radiance Field
Figure 2 for NeRF-Loc: Visual Localization with Conditional Neural Radiance Field
Figure 3 for NeRF-Loc: Visual Localization with Conditional Neural Radiance Field
Figure 4 for NeRF-Loc: Visual Localization with Conditional Neural Radiance Field

We propose a novel visual re-localization method based on direct matching between the implicit 3D descriptors and the 2D image with transformer. A conditional neural radiance field(NeRF) is chosen as the 3D scene representation in our pipeline, which supports continuous 3D descriptors generation and neural rendering. By unifying the feature matching and the scene coordinate regression to the same framework, our model learns both generalizable knowledge and scene prior respectively during two training stages. Furthermore, to improve the localization robustness when domain gap exists between training and testing phases, we propose an appearance adaptation layer to explicitly align styles between the 3D model and the query image. Experiments show that our method achieves higher localization accuracy than other learning-based approaches on multiple benchmarks. Code is available at \url{https://github.com/JenningsL/nerf-loc}.

* accepted by ICRA 2023 
Viaarxiv icon

HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation

Feb 28, 2023
Kai Zhai, Qiang Nie, Bo Ouyang, Xiang Li, ShanLin Yang

Figure 1 for HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation
Figure 2 for HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation
Figure 3 for HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation
Figure 4 for HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation

2D-to-3D human pose lifting is fundamental for 3D human pose estimation (HPE). Graph Convolutional Network (GCN) has been proven inherently suitable to model the human skeletal topology. However, current GCN-based 3D HPE methods update the node features by aggregating their neighbors' information without considering the interaction of joints in different motion patterns. Although some studies import limb information to learn the movement patterns, the latent synergies among joints, such as maintaining balance in the motion are seldom investigated. We propose a hop-wise GraphFormer with intragroup joint refinement (HopFIR) to tackle the 3D HPE problem. The HopFIR mainly consists of a novel Hop-wise GraphFormer(HGF) module and an Intragroup Joint Refinement(IJR) module which leverages the prior limb information for peripheral joints refinement. The HGF module groups the joints by $k$-hop neighbors and utilizes a hop-wise transformer-like attention mechanism among these groups to discover latent joint synergy. Extensive experimental results show that HopFIR outperforms the SOTA methods with a large margin (on the Human3.6M dataset, the mean per joint position error (MPJPE) is 32.67mm). Furthermore, it is also demonstrated that previous SOTA GCN-based methods can benefit from the proposed hop-wise attention mechanism efficiently with significant performance promotion, such as SemGCN and MGCN are improved by 8.9% and 4.5%, respectively.

Viaarxiv icon

Rethinking Dimensionality Reduction in Grid-based 3D Object Detection

Sep 24, 2022
Dihe Huang, Ying Chen, Yikang Ding, Jinli Liao, Jianlin Liu, Kai Wu, Qiang Nie, Yong Liu, Chengjie Wang

Figure 1 for Rethinking Dimensionality Reduction in Grid-based 3D Object Detection
Figure 2 for Rethinking Dimensionality Reduction in Grid-based 3D Object Detection
Figure 3 for Rethinking Dimensionality Reduction in Grid-based 3D Object Detection
Figure 4 for Rethinking Dimensionality Reduction in Grid-based 3D Object Detection

Bird's eye view (BEV) is widely adopted by most of the current point cloud detectors due to the applicability of well-explored 2D detection techniques. However, existing methods obtain BEV features by simply collapsing voxel or point features along the height dimension, which causes the heavy loss of 3D spatial information. To alleviate the information loss, we propose a novel point cloud detection network based on a Multi-level feature dimensionality reduction strategy, called MDRNet. In MDRNet, the Spatial-aware Dimensionality Reduction (SDR) is designed to dynamically focus on the valuable parts of the object during voxel-to-BEV feature transformation. Furthermore, the Multi-level Spatial Residuals (MSR) is proposed to fuse the multi-level spatial information in the BEV feature maps. Extensive experiments on nuScenes show that the proposed method outperforms the state-of-the-art methods. The code will be available upon publication.

* Submitted to ICRA 2023 
Viaarxiv icon

Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept

Nov 23, 2021
Qiang Nie, Ziwei Liu, Yunhui Liu

Figure 1 for Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept
Figure 2 for Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept
Figure 3 for Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept
Figure 4 for Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept

Lifting the 2D human pose to the 3D pose is an important yet challenging task. Existing 3D pose estimation suffers from 1) the inherent ambiguity between the 2D and 3D data, and 2) the lack of well labeled 2D-3D pose pairs in the wild. Human beings are able to imagine the human 3D pose from a 2D image or a set of 2D body key-points with the least ambiguity, which should be attributed to the prior knowledge of the human body that we have acquired in our mind. Inspired by this, we propose a new framework that leverages the labeled 3D human poses to learn a 3D concept of the human body to reduce the ambiguity. To have consensus on the body concept from 2D pose, our key insight is to treat the 2D human pose and the 3D human pose as two different domains. By adapting the two domains, the body knowledge learned from 3D poses is applied to 2D poses and guides the 2D pose encoder to generate informative 3D "imagination" as embedding in pose lifting. Benefiting from the domain adaptation perspective, the proposed framework unifies the supervised and semi-supervised 3D pose estimation in a principled framework. Extensive experiments demonstrate that the proposed approach can achieve state-of-the-art performance on standard benchmarks. More importantly, it is validated that the explicitly learned 3D body concept effectively alleviates the 2D-3D ambiguity in 2D pose lifting, improves the generalization, and enables the network to exploit the abundant unlabeled 2D data.

* 15 pages, a paper submitted to IJCV 
Viaarxiv icon

Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement

Jul 14, 2020
Qiang Nie, Ziwei Liu, Yunhui Liu

Figure 1 for Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement
Figure 2 for Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement
Figure 3 for Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement
Figure 4 for Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement

Learning a good 3D human pose representation is important for human pose related tasks, e.g. human 3D pose estimation and action recognition. Within all these problems, preserving the intrinsic pose information and adapting to view variations are two critical issues. In this work, we propose a novel Siamese denoising autoencoder to learn a 3D pose representation by disentangling the pose-dependent and view-dependent feature from the human skeleton data, in a fully unsupervised manner. These two disentangled features are utilized together as the representation of the 3D pose. To consider both the kinematic and geometric dependencies, a sequential bidirectional recursive network (SeBiReNet) is further proposed to model the human skeleton data. Extensive experiments demonstrate that the learned representation 1) preserves the intrinsic information of human pose, 2) shows good transferability across datasets and tasks. Notably, our approach achieves state-of-the-art performance on two inherently different tasks: pose denoising and unsupervised action recognition. Code and models are available at: \url{https://github.com/NIEQiang001/unsupervised-human-pose.git}

* To appear in ECCV 2020. Code and models are available at: https://github.com/NIEQiang001/unsupervised-human-pose.git 
Viaarxiv icon