Alert button
Picture for Ailing Zeng

Ailing Zeng

Alert button

FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild

Sep 12, 2023
Jiong Wang, Fengyu Yang, Wenbo Gou, Bingliang Li, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Ruimao Zhang

Figure 1 for FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild
Figure 2 for FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild
Figure 3 for FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild
Figure 4 for FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild

Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. This task carries great importance for fields like AIGC and human-robot interaction. In practice, 3D human pose estimation in real-world settings is a critical initial step in solving this problem. However, the current datasets, often collected under controlled laboratory conditions using complex motion capture equipment and unvarying backgrounds, are insufficient. The absence of real-world datasets is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, real-world multi-view dataset. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an automated, precise labeling pipeline that allows for large-scale processing efficiently. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. FreeMan is now publicly available at https://wangjiongw.github.io/freeman.

* 18 pages, 9 figures. Project page: https://wangjiongw.github.io/freeman/ ; API: https://github.com/wangjiongw/FreeMan_API 
Viaarxiv icon

Neural Interactive Keypoint Detection

Aug 20, 2023
Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, Lei Zhang

Figure 1 for Neural Interactive Keypoint Detection
Figure 2 for Neural Interactive Keypoint Detection
Figure 3 for Neural Interactive Keypoint Detection
Figure 4 for Neural Interactive Keypoint Detection

This work proposes an end-to-end neural interactive keypoint detection framework named Click-Pose, which can significantly reduce more than 10 times labeling costs of 2D keypoint annotation compared with manual-only annotation. Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective annotation process. Specifically, we design the pose error modeling strategy that inputs the ground truth pose combined with four typical pose errors into the decoder and trains the model to reconstruct the correct poses, which enhances the self-correction ability of the model. Then, we attach an interactive human-feedback loop that allows receiving users' clicks to correct one or several predicted keypoints and iteratively utilizes the decoder to update all other keypoints with a minimum number of clicks (NoC) for efficient annotation. We validate Click-Pose in in-domain, out-of-domain scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing 31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction, respectively. Besides, without user clicks, Click-Pose surpasses the previous end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. The code is available at https://github.com/IDEA-Research/Click-Pose.

* Accepted to ICCV 2023 
Viaarxiv icon

Effective Whole-body Pose Estimation with Two-stages Distillation

Jul 29, 2023
Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li

Figure 1 for Effective Whole-body Pose Estimation with Two-stages Distillation
Figure 2 for Effective Whole-body Pose Estimation with Two-stages Distillation
Figure 3 for Effective Whole-body Pose Estimation with Two-stages Distillation
Figure 4 for Effective Whole-body Pose Estimation with Two-stages Distillation

Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. This task is challenging due to multi-scale body parts, fine-grained localization for low-resolution regions, and data scarcity. Meanwhile, applying a highly efficient and accurate pose estimator to widely human-centric understanding and generation tasks is urgent. In this work, we present a two-stage pose \textbf{D}istillation for \textbf{W}hole-body \textbf{P}ose estimators, named \textbf{DWPose}, to improve their effectiveness and efficiency. The first-stage distillation designs a weight-decay strategy while utilizing a teacher's intermediate feature and final logits with both visible and invisible keypoints to supervise the student from scratch. The second stage distills the student model itself to further improve performance. Different from the previous self-knowledge distillation, this stage finetunes the student's head with only 20% training time as a plug-and-play training strategy. For data limitations, we explore the UBody dataset that contains diverse facial expressions and hand gestures for real-life applications. Comprehensive experiments show the superiority of our proposed simple yet effective methods. We achieve new state-of-the-art performance on COCO-WholeBody, significantly boosting the whole-body AP of RTMPose-l from 64.8% to 66.5%, even surpassing RTMPose-x teacher with 65.3% AP. We release a series of models with different sizes, from tiny to large, for satisfying various downstream tasks. Our codes and models are available at https://github.com/IDEA-Research/DWPose.

* Codes and models are available at https://github.com/IDEA-Research/DWPose 
Viaarxiv icon

FITS: Modeling Time Series with $10k$ Parameters

Jul 06, 2023
Zhijian Xu, Ailing Zeng, Qiang Xu

Figure 1 for FITS: Modeling Time Series with $10k$ Parameters
Figure 2 for FITS: Modeling Time Series with $10k$ Parameters
Figure 3 for FITS: Modeling Time Series with $10k$ Parameters
Figure 4 for FITS: Modeling Time Series with $10k$ Parameters

In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain. By discarding high-frequency components with negligible impact on time series data, FITS achieves performance comparable to state-of-the-art models for time series forecasting and anomaly detection tasks, while having a remarkably compact size of only approximately $10k$ parameters. Such a lightweight model can be easily trained and deployed in edge devices, creating opportunities for various applications. The anonymous code repo is available in: \url{https://anonymous.4open.science/r/FITS}

Viaarxiv icon

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

Jul 03, 2023
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang

Figure 1 for Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
Figure 2 for Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
Figure 3 for Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
Figure 4 for Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 13.7M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 96K motion sequences from massive scenes. Besides, Motion-X provides 13.7M frame-level whole-body pose descriptions and 96K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.

* A large-scale 3D whole-body human motion-text dataset; GitHub: https://github.com/IDEA-Research/Motion-X 
Viaarxiv icon

detrex: Benchmarking Detection Transformers

Jun 13, 2023
Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang

Figure 1 for detrex: Benchmarking Detection Transformers
Figure 2 for detrex: Benchmarking Detection Transformers
Figure 3 for detrex: Benchmarking Detection Transformers
Figure 4 for detrex: Benchmarking Detection Transformers

The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms.We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition. Our code is available at https://github.com/IDEA-Research/detrex. The project is currently being actively developed. We encourage the community to use detrex codebase for further development and contributions.

* project link: https://github.com/IDEA-Research/detrex 
Viaarxiv icon

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

May 21, 2023
Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, Lei Zhang

Figure 1 for DreamWaltz: Make a Scene with Complex 3D Animatable Avatars
Figure 2 for DreamWaltz: Make a Scene with Complex 3D Animatable Avatars
Figure 3 for DreamWaltz: Make a Scene with Complex 3D Animatable Avatars
Figure 4 for DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

We present DreamWaltz, a novel framework for generating and animating complex avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results in the text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning and enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable and generalizable avatar representation which could map arbitrary poses to the canonical pose representation. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions.

Viaarxiv icon

Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

May 20, 2023
Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, Ruimao Zhang

Figure 1 for Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model
Figure 2 for Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model
Figure 3 for Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model
Figure 4 for Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

This paper investigates the problem of the current HOI detection methods and introduces DiffHOI, a novel HOI detection scheme grounded on a pre-trained text-image diffusion model, which enhances the detector's performance via improved data diversity and HOI representation. We demonstrate that the internal representation space of a frozen text-to-image diffusion model is highly relevant to verb concepts and their corresponding context. Accordingly, we propose an adapter-style tuning method to extract the various semantic associated representation from a frozen diffusion model and CLIP model to enhance the human and object representations from the pre-trained detector, further reducing the ambiguity in interaction prediction. Moreover, to fill in the gaps of HOI datasets, we propose SynHOI, a class-balance, large-scale, and high-diversity synthetic dataset containing over 140K HOI images with fully triplet annotations. It is built using an automatic and scalable pipeline designed to scale up the generation of diverse and high-precision HOI-annotated data. SynHOI could effectively relieve the long-tail issue in existing datasets and facilitate learning interaction representations. Extensive experiments demonstrate that DiffHOI significantly outperforms the state-of-the-art in regular detection (i.e., 41.50 mAP) and zero-shot detection. Furthermore, SynHOI can improve the performance of model-agnostic and backbone-agnostic HOI detection, particularly exhibiting an outstanding 11.55% mAP improvement in rare classes.

Viaarxiv icon

A Strong and Reproducible Object Detector with Only Public Datasets

Apr 25, 2023
Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang

Figure 1 for A Strong and Reproducible Object Detector with Only Public Datasets
Figure 2 for A Strong and Reproducible Object Detector with Only Public Datasets
Figure 3 for A Strong and Reproducible Object Detector with Only Public Datasets
Figure 4 for A Strong and Reproducible Object Detector with Only Public Datasets

This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev using only 700M parameters without any test time augmentation. It explores the combination of the powerful FocalNet-Huge backbone with the effective Stable-DINO detector. Different from existing SOTA models that utilize an extensive number of parameters and complex training techniques on large-scale private data or merged data, our model is exclusively trained on the publicly available dataset Objects365, which ensures the reproducibility of our approach.

* 64.8 AP on COCO test-dev 
Viaarxiv icon

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

Apr 09, 2023
Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, Qiang Xu

Figure 1 for HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
Figure 2 for HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
Figure 3 for HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
Figure 4 for HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

Controllable human image generation (HIG) has numerous real-life applications. State-of-the-art solutions, such as ControlNet and T2I-Adapter, introduce an additional learnable branch on top of the frozen pre-trained stable diffusion (SD) model, which can enforce various conditions, including skeleton guidance of HIG. While such a plug-and-play approach is appealing, the inevitable and uncertain conflicts between the original images produced from the frozen SD branch and the given condition incur significant challenges for the learnable branch, which essentially conducts image feature editing for condition enforcement. In this work, we propose a native skeleton-guided diffusion model for controllable HIG called HumanSD. Instead of performing image editing with dual-branch diffusion, we fine-tune the original SD model using a novel heatmap-guided denoising loss. This strategy effectively and efficiently strengthens the given skeleton condition during model training while mitigating the catastrophic forgetting effects. HumanSD is fine-tuned on the assembly of three large-scale human-centric datasets with text-image-pose information, two of which are established in this work. As shown in Figure 1, HumanSD outperforms ControlNet in terms of accurate pose control and image quality, particularly when the given skeleton guidance is sophisticated.

Viaarxiv icon