Alert button
Picture for Xuansong Xie

Xuansong Xie

Alert button

Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization

Sep 07, 2023
Tao Yang, Peiran Ren, Xuansong Xie, Lei Zhang

Figure 1 for Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
Figure 2 for Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
Figure 3 for Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
Figure 4 for Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization

Realistic image super-resolution (Real-ISR) aims to reproduce perceptually realistic image details from a low-quality input. The commonly used adversarial training based Real-ISR methods often introduce unnatural visual artifacts and fail to generate realistic textures for natural scene images. The recently developed generative stable diffusion models provide a potential solution to Real-ISR with pre-learned strong image priors. However, the existing methods along this line either fail to keep faithful pixel-wise image structures or resort to extra skipped connections to reproduce details, which requires additional training in image space and limits their extension to other related tasks in latent space such as image stylization. In this work, we propose a pixel-aware stable diffusion (PASD) network to achieve robust Real-ISR as well as personalized stylization. In specific, a pixel-aware cross attention module is introduced to enable diffusion models perceiving image local structures in pixel-wise level, while a degradation removal module is used to extract degradation insensitive features to guide the diffusion process together with image high level information. By simply replacing the base diffusion model with a personalized one, our method can generate diverse stylized images without the need to collect pairwise training data. PASD can be easily integrated into existing diffusion models such as Stable Diffusion. Experiments on Real-ISR and personalized stylization demonstrate the effectiveness of our proposed approach. The source code and models can be found at \url{https://github.com/yangxy/PASD}.

Viaarxiv icon

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

Sep 06, 2023
Hanbing Liu, Wangmeng Xiang, Jun-Yan He, Zhi-Qi Cheng, Bin Luo, Yifeng Geng, Xuansong Xie

Figure 1 for Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation
Figure 2 for Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation
Figure 3 for Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation
Figure 4 for Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA.

* 11 pages, 5 figures 
Viaarxiv icon

FaceChain: A Playground for Identity-Preserving Portrait Generation

Aug 28, 2023
Yang Liu, Cheng Yu, Lei Shang, Ziheng Wu, Xingjun Wang, Yuze Zhao, Lin Zhu, Chen Cheng, Weitao Chen, Chao Xu, Haoyu Xie, Yuan Yao, Wenmeng Zhou, Yingda Chen, Xuansong Xie, Baigui Sun

Figure 1 for FaceChain: A Playground for Identity-Preserving Portrait Generation
Figure 2 for FaceChain: A Playground for Identity-Preserving Portrait Generation

Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions can be vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Through the development of FaceChain, we have identified several potential directions to accelerate development of Face/Human-Centric AIGC research and application. We have designed FaceChain as a framework comprised of pluggable components that can be easily adjusted to accommodate different styles and personalized needs. We hope it can grow to serve the burgeoning needs from the communities. FaceChain is open-sourced under Apache-2.0 license at \url{https://github.com/modelscope/facechain}.

* This is an ongoing work that will be consistently refined and improved upon 
Viaarxiv icon

TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective

Aug 20, 2023
Jun Dan, Yang Liu, Haoyu Xie, Jiankang Deng, Haoran Xie, Xuansong Xie, Baigui Sun

Figure 1 for TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective
Figure 2 for TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective
Figure 3 for TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective
Figure 4 for TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective

Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature. However, we unexpectedly find that ViTs perform vulnerably when applied to face recognition (FR) scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approach and hard sample mining strategy are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, this paper proposes a superior FR model called TransFace, which employs a patch-level data augmentation strategy named DPAP and a hard sample mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude information of dominant patches to expand sample diversity, which effectively alleviates the overfitting problem in ViTs. EHSM utilizes the information entropy in the local tokens to dynamically adjust the importance weight of easy and hard samples during training, leading to a more stable prediction. Experiments on several benchmarks demonstrate the superiority of our TransFace. Code and models are available at https://github.com/DanJun6737/TransFace.

* Accepted by ICCV 2023 
Viaarxiv icon

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

Aug 18, 2023
Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie

Figure 1 for PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation
Figure 2 for PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation
Figure 3 for PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation
Figure 4 for PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to overcome this issue without extensive target domain annotation. Utilizing a diffusion-centric structure, PoSynDA simulates the 3D pose distribution in the target domain, filling the data diversity gap. By incorporating a multi-hypothesis network, it creates diverse pose hypotheses and aligns them with the target domain. Target-specific source augmentation obtains the target domain distribution data from the source domain by decoupling the scale and position parameters. The teacher-student paradigm and low-rank adaptation further refine the process. PoSynDA demonstrates competitive performance on benchmarks, such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model~\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation. The code is available at https://github.com/hbing-l/PoSynDA.

* Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynDA 
Viaarxiv icon

Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

Aug 14, 2023
Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, Yifeng Geng, Xuansong Xie

Figure 1 for Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning
Figure 2 for Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning
Figure 3 for Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning
Figure 4 for Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

Depth-aware panoptic segmentation is an emerging topic in computer vision which combines semantic and geometric understanding for more robust scene interpretation. Recent works pursue unified frameworks to tackle this challenge but mostly still treat it as two individual learning tasks, which limits their potential for exploring cross-domain information. We propose a deeply unified framework for depth-aware panoptic segmentation, which performs joint segmentation and depth estimation both in a per-segment manner with identical object queries. To narrow the gap between the two tasks, we further design a geometric query enhancement method, which is able to integrate scene geometry into object queries using latent representations. In addition, we propose a bi-directional guidance learning approach to facilitate cross-task feature learning by taking advantage of their mutual relations. Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets. Moreover, our guidance learning approach is shown to deliver performance improvement even under incomplete supervision labels.

* to be published in ICCV 2023 
Viaarxiv icon

PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction

Jun 12, 2023
Yanwen Fang, Chao Li, Jintai Chen, Peng-Tao Jiang, Yifeng Geng, Xuansong Xie, Eddy K. F. Lam, Guodong Li

Figure 1 for PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction
Figure 2 for PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction
Figure 3 for PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction
Figure 4 for PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction

Multi-person motion prediction is a challenging task, especially for real-world scenarios of densely interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., hand-shaking), which typically forecast each human pose in isolation. In this paper, we focus on motion prediction for multiple persons with extreme collaborations and attempt to explore the relationships between the highly interactive persons' motion trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. Additionally, we introduce and build a proxy entity to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional information flows, acting as a motion intermediary. We then adapt these designs to a Transformer-based architecture and devise a simple yet effective end-to-end framework called proxy-bridged game Transformer (PGformer) for multi-person interactive motion prediction. The effectiveness of our method has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. We show that our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results. Our code will become publicly available upon acceptance.

Viaarxiv icon

KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

May 25, 2023
Xu Bao, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Wangmeng Xiang, Jingdong Sun, Hanbing Liu, Wei Liu, Bin Luo, Yifeng Geng, Xuansong Xie

Figure 1 for KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
Figure 2 for KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
Figure 3 for KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
Figure 4 for KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

In the realm of facial analysis, accurate landmark detection is crucial for various applications, ranging from face recognition and expression analysis to animation. Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS), a groundbreaking facial landmark detection framework that stands out from existing methods. For the first time, KeyPosS employs the True-range Multilateration algorithm, a technique originally used in GPS systems, to achieve rapid and precise facial landmark detection without relying on computationally intensive regression approaches. The framework utilizes a fully convolutional network to predict a distance map, which computes the distance between a Point of Interest (POI) and multiple anchor points. These anchor points are ingeniously harnessed to triangulate the POI's position through the True-range Multilateration algorithm. Notably, the plug-and-play nature of KeyPosS enables seamless integration into any decoding stage, ensuring a versatile and adaptable solution. We conducted a thorough evaluation of KeyPosS's performance by benchmarking it against state-of-the-art models on four different datasets. The results show that KeyPosS substantially outperforms leading methods in low-resolution settings while requiring a minimal time overhead. The code is available at https://github.com/zhiqic/KeyPosS.

Viaarxiv icon