Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guo-Jun Qi

BARET : Balanced Attention based Real image Editing driven by Target-text Inversion

Dec 09, 2023

Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu, Siyu Wu, Guo-Jun Qi

Figure 1 for BARET : Balanced Attention based Real image Editing driven by Target-text Inversion

Figure 2 for BARET : Balanced Attention based Real image Editing driven by Target-text Inversion

Figure 3 for BARET : Balanced Attention based Real image Editing driven by Target-text Inversion

Figure 4 for BARET : Balanced Attention based Real image Editing driven by Target-text Inversion

Abstract:Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties:(I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence.(II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability.(III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics.By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized.In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods.

* Accepted by AAAI2024

Via

Access Paper or Ask Questions

OmniMotionGPT: Animal Motion Generation with Limited Data

Nov 30, 2023

Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan, Bingbing Wen, Ziwei Xuan, Mitch Hill, Junjie Bai, Guo-Jun Qi, Yalin Wang

Figure 1 for OmniMotionGPT: Animal Motion Generation with Limited Data

Figure 2 for OmniMotionGPT: Animal Motion Generation with Limited Data

Figure 3 for OmniMotionGPT: Animal Motion Generation with Limited Data

Figure 4 for OmniMotionGPT: Animal Motion Generation with Limited Data

Abstract:Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked, it remains challenging to transfer this success to other skeleton structures with limited data. In this work, we design a model architecture that imitates Generative Pretraining Transformer (GPT), utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding, animal motion encoding, and text CLIP embedding. Presenting the first solution to this problem, we are able to generate animal motions with high diversity and fidelity, quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally, we introduce AnimalML3D, the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation, providing a new playground for the research community.

* The project page is at https://zshyang.github.io/omgpt-website/

Via

Access Paper or Ask Questions

Exploring the Robustness of Human Parsers Towards Common Corruptions

Sep 07, 2023

Sanyi Zhang, Xiaochun Cao, Rui Wang, Guo-Jun Qi, Jie Zhou

Figure 1 for Exploring the Robustness of Human Parsers Towards Common Corruptions

Figure 2 for Exploring the Robustness of Human Parsers Towards Common Corruptions

Figure 3 for Exploring the Robustness of Human Parsers Towards Common Corruptions

Figure 4 for Exploring the Robustness of Human Parsers Towards Common Corruptions

Abstract:Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.

* Accepted by IEEE Transactions on Image Processing (TIP)

Via

Access Paper or Ask Questions

High-Fidelity Clothed Avatar Reconstruction from a Single Image

Apr 08, 2023

Tingting Liao, Xiaomei Zhang, Yuliang Xiu, Hongwei Yi, Xudong Liu, Guo-Jun Qi, Yong Zhang, Xuan Wang, Xiangyu Zhu, Zhen Lei

Abstract:This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to realize a high-fidelity clothed avatar reconstruction (CAR) from a single image. At the first stage, we use an implicit model to learn the general shape in the canonical space of a person in a learning-based way, and at the second stage, we refine the surface detail by estimating the non-rigid deformation in the posed space in an optimization way. A hyper-network is utilized to generate a good initialization so that the convergence o f the optimization process is greatly accelerated. Extensive experiments on various datasets show that the proposed CAR successfully produces high-fidelity avatars for arbitrarily clothed humans in real scenes.

Via

Access Paper or Ask Questions

Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

Apr 03, 2023

Xianpeng Liu, Ce Zheng, Kelvin Cheng, Nan Xue, Guo-Jun Qi, Tianfu Wu

Figure 1 for Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

Figure 2 for Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

Figure 3 for Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

Figure 4 for Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

Abstract:The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propose a stage-wise approach, which combines the information flow from 2D-to-3D (3D bounding box proposal generation with a single 2D image) and 3D-to-2D (proposal verification by denoising with 3D-to-2D contexts) in a top-down manner. Specifically, we first obtain initial proposals from off-the-shelf backbone monocular 3D detectors. Then, we generate a 3D anchor space by local-grid sampling from the initial proposals. Finally, we perform 3D bounding box denoising at the 3D-to-2D proposal verification stage. To effectively learn discriminative features for denoising highly overlapped proposals, this paper presents a method of using the Perceiver I/O model to fuse the 3D-to-2D geometric information and the 2D appearance information. With the encoded latent representation of a proposal, the verification head is implemented with a self-attention module. Our method, named as MonoXiver, is generic and can be easily adapted to any backbone monocular 3D detectors. Experimental results on the well-established KITTI dataset and the challenging large-scale Waymo dataset show that MonoXiver consistently achieves improvement with limited computation overhead.

Via

Access Paper or Ask Questions

DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Mar 29, 2023

Ce Zheng, Guo-Jun Qi, Chen Chen

Figure 1 for DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Figure 2 for DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Figure 3 for DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Figure 4 for DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Abstract:Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR and MPS-Net rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW), which demonstrated the effectiveness and efficiency of our DDT.

Via

Access Paper or Ask Questions

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Mar 23, 2023

Ce Zheng, Xianpeng Liu, Guo-Jun Qi, Chen Chen

Figure 1 for POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Figure 2 for POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Figure 3 for POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Figure 4 for POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Abstract:Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficient model to reconstruct accurate human mesh is needed for real-world applications. In this paper, we propose a pure transformer architecture named POoling aTtention TransformER (POTTER) for the HMR task from single images. Observing that the conventional attention module is memory and computationally expensive, we propose an efficient pooling attention module, which significantly reduces the memory and computational cost without sacrificing performance. Furthermore, we design a new transformer architecture by integrating a High-Resolution (HR) stream for the HMR task. The high-resolution local and global features from the HR stream can be utilized for recovering more accurate human mesh. Our POTTER outperforms the SOTA method METRO by only requiring 7% of total parameters and 14% of the Multiply-Accumulate Operations on the Human3.6M (PA-MPJPE metric) and 3DPW (all three metrics) datasets. The project webpage is https://zczcwh.github.io/potter_page.

* CVPR 2023

Via

Access Paper or Ask Questions

AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Mar 14, 2023

Xiao Wang, Ying Wang, Ziwei Xuan, Guo-Jun Qi

Figure 1 for AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Figure 2 for AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Figure 3 for AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Figure 4 for AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Abstract:Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining. The code is available at https://github.com/maple-research-lab/AdPE.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

MorphGANFormer: Transformer-based Face Morphing and De-Morphing

Feb 18, 2023

Na Zhang, Xudong Liu, Xin Li, Guo-Jun Qi

Figure 1 for MorphGANFormer: Transformer-based Face Morphing and De-Morphing

Figure 2 for MorphGANFormer: Transformer-based Face Morphing and De-Morphing

Figure 3 for MorphGANFormer: Transformer-based Face Morphing and De-Morphing

Figure 4 for MorphGANFormer: Transformer-based Face Morphing and De-Morphing

Abstract:Semantic face image manipulation has received increasing attention in recent years. StyleGAN-based approaches to face morphing are among the leading techniques; however, they often suffer from noticeable blurring and artifacts as a result of the uniform attention in the latent feature space. In this paper, we propose to develop a transformer-based alternative to face morphing and demonstrate its superiority to StyleGAN-based methods. Our contributions are threefold. First, inspired by GANformer, we introduce a bipartite structure to exploit long-range interactions in face images for iterative propagation of information from latent variables to salient facial features. Special loss functions are designed to support the optimization of face morphing. Second, we extend the study of transformer-based face morphing to demorphing by presenting an effective defense strategy with access to a reference image using the same generator of MorphGANFormer. Such demorphing is conceptually similar to unmixing of hyperspectral images but operates in the latent (instead of pixel) space. Third, for the first time, we address a fundamental issue of vulnerability-detectability trade-off for face morphing studies. It is argued that neither doppelganger norrandom pair selection is optimal, and a Lagrangian multiplier-based approach should be used to achieve an improved trade-off between recognition vulnerability and attack detectability.

* 13 pages, 13 figures

Via

Access Paper or Ask Questions

Efficient Image Super-Resolution with Feature Interaction Weighted Hybrid Network

Dec 29, 2022

Wenjie Li, Juncheng Li, Guangwei Gao, Weihong Deng, Jian Yang, Guo-Jun Qi, Chia-Wen Lin

Abstract:Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.

* 15 pages, 14 figures, extention of our AAAI2022

Via

Access Paper or Ask Questions