Alert button
Picture for Guojun Qi

Guojun Qi

Alert button

LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar

May 03, 2023
Yuelang Xu, Hongwen Zhang, Lizhen Wang, Xiaochen Zhao, Han Huang, Guojun Qi, Yebin Liu

Figure 1 for LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar
Figure 2 for LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar
Figure 3 for LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar
Figure 4 for LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar

Existing approaches to animatable NeRF-based head avatars are either built upon face templates or use the expression coefficients of templates as the driving signal. Despite the promising progress, their performances are heavily bound by the expression power and the tracking accuracy of the templates. In this work, we present LatentAvatar, an expressive neural head avatar driven by latent expression codes. Such latent expression codes are learned in an end-to-end and self-supervised manner without templates, enabling our method to get rid of expression and tracking issues. To achieve this, we leverage a latent head NeRF to learn the person-specific latent expression codes from a monocular portrait video, and further design a Y-shaped network to learn the shared latent expression codes of different subjects for cross-identity reenactment. By optimizing the photometric reconstruction objectives in NeRF, the latent expression codes are learned to be 3D-aware while faithfully capturing the high-frequency detailed expressions. Moreover, by learning a mapping between the latent expression code learned in shared and person-specific settings, LatentAvatar is able to perform expressive reenactment between different subjects. Experimental results show that our LatentAvatar is able to capture challenging expressions and the subtle movement of teeth and even eyeballs, which outperforms previous state-of-the-art solutions in both quantitative and qualitative comparisons. Project page: https://www.liuyebin.com/latentavatar.

* Accepted by SIGGRAPH 2023 
Viaarxiv icon

OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering

Mar 26, 2023
Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Zhen Lei, Lei Zhang

Figure 1 for OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering
Figure 2 for OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering
Figure 3 for OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering
Figure 4 for OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering

Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. They either focus on static portraits, restricting the representation ability to a specific subject, or suffer from substantial computational cost, limiting their flexibility. In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference. Specifically, OTAvatar first inverts a portrait image to a motion-free identity code. Second, the identity code and a motion code are utilized to modulate an efficient CNN to generate a tri-plane formulated volume, which encodes the subject in the desired motion. Finally, volume rendering is employed to generate an image in any view. The core of our solution is a novel decoupling-by-inverting strategy that disentangles identity and motion in the latent code via optimization-based inversion. Benefiting from the efficient tri-plane representation, we achieve controllable rendering of generalized face avatar at $35$ FPS on A100. Experiments show promising performance of cross-identity reenactment on subjects out of the training set and better 3D consistency.

* Accepted by CVPR 2023. The code is available at https://github.com/theEricMa/OTAvatar 
Viaarxiv icon

Causal Attention for Vision-Language Tasks

Mar 05, 2021
Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai

Figure 1 for Causal Attention for Vision-Language Tasks
Figure 2 for Causal Attention for Vision-Language Tasks
Figure 3 for Causal Attention for Vision-Language Tasks
Figure 4 for Causal Attention for Vision-Language Tasks

We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

Viaarxiv icon

FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers

Dec 29, 2019
Haohang Xu, Hongkai Xiong, Guojun Qi

Figure 1 for FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers
Figure 2 for FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers
Figure 3 for FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers
Figure 4 for FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers

One of the most significant challenges facing a few-shot learning task is the generalizability of the (meta-)model from the base to the novel categories. Most of existing few-shot learning models attempt to address this challenge by either learning the meta-knowledge from multiple simulated tasks on the base categories, or resorting to data augmentation by applying various transformations to training examples. However, the supervised nature of model training in these approaches limits their ability of exploring variations across different categories, thus restricting their cross-category generalizability in modeling novel concepts. To this end, we present a novel regularization mechanism by learning the change of feature representations induced by a distribution of transformations without using the labels of data examples. We expect this regularizer could expand the semantic space of base categories to cover that of novel categories through the transformation of feature representations. It could minimize the risk of overfitting into base categories by inspecting the transformation-augmented variations at the encoded feature level. This results in the proposed FLAT (Few-shot Learning via Autoencoding Transformations) approach by autoencoding the applied transformations. The experiment results show the superior performances to the current state-of-the-art methods in literature.

Viaarxiv icon

An End-to-End Foreground-Aware Network for Person Re-Identification

Oct 25, 2019
Yiheng Liu, Wengang Zhou, Jianzhuang Liu, Guojun Qi, Qi Tian, Houqiang Li

Figure 1 for An End-to-End Foreground-Aware Network for Person Re-Identification
Figure 2 for An End-to-End Foreground-Aware Network for Person Re-Identification
Figure 3 for An End-to-End Foreground-Aware Network for Person Re-Identification
Figure 4 for An End-to-End Foreground-Aware Network for Person Re-Identification

Person re-identification is a crucial task of identifying pedestrians of interest across multiple surveillance camera views. In person re-identification, a pedestrian is usually represented with features extracted from a rectangular image region that inevitably contains the scene background, which incurs ambiguity to distinguish different pedestrians and degrades the accuracy. To this end, we propose an end-to-end foreground-aware network to discriminate foreground from background by learning a soft mask for person re-identification. In our method, in addition to the pedestrian ID as supervision for foreground, we introduce the camera ID of each pedestrian image for background modeling. The foreground branch and the background branch are optimized collaboratively. By presenting a target attention loss, the pedestrian features extracted from the foreground branch become more insensitive to the backgrounds, which greatly reduces the negative impacts of changing backgrounds on matching an identical across different camera views. Notably, in contrast to existing methods, our approach does not require any additional dataset to train a human landmark detector or a segmentation model for locating the background regions. The experimental results conducted on three challenging datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17, demonstrate the effectiveness of our approach.

* TIP Under Review 
Viaarxiv icon

Rethink and Redesign Meta learning

Dec 24, 2018
Yunxiao Qin, Weiguo Zhang, Chenxu Zhao, Zezheng Wang, Hailin Shi, Guojun Qi, Jingping Shi, Zhen Lei

Figure 1 for Rethink and Redesign Meta learning
Figure 2 for Rethink and Redesign Meta learning
Figure 3 for Rethink and Redesign Meta learning
Figure 4 for Rethink and Redesign Meta learning

Recently, meta-learning has shown as a promising way to improve the ability to learn from few-data for many computer vision tasks. However, existing meta-learning approaches still fall behind human greatly, and like many deep learning algorithms, they also suffer from overfitting. We named this problem as Task-Over-Fitting (TOF) problem that the meta-learner over-fits to the training tasks, not to the training data. We human beings can learn from few-data, mainly due to that we are so smart to leverage past knowledge to understand the images of new categories rapidly. Furthermore, be benefiting from our flexible attention mechanism, we can accurately extract and select key features from images and further solve few-shot learning tasks with excellent performance. In this paper, we rethink the meta-learning algorithm and find that existing meta-learning approaches miss considering attention mechanism and past knowledge. To this end, we present a novel paradigm of meta-learning approach with three developments to introduce attention mechanism and past knowledge step by step. In this way, we can narrow the problem space and improve the performance of meta-learning, and the TOF problem can also be significantly reduced. Extensive experiments demonstrate the effectiveness of our designation and methods with state-of-the-art performance not only on several few-shot learning benchmarks but also on the Cross-Entropy across Tasks (CET) metric.

* 16 pages. arXiv admin note: text overlap with arXiv:1811.07545 
Viaarxiv icon

Rank Subspace Learning for Compact Hash Codes

Mar 19, 2015
Kai Li, Guojun Qi, Jun Ye, Kien A. Hua

Figure 1 for Rank Subspace Learning for Compact Hash Codes
Figure 2 for Rank Subspace Learning for Compact Hash Codes
Figure 3 for Rank Subspace Learning for Compact Hash Codes
Figure 4 for Rank Subspace Learning for Compact Hash Codes

The era of Big Data has spawned unprecedented interests in developing hashing algorithms for efficient storage and fast nearest neighbor search. Most existing work learn hash functions that are numeric quantizations of feature values in projected feature space. In this work, we propose a novel hash learning framework that encodes feature's rank orders instead of numeric values in a number of optimal low-dimensional ranking subspaces. We formulate the ranking subspace learning problem as the optimization of a piece-wise linear convex-concave function and present two versions of our algorithm: one with independent optimization of each hash bit and the other exploiting a sequential learning framework. Our work is a generalization of the Winner-Take-All (WTA) hash family and naturally enjoys all the numeric stability benefits of rank correlation measures while being optimized to achieve high precision at very short code length. We compare with several state-of-the-art hashing algorithms in both supervised and unsupervised domain, showing superior performance in a number of data sets.

* 10 pages 
Viaarxiv icon