Alert button
Picture for Xuhui Jia

Xuhui Jia

Alert button

Controllable One-Shot Face Video Synthesis With Semantic Aware Prior

Apr 27, 2023
Kangning Liu, Yu-Chuan Su, Wei, Hong, Ruijin Cang, Xuhui Jia

Figure 1 for Controllable One-Shot Face Video Synthesis With Semantic Aware Prior
Figure 2 for Controllable One-Shot Face Video Synthesis With Semantic Aware Prior
Figure 3 for Controllable One-Shot Face Video Synthesis With Semantic Aware Prior
Figure 4 for Controllable One-Shot Face Video Synthesis With Semantic Aware Prior

The one-shot talking-head synthesis task aims to animate a source image to another pose and expression, which is dictated by a driving frame. Recent methods rely on warping the appearance feature extracted from the source, by using motion fields estimated from the sparse keypoints, that are learned in an unsupervised manner. Due to their lightweight formulation, they are suitable for video conferencing with reduced bandwidth. However, based on our study, current methods suffer from two major limitations: 1) unsatisfactory generation quality in the case of large head poses and the existence of observable pose misalignment between the source and the first frame in driving videos. 2) fail to capture fine yet critical face motion details due to the lack of semantic understanding and appropriate face geometry regularization. To address these shortcomings, we propose a novel method that leverages the rich face prior information, the proposed model can generate face videos with improved semantic consistency (improve baseline by $7\%$ in average keypoint distance) and expression-preserving (outperform baseline by $15 \%$ in average emotion embedding distance) under equivalent bandwidth. Additionally, incorporating such prior information provides us with a convenient interface to achieve highly controllable generation in terms of both pose and expression.

Viaarxiv icon

Federated Learning of Shareable Bases for Personalization-Friendly Image Classification

Apr 16, 2023
Hong-You Chen, Jike Zhong, Mingda Zhang, Xuhui Jia, Hang Qi, Boqing Gong, Wei-Lun Chao, Li Zhang

Figure 1 for Federated Learning of Shareable Bases for Personalization-Friendly Image Classification
Figure 2 for Federated Learning of Shareable Bases for Personalization-Friendly Image Classification
Figure 3 for Federated Learning of Shareable Bases for Personalization-Friendly Image Classification
Figure 4 for Federated Learning of Shareable Bases for Personalization-Friendly Image Classification

Personalized federated learning (PFL) aims to harness the collective wisdom of clients' data to build customized models tailored to individual clients' data distributions. Existing works offer personalization primarily to clients who participate in the FL process, making it hard to encompass new clients who were absent or newly show up. In this paper, we propose FedBasis, a novel PFL framework to tackle such a deficiency. FedBasis learns a set of few, shareable ``basis'' models, which can be linearly combined to form personalized models for clients. Specifically for a new client, only a small set of combination coefficients, not the models, needs to be learned. This notion makes FedBasis more parameter-efficient, robust, and accurate compared to other competitive PFL baselines, especially in the low data regime, without increasing the inference cost. To demonstrate its applicability, we also present a more practical PFL testbed for image classification, featuring larger data discrepancies across clients in both the image and label spaces as well as more faithful training and test splits.

* Preprint 
Viaarxiv icon

Identity Encoder for Personalized Diffusion

Apr 14, 2023
Yu-Chuan Su, Kelvin C. K. Chan, Yandong Li, Yang Zhao, Han Zhang, Boqing Gong, Huisheng Wang, Xuhui Jia

Figure 1 for Identity Encoder for Personalized Diffusion
Figure 2 for Identity Encoder for Personalized Diffusion
Figure 3 for Identity Encoder for Personalized Diffusion
Figure 4 for Identity Encoder for Personalized Diffusion

Many applications can benefit from personalized image generation models, including image enhancement, video conferences, just to name a few. Existing works achieved personalization by fine-tuning one model for each person. While being successful, this approach incurs additional computation and storage overhead for each new identity. Furthermore, it usually expects tens or hundreds of examples per identity to achieve the best performance. To overcome these challenges, we propose an encoder-based approach for personalization. We learn an identity encoder which can extract an identity representation from a set of reference images of a subject, together with a diffusion generator that can generate new images of the subject conditioned on the identity representation. Once being trained, the model can be used to generate images of arbitrary identities given a few examples even if the model hasn't been trained on the identity. Our approach greatly reduces the overhead for personalized image generation and is more applicable in many potential applications. Empirical results show that our approach consistently outperforms existing fine-tuning based approach in both image generation and reconstruction, and the outputs is preferred by users more than 95% of the time compared with the best performing baseline.

Viaarxiv icon

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Apr 14, 2023
Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, William W. Cohen

Figure 1 for Subject-driven Text-to-Image Generation via Apprenticeship Learning
Figure 2 for Subject-driven Text-to-Image Generation via Apprenticeship Learning
Figure 3 for Subject-driven Text-to-Image Generation via Apprenticeship Learning
Figure 4 for Subject-driven Text-to-Image Generation via Apprenticeship Learning

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with \emph{in-context} learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by {\em apprenticeship learning}, where a single apprentice model is learned from data generated by massive amount of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train massive amount of expert models specialized on different subjects. The apprentice model SuTI then learns to mimic the behavior of these experts through the proposed apprenticeship learning algorithm. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI can significantly outperform existing approaches like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen while performing on par with DreamBooth.

* Work in Progress 
Viaarxiv icon

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Apr 05, 2023
Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su

Figure 1 for Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Figure 2 for Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Figure 3 for Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Figure 4 for Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.

Viaarxiv icon

What's in a Name? Beyond Class Indices for Image Recognition

Apr 05, 2023
Kai Han, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia

Figure 1 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 2 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 3 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 4 for What's in a Name? Beyond Class Indices for Image Recognition

Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, although they still rely on a predefined set of candidate names at test time. In this paper, we reconsider the recognition problem and task a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information. We use non-parametric methods to establish relationships between images which allow the model to automatically narrow down the set of possible candidate names. Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50\% improvement over the baseline on ImageNet. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary.

Viaarxiv icon

Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data

Apr 27, 2021
Xuhui Jia, Kai Han, Yukun Zhu, Bradley Green

Figure 1 for Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data
Figure 2 for Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data
Figure 3 for Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data
Figure 4 for Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data

This paper studies the problem of novel category discovery on single- and multi-modal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. In particular, we propose using category discrimination on labelled data and cross-modal discrimination on multi-modal data to augment instance discrimination used in conventional contrastive learning approaches. We further employ Winner-Take-All (WTA) hashing algorithm on the shared representation space to generate pairwise pseudo labels for unlabelled data to better predict cluster assignments. We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results.

Viaarxiv icon

Ranking Neural Checkpoints

Nov 23, 2020
Yandong Li, Xuhui Jia, Ruoxin Sang, Yukun Zhu, Bradley Green, Liqiang Wang, Boqing Gong

Figure 1 for Ranking Neural Checkpoints
Figure 2 for Ranking Neural Checkpoints
Figure 3 for Ranking Neural Checkpoints
Figure 4 for Ranking Neural Checkpoints

This paper is concerned with ranking many pre-trained deep neural networks (DNNs), called checkpoints, for the transfer learning to a downstream task. Thanks to the broad use of DNNs, we may easily collect hundreds of checkpoints from various sources. Which of them transfers the best to our downstream task of interest? Striving to answer this question thoroughly, we establish a neural checkpoint ranking benchmark (NeuCRaB) and study some intuitive ranking measures. These measures are generic, applying to the checkpoints of different output types without knowing how the checkpoints are pre-trained on which dataset. They also incur low computation cost, making them practically meaningful. Our results suggest that the linear separability of the features extracted by the checkpoints is a strong indicator of transferability. We also arrive at a new ranking measure, NLEEP, which gives rise to the best performance in the experiments.

Viaarxiv icon

Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze

Oct 15, 2020
Bardia Doosti, Ching-Hui Chen, Raviteja Vemulapalli, Xuhui Jia, Yukun Zhu, Bradley Green

Figure 1 for Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze
Figure 2 for Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze
Figure 3 for Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze
Figure 4 for Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze

Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during training. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images.

Viaarxiv icon