Alert button
Picture for Liming Jiang

Liming Jiang

Alert button

Evaluating General-Purpose AI with Psychometrics

Oct 25, 2023
Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, Luning Sun, David Stillwell, Fang Luo, Xing Xie

Artificial intelligence (AI) has witnessed an evolution from task-specific to general-purpose systems that trend toward human versatility. As AI systems begin to play pivotal roles in society, it is important to ensure that they are adequately evaluated. Current AI benchmarks typically assess performance on collections of specific tasks. This has drawbacks when used for assessing general-purpose AI systems. First, it is difficult to predict whether AI systems could complete a new task it has never seen or that did not previously exist. Second, these benchmarks often focus on overall performance metrics, potentially overlooking the finer details crucial for making informed decisions. Lastly, there are growing concerns about the reliability of existing benchmarks and questions about what is being measured. To solve these challenges, this paper suggests that psychometrics, the science of psychological measurement, should be placed at the core of evaluating general-purpose AI. Psychometrics provides a rigorous methodology for identifying and measuring the latent constructs that underlie performance across multiple tasks. We discuss its merits, warn against potential pitfalls, and propose a framework for putting it into practice. Finally, we explore future opportunities to integrate psychometrics with AI.

* Work in progress 
Viaarxiv icon

PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

Oct 14, 2023
Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu

Figure 1 for PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation
Figure 2 for PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation
Figure 3 for PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation
Figure 4 for PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (eg, SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leverage existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two aspects. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guidance to ensure the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art methods, validate the efficacy of our approach.

Viaarxiv icon

StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

Aug 31, 2023
Yuhan Wang, Liming Jiang, Chen Change Loy

Figure 1 for StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation
Figure 2 for StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation
Figure 3 for StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation
Figure 4 for StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency.

* ICCV 2023. Code: https://github.com/johannwyh/StyleInV Project page: https://www.mmlab-ntu.com/project/styleinv/index.html 
Viaarxiv icon

Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

Aug 24, 2023
Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy

Figure 1 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Figure 2 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Figure 3 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Figure 4 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.

* ICCV 2023. The first two authors contributed equally. Code: https://github.com/Yuxinn-J/Scenimefy Project page: https://yuxinn-j.github.io/projects/Scenimefy.html 
Viaarxiv icon

GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation

Jun 07, 2023
Shuai Yang, Liming Jiang, Ziwei Liu, Chen Change Loy

Figure 1 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation
Figure 2 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation
Figure 3 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation
Figure 4 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation

Recent advances in deep learning have witnessed many successful unsupervised image-to-image translation models that learn correspondences between two visual domains without paired data. However, it is still a great challenge to build robust mappings between various domains especially for those with drastic visual discrepancies. In this paper, we introduce a novel versatile framework, Generative Prior-guided UNsupervised Image-to-image Translation (GP-UNIT), that improves the quality, applicability and controllability of the existing translation models. The key idea of GP-UNIT is to distill the generative prior from pre-trained class-conditional GANs to build coarse-level cross-domain correspondences, and to apply the learned prior to adversarial translations to excavate fine-level correspondences. With the learned multi-level content correspondences, GP-UNIT is able to perform valid translations between both close domains and distant domains. For close domains, GP-UNIT can be conditioned on a parameter to determine the intensity of the content correspondences during translation, allowing users to balance between content and style consistency. For distant domains, semi-supervised learning is explored to guide GP-UNIT to discover accurate semantic correspondences that are hard to learn solely from the appearance. We validate the superiority of GP-UNIT over state-of-the-art translation models in robust, high-quality and diversified translations between various domains through extensive experiments.

* Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Code: https://github.com/williamyang1991/GP-UNIT Project page: https://www.mmlab-ntu.com/project/gpunit/. arXiv admin note: substantial text overlap with arXiv:2204.03641 
Viaarxiv icon

CelebV-Text: A Large-Scale Facial Text-Video Dataset

Mar 26, 2023
Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu

Figure 1 for CelebV-Text: A Large-Scale Facial Text-Video Dataset
Figure 2 for CelebV-Text: A Large-Scale Facial Text-Video Dataset
Figure 3 for CelebV-Text: A Large-Scale Facial Text-Video Dataset
Figure 4 for CelebV-Text: A Large-Scale Facial Text-Video Dataset

Text-driven generation models are flourishing in video generation and editing. However, face-centric text-to-video generation remains a challenge due to the lack of a suitable dataset containing high-quality videos and highly relevant texts. This paper presents CelebV-Text, a large-scale, diverse, and high-quality dataset of facial text-video pairs, to facilitate research on facial text-to-video generation tasks. CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts are of high quality, describing both static and dynamic attributes precisely. The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance. The effectiveness and potential of CelebV-Text are further shown through extensive self-evaluation. A benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task. All data and models are publicly available.

* Accepted by CVPR2023. Project Page: https://celebv-text.github.io/ 
Viaarxiv icon

StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces

Mar 10, 2023
Shuai Yang, Liming Jiang, Ziwei Liu, Chen Change Loy

Figure 1 for StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces
Figure 2 for StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces
Figure 3 for StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces
Figure 4 for StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces

Recent advances in face manipulation using StyleGAN have produced impressive results. However, StyleGAN is inherently limited to cropped aligned faces at a fixed image resolution it is pre-trained on. In this paper, we propose a simple and effective solution to this limitation by using dilated convolutions to rescale the receptive fields of shallow layers in StyleGAN, without altering any model parameters. This allows fixed-size small features at shallow layers to be extended into larger ones that can accommodate variable resolutions, making them more robust in characterizing unaligned faces. To enable real face inversion and manipulation, we introduce a corresponding encoder that provides the first-layer feature of the extended StyleGAN in addition to the latent style code. We validate the effectiveness of our method using unaligned face inputs of various resolutions in a diverse set of face manipulation tasks, including facial attribute editing, super-resolution, sketch/mask-to-face translation, and face toonification.

* Code: https://github.com/williamyang1991/StyleGANEX Project page: https://www.mmlab-ntu.com/project/styleganex/ 
Viaarxiv icon

VToonify: Controllable High-Resolution Portrait Video Style Transfer

Sep 30, 2022
Shuai Yang, Liming Jiang, Ziwei Liu, Chen Change Loy

Figure 1 for VToonify: Controllable High-Resolution Portrait Video Style Transfer
Figure 2 for VToonify: Controllable High-Resolution Portrait Video Style Transfer
Figure 3 for VToonify: Controllable High-Resolution Portrait Video Style Transfer
Figure 4 for VToonify: Controllable High-Resolution Portrait Video Style Transfer

Generating high-quality artistic portrait videos is an important and desirable task in computer graphics and vision. Although a series of successful portrait image toonification models built upon the powerful StyleGAN have been proposed, these image-oriented methods have obvious limitations when applied to videos, such as the fixed frame size, the requirement of face alignment, missing non-facial details and temporal inconsistency. In this work, we investigate the challenging controllable high-resolution portrait video style transfer by introducing a novel VToonify framework. Specifically, VToonify leverages the mid- and high-resolution layers of StyleGAN to render high-quality artistic portraits based on the multi-scale content features extracted by an encoder to better preserve the frame details. The resulting fully convolutional architecture accepts non-aligned faces in videos of variable size as input, contributing to complete face regions with natural motions in the output. Our framework is compatible with existing StyleGAN-based image toonification models to extend them to video toonification, and inherits appealing features of these models for flexible style control on color and intensity. This work presents two instantiations of VToonify built upon Toonify and DualStyleGAN for collection-based and exemplar-based portrait video style transfer, respectively. Extensive experimental results demonstrate the effectiveness of our proposed VToonify framework over existing methods in generating high-quality and temporally-coherent artistic portrait videos with flexible style controls.

* ACM Transactions on Graphics (SIGGRAPH Asia 2022). Code: https://github.com/williamyang1991/VToonify Project page: https://www.mmlab-ntu.com/project/vtoonify/ 
Viaarxiv icon

CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Jul 25, 2022
Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, Chen Change Loy

Figure 1 for CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
Figure 2 for CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
Figure 3 for CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
Figure 4 for CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Large-scale datasets have played indispensable roles in the recent success of face generation/editing and significantly facilitated the advances of emerging research fields. However, the academic community still lacks a video dataset with diverse facial attribute annotations, which is crucial for the research on face-related videos. In this work, we propose a large-scale, high-quality, and diverse video dataset with rich facial attribute annotations, named the High-Quality Celebrity Video Dataset (CelebV-HQ). CelebV-HQ contains 35,666 video clips with the resolution of 512x512 at least, involving 15,653 identities. All clips are labeled manually with 83 facial attributes, covering appearance, action, and emotion. We conduct a comprehensive analysis in terms of age, ethnicity, brightness stability, motion smoothness, head pose diversity, and data quality to demonstrate the diversity and temporal coherence of CelebV-HQ. Besides, its versatility and potential are validated on two representative tasks, i.e., unconditional video generation and video facial attribute editing. Furthermore, we envision the future potential of CelebV-HQ, as well as the new opportunities and challenges it would bring to related research directions. Data, code, and models are publicly available. Project page: https://celebv-hq.github.io.

* ECCV 2022. Project Page: https://celebv-hq.github.io/ ; Dataset: https://github.com/CelebV-HQ/CelebV-HQ 
Viaarxiv icon