Alert button
Picture for Nenghai Yu

Nenghai Yu

Alert button

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

Sep 22, 2023
Jiazhen Wang, Bin Liu, Changtao Miao, Zhiwei Zhao, Wanyi Zhuang, Qi Chu, Nenghai Yu

AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 
Viaarxiv icon

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

Aug 22, 2023
Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, Nenghai Yu

Figure 1 for Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting
Figure 2 for Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting
Figure 3 for Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting
Figure 4 for Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

In this paper, we investigate the adversarial robustness of vision transformers that are equipped with BERT pretraining (e.g., BEiT, MAE). A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods. This observation drives us to rethink the basic differences between these BERT pretraining methods and how these differences affect the robustness against adversarial perturbations. Our empirical analysis reveals that the adversarial robustness of BERT pretraining is highly related to the reconstruction target, i.e., predicting the raw pixels of masked image patches will degrade more adversarial robustness of the model than predicting the semantic context, since it guides the model to concentrate more on medium-/high-frequency components of images. Based on our analysis, we provide a simple yet effective way to boost the adversarial robustness of MAE. The basic idea is using the dataset-extracted domain knowledge to occupy the medium-/high-frequency of images, thus narrowing the optimization space of adversarial perturbations. Specifically, we group the distribution of pretraining data and optimize a set of cluster-specific visual prompts on frequency domain. These prompts are incorporated with input images through prototype-based prompt selection during test period. Extensive evaluation shows that our method clearly boost MAE's adversarial robustness while maintaining its clean performance on ImageNet-1k classification. Our code is available at: https://github.com/shikiw/RobustMAE.

* Accepted at ICCV 2023 
Viaarxiv icon

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

Jun 19, 2023
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, Wanli Ouyang

Figure 1 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators
Figure 2 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators
Figure 3 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators
Figure 4 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Codes shall be released upon acceptance.

* 18 pages, 8 figures 
Viaarxiv icon

EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

Jun 16, 2023
Yaqi Zhang, Yan Lu, Bin Liu, Zhiwei Zhao, Qi Chu, Nenghai Yu

Figure 1 for EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors
Figure 2 for EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors
Figure 3 for EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors
Figure 4 for EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

Transformer is popular in recent 3D human pose estimation, which utilizes long-term modeling to lift 2D keypoints into the 3D space. However, current transformer-based methods do not fully exploit the prior knowledge of the human skeleton provided by the kinematic structure. In this paper, we propose a novel transformer-based model EvoPose to introduce the human body prior knowledge for 3D human pose estimation effectively. Specifically, a Structural Priors Representation (SPR) module represents human priors as structural features carrying rich body patterns, e.g. joint relationships. The structural features are interacted with 2D pose sequences and help the model to achieve more informative spatiotemporal features. Moreover, a Recursive Refinement (RR) module is applied to refine the 3D pose outputs by utilizing estimated results and further injects human priors simultaneously. Extensive experiments demonstrate the effectiveness of EvoPose which achieves a new state of the art on two most popular benchmarks, Human3.6M and MPI-INF-3DHP.

* 5 pages, 2 figures, 4 tables, published in the proceedings of IEEE ICASSP 2023 
Viaarxiv icon

Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal

Jun 15, 2023
Zhentao Tan, Yue Wu, Qiankun Liu, Qi Chu, Le Lu, Jieping Ye, Nenghai Yu

Figure 1 for Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal
Figure 2 for Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal
Figure 3 for Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal
Figure 4 for Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal

Image restoration under adverse weather conditions (e.g., rain, snow and haze) is a fundamental computer vision problem and has important indications for various downstream applications. Different from early methods that are specially designed for specific type of weather, most recent works tend to remove various adverse weather effects simultaneously through either spatial feature representation learning or semantic information embedding. Inspired by the various successful applications of large-scale pre-trained models (e.g, CLIP), in this paper, we explore the potential benefits of them for this task through both spatial feature representation learning and semantic information embedding aspects: 1) for spatial feature representation learning, we design a Spatially-Adaptive Residual (\textbf{SAR}) Encoder to extract degraded areas adaptively. To facilitate its training, we propose a Soft Residual Distillation (\textbf{CLIP-SRD}) strategy to transfer the spatial knowledge from CLIP between clean and adverse weather images; 2) for semantic information embedding, we propose a CLIP Weather Prior (\textbf{CWP}) embedding module to make the network handle different weather conditions adaptively. This module integrates the sample specific weather prior extracted by CLIP image encoder together with the distribution specific information learned by a set of parameters, and embeds them through a cross attention mechanism. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art performance under different and challenging adverse weather conditions. Code will be made available.

Viaarxiv icon

HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

Jun 08, 2023
Qinhong Yang, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Lu Yuan, Gang Hua, Nenghai Yu

Figure 1 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration
Figure 2 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration
Figure 3 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration
Figure 4 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

This paper introduces a new large-scale image restoration dataset, called HQ-50K, which contains 50,000 high-quality images with rich texture details and semantic diversity. We analyze existing image restoration datasets from five different perspectives, including data scale, resolution, compression rates, texture details, and semantic coverage. However, we find that all of these datasets are deficient in some aspects. In contrast, HQ-50K considers all of these five aspects during the data curation process and meets all requirements. We also present a new Degradation-Aware Mixture of Expert (DAMoE) model, which enables a single model to handle multiple corruption types and unknown levels. Our extensive experiments demonstrate that HQ-50K consistently improves the performance on various image restoration tasks, such as super-resolution, denoising, dejpeg, and deraining. Furthermore, our proposed DAMoE, trained on our \dataset, outperforms existing state-of-the-art unified models designed for multiple restoration tasks and levels. The dataset and code are available at \url{https://github.com/littleYaang/HQ-50K}.

* Dataset and code will be available at https://github.com/littleYaang/HQ-50K 
Viaarxiv icon

GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance

May 21, 2023
Xiao Yu, Yuang Qi, Kejiang Chen, Guoqiang Chen, Xi Yang, Pengyuan Zhu, Weiming Zhang, Nenghai Yu

Figure 1 for GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance
Figure 2 for GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance
Figure 3 for GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance
Figure 4 for GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance

Large Language Models (LLMs) can generate texts that carry the risk of various misuses, including plagiarism, planting fake reviews on e-commerce platforms, or creating fake social media postings that can sway election results. Detecting whether a text is machine-generated has thus become increasingly important. While machine-learning-based detection strategies exhibit superior performance, they often lack generalizability, limiting their practicality. In this work, we introduce GPT Paternity Test (GPT-Pat), which reliably detects machine-generated text across varied datasets. Given a text under scrutiny, we leverage ChatGPT to generate a corresponding question and provide a re-answer to the question. By comparing the similarity between the original text and the generated re-answered text, it can be determined whether the text is machine-generated. GPT-Pat consists of a Siamese network to compute the similarity between the original text and the generated re-answered text and a binary classifier. Our method achieved an average accuracy of 94.57% on four generalization test sets, surpassing the state-of-the-art RoBERTa-based method by 12.34%. The accuracy drop of our method is only about half of that of the RoBERTa-based method when it is attacked by re-translation and polishing.

Viaarxiv icon

Multi-spectral Class Center Network for Face Manipulation Detection and Localization

May 18, 2023
Changtao Miao, Qi Chu, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Yue Wu, Bin Liu, Honggang Hu, Nenghai Yu

Figure 1 for Multi-spectral Class Center Network for Face Manipulation Detection and Localization
Figure 2 for Multi-spectral Class Center Network for Face Manipulation Detection and Localization
Figure 3 for Multi-spectral Class Center Network for Face Manipulation Detection and Localization
Figure 4 for Multi-spectral Class Center Network for Face Manipulation Detection and Localization

As Deepfake contents continue to proliferate on the internet, advancing face manipulation forensics has become a pressing issue. To combat this emerging threat, previous methods mainly focus on studying how to distinguish authentic and manipulated face images. Despite impressive, image-level classification lacks explainability and is limited to some specific application scenarios. Existing forgery localization methods suffer from imprecise and inconsistent pixel-level annotations. To alleviate these problems, this paper first re-constructs the FaceForensics++ dataset by introducing pixel-level annotations, then builds an extensive benchmark for localizing tampered regions. Next, a novel Multi-Spectral Class Center Network (MSCCNet) is proposed for face manipulation detection and localization. Specifically, inspired by the power of frequency-related forgery traces, we design Multi-Spectral Class Center (MSCC) module to learn more generalizable and semantic-agnostic features. Based on the features of different frequency bands, the MSCC module collects multispectral class centers and computes pixel-to-class relations. Applying multi-spectral class-level representations suppresses the semantic information of the visual concepts, which is insensitive to manipulations. Furthermore, we propose a Multi-level Features Aggregation (MFA) module to employ more low-level forgery artifacts and structure textures. Experimental results quantitatively and qualitatively indicate the effectiveness and superiority of the proposed MSCCNet on comprehensive localization benchmarks. We expect this work to inspire more studies on pixel-level face manipulation localization. The annotations and code will be available.

* Email Address: miaoct@mail.ustc.edu.cn 
Viaarxiv icon

Watermarking Text Generated by Black-Box Language Models

May 14, 2023
Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, Nenghai Yu

Figure 1 for Watermarking Text Generated by Black-Box Language Models
Figure 2 for Watermarking Text Generated by Black-Box Language Models
Figure 3 for Watermarking Text Generated by Black-Box Language Models
Figure 4 for Watermarking Text Generated by Black-Box Language Models

LLMs now exhibit human-like skills in various fields, leading to worries about misuse. Thus, detecting generated text is crucial. However, passive detection methods are stuck in domain specificity and limited adversarial robustness. To achieve reliable detection, a watermark-based method was proposed for white-box LLMs, allowing them to embed watermarks during text generation. The method involves randomly dividing the model vocabulary to obtain a special list and adjusting the probability distribution to promote the selection of words in the list. A detection algorithm aware of the list can identify the watermarked text. However, this method is not applicable in many real-world scenarios where only black-box language models are available. For instance, third-parties that develop API-based vertical applications cannot watermark text themselves because API providers only supply generated text and withhold probability distributions to shield their commercial interests. To allow third-parties to autonomously inject watermarks into generated text, we develop a watermarking framework for black-box language model usage scenarios. Specifically, we first define a binary encoding function to compute a random binary encoding corresponding to a word. The encodings computed for non-watermarked text conform to a Bernoulli distribution, wherein the probability of a word representing bit-1 being approximately 0.5. To inject a watermark, we alter the distribution by selectively replacing words representing bit-0 with context-based synonyms that represent bit-1. A statistical test is then used to identify the watermark. Experiments demonstrate the effectiveness of our method on both Chinese and English datasets. Furthermore, results under re-translation, polishing, word deletion, and synonym substitution attacks reveal that it is arduous to remove the watermark without compromising the original semantics.

* Code will be available at https://github.com/Kiode/Text_Watermark_Language_Models 
Viaarxiv icon

Clothes-Invariant Feature Learning by Causal Intervention for Clothes-Changing Person Re-identification

May 10, 2023
Xulin Li, Yan Lu, Bin Liu, Yuenan Hou, Yating Liu, Qi Chu, Wanli Ouyang, Nenghai Yu

Figure 1 for Clothes-Invariant Feature Learning by Causal Intervention for Clothes-Changing Person Re-identification
Figure 2 for Clothes-Invariant Feature Learning by Causal Intervention for Clothes-Changing Person Re-identification
Figure 3 for Clothes-Invariant Feature Learning by Causal Intervention for Clothes-Changing Person Re-identification
Figure 4 for Clothes-Invariant Feature Learning by Causal Intervention for Clothes-Changing Person Re-identification

Clothes-invariant feature extraction is critical to the clothes-changing person re-identification (CC-ReID). It can provide discriminative identity features and eliminate the negative effects caused by the confounder--clothing changes. But we argue that there exists a strong spurious correlation between clothes and human identity, that restricts the common likelihood-based ReID method P(Y|X) to extract clothes-irrelevant features. In this paper, we propose a new Causal Clothes-Invariant Learning (CCIL) method to achieve clothes-invariant feature learning by modeling causal intervention P(Y|do(X)). This new causality-based model is inherently invariant to the confounder in the causal view, which can achieve the clothes-invariant features and avoid the barrier faced by the likelihood-based methods. Extensive experiments on three CC-ReID benchmarks, including PRCC, LTCC, and VC-Clothes, demonstrate the effectiveness of our approach, which achieves a new state of the art.

Viaarxiv icon