Alert button
Picture for Zitong Yu

Zitong Yu

Alert button

Biomedical Image Splicing Detection using Uncertainty-Guided Refinement

Sep 28, 2023
Xun Lin, Wenzhong Tang, Shuai Wang, Zitong Yu, Yizhong Liu, Haoran Wang, Ying Fu, Alex Kot

Figure 1 for Biomedical Image Splicing Detection using Uncertainty-Guided Refinement
Figure 2 for Biomedical Image Splicing Detection using Uncertainty-Guided Refinement
Figure 3 for Biomedical Image Splicing Detection using Uncertainty-Guided Refinement
Figure 4 for Biomedical Image Splicing Detection using Uncertainty-Guided Refinement

Recently, a surge in biomedical academic publications suspected of image manipulation has led to numerous retractions, turning biomedical image forensics into a research hotspot. While manipulation detectors are concerning, the specific detection of splicing traces in biomedical images remains underexplored. The disruptive factors within biomedical images, such as artifacts, abnormal patterns, and noises, show misleading features like the splicing traces, greatly increasing the challenge for this task. Moreover, the scarcity of high-quality spliced biomedical images also limits potential advancements in this field. In this work, we propose an Uncertainty-guided Refinement Network (URN) to mitigate the effects of these disruptive factors. Our URN can explicitly suppress the propagation of unreliable information flow caused by disruptive factors among regions, thereby obtaining robust features. Moreover, URN enables a concentration on the refinement of uncertainly predicted regions during the decoding phase. Besides, we construct a dataset for Biomedical image Splicing (BioSp) detection, which consists of 1,290 spliced images. Compared with existing datasets, BioSp comprises the largest number of spliced images and the most diverse sources. Comprehensive experiments on three benchmark datasets demonstrate the superiority of the proposed method. Meanwhile, we verify the generalizability of URN when against cross-dataset domain shifts and its robustness to resist post-processing approaches. Our BioSp dataset will be released upon acceptance.

Viaarxiv icon

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Sep 07, 2023
Rizhao Cai, Zitong Yu, Chenqi Kong, Haoliang Li, Changsheng Chen, Yongjian Hu, Alex Kot

Figure 1 for S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens
Figure 2 for S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens
Figure 3 for S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens
Figure 4 for S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.

Viaarxiv icon

Hyperbolic Face Anti-Spoofing

Aug 17, 2023
Shuangpeng Han, Rizhao Cai, Yawen Cui, Zitong Yu, Yongjian Hu, Alex Kot

Figure 1 for Hyperbolic Face Anti-Spoofing
Figure 2 for Hyperbolic Face Anti-Spoofing
Figure 3 for Hyperbolic Face Anti-Spoofing
Figure 4 for Hyperbolic Face Anti-Spoofing

Learning generalized face anti-spoofing (FAS) models against presentation attacks is essential for the security of face recognition systems. Previous FAS methods usually encourage models to extract discriminative features, of which the distances within the same class (bonafide or attack) are pushed close while those between bonafide and attack are pulled away. However, these methods are designed based on Euclidean distance, which lacks generalization ability for unseen attack detection due to poor hierarchy embedding ability. According to the evidence that different spoofing attacks are intrinsically hierarchical, we propose to learn richer hierarchical and discriminative spoofing cues in hyperbolic space. Specifically, for unimodal FAS learning, the feature embeddings are projected into the Poincar\'e ball, and then the hyperbolic binary logistic regression layer is cascaded for classification. To further improve generalization, we conduct hyperbolic contrastive learning for the bonafide only while relaxing the constraints on diverse spoofing attacks. To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models. Besides, we further design a multimodal FAS framework with Euclidean multimodal feature decomposition and hyperbolic multimodal feature fusion & classification. Extensive experiments on three benchmark datasets (i.e., WMCA, PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed method can bring significant improvement compared to the Euclidean baselines on unseen attack detection. In addition, the proposed framework is also generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.

Viaarxiv icon

Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

Aug 15, 2023
Xin Liu, Kaishen Yuan, Xuesong Niu, Jingang Shi, Zitong Yu, Huanjing Yue, Jingyu Yang

Figure 1 for Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection
Figure 2 for Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection
Figure 3 for Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection
Figure 4 for Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

Facial Action Unit (AU) detection is a crucial task in affective computing and social robotics as it helps to identify emotions expressed through facial expressions. Anatomically, there are innumerable correlations between AUs, which contain rich information and are vital for AU detection. Previous methods used fixed AU correlations based on expert experience or statistical rules on specific benchmarks, but it is challenging to comprehensively reflect complex correlations between AUs via hand-crafted settings. There are alternative methods that employ a fully connected graph to learn these dependencies exhaustively. However, these approaches can result in a computational explosion and high dependency with a large dataset. To address these challenges, this paper proposes a novel self-adjusting AU-correlation learning (SACL) method with less computation for AU detection. This method adaptively learns and updates AU correlation graphs by efficiently leveraging the characteristics of different levels of AU motion and emotion representation information extracted in different stages of the network. Moreover, this paper explores the role of multi-scale learning in correlation information extraction, and design a simple yet effective multi-scale feature learning (MSFL) method to promote better performance in AU detection. By integrating AU correlation information with multi-scale features, the proposed method obtains a more robust feature representation for the final AU detection. Extensive experiments show that the proposed method outperforms the state-of-the-art methods on widely used AU detection benchmark datasets, with only 28.7\% and 12.0\% of the parameters and FLOPs of the best method, respectively. The code for this method is available at \url{https://github.com/linuxsino/Self-adjusting-AU}.

* 13pages, 7 figures 
Viaarxiv icon

Visual Prompt Flexible-Modal Face Anti-Spoofing

Jul 26, 2023
Zitong Yu, Rizhao Cai, Yawen Cui, Ajian Liu, Changsheng Chen

Figure 1 for Visual Prompt Flexible-Modal Face Anti-Spoofing
Figure 2 for Visual Prompt Flexible-Modal Face Anti-Spoofing
Figure 3 for Visual Prompt Flexible-Modal Face Anti-Spoofing
Figure 4 for Visual Prompt Flexible-Modal Face Anti-Spoofing

Recently, vision transformer based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors. Recently, flexible-modal FAS~\cite{yu2023flexible} has attracted more attention, which aims to develop a unified multimodal FAS model using complete multimodal face data but is insensitive to test-time missing modalities. In this paper, we tackle one main challenge in flexible-modal FAS, i.e., when missing modality occurs either during training or testing in real-world situations. Inspired by the recent success of the prompt learning in language models, we propose \textbf{V}isual \textbf{P}rompt flexible-modal \textbf{FAS} (VP-FAS), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task. Specifically, both vanilla visual prompts and residual contextual prompts are plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 4\% learnable parameters compared to training the entire model. Furthermore, missing-modality regularization is proposed to force models to learn consistent multimodal feature embeddings when missing partial modalities. Extensive experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework that improves the performance under various missing-modality cases while alleviating the requirement of heavy model re-training.

* arXiv admin note: text overlap with arXiv:2303.03369 by other authors 
Viaarxiv icon

Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization

Jun 29, 2023
Yingxin Lai, Zhiming Luo, Zitong Yu

Figure 1 for Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization
Figure 2 for Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization
Figure 3 for Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization
Figure 4 for Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization

The rapid advancements in computer vision have stimulated remarkable progress in face forgery techniques, capturing the dedicated attention of researchers committed to detecting forgeries and precisely localizing manipulated areas. Nonetheless, with limited fine-grained pixel-wise supervision labels, deepfake detection models perform unsatisfactorily on precise forgery detection and localization. To address this challenge, we introduce the well-trained vision segmentation foundation model, i.e., Segment Anything Model (SAM) in face forgery detection and localization. Based on SAM, we propose the Detect Any Deepfakes (DADF) framework with the Multiscale Adapter, which can capture short- and long-range forgery contexts for efficient fine-tuning. Moreover, to better identify forged traces and augment the model's sensitivity towards forgery regions, Reconstruction Guided Attention (RGA) module is proposed. The proposed framework seamlessly integrates end-to-end forgery localization and detection optimization. Extensive experiments on three benchmark datasets demonstrate the superiority of our approach for both forgery detection and localization. The codes will be released soon at https://github.com/laiyingxin2/DADF.

Viaarxiv icon

DEMIST: A deep-learning-based task-specific denoising approach for myocardial perfusion SPECT

Jun 14, 2023
Md Ashequr Rahman, Zitong Yu, Richard Laforest, Craig K. Abbey, Barry A. Siegel, Abhinav K. Jha

Figure 1 for DEMIST: A deep-learning-based task-specific denoising approach for myocardial perfusion SPECT
Figure 2 for DEMIST: A deep-learning-based task-specific denoising approach for myocardial perfusion SPECT
Figure 3 for DEMIST: A deep-learning-based task-specific denoising approach for myocardial perfusion SPECT
Figure 4 for DEMIST: A deep-learning-based task-specific denoising approach for myocardial perfusion SPECT

There is an important need for methods to process myocardial perfusion imaging (MPI) SPECT images acquired at lower radiation dose and/or acquisition time such that the processed images improve observer performance on the clinical task of detecting perfusion defects. To address this need, we build upon concepts from model-observer theory and our understanding of the human visual system to propose a Detection task-specific deep-learning-based approach for denoising MPI SPECT images (DEMIST). The approach, while performing denoising, is designed to preserve features that influence observer performance on detection tasks. We objectively evaluated DEMIST on the task of detecting perfusion defects using a retrospective study with anonymized clinical data in patients who underwent MPI studies across two scanners (N = 338). The evaluation was performed at low-dose levels of 6.25%, 12.5% and 25% and using an anthropomorphic channelized Hotelling observer. Performance was quantified using area under the receiver operating characteristics curve (AUC). Images denoised with DEMIST yielded significantly higher AUC compared to corresponding low-dose images and images denoised with a commonly used task-agnostic DL-based denoising method. Similar results were observed with stratified analysis based on patient sex and defect type. Additionally, DEMIST improved visual fidelity of the low-dose images as quantified using root mean squared error and structural similarity index metric. A mathematical analysis revealed that DEMIST preserved features that assist in detection tasks while improving the noise properties, resulting in improved observer performance. The results provide strong evidence for further clinical evaluation of DEMIST to denoise low-count images in MPI SPECT.

Viaarxiv icon

rPPG-MAE: Self-supervised Pre-training with Masked Autoencoders for Remote Physiological Measurement

Jun 04, 2023
Xin Liu, Yuting Zhang, Zitong Yu, Hao Lu, Huanjing Yue, Jingyu Yang

Figure 1 for rPPG-MAE: Self-supervised Pre-training with Masked Autoencoders for Remote Physiological Measurement
Figure 2 for rPPG-MAE: Self-supervised Pre-training with Masked Autoencoders for Remote Physiological Measurement
Figure 3 for rPPG-MAE: Self-supervised Pre-training with Masked Autoencoders for Remote Physiological Measurement
Figure 4 for rPPG-MAE: Self-supervised Pre-training with Masked Autoencoders for Remote Physiological Measurement

Remote photoplethysmography (rPPG) is an important technique for perceiving human vital signs, which has received extensive attention. For a long time, researchers have focused on supervised methods that rely on large amounts of labeled data. These methods are limited by the requirement for large amounts of data and the difficulty of acquiring ground truth physiological signals. To address these issues, several self-supervised methods based on contrastive learning have been proposed. However, they focus on the contrastive learning between samples, which neglect the inherent self-similar prior in physiological signals and seem to have a limited ability to cope with noisy. In this paper, a linear self-supervised reconstruction task was designed for extracting the inherent self-similar prior in physiological signals. Besides, a specific noise-insensitive strategy was explored for reducing the interference of motion and illumination. The proposed framework in this paper, namely rPPG-MAE, demonstrates excellent performance even on the challenging VIPL-HR dataset. We also evaluate the proposed method on two public datasets, namely PURE and UBFC-rPPG. The results show that our method not only outperforms existing self-supervised methods but also exceeds the state-of-the-art (SOTA) supervised methods. One important observation is that the quality of the dataset seems more important than the size in self-supervised pre-training of rPPG. The source code is released at https://github.com/linuxsino/rPPG-MAE.

Viaarxiv icon

FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing

May 05, 2023
Ajian Liu, Zichang Tan, Zitong Yu, Chenxu Zhao, Jun Wan, Yanyan Liang, Zhen Lei, Du Zhang, Stan Z. Li, Guodong Guo

Figure 1 for FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing
Figure 2 for FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing
Figure 3 for FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing
Figure 4 for FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing

The availability of handy multi-modal (i.e., RGB-D) sensors has brought about a surge of face anti-spoofing research. However, the current multi-modal face presentation attack detection (PAD) has two defects: (1) The framework based on multi-modal fusion requires providing modalities consistent with the training input, which seriously limits the deployment scenario. (2) The performance of ConvNet-based model on high fidelity datasets is increasingly limited. In this work, we present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT), for face anti-spoofing to flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data. Specifically, FM-ViT retains a specific branch for each modality to capture different modal information and introduces the Cross-Modal Transformer Block (CMTB), which consists of two cascaded attentions named Multi-headed Mutual-Attention (MMA) and Fusion-Attention (MFA) to guide each modal branch to mine potential features from informative patch tokens, and to learn modality-agnostic liveness features by enriching the modal information of own CLS token, respectively. Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin, and approaches the multi-modal frameworks introduced with smaller FLOPs and model parameters.

* 12 pages, 7 figures 
Viaarxiv icon