Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bin Li

Member, IEEE

Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

Sep 04, 2024

Kaiqing Lin, Yuzhen Lin, Weixiang Li, Taiping Yao, Bin Li

Figure 1 for Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

Figure 2 for Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

Figure 3 for Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

Figure 4 for Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

Abstract:The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.

Via

Access Paper or Ask Questions

SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

Aug 28, 2024

Zhiqing Zhang, Tianyong Liu, Guojia Fan, Bin Li, Qianjin Feng, Shoujun Zhou

Figure 1 for SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

Figure 2 for SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

Figure 3 for SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

Figure 4 for SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

Abstract:Accurate segmentation of 3D clinical medical images is critical in the diagnosis and treatment of spinal diseases. However, the inherent complexity of spinal anatomy and uncertainty inherent in current imaging technologies, poses significant challenges for semantic segmentation of spinal images. Although convolutional neural networks (CNNs) and Transformer-based models have made some progress in spinal segmentation, their limitations in handling long-range dependencies hinder further improvements in segmentation accuracy.To address these challenges, we introduce a residual visual Mamba layer to effectively capture and model the deep semantic features and long-range spatial dependencies of 3D spinal data. To further enhance the structural semantic understanding of the vertebrae, we also propose a novel spinal shape prior module that captures specific anatomical information of the spine from medical images, significantly enhancing the model's ability to extract structural semantic information of the vertebrae. Comparative and ablation experiments on two datasets demonstrate that SpineMamba outperforms existing state-of-the-art models. On the CT dataset, the average Dice similarity coefficient for segmentation reaches as high as 94.40, while on the MR dataset, it reaches 86.95. Notably, compared to the renowned nnU-Net, SpineMamba achieves superior segmentation performance, exceeding it by up to 2 percentage points. This underscores its accuracy, robustness, and excellent generalization capabilities.

* 17 pages, 11 figures

Via

Access Paper or Ask Questions

Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Aug 21, 2024

Ziqiang Li, Yueqi Zeng, Pengfei Xia, Lei Liu, Zhangjie Fu, Bin Li

Figure 1 for Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Figure 2 for Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Figure 3 for Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Figure 4 for Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Abstract:With the burgeoning advancements in the field of natural language processing (NLP), the demand for training data has increased significantly. To save costs, it has become common for users and businesses to outsource the labor-intensive task of data collection to third-party entities. Unfortunately, recent research has unveiled the inherent risk associated with this practice, particularly in exposing NLP systems to potential backdoor attacks. Specifically, these attacks enable malicious control over the behavior of a trained model by poisoning a small portion of the training data. Unlike backdoor attacks in computer vision, textual backdoor attacks impose stringent requirements for attack stealthiness. However, existing attack methods meet significant trade-off between effectiveness and stealthiness, largely due to the high information entropy inherent in textual data. In this paper, we introduce the Efficient and Stealthy Textual backdoor attack method, EST-Bad, leveraging Large Language Models (LLMs). Our EST-Bad encompasses three core strategies: optimizing the inherent flaw of models as the trigger, stealthily injecting triggers with LLMs, and meticulously selecting the most impactful samples for backdoor injection. Through the integration of these techniques, EST-Bad demonstrates an efficient achievement of competitive attack performance while maintaining superior stealthiness compared to prior methods across various text classifier datasets.

* Under Review

Via

Access Paper or Ask Questions

SZU-AFS Antispoofing System for the ASVspoof 5 Challenge

Aug 19, 2024

Yuxiong Xu, Jiafeng Zhong, Sengui Zheng, Zefeng Liu, Bin Li

Abstract:This paper presents the SZU-AFS anti-spoofing system, designed for Track 1 of the ASVspoof 5 Challenge under open conditions. The system is built with four stages: selecting a baseline model, exploring effective data augmentation (DA) methods for fine-tuning, applying a co-enhancement strategy based on gradient norm aware minimization (GAM) for secondary fine-tuning, and fusing logits scores from the two best-performing fine-tuned models. The system utilizes the Wav2Vec2 front-end feature extractor and the AASIST back-end classifier as the baseline model. During model fine-tuning, three distinct DA policies have been investigated: single-DA, random-DA, and cascade-DA. Moreover, the employed GAM-based co-enhancement strategy, designed to fine-tune the augmented model at both data and optimizer levels, helps the Adam optimizer find flatter minima, thereby boosting model generalization. Overall, the final fusion system achieves a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

* 8 pages, 2 figures, ASVspoof 5 Workshop (Interspeech2024 Satellite)

Via

Access Paper or Ask Questions

5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Aug 15, 2024

Dongshuo Yin, Leiyi Hu, Bin Li, Youqun Zhang, Xue Yang

Figure 1 for 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Figure 2 for 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Figure 3 for 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Figure 4 for 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Abstract:Pre-training & fine-tuning can enhance the transferring efficiency and performance in visual tasks. Recent delta-tuning methods provide more options for visual classification tasks. Despite their success, existing visual delta-tuning art fails to exceed the upper limit of full fine-tuning on challenging tasks like object detection and segmentation. To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. First, we introduce multiple vision-friendly filters into the adapter to enhance its ability to process visual signals, while previous methods mainly rely on language-friendly linear filters. Second, we add the scaled normalization layer in the adapter to regulate the distribution of input features for visual filters. To fully demonstrate the practicality and generality of Mona, we conduct experiments on multiple representative visual tasks, including instance segmentation on COCO, semantic segmentation on ADE20K, object detection on Pascal VOC, oriented object detection on DOTA/STAR, and image classification on three common datasets. Exciting results illustrate that Mona surpasses full fine-tuning on all these tasks, and is the only delta-tuning method outperforming full fine-tuning on the above various tasks. For example, Mona achieves 1% performance gain on the COCO dataset compared to full fine-tuning. Comprehensive results suggest that Mona-tuning is more suitable for retaining and utilizing the capabilities of pre-trained models than full fine-tuning. We will make the code publicly available.

* arXiv admin note: substantial text overlap with arXiv:2311.15010

Via

Access Paper or Ask Questions

Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling

Aug 14, 2024

Ruofeng Wei, Bin Li, Kai Chen, Yiyao Ma, Yunhui Liu, Qi Dou

Figure 1 for Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling

Figure 2 for Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling

Figure 3 for Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling

Figure 4 for Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling

Abstract:Scale-aware monocular depth estimation poses a significant challenge in computer-aided endoscopic navigation. However, existing depth estimation methods that do not consider the geometric priors struggle to learn the absolute scale from training with monocular endoscopic sequences. Additionally, conventional methods face difficulties in accurately estimating details on tissue and instruments boundaries. In this paper, we tackle these problems by proposing a novel enhanced scale-aware framework that only uses monocular images with geometric modeling for depth estimation. Specifically, we first propose a multi-resolution depth fusion strategy to enhance the quality of monocular depth estimation. To recover the precise scale between relative depth and real-world values, we further calculate the 3D poses of instruments in the endoscopic scenes by algebraic geometry based on the image-only geometric primitives (i.e., boundaries and tip of instruments). Afterwards, the 3D poses of surgical instruments enable the scale recovery of relative depth maps. By coupling scale factors and relative depth estimation, the scale-aware depth of the monocular endoscopic scenes can be estimated. We evaluate the pipeline on in-house endoscopic surgery videos and simulated data. The results demonstrate that our method can learn the absolute scale with geometric modeling and accurately estimate scale-aware depth for monocular scenes.

Via

Access Paper or Ask Questions

Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention Mechanism

Jul 31, 2024

Jiafeng Zhong, Bin Li, Jiangyan Yi

Abstract:The task of partially spoofed audio localization aims to accurately determine audio authenticity at a frame level. Although some works have achieved encouraging results, utilizing boundary information within a single model remains an unexplored research topic. In this work, we propose a novel method called Boundary-aware Attention Mechanism (BAM). Specifically, it consists of two core modules: Boundary Enhancement and Boundary Frame-wise Attention. The former assembles the intra-frame and inter-frame information to extract discriminative boundary features that are subsequently used for boundary position detection and authenticity decision, while the latter leverages boundary prediction results to explicitly control the feature interaction between frames, which achieves effective discrimination between real and fake frames. Experimental results on PartialSpoof database demonstrate our proposed method achieves the best performance. The code is available at https://github.com/media-sec-lab/BAM.

Via

Access Paper or Ask Questions

FedDEO: Description-Enhanced One-Shot Federated Learning with Diffusion Models

Jul 29, 2024

Mingzhao Yang, Shangchao Su, Bin Li, Xiangyang Xue

Abstract:In recent years, the attention towards One-Shot Federated Learning (OSFL) has been driven by its capacity to minimize communication. With the development of the diffusion model (DM), several methods employ the DM for OSFL, utilizing model parameters, image features, or textual prompts as mediums to transfer the local client knowledge to the server. However, these mediums often require public datasets or the uniform feature extractor, significantly limiting their practicality. In this paper, we propose FedDEO, a Description-Enhanced One-Shot Federated Learning Method with DMs, offering a novel exploration of utilizing the DM in OSFL. The core idea of our method involves training local descriptions on the clients, serving as the medium to transfer the knowledge of the distributed clients to the server. Firstly, we train local descriptions on the client data to capture the characteristics of client distributions, which are then uploaded to the server. On the server, the descriptions are used as conditions to guide the DM in generating synthetic datasets that comply with the distributions of various clients, enabling the training of the aggregated model. Theoretical analyses and sufficient quantitation and visualization experiments on three large-scale real-world datasets demonstrate that through the training of local descriptions, the server is capable of generating synthetic datasets with high quality and diversity. Consequently, with advantages in communication and privacy protection, the aggregated model outperforms compared FL or diffusion-based OSFL methods and, on some clients, outperforms the performance ceiling of centralized training.

* Accepted by MM 24

Via

Access Paper or Ask Questions

Robust Deep Hawkes Process under Label Noise of Both Event and Occurrence

Jul 29, 2024

Xiaoyu Tan, Bin Li, Xihe Qiu, Jingjing Huang, Yinghui Xu, Wei Chu

Figure 1 for Robust Deep Hawkes Process under Label Noise of Both Event and Occurrence

Figure 2 for Robust Deep Hawkes Process under Label Noise of Both Event and Occurrence

Figure 3 for Robust Deep Hawkes Process under Label Noise of Both Event and Occurrence

Figure 4 for Robust Deep Hawkes Process under Label Noise of Both Event and Occurrence

Abstract:Integrating deep neural networks with the Hawkes process has significantly improved predictive capabilities in finance, health informatics, and information technology. Nevertheless, these models often face challenges in real-world settings, particularly due to substantial label noise. This issue is of significant concern in the medical field, where label noise can arise from delayed updates in electronic medical records or misdiagnoses, leading to increased prediction risks. Our research indicates that deep Hawkes process models exhibit reduced robustness when dealing with label noise, particularly when it affects both event types and timing. To address these challenges, we first investigate the influence of label noise in approximated intensity functions and present a novel framework, the Robust Deep Hawkes Process (RDHP), to overcome the impact of label noise on the intensity function of Hawkes models, considering both the events and their occurrences. We tested RDHP using multiple open-source benchmarks with synthetic noise and conducted a case study on obstructive sleep apnea-hypopnea syndrome (OSAHS) in a real-world setting with inherent label noise. The results demonstrate that RDHP can effectively perform classification and regression tasks, even in the presence of noise related to events and their timing. To the best of our knowledge, this is the first study to successfully address both event and time label noise in deep Hawkes process models, offering a promising solution for medical applications, specifically in diagnosing OSAHS.

* ECAI2024

Via

Access Paper or Ask Questions

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Jul 24, 2024

Haiyang Yu, Teng Fu, Bin Li, Xiangyang Xue

Abstract:Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.

* ECCV 2024

Via

Access Paper or Ask Questions