Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pan Zhou

The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology

Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Oct 30, 2024

Youcheng Huang, Fengbin Zhu, Jingkun Tang, Pan Zhou, Wenqiang Lei, Jiancheng Lv, Tat-Seng Chua

Figure 1 for Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Figure 2 for Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Figure 3 for Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Figure 4 for Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Abstract:Visual Language Models (VLMs) are vulnerable to adversarial attacks, especially those from adversarial images, which is however under-explored in literature. To facilitate research on this critical safety problem, we first construct a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR), given that existing datasets are either small-scale or only contain limited types of harmful responses. With the new RADAR dataset, we further develop a novel and effective iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of VLMs, which we call the attacking direction, to achieve the detection of adversarial images against benign ones in the input. Extensive experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the effectiveness, efficiency, and cross-model transferrability of our proposed method. Our code is available at https://github.com/mob-scu/RADAR-NEARSIDE

Via

Access Paper or Ask Questions

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Oct 29, 2024

Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Bo Li, Yang Tang, Pan Zhou

Figure 1 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Figure 2 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Figure 3 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Figure 4 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Abstract:Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudo-labels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open-source our code and models at https://github.com/XiaRho/MADM.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Two are better than one: Context window extension with multi-grained self-injection

Oct 25, 2024

Wei Han, Pan Zhou, Soujanya Poria, Shuicheng Yan

Figure 1 for Two are better than one: Context window extension with multi-grained self-injection

Figure 2 for Two are better than one: Context window extension with multi-grained self-injection

Figure 3 for Two are better than one: Context window extension with multi-grained self-injection

Figure 4 for Two are better than one: Context window extension with multi-grained self-injection

Abstract:The limited context window of contemporary large language models (LLMs) remains a huge barrier to their broader application across various domains. While continual pre-training on long-context data is a straightforward and effective solution, it incurs substantial costs in terms of data acquisition and computational resources. To alleviate this issue, we propose SharedLLM, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. SharedLLM is composed of two short-context LLMs such as LLaMA-2, termed upper model and lower model. The lower model functions as a compressor while the upper model acts as a decoder. The upper model receives compressed, multi-grained context information from the lower model and performs context-aware modeling on the running text. Information transfer between the compressor and decoder occurs only at the lowest layers to refrain from long forward paths in the lower model and redundant cross-attention modules in the upper model. Based on this architecture, we introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks. This structure, combined with a search algorithm, enables rapid encoding and retrieval of relevant information from various levels of the tree based on the input query. This entire process, wherein the sender and receiver are derived from the same LLM layer, is referred to as self-injection.

* The code is available at https://github.com/Clement25/SharedLLM

Via

Access Paper or Ask Questions

Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning

Oct 15, 2024

Jingyang Li, Jiachun Pan, Vincent Y. F. Tan, Kim-Chuan Toh, Pan Zhou

Figure 1 for Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning

Figure 2 for Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning

Figure 3 for Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning

Figure 4 for Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning

Abstract:Semi-supervised learning (SSL), exemplified by FixMatch (Sohn et al., 2020), has shown significant generalization advantages over supervised learning (SL), particularly in the context of deep neural networks (DNNs). However, it is still unclear, from a theoretical standpoint, why FixMatch-like SSL algorithms generalize better than SL on DNNs. In this work, we present the first theoretical justification for the enhanced test accuracy observed in FixMatch-like SSL applied to DNNs by taking convolutional neural networks (CNNs) on classification tasks as an example. Our theoretical analysis reveals that the semantic feature learning processes in FixMatch and SL are rather different. In particular, FixMatch learns all the discriminative features of each semantic class, while SL only randomly captures a subset of features due to the well-known lottery ticket hypothesis. Furthermore, we show that our analysis framework can be applied to other FixMatch-like SSL methods, e.g., FlexMatch, FreeMatch, Dash, and SoftMatch. Inspired by our theoretical analysis, we develop an improved variant of FixMatch, termed Semantic-Aware FixMatch (SA-FixMatch). Experimental results corroborate our theoretical findings and the enhanced generalization capability of SA-FixMatch.

Via

Access Paper or Ask Questions

SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

Oct 11, 2024

Ziming Yu, Pan Zhou, Sike Wang, Jia Li, Hua Huang

Figure 1 for SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

Figure 2 for SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

Figure 3 for SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

Figure 4 for SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

Abstract:Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model's parameter dimension$\unicode{x2013}$a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs' high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving training performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks.

Via

Access Paper or Ask Questions

Towards Natural Image Matting in the Wild via Real-Scenario Prior

Oct 09, 2024

Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Qianru Sun, Yang Tang, Bo Li, Pan Zhou

Figure 1 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Figure 2 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Figure 3 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Figure 4 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Abstract:Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.

Via

Access Paper or Ask Questions

The Impact of Large Language Models in Academia: from Writing to Speaking

Sep 20, 2024

Mingmeng Geng, Caixi Chen, Yanru Wu, Dongping Chen, Yao Wan, Pan Zhou

Figure 1 for The Impact of Large Language Models in Academia: from Writing to Speaking

Figure 2 for The Impact of Large Language Models in Academia: from Writing to Speaking

Figure 3 for The Impact of Large Language Models in Academia: from Writing to Speaking

Figure 4 for The Impact of Large Language Models in Academia: from Writing to Speaking

Abstract:Large language models (LLMs) are increasingly impacting human society, particularly in textual information. Based on more than 30,000 papers and 1,000 presentations from machine learning conferences, we examined and compared the words used in writing and speaking, representing the first large-scale investigating study of how LLMs influence the two main modes of verbal communication and expression within the same group of people. Our empirical results show that LLM-style words such as "significant" have been used more frequently in abstracts and oral presentations. The impact on speaking is beginning to emerge and is likely to grow in the future, calling attention to the implicit influence and ripple effect of LLMs on human society.

* 16 pages

Via

Access Paper or Ask Questions

LPT++: Efficient Training on Mixture of Long-tailed Experts

Sep 17, 2024

Bowen Dong, Pan Zhou, Wangmeng Zuo

Figure 1 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Figure 2 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Figure 3 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Figure 4 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Abstract:We introduce LPT++, a comprehensive framework for long-tailed classification that combines parameter-efficient fine-tuning (PEFT) with a learnable model ensemble. LPT++ enhances frozen Vision Transformers (ViTs) through the integration of three core components. The first is a universal long-tailed adaptation module, which aggregates long-tailed prompts and visual adapters to adapt the pretrained model to the target domain, meanwhile improving its discriminative ability. The second is the mixture of long-tailed experts framework with a mixture-of-experts (MoE) scorer, which adaptively calculates reweighting coefficients for confidence scores from both visual-only and visual-language (VL) model experts to generate more accurate predictions. Finally, LPT++ employs a three-phase training framework, wherein each critical module is learned separately, resulting in a stable and effective long-tailed classification training paradigm. Besides, we also propose the simple version of LPT++ namely LPT, which only integrates visual-only pretrained ViT and long-tailed prompts to formulate a single model method. LPT can clearly illustrate how long-tailed prompts works meanwhile achieving comparable performance without VL pretrained models. Experiments show that, with only ~1% extra trainable parameters, LPT++ achieves comparable accuracy against all the counterparts.

* Extended version of arXiv:2210.01033

Via

Access Paper or Ask Questions

MoExtend: Tuning New Experts for Modality and Task Extension

Aug 07, 2024

Shanshan Zhong, Shanghua Gao, Zhongzhan Huang, Wushao Wen, Marinka Zitnik, Pan Zhou

Figure 1 for MoExtend: Tuning New Experts for Modality and Task Extension

Figure 2 for MoExtend: Tuning New Experts for Modality and Task Extension

Figure 3 for MoExtend: Tuning New Experts for Modality and Task Extension

Figure 4 for MoExtend: Tuning New Experts for Modality and Task Extension

Abstract:Large language models (LLMs) excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models such as MoE and vision encoders. This approach enables rapid adaptation and extension to new modal data or tasks, effectively addressing the challenge of accommodating new modalities within LLMs. Furthermore, MoExtend avoids tuning pretrained models, thus mitigating the risk of catastrophic forgetting. Experimental results demonstrate the efficacy and efficiency of MoExtend in enhancing the multimodal capabilities of LLMs, contributing to advancements in multimodal AI research. Code: https://github.com/zhongshsh/MoExtend.

* ACL 2024 - SRW

Via

Access Paper or Ask Questions

Can Large Language Models Automatically Jailbreak GPT-4V?

Jul 23, 2024

Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou, Lichao Sun

Abstract:GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage Large Language Models (LLMs) for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3\%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.

* TrustNLP@NAACL2024 (Fourth Workshop on Trustworthy Natural Language Processing)

Via

Access Paper or Ask Questions