Alert button
Picture for Yichen Zhu

Yichen Zhu

Alert button

Query-Relevant Images Jailbreak Large Multi-Modal Models

Nov 29, 2023
Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao

Warning: This paper contains examples of harmful language and images, and reader discretion is recommended. The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Large Multi-Modal Models (LMMs) remains understudied. In our study, we present a novel visual prompt attack that exploits query-relevant images to jailbreak the open-source LMMs. Our method creates a composite image from one image generated by diffusion models and another that displays the text as typography, based on keywords extracted from a malicious query. We show LLMs can be easily attacked by our approach, even if the employed Large Language Models are safely aligned. To evaluate the extent of this vulnerability in open-source LMMs, we have compiled a substantial dataset encompassing 13 scenarios with a total of 5,040 text-image pairs, using our presented attack technique. Our evaluation of 12 cutting-edge LMMs using this dataset shows the vulnerability of existing multi-modal models on adversarial attacks. This finding underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source LMMs against potential malicious exploits. The resource is available at \href{this https URL}{https://github.com/isXinLiu/MM-SafetyBench}.

* Technique report 
Viaarxiv icon

Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models

Oct 15, 2023
Zijian Zhang, Luping Liu. Zhijie Lin, Yichen Zhu, Zhou Zhao

We propose the first unsupervised and learning-based method to identify interpretable directions in the h-space of pre-trained diffusion models. Our method is derived from an existing technique that operates on the GAN latent space. In a nutshell, we employ a shift control module for pre-trained diffusion models to manipulate a sample into a shifted version of itself, followed by a reconstructor to reproduce both the type and the strength of the manipulation. By jointly optimizing them, the model will spontaneously discover disentangled and interpretable directions. To prevent the discovery of meaningless and destructive directions, we employ a discriminator to maintain the fidelity of shifted sample. Due to the iterative generative process of diffusion models, our training requires a substantial amount of GPU VRAM to store numerous intermediate tensors for back-propagating gradient. To address this issue, we first propose a general VRAM-efficient training algorithm based on gradient checkpointing technique to back-propagate any gradient through the whole generative process, with acceptable occupancy of VRAM and sacrifice of training efficiency. Compared with existing related works on diffusion models, our method inherently identifies global and scalable directions, without necessitating any other complicated procedures. Extensive experiments on various datasets demonstrate the effectiveness of our method.

Viaarxiv icon

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

Jul 25, 2023
Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao

Figure 1 for 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
Figure 2 for 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
Figure 3 for 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
Figure 4 for 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. Typically, the sentences describing the target object tend to provide information about its relative relation between other objects and its position within the whole scene. In this work, we propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3DRP-Net), which can effectively capture the relative spatial relationships between objects and enhance object attributes. Specifically, 1) we propose a 3D Relative Position Multi-head Attention (3DRP-MA) module to analyze relative relations from different directions in the context of object pairs, which helps the model to focus on the specific object relations mentioned in the sentence. 2) We designed a soft-labeling strategy to alleviate the spatial ambiguity caused by redundant points, which further stabilizes and enhances the learning process through a constant and discriminative distribution. Extensive experiments conducted on three benchmarks (i.e., ScanRefer and Nr3D/Sr3D) demonstrate that our method outperforms all the state-of-the-art methods in general. The source code will be released on GitHub.

Viaarxiv icon

Revisiting Event-based Video Frame Interpolation

Jul 24, 2023
Jiaben Chen, Yichen Zhu, Dongze Lian, Jiaqi Yang, Yifu Wang, Renrui Zhang, Xinhang Liu, Shenhan Qian, Laurent Kneip, Shenghua Gao

Figure 1 for Revisiting Event-based Video Frame Interpolation
Figure 2 for Revisiting Event-based Video Frame Interpolation
Figure 3 for Revisiting Event-based Video Frame Interpolation
Figure 4 for Revisiting Event-based Video Frame Interpolation

Dynamic vision sensors or event cameras provide rich complementary information for video frame interpolation. Existing state-of-the-art methods follow the paradigm of combining both synthesis-based and warping networks. However, few of those methods fully respect the intrinsic characteristics of events streams. Given that event cameras only encode intensity changes and polarity rather than color intensities, estimating optical flow from events is arguably more difficult than from RGB information. We therefore propose to incorporate RGB information in an event-guided optical flow refinement strategy. Moreover, in light of the quasi-continuous nature of the time signals provided by event cameras, we propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages rather than in a single, long stage. Extensive experiments on both synthetic and real-world datasets show that these modifications lead to more reliable and realistic intermediate frame results than previous video frame interpolation methods. Our findings underline that a careful consideration of event characteristics such as high temporal density and elevated noise benefits interpolation accuracy.

* Accepted by IROS2023 Project Site: https://jiabenchen.github.io/revisit_event 
Viaarxiv icon

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

Jul 18, 2023
Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao

Figure 1 for Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
Figure 2 for Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
Figure 3 for Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
Figure 4 for Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we reconstruct the masked keywords of the sentence using each candidate one by one, and the reconstructed accuracy finely reflects the semantic similarity of each candidate to the query. Additionally, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D visual grounding model, which reduces inference costs and improves performance by taking full advantage of the well-studied structure of the existing architectures. We conduct extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the effectiveness of our proposed method.

* ICCV2023 
Viaarxiv icon

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Jul 05, 2023
Qiqi Zhou, Yichen Zhu

The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-ViT (ReViT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReViT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReViT to process images with the minimum sufficient number of tokens, reducing token numbers in the ViT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative ViT models on image classification and action recognition.

* accepted to ECML PKDD. arXiv admin note: substantial text overlap with arXiv:2112.01686 
Viaarxiv icon

Prediction with Incomplete Data under Agnostic Mask Distribution Shift

May 18, 2023
Yichen Zhu, Jian Yuan, Bo Jiang, Tao Lin, Haiming Jin, Xinbing Wang, Chenghu Zhou

Figure 1 for Prediction with Incomplete Data under Agnostic Mask Distribution Shift
Figure 2 for Prediction with Incomplete Data under Agnostic Mask Distribution Shift
Figure 3 for Prediction with Incomplete Data under Agnostic Mask Distribution Shift
Figure 4 for Prediction with Incomplete Data under Agnostic Mask Distribution Shift

Data with missing values is ubiquitous in many applications. Recent years have witnessed increasing attention on prediction with only incomplete data consisting of observed features and a mask that indicates the missing pattern. Existing methods assume that the training and testing distributions are the same, which may be violated in real-world scenarios. In this paper, we consider prediction with incomplete data in the presence of distribution shift. We focus on the case where the underlying joint distribution of complete features and label is invariant, but the missing pattern, i.e., mask distribution may shift agnostically between training and testing. To achieve generalization, we leverage the observation that for each mask, there is an invariant optimal predictor. To avoid the exponential explosion when learning them separately, we approximate the optimal predictors jointly using a double parameterization technique. This has the undesirable side effect of allowing the learned predictors to rely on the intra-mask correlation and that between features and mask. We perform decorrelation to minimize this effect. Combining the techniques above, we propose a novel prediction method called StableMiss. Extensive experiments on both synthetic and real-world datasets show that StableMiss is robust and outperforms state-of-the-art methods under agnostic mask distribution shift.

Viaarxiv icon

Label-Guided Auxiliary Training Improves 3D Object Detector

Jul 24, 2022
Yaomin Huang, Xinmei Liu, Yichen Zhu, Zhiyuan Xu, Chaomin Shen, Zhengping Che, Guixu Zhang, Yaxin Peng, Feifei Feng, Jian Tang

Figure 1 for Label-Guided Auxiliary Training Improves 3D Object Detector
Figure 2 for Label-Guided Auxiliary Training Improves 3D Object Detector
Figure 3 for Label-Guided Auxiliary Training Improves 3D Object Detector
Figure 4 for Label-Guided Auxiliary Training Improves 3D Object Detector

Detecting 3D objects from point clouds is a practical yet challenging task that has attracted increasing attention recently. In this paper, we propose a Label-Guided auxiliary training method for 3D object detection (LG3D), which serves as an auxiliary network to enhance the feature learning of existing 3D object detectors. Specifically, we propose two novel modules: a Label-Annotation-Inducer that maps annotations and point clouds in bounding boxes to task-specific representations and a Label-Knowledge-Mapper that assists the original features to obtain detection-critical representations. The proposed auxiliary network is discarded in inference and thus has no extra computational cost at test time. We conduct extensive experiments on both indoor and outdoor datasets to verify the effectiveness of our approach. For example, our proposed LG3D improves VoteNet by 2.5% and 3.1% mAP on the SUN RGB-D and ScanNetV2 datasets, respectively.

* Yaomin Huang and Xinmei Liu are with equal contribution. This paper has been accepted by ECCV 2022 
Viaarxiv icon

UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks

Dec 06, 2021
Yichen Zhu, Weibin Meng, Ying Liu, Shenglin Zhang, Tao Han, Shimin Tao, Dan Pei

Figure 1 for UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
Figure 2 for UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
Figure 3 for UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
Figure 4 for UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks

UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks

* Technical report 
Viaarxiv icon