Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yifan Lan

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Sep 15, 2025

Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen

Figure 1 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Figure 2 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Figure 3 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Figure 4 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Abstract:Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Via

Access Paper or Ask Questions