Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Jun 11, 2025

Beomsik Cho, Jaehyung Kim

Figure 1 for Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Figure 2 for Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Figure 3 for Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Figure 4 for Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Share this with someone who'll enjoy it:

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.

* Code available at https://github.com/bscho333/ReVisiT

View paper on

Share this with someone who'll enjoy it:

Title:Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Paper and Code