Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiqiang Li

PEVLM: Parallel Encoding for Vision-Language Models

Jun 24, 2025

Letian Kang, Shixian Luo, Yiqiang Li, Xiaoyang Yu, Shenxuan Zhou, Yong Wu

Figure 1 for PEVLM: Parallel Encoding for Vision-Language Models

Figure 2 for PEVLM: Parallel Encoding for Vision-Language Models

Figure 3 for PEVLM: Parallel Encoding for Vision-Language Models

Figure 4 for PEVLM: Parallel Encoding for Vision-Language Models

Abstract:Vision-Language Models (VLMs) have demonstrated strong performance in video-language tasks, yet their application to long video understanding remains constrained by the quadratic complexity of standard attention mechanisms. In this paper, we propose \textbf{PEVLM}, a parallel encoding strategy specifically designed to improve the prefill efficiency of VLMs without requiring model finetuning. PEVLM partitions the input into block-wise segments with a shared sink, preserves full-attention positional embeddings, and aligns attention weights to mimic full-attention distributions. This design reduces attention computation from $O((T \times N)^2)$ to $O(T \times N)$ while maintaining high accuracy. Extensive experiments on the LongVideoBench benchmark show that PEVLM achieves up to 8.37\% accuracy improvement over existing inference-efficient methods and delivers up to 7.47x speedup in attention computation and 40\% reduction in end-to-end latency. Under strict latency constraints, PEVLM significantly outperforms baselines, raising accuracy from 23.26\% to 61.03\%. These results highlight PEVLM's effectiveness for low-latency, long-context video understanding, making it well-suited for real-world applications such as autonomous driving.

Via

Access Paper or Ask Questions

Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Apr 28, 2025

Huichi Zhou, Zehao Xu, Munan Zhao, Kaihong Li, Yiqiang Li, Hongtao Wang

Figure 1 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Figure 2 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Figure 3 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Figure 4 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Abstract:In this paper, we introduce the Multilingual Moral Reasoning Benchmark (MMRB) to evaluate the moral reasoning abilities of large language models (LLMs) across five typologically diverse languages and three levels of contextual complexity: sentence, paragraph, and document. Our results show moral reasoning performance degrades with increasing context complexity, particularly for low-resource languages such as Vietnamese. We further fine-tune the open-source LLaMA-3-8B model using curated monolingual data for alignment and poisoning. Surprisingly, low-resource languages have a stronger impact on multilingual reasoning than high-resource ones, highlighting their critical role in multilingual NLP.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Jun 16, 2024

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li(+10 more)

Figure 1 for GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Figure 2 for GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Figure 3 for GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Figure 4 for GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Abstract:Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.

Via

Access Paper or Ask Questions