Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ashish Choithani

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

Apr 13, 2026

Shivam Sharma, Sankalp Nagaonkar, Ashish Choithani, Ashutosh Trivedi

Abstract:We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

Via

Access Paper or Ask Questions

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Feb 10, 2025

Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi

Figure 1 for Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Figure 2 for Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Figure 3 for Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Figure 4 for Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Abstract:This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

* Code and dataset: https://github.com/video-db/ocr-benchmark

Via

Access Paper or Ask Questions