Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Franklin Mingzhe Li

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Mar 12, 2026

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

Abstract:Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

* Project page: https://hanxjing.github.io/OSCBench

Via

Access Paper or Ask Questions

OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Mar 07, 2025

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington

Figure 1 for OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Figure 2 for OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Figure 3 for OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Figure 4 for OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Abstract:Following recipes while cooking is an important but difficult task for visually impaired individuals. We developed OSCAR (Object Status Context Awareness for Recipes), a novel approach that provides recipe progress tracking and context-aware feedback on the completion of cooking tasks through tracking object statuses. OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log. We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos to demonstrate OSCAR's capability to track cooking steps and provide contextual guidance. Our results highlight the effectiveness of using object status to improve performance compared to baseline by over 20% across different VLMs, and we present factors that impact prediction performance. Furthermore, we contribute a dataset of real-world non-visual cooking videos with step annotations as an evaluation benchmark.

* CHI 2025 Late Breaking Work

Via

Access Paper or Ask Questions

Selenite: Scaffolding Decision Making with Comprehensive Overviews Elicited from Large Language Models

Oct 03, 2023

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, Brad A. Myers

Figure 1 for Selenite: Scaffolding Decision Making with Comprehensive Overviews Elicited from Large Language Models

Figure 2 for Selenite: Scaffolding Decision Making with Comprehensive Overviews Elicited from Large Language Models

Figure 3 for Selenite: Scaffolding Decision Making with Comprehensive Overviews Elicited from Large Language Models

Figure 4 for Selenite: Scaffolding Decision Making with Comprehensive Overviews Elicited from Large Language Models

Abstract:Decision-making in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect to various criteria. Prior research and our formative study found that people would benefit from seeing an overview of the information space upfront, such as the criteria that others have previously found useful. However, existing sensemaking tools struggle with the "cold-start" problem -- it not only requires significant input from previous users to generate and share these overviews, but such overviews may also be biased and incomplete. In this work, we introduce a novel system, Selenite, which leverages LLMs as reasoning machines and knowledge retrievers to automatically produce a comprehensive overview of options and criteria to jumpstart users' sensemaking processes. Subsequently, Selenite also adapts as people use it, helping users find, read, and navigate unfamiliar information in a systematic yet personalized manner. Through three studies, we found that Selenite produced accurate and high-quality overviews reliably, significantly accelerated users' information processing, and effectively improved their overall comprehension and sensemaking experience.

Via

Access Paper or Ask Questions