Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaifeng Huang

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Feb 11, 2026

Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang

Abstract:Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

Via

Access Paper or Ask Questions

Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems

Jan 10, 2023

Junming Cao, Bihuan Chen, Longjie Hu, Jie Gao, Kaifeng Huang, Xin Peng

Figure 1 for Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems

Figure 2 for Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems

Figure 3 for Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems

Figure 4 for Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems

Abstract:Machine learning (ML) enabled systems are emerging with recent breakthroughs in ML. A model-centric view is widely taken by the literature to focus only on the analysis of ML models. However, only a small body of work takes a system view that looks at how ML components work with the system and how they affect software engineering for MLenabled systems. In this paper, we adopt this system view, and conduct a case study on Rasa 3.0, an industrial dialogue system that has been widely adopted by various companies around the world. Our goal is to characterize the complexity of such a largescale ML-enabled system and to understand the impact of the complexity on testing. Our study reveals practical implications for software engineering for ML-enabled systems.

Via

Access Paper or Ask Questions