Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Sep 29, 2022

Gang Li, Yang Li

Figure 1 for Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Figure 2 for Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Figure 3 for Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Figure 4 for Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Share this with someone who'll enjoy it:

Abstract:Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchy is not always available, and is often corrupted with missing object descriptions or misaligned bounding box positions. As a result, although using view hierarchy offers some short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen -- the focus -- as the input. This general architecture is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as input. Furthermore, we explore the multi-task learning and few-shot prompting capacity of the proposed models, demonstrating promising results in the multi-task learning direction.

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Paper and Code