Abstract:Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.




Abstract:Many robotic tasks require heavy computation, which can easily exceed the robot's onboard computer capability. A promising solution to address this challenge is outsourcing the computation to the cloud. However, exploiting the potential of cloud resources in robotic software is difficult, because it involves complex code modification and extensive (re)configuration procedures. Moreover, quality of service (QoS) such as timeliness, which is critical to robot's behavior, have to be considered. In this paper, we propose a transparent and QoS-aware software framework called Cloudroid for cloud robotic applications. This framework supports direct deployment of existing robotic software packages to the cloud, transparently transforming them into Internet-accessible cloud services. And with the automatically generated service stubs, robotic applications can outsource their computation to the cloud without any code modification. Furthermore, the robot and the cloud can cooperate to maintain the specific QoS property such as request response time, even in a highly dynamic and resource-competitive environment. We evaluated Cloudroid based on a group of typical robotic scenarios and a set of software packages widely adopted in real-world robot practices. Results show that robot's capability can be enhanced significantly without code modification and specific QoS objectives can be guaranteed. In certain tasks, the "cloud + robot" setup shows improved performance in orders of magnitude compared with the robot native setup.