Abstract:The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
Abstract:The use of large language model (LLM)-based AI chatbots among college students has increased rapidly, yet little is known about how individual psychological attributes shape students' interaction patterns with these technologies. This qualitative study explored how college students with different attachment styles describe their interactions with ChatGPT. Using semi-structured interviews with seven undergraduate students and grounded theory analysis, we identified three main themes: (1) AI as a low-risk emotional space, where participants across attachment styles valued the non-judgmental and low-stakes nature of AI interactions; (2) attachment-congruent patterns of AI engagement, where securely attached students integrated AI as a supplementary tool within their existing support systems, while avoidantly attached students used AI to buffer vulnerability and maintain interpersonal boundaries; and (3) the paradox of AI intimacy, capturing the tension between students' willingness to disclose personal information to AI while simultaneously recognizing its limitations as a relational partner. These findings suggest that attachment orientations play an important role in shaping how students experience and interpret their interactions with AI chatbots, extending attachment theory to the domain of human-AI interaction.