CSIRO Robotics, Clayton, Australia
Abstract:With increasing levels of robot autonomy, robots are increasingly being supervised by users with varying levels of robotics expertise. As the diversity of the user population increases, it is important to understand how users with different expertise levels approach the supervision task and how this impacts performance of the human-robot team. This exploratory study investigates how operators with varying expertise levels perceive information and make intervention decisions when supervising a remote robot. We conducted a user study (N=27) where participants supervised a robot autonomously exploring four unknown tunnel environments in a simulator, and provided waypoints to intervene when they believed the robot had encountered difficulties. By analyzing the interaction data and questionnaire responses, we identify differing patterns in intervention timing and decision-making strategies across novice, intermediate, and expert users.
Abstract:Robotic systems for household object rearrangement often rely on latent preference models inferred from human demonstrations. While effective at prediction, these models offer limited insight into the interpretable factors that guide human decisions. We introduce an explicit formulation of object arrangement preferences along four interpretable constructs: spatial practicality (putting items where they naturally fit best in the space), habitual convenience (making frequently used items easy to reach), semantic coherence (placing items together if they are used for the same task or are contextually related), and commonsense appropriateness (putting things where people would usually expect to find them). To capture these constructs, we designed and validated a self-report questionnaire through a 63-participant online study. Results confirm the psychological distinctiveness of these constructs and their explanatory power across two scenarios (kitchen and living room). We demonstrate the utility of these constructs by integrating them into a Monte Carlo Tree Search (MCTS) planner and show that when guided by participant-derived preferences, our planner can generate reasonable arrangements that closely align with those generated by participants. This work contributes a compact, interpretable formulation of object arrangement preferences and a demonstration of how it can be operationalized for robot planning.
Abstract:The growing use of service robots in hospitality highlights the need to understand how to effectively communicate with pre-occupied customers. This study investigates the efficacy of commonly used communication modalities by service robots, namely, acoustic/speech, visual display, and micromotion gestures in capturing attention and communicating intention with a user in a simulated restaurant scenario. We conducted a two-part user study (N=24) using a Temi robot to simulate delivery tasks, with participants engaged in a typing game (MonkeyType) to emulate a state of busyness. The participants' engagement in the typing game is measured by words per minute (WPM) and typing accuracy. In Part 1, we compared non-verbal acoustic cue versus baseline conditions to assess attention capture during a single-cup delivery task. In Part 2, we evaluated the effectiveness of speech, visual display, micromotion and their multimodal combination in conveying specific intentions (correct cup selection) during a two-cup delivery task. The results indicate that, while speech is highly effective in capturing attention, it is less successful in clearly communicating intention. Participants rated visual as the most effective modality for intention clarity, followed by speech, with micromotion being the lowest ranked.These findings provide insights into optimizing communication strategies for service robots, highlighting the distinct roles of attention capture and intention communication in enhancing user experience in dynamic hospitality settings.




Abstract:Robots are increasingly working alongside people, delivering food to patrons in restaurants or helping workers on assembly lines. These scenarios often involve object handovers between the person and the robot. To achieve safe and efficient human-robot collaboration (HRC), it is important to incorporate human context in a robot's handover strategies. Therefore, in this work, we develop a collaborative handover model trained on human teleoperation data collected in a naturalistic crafting task. To evaluate the performance of this model, we conduct cross-validation experiments on the training dataset as well as a user study in the same HRC crafting task. The handover episodes and user perceptions of the autonomous handover policy were compared with those of the human teleoperated handovers. While the cross-validation experiment and user study indicate that the autonomous policy successfully achieved collaborative handovers, the comparison with human teleoperation revealed avenues for further improvements.




Abstract:In human-robot teams, human situational awareness is the operator's conscious knowledge of the team's states, actions, plans and their environment. Appropriate human situational awareness is critical to successful human-robot collaboration. In human-robot teaming, it is often assumed that the best and required level of situational awareness is knowing everything at all times. This view is problematic, because what a human needs to know for optimal team performance varies given the dynamic environmental conditions, task context and roles and capabilities of team members. We explore this topic by interviewing 16 participants with active and repeated experience in diverse human-robot teaming applications. Based on analysis of these interviews, we derive a framework explaining the dynamic nature of required situational awareness in human-robot teaming. In addition, we identify a range of factors affecting the dynamic nature of required and actual levels of situational awareness (i.e., dynamic situational awareness), types of situational awareness inefficiencies resulting from gaps between actual and required situational awareness, and their main consequences. We also reveal various strategies, initiated by humans and robots, that assist in maintaining the required situational awareness. Our findings inform the implementation of accurate estimates of dynamic situational awareness and the design of user-adaptive human-robot interfaces. Therefore, this work contributes to the future design of more collaborative and effective human-robot teams.




Abstract:This paper investigates the application of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries. This work represents, to our knowledge, the first zero-shot application of ViFMs for generating multi-modal robot-to-human communication in intermittent supervision contexts, demonstrating both the promise and limitations of these models in human-robot interaction (HRI) scenarios.
Abstract:Service robots are increasingly employed in the hospitality industry for delivering food orders in restaurants. However, in current practice the robot often arrives at a fixed location for each table when delivering orders to different patrons in the same dining group, thus requiring a human staff member or the customers themselves to identify and retrieve each order. This study investigates how to improve the robot's service behaviours to facilitate clear intention communication to a group of users, thus achieving accurate delivery and positive user experiences. Specifically, we conduct user studies (N=30) with a Temi service robot as a representative delivery robot currently adopted in restaurants. We investigated two factors in the robot's intent communication, namely visualisation and movement trajectories, and their influence on the objective and subjective interaction outcomes. A robot personalising its movement trajectory and stopping location in addition to displaying a visualisation of the order yields more accurate intent communication and successful order delivery, as well as more positive user perception towards the robot and its service. Our results also showed that individuals in a group have different interaction experiences.




Abstract:Social robots often rely on visual perception to understand their users and the environment. Recent advancements in data-driven approaches for computer vision have demonstrated great potentials for applying deep-learning models to enhance a social robot's visual perception. However, the high computational demands of deep-learning methods, as opposed to the more resource-efficient shallow-learning models, bring up important questions regarding their effects on real-world interaction and user experience. It is unclear how will the objective interaction performance and subjective user experience be influenced when a social robot adopts a deep-learning based visual perception model. We employed state-of-the-art human perception and tracking models to improve the visual perception function of the Pepper robot and conducted a controlled lab study and an in-the-wild human-robot interaction study to evaluate this novel perception function for following a specific user with other people present in the scene.




Abstract:Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propose to predict the user's future emotion based on its temporal relationship with the system's current emotion and its causal relationship with the system's current Dialogue Act (DA). In laughter, we propose to predict the occurrence and type of the user's laughter using the system's laughter behaviors in the current turn. Preliminary analysis of human-robot dialogue demonstrated synchronicity in the emotions and laughter displayed by the human and robot, as well as DA-emotion causality in their dialogue. This verifies that our architecture can contribute to the development of an anticipatory SDS.




Abstract:We study human-robot handovers in a naturalistic collaboration scenario, where a mobile manipulator robot assists a person during a crafting session by providing and retrieving objects used for wooden piece assembly (functional activities) and painting (creative activities). We collect quantitative and qualitative data from 20 participants in a Wizard-of-Oz study, generating the Functional And Creative Tasks Human-Robot Collaboration dataset (the FACT HRC dataset), available to the research community. This work illustrates how social cues and task context inform the temporal-spatial coordination in human-robot handovers, and how human-robot collaboration is shaped by and in turn influences people's functional and creative activities.