Abstract:Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.



Abstract:A social robot acting as a 'mediator' can enhance interactions between humans, for example, in fields such as education and healthcare. A particularly promising area of research is the use of a social robot mediator in a multiparty setting, which tends to be the most applicable in real-world scenarios. However, research in social robot mediation for multiparty interactions is still emerging and faces numerous challenges. This paper provides an overview of social robotics and mediation research by highlighting relevant literature and some of the ongoing problems. The importance of incorporating relevant psychological principles for developing social robot mediators is also presented. Additionally, the potential of implementing a Theory of Mind in a social robot mediator is explored, given how such a framework could greatly improve mediation by reading the individual and group mental states to interact effectively.