Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tongfei Bian

Robust Understanding of Human-Robot Social Interactions through Multimodal Distillation

May 06, 2025

Tongfei Bian, Mathieu Chollet, Tanaya Guha

Abstract:The need for social robots and agents to interact and assist humans is growing steadily. To be able to successfully interact with humans, they need to understand and analyse socially interactive scenes from their (robot's) perspective. Works that model social situations between humans and agents are few; and even those existing ones are often too computationally intensive to be suitable for deployment in real time or on real world scenarios with limited available information. We propose a robust knowledge distillation framework that models social interactions through various multimodal cues, yet is robust against incomplete and noisy information during inference. Our teacher model is trained with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model that relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that the our student model achieves an average accuracy gain of 14.75\% over relevant baselines on multiple downstream social understanding task even with up to 51\% of its input being corrupted. The student model is highly efficient: it is $<1$\% in size of the teacher model in terms of parameters and uses $\sim 0.5$\textperthousand~FLOPs of that in the teacher model. Our code will be made public during publication.

* This paper has been submitted to ACM Multimedia 2025

Via

Access Paper or Ask Questions

Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

Dec 21, 2024

Tongfei Bian, Yiming Ma, Mathieu Chollet, Victor Sanchez, Tanaya Guha

Figure 1 for Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

Figure 2 for Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

Figure 3 for Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

Figure 4 for Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

Abstract:For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose \emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15\%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.

Via

Access Paper or Ask Questions