Alert button
Picture for Angelica Lim

Angelica Lim

Alert button

Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning

Oct 30, 2023
Yasaman Etesam, Ozge Nilay Yalcin, Chuxuan Zhang, Angelica Lim

The emotional theory of mind problem in images is an emotion recognition task, specifically asking "How does the person in the bounding box feel?" Facial expressions, body pose, contextual information and implicit commonsense knowledge all contribute to the difficulty of the task, making this task currently one of the hardest problems in affective computing. The goal of this work is to evaluate the emotional commonsense knowledge embedded in recent large vision language models (CLIP, LLaVA) and large language models (GPT-3.5) on the Emotions in Context (EMOTIC) dataset. In order to evaluate a purely text-based language model on images, we construct "narrative captions" relevant to emotion perception, using a set of 872 physical social signal descriptions related to 26 emotional categories, along with 224 labels for emotionally salient environmental contexts, sourced from writer's guides for character expressions and settings. We evaluate the use of the resulting captions in an image-to-language-to-emotion task. Experiments using zero-shot vision-language models on EMOTIC show that combining "fast" and "slow" reasoning is a promising way forward to improve emotion recognition systems. Nevertheless, a gap remains in the zero-shot emotional theory of mind task compared to prior work trained on the EMOTIC dataset.

* 16 pages(including references and appendix), 8 Tables, 3 figures 
Viaarxiv icon

An MCTS-DRL Based Obstacle and Occlusion Avoidance Methodology in Robotic Follow-Ahead Applications

Sep 28, 2023
Sahar Leisiazar, Edward J. Park, Angelica Lim, Mo Chen

We propose a novel methodology for robotic follow-ahead applications that address the critical challenge of obstacle and occlusion avoidance. Our approach effectively navigates the robot while ensuring avoidance of collisions and occlusions caused by surrounding objects. To achieve this, we developed a high-level decision-making algorithm that generates short-term navigational goals for the mobile robot. Monte Carlo Tree Search is integrated with a Deep Reinforcement Learning method to enhance the performance of the decision-making process and generate more reliable navigational goals. Through extensive experimentation and analysis, we demonstrate the effectiveness and superiority of our proposed approach in comparison to the existing follow-ahead human-following robotic methods. Our code is available at https://github.com/saharLeisiazar/follow-ahead-ros.

Viaarxiv icon

Contextual Emotion Estimation from Image Captions

Sep 22, 2023
Vera Yang, Archita Srivastava, Yasaman Etesam, Chuxuan Zhang, Angelica Lim

Figure 1 for Contextual Emotion Estimation from Image Captions
Figure 2 for Contextual Emotion Estimation from Image Captions
Figure 3 for Contextual Emotion Estimation from Image Captions
Figure 4 for Contextual Emotion Estimation from Image Captions

Emotion estimation in images is a challenging task, typically using computer vision methods to directly estimate people's emotions using face, body pose and contextual cues. In this paper, we explore whether Large Language Models (LLMs) can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference. First, we must understand: how well do LLMs perceive human emotions? And which parts of the information enable them to determine emotions? One initial challenge is to construct a caption that describes a person within a scene with information relevant for emotion perception. Towards this goal, we propose a set of natural language descriptors for faces, bodies, interactions, and environments. We use them to manually generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset. These captions offer an interpretable representation for emotion estimation, towards understanding how elements of a scene affect emotion perception in LLMs and beyond. Secondly, we test the capability of a large language model to infer an emotion from the resulting image captions. We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations, but accuracy can depend on the emotion concept. Overall, the results suggest promise in the image captioning and LLM approach.

* Accepted to ACII 2023. Project page: http://rosielab.github.io/emotion-captions/ 
Viaarxiv icon

Towards Inclusive HRI: Using Sim2Real to Address Underrepresentation in Emotion Expression Recognition

Aug 15, 2022
Saba Akhyani, Mehryar Abbasi Boroujeni, Mo Chen, Angelica Lim

Figure 1 for Towards Inclusive HRI: Using Sim2Real to Address Underrepresentation in Emotion Expression Recognition
Figure 2 for Towards Inclusive HRI: Using Sim2Real to Address Underrepresentation in Emotion Expression Recognition
Figure 3 for Towards Inclusive HRI: Using Sim2Real to Address Underrepresentation in Emotion Expression Recognition
Figure 4 for Towards Inclusive HRI: Using Sim2Real to Address Underrepresentation in Emotion Expression Recognition

Robots and artificial agents that interact with humans should be able to do so without bias and inequity, but facial perception systems have notoriously been found to work more poorly for certain groups of people than others. In our work, we aim to build a system that can perceive humans in a more transparent and inclusive manner. Specifically, we focus on dynamic expressions on the human face, which are difficult to collect for a broad set of people due to privacy concerns and the fact that faces are inherently identifiable. Furthermore, datasets collected from the Internet are not necessarily representative of the general population. We address this problem by offering a Sim2Real approach in which we use a suite of 3D simulated human models that enables us to create an auditable synthetic dataset covering 1) underrepresented facial expressions, outside of the six basic emotions, such as confusion; 2) ethnic or gender minority groups; and 3) a wide range of viewing angles that a robot may encounter a human in the real world. By augmenting a small dynamic emotional expression dataset containing 123 samples with a synthetic dataset containing 4536 samples, we achieved an improvement in accuracy of 15% on our own dataset and 11% on an external benchmark dataset, compared to the performance of the same model architecture without synthetic training data. We also show that this additional step improves accuracy specifically for racial minorities when the architecture's feature extraction weights are trained from scratch.

* 8 pages, 10 figures, submitted to IROS2022 
Viaarxiv icon

Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts

May 10, 2022
Emma Hughson, Paige Tuttosi, Akihiro Matsufuji, Angelica Lim

Figure 1 for Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts
Figure 2 for Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts
Figure 3 for Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts
Figure 4 for Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts

Adapting one's voice to different ambient environments and social interactions is required for human social interaction. In robotics, the ability to recognize speech in noisy and quiet environments has received significant attention, but considering ambient cues in the production of social speech features has been little explored. Our research aims to modify a robot's speech to maximize acceptability in various social and acoustic contexts, starting with a use case for service robots in varying restaurants. We created an original dataset collected over Zoom with participants conversing in scripted and unscripted tasks given 7 different ambient sounds and background images. Voice conversion methods, in addition to altered Text-to-Speech that matched ambient specific data, were used for speech synthesis tasks. We conducted a subjective perception study that showed humans prefer synthetic speech that matches ambience and social context, ultimately preferring more human-like voices. This work provides three solutions to ambient and socially appropriate synthetic voices: (1) a novel protocol to collect real contextual audio voice data, (2) tools and directions to manipulate robot speech for appropriate social and ambient specific interactions, and (3) insight into voice conversion's role in flexibly altering robot speech to match different ambient environments.

* 8 pages 
Viaarxiv icon

Data-driven emotional body language generation for social robotics

May 02, 2022
Mina Marmpena, Fernando Garcia, Angelica Lim, Nikolas Hemion, Thomas Wennekers

Figure 1 for Data-driven emotional body language generation for social robotics
Figure 2 for Data-driven emotional body language generation for social robotics
Figure 3 for Data-driven emotional body language generation for social robotics
Figure 4 for Data-driven emotional body language generation for social robotics

In social robotics, endowing humanoid robots with the ability to generate bodily expressions of affect can improve human-robot interaction and collaboration, since humans attribute, and perhaps subconsciously anticipate, such traces to perceive an agent as engaging, trustworthy, and socially present. Robotic emotional body language needs to be believable, nuanced and relevant to the context. We implemented a deep learning data-driven framework that learns from a few hand-designed robotic bodily expressions and can generate numerous new ones of similar believability and lifelikeness. The framework uses the Conditional Variational Autoencoder model and a sampling approach based on the geometric properties of the model's latent space to condition the generative process on targeted levels of valence and arousal. The evaluation study found that the anthropomorphism and animacy of the generated expressions are not perceived differently from the hand-designed ones, and the emotional conditioning was adequately differentiable between most levels except the pairs of neutral-positive valence and low-medium arousal. Furthermore, an exploratory analysis of the results reveals a possible impact of the conditioning on the perceived dominance of the robot, as well as on the participants' attention.

* For the associated video of the generated animations, see https://youtu.be/wmLT8FARSk0 and for a repository of the training data, see https://github.com/minamar/rebl-pepper-data 
Viaarxiv icon

The Many Faces of Anger: A Multicultural Video Dataset of Negative Emotions in the Wild (MFA-Wild)

Dec 10, 2021
Roya Javadi, Angelica Lim

Figure 1 for The Many Faces of Anger: A Multicultural Video Dataset of Negative Emotions in the Wild (MFA-Wild)
Figure 2 for The Many Faces of Anger: A Multicultural Video Dataset of Negative Emotions in the Wild (MFA-Wild)
Figure 3 for The Many Faces of Anger: A Multicultural Video Dataset of Negative Emotions in the Wild (MFA-Wild)
Figure 4 for The Many Faces of Anger: A Multicultural Video Dataset of Negative Emotions in the Wild (MFA-Wild)

The portrayal of negative emotions such as anger can vary widely between cultures and contexts, depending on the acceptability of expressing full-blown emotions rather than suppression to maintain harmony. The majority of emotional datasets collect data under the broad label ``anger", but social signals can range from annoyed, contemptuous, angry, furious, hateful, and more. In this work, we curated the first in-the-wild multicultural video dataset of emotions, and deeply explored anger-related emotional expressions by asking culture-fluent annotators to label the videos with 6 labels and 13 emojis in a multi-label framework. We provide a baseline multi-label classifier on our dataset, and show how emojis can be effectively used as a language-agnostic tool for annotation.

* 8 pages, 13 figures, submitted to FG2021 
Viaarxiv icon

Developing a Data-Driven Categorical Taxonomy of Emotional Expressions in Real World Human Robot Interactions

Mar 07, 2021
Ghazal Saheb Jam, Jimin Rhim, Angelica Lim

Figure 1 for Developing a Data-Driven Categorical Taxonomy of Emotional Expressions in Real World Human Robot Interactions
Figure 2 for Developing a Data-Driven Categorical Taxonomy of Emotional Expressions in Real World Human Robot Interactions
Figure 3 for Developing a Data-Driven Categorical Taxonomy of Emotional Expressions in Real World Human Robot Interactions
Figure 4 for Developing a Data-Driven Categorical Taxonomy of Emotional Expressions in Real World Human Robot Interactions

Emotions are reactions that can be expressed through a variety of social signals. For example, anger can be expressed through a scowl, narrowed eyes, a long stare, or many other expressions. This complexity is problematic when attempting to recognize a human's expression in a human-robot interaction: categorical emotion models used in HRI typically use only a few prototypical classes, and do not cover the wide array of expressions in the wild. We propose a data-driven method towards increasing the number of known emotion classes present in human-robot interactions, to 28 classes or more. The method includes the use of automatic segmentation of video streams into short (<10s) videos, and annotation using the large set of widely-understood emojis as categories. In this work, we showcase our initial results using a large in-the-wild HRI dataset (UE-HRI), with 61 clips randomly sampled from the dataset, labeled with 28 different emojis. In particular, our results showed that the "skeptical" emoji was a common expression in our dataset, which is not often considered in typical emotion taxonomies. This is the first step in developing a rich taxonomy of emotional expressions that can be used in the future as labels for training machine learning models, towards more accurate perception of humans by robots.

Viaarxiv icon

SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation

Oct 28, 2020
Zhitian Zhang, Jimin Rhim, Taher Ahmadi, Kefan Yang, Angelica Lim, Mo Chen

Figure 1 for SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation
Figure 2 for SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation
Figure 3 for SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation
Figure 4 for SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation

This article describes a dataset collected in a set of experiments that involves human participants and a robot. The set of experiments was conducted in the computing science robotics lab in Simon Fraser University, Burnaby, BC, Canada, and its aim is to gather data containing common gestures, movements, and other behaviours that may indicate humans' navigational intent relevant for autonomous robot navigation. The experiment simulates a shopping scenario where human participants come in to pick up items from his/her shopping list and interact with a Pepper robot that is programmed to help the human participant. We collected visual data and motion capture data from 108 human participants. The visual data contains live recordings of the experiments and the motion capture data contains the position and orientation of the human participants in world coordinates. This dataset could be valuable for researchers in the robotics, machine learning and computer vision community.

* 5 pages, paper submitted to Data In Brief Journal 
Viaarxiv icon

The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in Storytelling

Aug 30, 2019
Pablo Barros, Nikhil Churamani, Angelica Lim, Stefan Wermter

Figure 1 for The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in Storytelling
Figure 2 for The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in Storytelling
Figure 3 for The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in Storytelling
Figure 4 for The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in Storytelling

Processing human affective behavior is important for developing intelligent agents that interact with humans in complex interaction scenarios. A large number of current approaches that address this problem focus on classifying emotion expressions by grouping them into known categories. Such strategies neglect, among other aspects, the impact of the affective responses from an individual on their interaction partner thus ignoring how people empathize towards each other. This is also reflected in the datasets used to train models for affective processing tasks. Most of the recent datasets, in particular, the ones which capture natural interactions ("in-the-wild" datasets), are designed, collected, and annotated based on the recognition of displayed affective reactions, ignoring how these displayed or expressed emotions are perceived. In this paper, we propose a novel dataset composed of dyadic interactions designed, collected and annotated with a focus on measuring the affective impact that eight different stories have on the listener. Each video of the dataset contains around 5 minutes of interaction where a speaker tells a story to a listener. After each interaction, the listener annotated, using a valence scale, how the story impacted their affective state, reflecting how they empathized with the speaker as well as the story. We also propose different evaluation protocols and a baseline that encourages participation in the advancement of the field of artificial empathy and emotion contagion.

* 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII) 
Viaarxiv icon