Abstract:Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.
Abstract:Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.
Abstract:Emotions guide our decision making process and yet have been little explored in practical ethical decision making scenarios. In this challenge, we explore emotions and how they can influence ethical decision making in a home robot context: which fetch requests should a robot execute, and why or why not? We discuss, in particular, two aspects of emotion: (1) somatic markers: objects to be retrieved are tagged as negative (dangerous, e.g. knives or mind-altering, e.g. medicine with overdose potential), providing a quick heuristic for where to focus attention to avoid the classic Frame Problem of artificial intelligence, (2) emotion inference: users' valence and arousal levels are taken into account in defining how and when a robot should respond to a human's requests, e.g. to carefully consider giving dangerous items to users experiencing intense emotions. Our emotion-based approach builds a foundation for the primary consideration of Safety, and is complemented by policies that support overriding based on Context (e.g. age of user, allergies) and Privacy (e.g. administrator settings). Transparency is another key aspect of our solution. Our solution is defined using behaviour trees, towards an implementable design that can provide reasoning information in real-time.