Training robots to perceive, act and communicate using multiple modalities still represents a challenging problem, particularly if robots are expected to learn efficiently from small sets of example interactions. We describe a learning approach as a step in this direction, where we teach a humanoid robot how to play the game of noughts and crosses. Given that multiple multimodal skills can be trained to play this game, we focus our attention to training the robot to perceive the game, and to interact in this game. Our multimodal deep reinforcement learning agent perceives multimodal features and exhibits verbal and non-verbal actions while playing. Experimental results using simulations show that the robot can learn to win or draw up to 98% of the games. A pilot test of the proposed multimodal system for the targeted game---integrating speech, vision and gestures---reports that reasonable and fluent interactions can be achieved using the proposed approach.
Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded noisy and far-field eval sets, we show that PCEN significantly improves recognition performance. Furthermore, we model PCEN as neural network layers and optimize high-dimensional PCEN parameters jointly with the keyword spotting acoustic model. The trained PCEN frontend demonstrates significant further improvements without increasing model complexity or inference-time cost.
Various algorithms for text-independent speaker recognition have been developed through the decades, aiming to improve both accuracy and efficiency. This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results. First, the performance based on only PCA and only LDA is measured; then a mixed model, taking advantages of both methods, is introduced. A subset of the TIMIT corpus composed of 200 male speakers, is used for enrollment, validation and testing. The best results achieve 100%; 96% and 95% classification rate at population level 50; 100 and 200, using 39-dimensional MFCC features with delta and double delta. These results are based on 12-second text-independent speech for training and 4-second data for test. These are comparable to the conventional MFCC-GMM methods, but require significantly less time to train and operate.
This article proposes an emotive lifelike robotic face, called ExpressionBot, that is designed to support verbal and non-verbal communication between the robot and humans, with the goal of closely modeling the dynamics of natural face-to-face communication. The proposed robotic head consists of two major components: 1) a hardware component that contains a small projector, a fish-eye lens, a custom-designed mask and a neck system with 3 degrees of freedom; 2) a facial animation system, projected onto the robotic mask, that is capable of presenting facial expressions, realistic eye movement, and accurate visual speech. We present three studies that compare Human-Robot Interaction with Human-Computer Interaction with a screen-based model of the avatar. The studies indicate that the robotic face is well accepted by users, with some advantages in recognition of facial expression and mutual eye gaze contact.
Deep neural networks have recently achieved state-of-the-art results in many machine learning problems, e.g., speech recognition or object recognition. Hitherto, work on rectified linear units (ReLU) provides empirical and theoretical evidence on performance increase of neural networks comparing to typically used sigmoid activation function. In this paper, we investigate a new manner of improving neural networks by introducing a bunch of copies of the same neuron modeled by the generalized Kumaraswamy distribution. As a result, we propose novel non-linear activation function which we refer to as Kumaraswamy unit which is closely related to ReLU. In the experimental study with MNIST image corpora we evaluate the Kumaraswamy unit applied to single-layer (shallow) neural network and report a significant drop in test classification error and test cross-entropy in comparison to sigmoid unit, ReLU and Noisy ReLU.
The advent of large scale neural computational platforms has highlighted the lack of algorithms for synthesis of neural structures to perform predefined cognitive tasks. The Neural Engineering Framework offers one such synthesis, but it is most effective for a spike rate representation of neural information, and it requires a large number of neurons to implement simple functions. We describe a neural network synthesis method that generates synaptic connectivity for neurons which process time-encoded neural signals, and which makes very sparse use of neurons. The method allows the user to specify, arbitrarily, neuronal characteristics such as axonal and dendritic delays, and synaptic transfer functions, and then solves for the optimal input-output relationship using computed dendritic weights. The method may be used for batch or online learning and has an extremely fast optimization process. We demonstrate its use in generating a network to recognize speech which is sparsely encoded as spike times.
This paper addresses the problem of classifying observations when features are context-sensitive, especially when the testing set involves a context that is different from the training set. The paper begins with a precise definition of the problem, then general strategies are presented for enhancing the performance of classification algorithms on this type of problem. These strategies are tested on three domains. The first domain is the diagnosis of gas turbine engines. The problem is to diagnose a faulty engine in one context, such as warm weather, when the fault has previously been seen only in another context, such as cold weather. The second domain is speech recognition. The context is given by the identity of the speaker. The problem is to recognize words spoken by a new speaker, not represented in the training set. The third domain is medical prognosis. The problem is to predict whether a patient with hepatitis will live or die. The context is the age of the patient. For all three domains, exploiting context results in substantially more accurate classification.
We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.
Artificial neural networks (ANN), typically referred to as neural networks, are a class of Machine Learning algorithms and have achieved widespread success, having been inspired by the biological structure of the human brain. Neural networks are inherently powerful due to their ability to learn complex function approximations from data. This generalization ability has been able to impact multidisciplinary areas involving image recognition, speech recognition, natural language processing, and others. Activation functions are a crucial sub-component of neural networks. They define the output of a node in the network given a set of inputs. This survey discusses the main concepts of activation functions in neural networks, including; a brief introduction to deep neural networks, a summary of what are activation functions and how they are used in neural networks, their most common properties, the different types of activation functions, some of the challenges, limitations, and alternative solutions faced by activation functions, concluding with the final remarks.
History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, without waiting for future context. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our experiments. Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset.