Most of the parameters in large vocabulary models are used in embedding layer to map categorical features to vectors and in softmax layer for classification weights. This is a bottle-neck in memory constraint on-device training applications like federated learning and on-device inference applications like automatic speech recognition (ASR). One way of compressing the embedding and softmax layers is to substitute larger units such as words with smaller sub-units such as characters. However, often the sub-unit models perform poorly compared to the larger unit models. We propose WEST, an algorithm for encoding categorical features and output classes with a sequence of random or domain dependent sub-units and demonstrate that this transduction can lead to significant compression without compromising performance. WEST bridges the gap between larger unit and sub-unit models and can be interpreted as a MaxEnt model over sub-unit features, which can be of independent interest.
Social assistance robots in health and elderly care have the potential to support and ease human lives. Given the macrosocial trends of aging and long-lived populations, robotics-based care research mainly focused on helping the elderly live independently. In this paper, we introduce Robot DE NIRO, a research platform that aims to support the supporter (the caregiver) and also offers direct human-robot interaction for the care recipient. Augmented by several sensors, DE NIRO is capable of complex manipulation tasks. It reliably interacts with humans and can autonomously and swiftly navigate through dynamically changing environments. We describe preliminary experiments in a demonstrative scenario and discuss DE NIRO's design and capabilities. We put particular emphases on safe, human-centered interaction procedures implemented in both hardware and software, including collision avoidance in manipulation and navigation as well as an intuitive perception stack through speech and face recognition.
Conventional wisdom is that hand-crafted features are redundant for deep learning models, as they already learn adequate representations of text automatically from corpora. In this work, we test this claim by proposing a new method for exploiting handcrafted features as part of a novel hybrid learning approach, incorporating a feature auto-encoder loss component. We evaluate on the task of named entity recognition (NER), where we show that including manual features for part-of-speech, word shapes and gazetteers can improve the performance of a neural CRF model. We obtain a $F_1$ of 91.89 for the CoNLL-2003 English shared task, which significantly outperforms a collection of highly competitive baseline models. We also present an ablation study showing the importance of auto-encoding, over using features as either inputs or outputs alone, and moreover, show including the autoencoder components reduces training requirements to 60\%, while retaining the same predictive accuracy.
We introduce Latent Vector Grammars (LVeGs), a new framework that extends latent variable grammars such that each nonterminal symbol is associated with a continuous vector space representing the set of (infinitely many) subtypes of the nonterminal. We show that previous models such as latent variable grammars and compositional vector grammars can be interpreted as special cases of LVeGs. We then present Gaussian Mixture LVeGs (GM-LVeGs), a new special case of LVeGs that uses Gaussian mixtures to formulate the weights of production rules over subtypes of nonterminals. A major advantage of using Gaussian mixtures is that the partition function and the expectations of subtype rules can be computed using an extension of the inside-outside algorithm, which enables efficient inference and learning. We apply GM-LVeGs to part-of-speech tagging and constituency parsing and show that GM-LVeGs can achieve competitive accuracies. Our code is available at https://github.com/zhaoyanpeng/lveg.
We study the problem of analyzing tweets with Universal Dependencies. We extend the UD guidelines to cover special constructions in tweets that affect tokenization, part-of-speech tagging, and labeled dependencies. Using the extended guidelines, we create a new tweet treebank for English (Tweebank v2) that is four times larger than the (unlabeled) Tweebank v1 introduced by Kong et al. (2014). We characterize the disagreements between our annotators and show that it is challenging to deliver consistent annotation due to ambiguity in understanding and explaining tweets. Nonetheless, using the new treebank, we build a pipeline system to parse raw tweets into UD. To overcome annotation noise without sacrificing computational efficiency, we propose a new method to distill an ensemble of 20 transition-based parsers into a single one. Our parser achieves an improvement of 2.2 in LAS over the un-ensembled baseline and outperforms parsers that are state-of-the-art on other treebanks in both accuracy and speed.
This paper describes the MCV image labeling algorithm which is a (semi-) hierarchical algorithm commencing with a partition made up of single pixel regions and merging regions or subsets of regions using a Markov random field (MRF) image model. It is an example of a general approach to computer vision called concurrent vision in which the operations of image segmentation and image classification are carried out concurrently. The output of the MCV algorithm can be a simple segmentation partition or a sequence of partitions which can provide useful information to higher level vision systems. In the case of an autoregressive Gaussian MRF the evaluation of sub-images for homogeneity is computationally inexpensive and may be effected by a hardwired feed-forward neural network. The merge operation of the algorithm is massively parallelizable. While being applicable to images (i.e. 2D signals), the algorithm is equally applicable to 1D signals (e.g. speech) or 3D signals (e.g. video sequences)
In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential problem of the present model is that it involves a parametric classifier, i.e., the last affine layer, which may consume some discriminative knowledge, thus leading to `information leak' for the feature learning. This paper presents a full-info training approach that discards the parametric classifier and enforces all the discriminative knowledge learned by the feature net. Our experiments on the Fisher database demonstrate that this new training scheme can produce more coherent features, leading to consistent and notable performance improvement on the speaker verification task.
Vanishing long-term gradients are a major issue in training standard recurrent neural networks (RNNs), which can be alleviated by long short-term memory (LSTM) models with memory cells. However, the extra parameters associated with the memory cells mean an LSTM layer has four times as many parameters as an RNN with the same hidden vector size. This paper addresses the vanishing gradient problem using a high order RNN (HORNN) which has additional connections from multiple previous time steps. Speech recognition experiments using British English multi-genre broadcast (MGB3) data showed that the proposed HORNN architectures for rectified linear unit and sigmoid activation functions reduced word error rates (WER) by 4.2% and 6.3% over the corresponding RNNs, and gave similar WERs to a (projected) LSTM while using only 20%--50% of the recurrent layer parameters and computation.
In order to create effective storytelling agents three fundamental questions must be answered: first, is a physically embodied agent preferable to a virtual agent or a voice-only narration? Second, does a human voice have an advantage over a synthesised voice? Third, how should the emotional trajectory of the different characters in a story be related to a storyteller's facial expressions during storytelling time, and how does this correlate with the apparent emotions on the faces of the listeners? The results of two specially designed studies indicate that the physically embodied robot produces more attention to the listener as compared to a virtual embodiment, that a human voice is preferable over the current state of the art of text-to-speech, and that there is a complex yet interesting relation between the emotion lines of the story, the facial expressions of the narrating agent, and the emotions of the listener, and that the empathising of the listener is evident through its facial expressions. This work constitutes an important step towards emotional storytelling robots that can observe their listeners and adapt their style in order to maximise their effectiveness.
The proliferation of sensor devices monitoring human activity generates voluminous amount of temporal sequences needing to be interpreted and categorized. Moreover, complex behavior detection requires the personalization of multi-sensor fusion algorithms. Conditional random fields (CRFs) are commonly used in structured prediction tasks such as part-of-speech tagging in natural language processing. Conditional probabilities guide the choice of each tag/label in the sequence conflating the structured prediction task with the sequence classification task where different models provide different categorization of the same sequence. The claim of this paper is that CRF models also provide discriminative models to distinguish between types of sequence regardless of the accuracy of the labels obtained if we calibrate the class membership estimate of the sequence. We introduce and compare different neural network based linear-chain CRFs and we present experiments on two complex sequence classification and structured prediction tasks to support this claim.