Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Frozen Binomials on the Web: Word Ordering and Language Conventions in Online Text

Mar 07, 2020
Katherine Van Koevering, Austin R. Benson, Jon Kleinberg

There is inherent information captured in the order in which we write words in a list. The orderings of binomials --- lists of two words separated by `and' or `or' --- has been studied for more than a century. These binomials are common across many areas of speech, in both formal and informal text. In the last century, numerous explanations have been given to describe what order people use for these binomials, from differences in semantics to differences in phonology. These rules describe primarily `frozen' binomials that exist in exactly one ordering and have lacked large-scale trials to determine efficacy. Online text provides a unique opportunity to study these lists in the context of informal text at a very large scale. In this work, we expand the view of binomials to include a large-scale analysis of both frozen and non-frozen binomials in a quantitative way. Using this data, we then demonstrate that most previously proposed rules are ineffective at predicting binomial ordering. By tracking the order of these binomials across time and communities we are able to establish additional, unexplored dimensions central to these predictions. Expanding beyond the question of individual binomials, we also explore the global structure of binomials in various communities, establishing a new model for these lists and analyzing this structure for non-frozen and frozen binomials. Additionally, novel analysis of trinomials --- lists of length three --- suggests that none of the binomials analysis applies in these cases. Finally, we demonstrate how large data sets gleaned from the web can be used in conjunction with older theories to expand and improve on old questions.

* 12 pages, 10 figures 

  Access Paper or Ask Questions

Cellular automata as convolutional neural networks

Sep 09, 2018
William Gilpin

Deep learning techniques have recently demonstrated broad success in predicting complex dynamical systems ranging from turbulence to human speech, motivating broader questions about how neural networks encode and represent dynamical rules. We explore this problem in the context of cellular automata (CA), simple dynamical systems that are intrinsically discrete and thus difficult to analyze using standard tools from dynamical systems theory. We show that any CA may readily be represented using a convolutional neural network with a network-in-network architecture. This motivates our development of a general convolutional multilayer perceptron architecture, which we find can learn the dynamical rules for arbitrary CA when given videos of the CA as training data. In the limit of large network widths, we find that training dynamics are strongly stereotyped across replicates, and that common patterns emerge in the structure of networks trained on different CA rulesets. We train ensembles of networks on randomly-sampled CA, and we probe how the trained networks internally represent the CA rules using an information-theoretic technique based on distributions of layer activation patterns. We find that CA with simpler rule tables produce trained networks with hierarchical structure and layer specialization, while more complex CA tend to produce shallower representations---illustrating how the underlying complexity of the CA's rules influences the specificity of these internal representations. Our results suggest how the entropy of a physical process can affect its representation when learned by neural networks.

* 6 pages, 3 figures (+Appendix) 

  Access Paper or Ask Questions

A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA

May 07, 2017
Xinyu Zhang, Srinjoy Das, Ojash Neopane, Ken Kreutz-Delgado

In recent years deep learning algorithms have shown extremely high performance on machine learning tasks such as image classification and speech recognition. In support of such applications, various FPGA accelerator architectures have been proposed for convolutional neural networks (CNNs) that enable high performance for classification tasks at lower power than CPU and GPU processors. However, to date, there has been little research on the use of FPGA implementations of deconvolutional neural networks (DCNNs). DCNNs, also known as generative CNNs, encode high-dimensional probability distributions and have been widely used for computer vision applications such as scene completion, scene segmentation, image creation, image denoising, and super-resolution imaging. We propose an FPGA architecture for deconvolutional networks built around an accelerator which effectively handles the complex memory access patterns needed to perform strided deconvolutions, and that supports convolution as well. We also develop a three-step design optimization method that systematically exploits statistical analysis, design space exploration and VLSI optimization. To verify our FPGA deconvolutional accelerator design methodology we train DCNNs offline on two representative datasets using the generative adversarial network method (GAN) run on Tensorflow, and then map these DCNNs to an FPGA DCNN-plus-accelerator implementation to perform generative inference on a Xilinx Zynq-7000 FPGA. Our DCNN implementation achieves a peak performance density of 0.012 GOPs/DSP.

  Access Paper or Ask Questions

Scalable and Flexible Multiview MAX-VAR Canonical Correlation Analysis

May 04, 2017
Xiao Fu, Kejun Huang, Mingyi Hong, Nicholas D. Sidiropoulos, Anthony Man-Cho So

Generalized canonical correlation analysis (GCCA) aims at finding latent low-dimensional common structure from multiple views (feature vectors in different domains) of the same entities. Unlike principal component analysis (PCA) that handles a single view, (G)CCA is able to integrate information from different feature spaces. Here we focus on MAX-VAR GCCA, a popular formulation which has recently gained renewed interest in multilingual processing and speech modeling. The classic MAX-VAR GCCA problem can be solved optimally via eigen-decomposition of a matrix that compounds the (whitened) correlation matrices of the views; but this solution has serious scalability issues, and is not directly amenable to incorporating pertinent structural constraints such as non-negativity and sparsity on the canonical components. We posit regularized MAX-VAR GCCA as a non-convex optimization problem and propose an alternating optimization (AO)-based algorithm to handle it. Our algorithm alternates between {\em inexact} solutions of a regularized least squares subproblem and a manifold-constrained non-convex subproblem, thereby achieving substantial memory and computational savings. An important benefit of our design is that it can easily handle structure-promoting regularization. We show that the algorithm globally converges to a critical point at a sublinear rate, and approaches a global optimal solution at a linear rate when no regularization is considered. Judiciously designed simulations and large-scale word embedding tasks are employed to showcase the effectiveness of the proposed algorithm.

  Access Paper or Ask Questions

Towards Robust Waveform-Based Acoustic Models

Oct 16, 2021
Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu

We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

  Access Paper or Ask Questions

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Jul 20, 2021
Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, Xin Yu

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

* IJCAI 2021 

  Access Paper or Ask Questions

Disentangling Homophemes in Lip Reading using Perplexity Analysis

Nov 28, 2020
Souheil Fenghour, Daqing Chen, Kun Guo, Perry Xiao

The performance of automated lip reading using visemes as a classification schema has achieved less success compared with the use of ASCII characters and words largely due to the problem of different words sharing identical visemes. The Generative Pre-Training transformer is an effective autoregressive language model used for many tasks in Natural Language Processing, including sentence prediction and text classification. This paper proposes a new application for this model and applies it in the context of lip reading, where it serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences. The network uses the search for optimal perplexity to perform the viseme-to-word mapping and is thus a solution to the one-to-many mapping problem that exists whereby various words that sound different when spoken look identical. This paper proposes a method to tackle the one-to-many mapping problem when performing automated lip reading using solely visual cues in two separate scenarios: the first scenario is where the word boundary, that is, the beginning and the ending of a word, is unknown; and the second scenario is where the boundary is known. Sentences from the benchmark BBC dataset "Lip Reading Sentences in the Wild"(LRS2), are classified with a character error rate of 10.7% and a word error rate of 18.0%. The main contribution of this paper is to propose a method of predicting words through the use of perplexity analysis when only visual cues are present, using an autoregressive language model.

* 17 pages 

  Access Paper or Ask Questions

Paralinguistic Privacy Protection at the Edge

Nov 04, 2020
Ranya Aloufi, Hamed Haddadi, David Boyle

Voice user interfaces and digital assistants are rapidly entering our homes and becoming integrated with all our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As sensitive attributes like our identity, gender, indicators of mental health status, alongside moods, emotions and their temporal patterns, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving on-device paralinguistic information filtering prior to transmitting voice data. In this paper we introduce EDGY, a new lightweight disentangled representation learning model that transforms and filters high-dimensional voice data to remove sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY's on-device performance, and explore optimization techniques, including model pruning and quantization, to enable private, accurate and efficient representation learning on resource-constrained devices. Our experimental results show that EDGY runs in tens of milliseconds with minimal performance penalties or accuracy losses in speech recognition using only a CPU and a single core ARM device without specialized hardware.

* 14 pages, 7 figures. arXiv admin note: text overlap with arXiv:2007.15064 

  Access Paper or Ask Questions

Hierarchical Transformer Network for Utterance-level Emotion Recognition

Feb 18, 2020
QingBiao Li, ChunHua Wu, KangFeng Zheng, Zhe Wang

While there have been significant advances in de-tecting emotions in text, in the field of utter-ance-level emotion recognition (ULER), there are still many problems to be solved. In this paper, we address some challenges in ULER in dialog sys-tems. (1) The same utterance can deliver different emotions when it is in different contexts or from different speakers. (2) Long-range contextual in-formation is hard to effectively capture. (3) Unlike the traditional text classification problem, this task is supported by a limited number of datasets, among which most contain inadequate conversa-tions or speech. To address these problems, we propose a hierarchical transformer framework (apart from the description of other studies, the "transformer" in this paper usually refers to the encoder part of the transformer) with a lower-level transformer to model the word-level input and an upper-level transformer to capture the context of utterance-level embeddings. We use a pretrained language model bidirectional encoder representa-tions from transformers (BERT) as the lower-level transformer, which is equivalent to introducing external data into the model and solve the problem of data shortage to some extent. In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the in-teraction between speakers. Experiments on three dialog emotion datasets, Friends, EmotionPush, and EmoryNLP, demonstrate that our proposed hierarchical transformer network models achieve 1.98%, 2.83%, and 3.94% improvement, respec-tively, over the state-of-the-art methods on each dataset in terms of macro-F1.

* 7 pages 

  Access Paper or Ask Questions

Harnessing Evolution of Multi-Turn Conversations for Effective Answer Retrieval

Jan 31, 2020
Mohammad Aliannejadi, Manajit Chakraborty, Esteban Andrés Ríssola, Fabio Crestani

With the improvements in speech recognition and voice generation technologies over the last years, a lot of companies have sought to develop conversation understanding systems that run on mobile phones or smart home devices through natural language interfaces. Conversational assistants, such as Google Assistant and Microsoft Cortana, can help users to complete various types of tasks. This requires an accurate understanding of the user's information need as the conversation evolves into multiple turns. Finding relevant context in a conversation's history is challenging because of the complexity of natural language and the evolution of a user's information need. In this work, we present an extensive analysis of language, relevance, dependency of user utterances in a multi-turn information-seeking conversation. To this aim, we have annotated relevant utterances in the conversations released by the TREC CaST 2019 track. The annotation labels determine which of the previous utterances in a conversation can be used to improve the current one. Furthermore, we propose a neural utterance relevance model based on BERT fine-tuning, outperforming competitive baselines. We study and compare the performance of multiple retrieval models, utilizing different strategies to incorporate the user's context. The experimental results on both classification and retrieval tasks show that our proposed approach can effectively identify and incorporate the conversation context. We show that processing the current utterance using the predicted relevant utterance leads to a 38% relative improvement in terms of [email protected] Finally, to foster research in this area, we have released the dataset of the annotations.

* To appear in ACM CHIIR 2020, Vancouver, BC, Canada 

  Access Paper or Ask Questions