Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

A Review of Recent Advances of Binary Neural Networks for Edge Computing

Nov 24, 2020
Wenyu Zhao, Teli Ma, Xuan Gong, Baochang Zhang, David Doermann

Edge computing is promising to become one of the next hottest topics in artificial intelligence because it benefits various evolving domains such as real-time unmanned aerial systems, industrial applications, and the demand for privacy protection. This paper reviews recent advances on binary neural network (BNN) and 1-bit CNN technologies that are well suitable for front-end, edge-based computing. We introduce and summarize existing work and classify them based on gradient approximation, quantization, architecture, loss functions, optimization method, and binary neural architecture search. We also introduce applications in the areas of computer vision and speech recognition and discuss future applications for edge computing.

  Access Paper or Ask Questions

Text Augmentation for Language Models in High Error Recognition Scenario

Nov 11, 2020
Karel Beneš, Lukáš Burget

We examine the effect of data augmentation for training of language models for speech recognition. We compare augmentation based on global error statistics with one based on per-word unigram statistics of ASR errors and observe that it is better to only pay attention the global substitution, deletion and insertion rates. This simple scheme also performs consistently better than label smoothing and its sampled variants. Additionally, we investigate into the behavior of perplexity estimated on augmented data, but conclude that it gives no better prediction of the final error rate. Our best augmentation scheme increases the absolute WER improvement from second-pass rescoring from 1.1 % to 1.9 % absolute on the CHiMe-6 challenge.

  Access Paper or Ask Questions

Textual Supervision for Visually Grounded Spoken Language Understanding

Oct 07, 2020
Bertrand Higy, Desmond Elliott, Grzegorz Chrupała

Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.

* Findings of EMNLP 2020 

  Access Paper or Ask Questions

Fuzzy Gesture Expression Model for an Interactive and Safe Robot Partner

Sep 26, 2019
Alexis Stoven-Dubois, Janos Botzheim, Naoyuki Kubota

Interaction with a robot partner requires many elements, including not only speech but also embodiment. Thus, gestural and facial expressions are important for communication. Furthermore, understanding human movements is essential for safe and natural interchange. This paper proposes an interactive fuzzy emotional model for the robot partner's gesture expression, following its facial emotional model. First, we describe the physical interaction between the user and its robot partner. Next, we propose a kinematic model for the robot partner based on the Denavit-Hartenberg convention and solve the inverse kinematic transformation through Bacterial Memetic Algorithm. Then, the emotional model along its interactivity with the user is discussed. Finally, we show experimental results of the proposed model.

* Journal of Network Intelligence Vol. 1 Number 4 (2016) pgs. 119-129 
* 11 pages, 8 figures, accepted for publication in Journal of Network Intelligence 

  Access Paper or Ask Questions

NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference

Aug 27, 2018
Mahdi Nazemi, Ghasem Pasandi, Massoud Pedram

Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. However, computational and storage complexity of these models has forced the majority of computations to be performed on high-end computing platforms or on the cloud. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floatingpoint operations, and has a substantially lower latency.

  Access Paper or Ask Questions

Attention-Based Models for Text-Dependent Speaker Verification

Jan 31, 2018
F A Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno, Li Wan

Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.

* Submitted to ICASSP 2018 

  Access Paper or Ask Questions

Limitations of Cross-Lingual Learning from Image Search

Sep 18, 2017
Mareike Hartmann, Anders Soegaard

Cross-lingual representation learning is an important step in making NLP scale to all the world's languages. Recent work on bilingual lexicon induction suggests that it is possible to learn cross-lingual representations of words based on similarities between images associated with these words. However, that work focused on the translation of selected nouns only. In our work, we investigate whether the meaning of other parts-of-speech, in particular adjectives and verbs, can be learned in the same way. We also experiment with combining the representations learned from visual data with embeddings learned from textual data. Our experiments across five language pairs indicate that previous work does not scale to the problem of learning cross-lingual representations beyond simple nouns.

  Access Paper or Ask Questions

Unsupervised Submodular Rank Aggregation on Score-based Permutations

Sep 06, 2017
Jun Qi, Xu Liu, Javier Tejedor, Shunsuke Kamijo

Unsupervised rank aggregation on score-based permutations, which is widely used in many applications, has not been deeply explored yet. This work studies the use of submodular optimization for rank aggregation on score-based permutations in an unsupervised way. Specifically, we propose an unsupervised approach based on the Lovasz Bregman divergence for setting up linear structured convex and nested structured concave objective functions. In addition, stochastic optimization methods are applied in the training process and efficient algorithms for inference can be guaranteed. The experimental results from Information Retrieval, Combining Distributed Neural Networks, Influencers in Social Networks, and Distributed Automatic Speech Recognition tasks demonstrate the effectiveness of the proposed methods.

  Access Paper or Ask Questions

Dynamic Bernoulli Embeddings for Language Evolution

Mar 23, 2017
Maja Rudolph, David Blei

Word embeddings are a powerful approach for unsupervised analysis of language. Recently, Rudolph et al. (2016) developed exponential family embeddings, which cast word embeddings in a probabilistic framework. Here, we develop dynamic embeddings, building on exponential family embeddings to capture how the meanings of words change over time. We use dynamic embeddings to analyze three large collections of historical texts: the U.S. Senate speeches from 1858 to 2009, the history of computer science ACM abstracts from 1951 to 2014, and machine learning papers on the Arxiv from 2007 to 2015. We find dynamic embeddings provide better fits than classical embeddings and capture interesting patterns about how language changes.

  Access Paper or Ask Questions

Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model

Jan 22, 2016
Shi Feng, Shujie Liu, Mu Li, Ming Zhou

Neural machine translation has shown very promising results lately. Most NMT models follow the encoder-decoder framework. To make encoder-decoder models more flexible, attention mechanism was introduced to machine translation and also other tasks like speech recognition and image captioning. We observe that the quality of translation by attention-based encoder-decoder can be significantly damaged when the alignment is incorrect. We attribute these problems to the lack of distortion and fertility models. Aiming to resolve these problems, we propose new variations of attention-based encoder-decoder and compare them with other models on machine translation. Our proposed method achieved an improvement of 2 BLEU points over the original attention-based encoder-decoder.

* 11 pages, updated details 

  Access Paper or Ask Questions