Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings

Oct 01, 2019
Myunghun Jung, Hyungjun Lim, Jahyun Goo, Youngmoon Jung, Hoirin Kim

Acoustic word embeddings --- fixed-dimensional vector representations of arbitrary-length words --- have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings. It showed that it is possible to learn discriminative embeddings by designing the objective which takes text labels as well as word segments. In this paper, we propose a network architecture that expands the multi-view approach by combining the Siamese multi-view encoders with a shared decoder network to maximize the effect of the relationship between acoustic and text embeddings in embedding space. Discriminatively trained with multi-view triplet loss and decoding loss, our proposed approach achieves better performance on acoustic word discrimination task with the WSJ dataset, resulting in 11.1% relative improvement in average precision. We also present experimental results on cross-view word discrimination and word level speech recognition tasks.

* Accepted at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019) 

  Access Paper or Ask Questions

Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media

Jun 10, 2019
Gustavo Aguilar, A. Pastor López-Monroy, Fabio A. González, Thamar Solorio

Recognizing named entities in a document is a key task in many NLP applications. Although current state-of-the-art approaches to this task reach a high performance on clean text (e.g. newswire genres), those algorithms dramatically degrade when they are moved to noisy environments such as social media domains. We present two systems that address the challenges of processing social media data using character-level phonetics and phonology, word embeddings, and Part-of-Speech tags as features. The first model is a multitask end-to-end Bidirectional Long Short-Term Memory (BLSTM)-Conditional Random Field (CRF) network whose output layer contains two CRF classifiers. The second model uses a multitask BLSTM network as feature extractor that transfers the learning to a CRF classifier for the final prediction. Our systems outperform the current F1 scores of the state of the art on the Workshop on Noisy User-generated Text 2017 dataset by 2.45% and 3.69%, establishing a more suitable approach for social media environments.

* Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1401-1412 
* NAACL 2018 

  Access Paper or Ask Questions

Text-based Editing of Talking-head Video

Jun 04, 2019
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, Maneesh Agrawala

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.

* A version with higher resolution images can be downloaded from the authors' website 

  Access Paper or Ask Questions

Face Video Generation from a Single Image and Landmarks

Apr 25, 2019
Kritaphat Songsri-in, Stefanos Zafeiriou

In this paper we are concerned with the challenging problem of producing a full image sequence of a deformable face given only an image and generic facial motions encoded by a set of sparse landmarks. To this end we build upon recent breakthroughs in image-to-image translation such as pix2pix, CycleGAN and StarGAN which learn Deep Convolutional Neural Networks (DCNNs) that learn to map aligned pairs or images between different domains (i.e., having different labels) and propose a new architecture which is not driven any more by labels but by spatial maps, facial landmarks. In particular, we propose the MotionGAN which transforms an input face image into a new one according to a heatmap of target landmarks. We show that it is possible to create very realistic face videos using a single image and a set of target landmarks. Furthermore, our method can be used to edit a facial image with arbitrary motions according to landmarks (e.g., expression, speech, etc.). This provides much more flexibility to face editing, expression transfer, facial video creation, etc. than models based on discrete expressions, audios or action units.

  Access Paper or Ask Questions

Discovery of Important Subsequences in Electrocardiogram Beats Using the Nearest Neighbour Algorithm

Jan 26, 2019
Ricards Marcinkevics, Steven Kelk, Carlo Galuzzi, Berthold Stegemann

The classification of time series data is a well-studied problem with numerous practical applications, such as medical diagnosis and speech recognition. A popular and effective approach is to classify new time series in the same way as their nearest neighbours, whereby proximity is defined using Dynamic Time Warping (DTW) distance, a measure analogous to sequence alignment in bioinformatics. However, practitioners are not only interested in accurate classification, they are also interested in why a time series is classified a certain way. To this end, we introduce here the problem of finding a minimum length subsequence of a time series, the removal of which changes the outcome of the classification under the nearest neighbour algorithm with DTW distance. Informally, such a subsequence is expected to be relevant for the classification and can be helpful for practitioners in interpreting the outcome. We describe a simple but optimized implementation for detecting these subsequences and define an accompanying measure to quantify the relevance of every time point in the time series for the classification. In tests on electrocardiogram data we show that the algorithm allows discovery of important subsequences and can be helpful in detecting abnormalities in cardiac rhythms distinguishing sick from healthy patients.

  Access Paper or Ask Questions

Computing Word Classes Using Spectral Clustering

Aug 16, 2018
Effi Levi, Saggy Herman, Ari Rappoport

Clustering a lexicon of words is a well-studied problem in natural language processing (NLP). Word clusters are used to deal with sparse data in statistical language processing, as well as features for solving various NLP tasks (text categorization, question answering, named entity recognition and others). Spectral clustering is a widely used technique in the field of image processing and speech recognition. However, it has scarcely been explored in the context of NLP; specifically, the method used in this (Meila and Shi, 2001) has never been used to cluster a general word lexicon. We apply spectral clustering to a lexicon of words, evaluating the resulting clusters by using them as features for solving two classical NLP tasks: semantic role labeling and dependency parsing. We compare performance with Brown clustering, a widely-used technique for word clustering, as well as with other clustering methods. We show that spectral clusters produce similar results to Brown clusters, and outperform other clustering methods. In addition, we quantify the overlap between spectral and Brown clusters, showing that each model captures some information which is uncaptured by the other.

  Access Paper or Ask Questions

OntoSenseNet: A Verb-Centric Ontological Resource for Indian Languages

Aug 02, 2018
Jyoti Jha, Sreekavitha Parupalli, Navjyoti Singh

Following approaches for understanding lexical meaning developed by Yaska, Patanjali and Bhartrihari from Indian linguistic traditions and extending approaches developed by Leibniz and Brentano in the modern times, a framework of formal ontology of language was developed. This framework proposes that meaning of words are in-formed by intrinsic and extrinsic ontological structures. The paper aims to capture such intrinsic and extrinsic meanings of words for two major Indian languages, namely, Hindi and Telugu. Parts-of-speech have been rendered into sense-types and sense-classes. Using them we have developed a gold- standard annotated lexical resource to support semantic understanding of a language. The resource has collection of Hindi and Telugu lexicons, which has been manually annotated by native speakers of the languages following our annotation guidelines. Further, the resource was utilised to derive adverbial sense-class distribution of verbs and karaka-verb sense- type distribution. Different corpora (news, novels) were compared using verb sense-types distribution. Word Embedding was used as an aid for the enrichment of the resource. This is a work in progress that aims at lexical coverage of language extensively.

* 14 pages, 8 tables, 2 figures 

  Access Paper or Ask Questions

From Nodes to Networks: Evolving Recurrent Neural Networks

Jun 07, 2018
Aditya Rawal, Risto Miikkulainen

Gated recurrent networks such as those composed of Long Short-Term Memory (LSTM) nodes have recently been used to improve state of the art in many sequential processing tasks such as speech recognition and machine translation. However, the basic structure of the LSTM node is essentially the same as when it was first conceived 25 years ago. Recently, evolutionary and reinforcement learning mechanisms have been employed to create new variations of this structure. This paper proposes a new method, evolution of a tree-based encoding of the gated memory nodes, and shows that it makes it possible to explore new variations more effectively than other methods. The method discovers nodes with multiple recurrent paths and multiple memory cells, which lead to significant improvement in the standard language modeling benchmark task. The paper also shows how the search process can be speeded up by training an LSTM network to estimate performance of candidate structures, and by encouraging exploration of novel solutions. Thus, evolutionary design of complex neural network structures promises to improve performance of deep learning architectures beyond human ability to do so.

  Access Paper or Ask Questions

A Non-Technical Survey on Deep Convolutional Neural Network Architectures

Mar 06, 2018
Felix Altenberger, Claus Lenz

Artificial neural networks have recently shown great results in many disciplines and a variety of applications, including natural language understanding, speech processing, games and image data generation. One particular application in which the strong performance of artificial neural networks was demonstrated is the recognition of objects in images, where deep convolutional neural networks are commonly applied. In this survey, we give a comprehensive introduction to this topic (object recognition with deep convolutional neural networks), with a strong focus on the evolution of network architectures. Therefore, we aim to compress the most important concepts in this field in a simple and non-technical manner to allow for future researchers to have a quick general understanding. This work is structured as follows: 1. We will explain the basic ideas of (convolutional) neural networks and deep learning and examine their usage for three object recognition tasks: image classification, object localization and object detection. 2. We give a review on the evolution of deep convolutional neural networks by providing an extensive overview of the most important network architectures presented in chronological order of their appearances.

* 17 pages (incl. references), 23 Postscript figures, uses IEEEtran 

  Access Paper or Ask Questions

Optimization Methods for Convolutional Sparse Coding

Jun 10, 2014
Hilton Bristow, Simon Lucey

Sparse and convolutional constraints form a natural prior for many optimization problems that arise from physical processes. Detecting motifs in speech and musical passages, super-resolving images, compressing videos, and reconstructing harmonic motions can all leverage redundancies introduced by convolution. Solving problems involving sparse and convolutional constraints remains a difficult computational problem, however. In this paper we present an overview of convolutional sparse coding in a consistent framework. The objective involves iteratively optimizing a convolutional least-squares term for the basis functions, followed by an L1-regularized least squares term for the sparse coefficients. We discuss a range of optimization methods for solving the convolutional sparse coding objective, and the properties that make each method suitable for different applications. In particular, we concentrate on computational complexity, speed to {\epsilon} convergence, memory usage, and the effect of implied boundary conditions. We present a broad suite of examples covering different signal and application domains to illustrate the general applicability of convolutional sparse coding, and the efficacy of the available optimization methods.

  Access Paper or Ask Questions