Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries

Mar 08, 2015
Evan R. Sparks, Ameet Talwalkar, Michael J. Franklin, Michael I. Jordan, Tim Kraska

The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. These and many other applications can be supported by Predictive Analytic Queries (PAQs). A major obstacle to supporting PAQs is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, thus limiting the effectiveness of such approaches on large-scale problems. In this work, we build upon these recent efforts and propose an integrated PAQ planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The result is TuPAQ, a component of the MLbase system, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.

  Access Paper or Ask Questions

Centrality-as-Relevance: Support Sets and Similarity as Geometric Proximity

Jan 16, 2014
Ricardo Ribeiro, David Martins de Matos

In automatic summarization, centrality-as-relevance means that the most important content of an information source, or a collection of information sources, corresponds to the most central passages, considering a representation where such notion makes sense (graph, spatial, etc.). We assess the main paradigms, and introduce a new centrality-based relevance model for automatic summarization that relies on the use of support sets to better estimate the relevant content. Geometric proximity is used to compute semantic relatedness. Centrality (relevance) is determined by considering the whole input source (and not only local information), and by taking into account the existence of minor topics or lateral subjects in the information sources to be summarized. The method consists in creating, for each passage of the input source, a support set consisting only of the most semantically related passages. Then, the determination of the most relevant content is achieved by selecting the passages that occur in the largest number of support sets. This model produces extractive summaries that are generic, and language- and domain-independent. Thorough automatic evaluation shows that the method achieves state-of-the-art performance, both in written text, and automatically transcribed speech summarization, including when compared to considerably more complex approaches.

* Journal Of Artificial Intelligence Research, Volume 42, pages 275-308, 2011 

  Access Paper or Ask Questions

Predicting electrode array impedance after one month from cochlear implantation surgery

May 20, 2022
Yousef A. Alohali, Yassin Abdelsamad, Tamer Mesallam, Fida Almuhawas, Abdulrahman Hagr, Mahmoud S. Fayed

Sensorineural hearing loss can be treated using Cochlear implantation. After this surgery using the electrode array impedance measurements, we can check the stability of the impedance value and the dynamic range. Deterioration of speech recognition scores could happen because of increased impedance values. Medicines used to do these measures many times during a year after the surgery. Predicting the electrode impedance could help in taking decisions to help the patient get better hearing. In this research we used a dataset of 80 patients of children who did cochlear implantation using MED-EL FLEX28 electrode array of 12 channels. We predicted the electrode impedance on each channel after 1 month from the surgery date. We used different machine learning algorithms like neural networks and decision trees. Our results indicates that the electrode impedance can be predicted, and the best algorithm is different based on the electrode channel. Also, the accuracy level varies between 66% and 100% based on the electrode channel when accepting an error range between 0 and 3 KO. Further research is required to predict the electrode impedance after three months, six months and one year.

  Access Paper or Ask Questions

Inferring Pitch from Coarse Spectral Features

Apr 10, 2022
Danni Ma, Neville Ryant, Mark Liberman

Fundamental frequency (F0) has long been treated as the physical definition of "pitch" in phonetic analysis. But there have been many demonstrations that F0 is at best an approximation to pitch, both in production and in perception: pitch is not F0, and F0 is not pitch. Changes in the pitch involve many articulatory and acoustic covariates; pitch perception often deviates from what F0 analysis predicts; and in fact, quasi-periodic signals from a single voice source are often incompletely characterized by an attempt to define a single time-varying F0. In this paper, we find strong support for the existence of covariates for pitch in aspects of relatively coarse spectra, in which an overtone series is not available. Thus linear regression can predict the pitch of simple vocalizations, produced by an articulatory synthesizer or by human, from single frames of such coarse spectra. Across speakers, and in more complex vocalizations, our experiments indicate that the covariates are not quite so simple, though apparently still available for more sophisticated modeling. On this basis, we propose that the field needs a better way of thinking about speech pitch, just as celestial mechanics requires us to go beyond Newton's point mass approximations to heavenly bodies.

* Submitted to Interspeech 2022 

  Access Paper or Ask Questions

Isometric MT: Neural Machine Translation for Automatic Dubbing

Dec 20, 2021
Surafel M. Lakew, Yogesh Virkar, Prashant Mathur, Marcello Federico

Automatic dubbing (AD) is among the use cases where translations should fit a given length template in order to achieve synchronicity between source and target speech. For neural machine translation (MT), generating translations of length close to the source length (e.g. within +-10% in character count), while preserving quality is a challenging task. Controlling NMT output length comes at a cost to translation quality which is usually mitigated with a two step approach of generation of n-best hypotheses and then re-ranking them based on length and quality. This work, introduces a self-learning approach that allows a transformer model to directly learn to generate outputs that closely match the source length, in short isometric MT. In particular, our approach for isometric MT does not require to generate multiple hypotheses nor any auxiliary scoring function. We report results on four language pairs (English - French, Italian, German, Spanish) with a publicly available benchmark based on TED Talk data. Both automatic and manual evaluations show that our self-learning approach to performs on par with more complex isometric MT approaches.

* Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022 

  Access Paper or Ask Questions

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Oct 06, 2021
Jen-Hao Rick Chang, Ashish Shrivastava, Hema Swetha Koppula, Xiaoshuai Zhang, Oncel Tuzel

Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, typical training algorithms for these controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but different samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. By introducing a style transformation module that we call style equalization, we enable training using different content and style samples and thereby mitigate the training-inference mismatch. To demonstrate its generality, we applied style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. Our models achieve state-of-the-art style replication with a similar mean style opinion score as the real data. Moreover, the proposed method enables style interpolation between sequences and generates novel styles.

  Access Paper or Ask Questions

Learner to learner fuzzy profiles similarity using a hybrid interaction analysis grid

Oct 01, 2021
Chabane Khentout, Khadidja Harbouche, Mahieddine Djoudi

The analysis of remote discussions is not yet at the same level as the face-to-face ones. The present paper aspires twofold. On the one hand, it attempts to establish a suitable environment of interaction and collaboration among learners by using the speech acts via a semi structured synchronous communication tool. On the other, it aims to define behavioral profiles and interpersonal skills hybrid grid by matching the BALES' IPA and PLETY's analysis system. By applying the fuzzy logic, we formalize human reasoning and, thus, giving very appreciable flexibility to the reasoning that use it, which makes it possible to take into account imprecisions and uncertainties. In addition, the educational data mining techniques are used to optimize the mapping of behaviors to learner's profile, with similarity-based clustering, using Eros and PCA measures. In order to show the validity of our system, we performed an experiment on real-world data. The results show, among others: (1) the usefulness of fuzzy logic to properly translate the profile text descriptions into a mathematical format, (2) an irregularity in the behavior of the learners, (3) the correlation between the profiles, (4) the superiority of Eros method to the PCA factor in precision.

* Revue des Sciences et Technologies de l'Information - S{\'e}rie ISI : Ing{\'e}nierie des Syst{\`e}mes d'Information, Lavoisier, 2021, 26 (4), pp.375-386 

  Access Paper or Ask Questions

Integrated Training for Sequence-to-Sequence Models Using Non-Autoregressive Transformer

Sep 27, 2021
Evgeniia Tokarchuk, Jan Rosendahl, Weiyue Wang, Pavel Petrushkov, Tomer Lancewicki, Shahram Khadivi, Hermann Ney

Complex natural language applications such as speech translation or pivot translation traditionally rely on cascaded models. However, cascaded models are known to be prone to error propagation and model discrepancy problems. Furthermore, there is no possibility of using end-to-end training data in conventional cascaded systems, meaning that the training data most suited for the task cannot be used. Previous studies suggested several approaches for integrated end-to-end training to overcome those problems, however they mostly rely on (synthetic or natural) three-way data. We propose a cascaded model based on the non-autoregressive Transformer that enables end-to-end training without the need for an explicit intermediate representation. This new architecture (i) avoids unnecessary early decisions that can cause errors which are then propagated throughout the cascaded models and (ii) utilizes the end-to-end training data directly. We conduct an evaluation on two pivot-based machine translation tasks, namely French-German and German-Czech. Our experimental results show that the proposed architecture yields an improvement of more than 2 BLEU for French-German over the cascaded baseline.

* IWSLT 2021 camera-ready 

  Access Paper or Ask Questions

How much pretraining data do language models need to learn syntax?

Sep 09, 2021
Laura Pérez-Mayos, Miguel Ballesteros, Leo Wanner

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.

* To be published in proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021) 

  Access Paper or Ask Questions

ELIT: Emory Language and Information Toolkit

Sep 08, 2021
Han He, Liyan Xu, Jinho D. Choi

We introduce ELIT, the Emory Language and Information Toolkit, which is a comprehensive NLP framework providing transformer-based end-to-end models for core tasks with a special focus on memory efficiency while maintaining state-of-the-art accuracy and speed. Compared to existing toolkits, ELIT features an efficient Multi-Task Learning (MTL) model with many downstream tasks that include lemmatization, part-of-speech tagging, named entity recognition, dependency parsing, constituency parsing, semantic role labeling, and AMR parsing. The backbone of ELIT's MTL framework is a pre-trained transformer encoder that is shared across tasks to speed up their inference. ELIT provides pre-trained models developed on a remix of eight datasets. To scale up its service, ELIT also integrates a RESTful Client/Server combination. On the server side, ELIT extends its functionality to cover other tasks such as tokenization and coreference resolution, providing an end user with agile research experience. All resources including the source codes, documentation, and pre-trained models are publicly available at

  Access Paper or Ask Questions