Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

HEAR 2021: Holistic Evaluation of Audio Representations

Mar 06, 2022
Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.

* to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track 

  Access Paper or Ask Questions

Detecting Anomalies within Time Series using Local Neural Transformations

Feb 08, 2022
Tim Schneider, Chen Qiu, Marius Kloft, Decky Aspandi Latif, Steffen Staab, Stephan Mandt, Maja Rudolph

We develop a new method to detect anomalies within time series, which is essential in many application domains, reaching from self-driving cars, finance, and marketing to medical diagnosis and epidemiology. The method is based on self-supervised deep learning that has played a key role in facilitating deep anomaly detection on images, where powerful image transformations are available. However, such transformations are widely unavailable for time series. Addressing this, we develop Local Neural Transformations(LNT), a method learning local transformations of time series from data. The method produces an anomaly score for each time step and thus can be used to detect anomalies within time series. We prove in a theoretical analysis that our novel training objective is more suitable for transformation learning than previous deep Anomaly detection(AD) methods. Our experiments demonstrate that LNT can find anomalies in speech segments from the LibriSpeech data set and better detect interruptions to cyber-physical systems than previous work. Visualization of the learned transformations gives insight into the type of transformations that LNT learns.

  Access Paper or Ask Questions

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Oct 12, 2021
Aida Mostafazadeh Davani, Mark Díaz, Vinodkumar Prabhakaran

Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators' judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

  Access Paper or Ask Questions

With One Voice: Composing a Travel Voice Assistant from Re-purposed Models

Aug 04, 2021
Shachaf Poran, Gil Amsalem, Amit Beka, Dmitri Goldenberg

Voice assistants provide users a new way of interacting with digital products, allowing them to retrieve information and complete tasks with an increased sense of control and flexibility. Such products are comprised of several machine learning models, like Speech-to-Text transcription, Named Entity Recognition and Resolution, and Text Classification. Building a voice assistant from scratch takes the prolonged efforts of several teams constructing numerous models and orchestrating between components. Alternatives such as using third-party vendors or re-purposing existing models may be considered to shorten time-to-market and development costs. However, each option has its benefits and drawbacks. We present key insights from building a voice search assistant for search and recommendation system. Our paper compares the achieved performance and development efforts in dedicated tailor-made solutions against existing re-purposed models. We share and discuss our data-driven decisions about implementation trade-offs and their estimated outcomes in hindsight, showing that a fully functional machine learning product can be built from existing models.

* 2nd International Workshop on Industrial Recommendation Systems @ KDD 2021 
* 2nd International Workshop on Industrial Recommendation Systems @ KDD 2021 

  Access Paper or Ask Questions

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

May 31, 2021
Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali

With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.

* Submitted to INTERSPEECH 2021, Multilingual ASR, Multi-dialectal ASR, Code-Switching ASR, Arabic ASR, Conformer, Transformer, E2E ASR, Speech Recognition, ASR, Arabic, English, French 

  Access Paper or Ask Questions

Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network

Apr 22, 2021
Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, Hannes Heikinheimo

Adaption of end-to-end speech recognition systems to new tasks is known to be challenging. A number of solutions have been proposed which apply external language models with various fusion methods, possibly with a combination of two-pass decoding. Also TTS systems have been used to generate adaptation data for the end-to-end models. In this paper we show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data. By taking advantage of model's inherent structure, where the prediction network is interpreted as a language model, we can apply fast adaptation to the model. Adapting the model avoids the need for complicated decoding time fusions and external language models. Using appropriate regularization, the prediction network can be adapted to new domains while still retaining good generalization capabilities. We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER. We also share insights how RNN-transducer prediction network performs as a language model.

* 5 pages, 2 figures. Submitted to Interspeech 2021 

  Access Paper or Ask Questions

Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

Apr 13, 2021
Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like encoder-decoder attention models, transducer models and segmental models (direct HMM). While transducer models stay with a frame-level model definition, segmental models are defined on the level of label segments, directly. While (soft-)attention-based models avoid explicit alignment, transducer and segmental approach internally do model alignment, either by segment hypotheses or, more implicitly, by emitting so-called blank symbols. In this work, we prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent and therefore show equal modeling power. It is shown that blank probabilities translate into segment length probabilities and vice versa. In addition, we provide initial experiments investigating decoding and beam-pruning, comparing time-synchronous and label-/segment-synchronous search strategies and their properties using the same underlying model.

* submitted to Interspeech2021 

  Access Paper or Ask Questions

Semantic Parsing to Manipulate Relational Database For a Management System

Feb 18, 2021
Muhammad Hamzah Mushtaq

Chatbots and AI assistants have claimed their importance in today life. The main reason behind adopting this technology is to connect with the user, understand their requirements, and fulfill them. This has been achieved but at the cost of heavy training data and complex learning models. This work is carried out proposes a simple algorithm, a model which can be implemented in different fields each with its own work scope. The proposed model converts human language text to computer-understandable SQL queries. The model requires data only related to the specific field, saving data space. This model performs linear computation hence solving the computational complexity. This work also defines the stages where a new methodology is implemented and what previous method was adopted to fulfill the requirement at that stage. Two datasets available online will be used in this work, the ATIS dataset, and WikiSQL. This work compares the computation time among the 2 datasets and also compares the accuracy of both. This paper works over basic Natural language processing tasks like semantic parsing, NER, parts of speech and tends to achieve results through these simple methods.

* 5 pages. Figures, methodology and comparisons included 

  Access Paper or Ask Questions

Automated Video Labelling: Identifying Faces by Corroborative Evidence

Feb 10, 2021
Andrew Brown, Ernesto Coto, Andrew Zisserman

We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio). We target the problem of ever-growing online video archives, where an effective, scalable indexing solution cannot require a user to provide manual annotation or supervision. To this end, we make three key contributions: (1) We provide a novel, simple, method for determining if a person is famous or not using image-search engines. In turn this enables a face-identity model to be built reliably and robustly, and used for high precision automatic labelling; (2) We show that even for less-famous people, image-search engines can then be used for corroborative evidence to accurately label faces that are named in the scene or the speech; (3) Finally, we quantitatively demonstrate the benefits of our approach on different video domains and test settings, such as TV shows and news broadcasts. Our method works across three disparate datasets without any explicit domain adaptation, and sets new state-of-the-art results on all the public benchmarks.

* IEEE 4th International Conference on Multimedia Information Processing and Retrieval (IEEE MIPR 2021) 

  Access Paper or Ask Questions

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Dec 12, 2020
Mufan Sang, Wei Xia, John H. L. Hansen

Despite speaker verification has achieved significant performance improvement with the development of deep neural networks, domain mismatch is still a challenging problem in this field. In this study, we propose a novel framework to disentangle speaker-related and domain-specific features and apply domain adaptation on the speaker-related feature space solely. Instead of performing domain adaptation directly on the feature space where domain information is not removed, using disentanglement can efficiently boost adaptation performance. To be specific, our model's input speech from the source and target domains is first encoded into different latent feature spaces. The adversarial domain adaptation is conducted on the shared speaker-related feature space to encourage the property of domain-invariance. Further, we minimize the mutual information between speaker-related and domain-specific features for both domains to enforce the disentanglement. Experimental results on the VOiCES dataset demonstrate that our proposed framework can effectively generate more speaker-discriminative and domain-invariant speaker representations with a relative 20.3% reduction of EER compared to the original ResNet-based system.

* Submitted to ICASSP 2021 

  Access Paper or Ask Questions