Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Synchronising audio and ultrasound by learning cross-modal embeddings

Jul 01, 2019
Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.

* 5 pages, 1 figure, 4 tables; accepted to Interspeech 2019: the 20th Annual Conference of the International Speech Communication Association (ISCA) 

  Access Paper or Ask Questions

On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR

Apr 03, 2021
Tsz Kin Lam, Mayumi Ohta, Shigehiko Schamoni, Stefan Riezler

We propose an on-the-fly data augmentation method for automatic speech recognition (ASR) that uses alignment information to generate effective training samples. Our method, called Aligned Data Augmentation (ADA) for ASR, replaces transcribed tokens and the speech representations in an aligned manner to generate previously unseen training pairs. The speech representations are sampled from an audio dictionary that has been extracted from the training corpus and inject speaker variations into the training examples. The transcribed tokens are either predicted by a language model such that the augmented data pairs are semantically close to the original data, or randomly sampled. Both strategies result in training pairs that improve robustness in ASR training. Our experiments on a Seq-to-Seq architecture show that ADA can be applied on top of SpecAugment, and achieves about 9-23% and 4-15% relative improvements in WER over SpecAugment alone on LibriSpeech 100h and LibriSpeech 960h test datasets, respectively.

* Submitted to INTERSPEECH 2021 

  Access Paper or Ask Questions

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Dec 17, 2019
Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

  Access Paper or Ask Questions

Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models

Nov 01, 2018
Herman Kamper

We investigate unsupervised models that can map a variable-duration speech segment to a fixed-dimensional representation. In settings where unlabelled speech is the only available resource, such acoustic word embeddings can form the basis for "zero-resource" speech search, discovery and indexing systems. Most existing unsupervised embedding methods still use some supervision, such as word or phoneme boundaries. Here we propose the encoder-decoder correspondence autoencoder (EncDec-CAE), which, instead of true word segments, uses automatically discovered segments: an unsupervised term discovery system finds pairs of words of the same unknown type, and the EncDec-CAE is trained to reconstruct one word given the other as input. We compare it to a standard encoder-decoder autoencoder (AE), a variational AE with a prior over its latent embedding, and downsampling. EncDec-CAE outperforms its closest competitor by 24% relative in average precision on two languages in a word discrimination task.

* 5 pages, 3 figures, 2 tables 

  Access Paper or Ask Questions

NL Understanding with a Grammar of Constructions

Jan 17, 1995
Wlodek Zadrozny, Marcin Szummer, Stanislaw Jarecki, David E. Johnson, Leora Morgenstern

We present an approach to natural language understanding based on a computable grammar of constructions. A "construction" consists of a set of features of form and a description of meaning in a context. A grammar is a set of constructions. This kind of grammar is the key element of Mincal, an implemented natural language, speech-enabled interface to an on-line calendar system. The system consists of a NL grammar, a parser, an on-line calendar, a domain knowledge base (about dates, times and meetings), an application knowledge base (about the calendar), a speech recognizer, a speech generator, and the interfaces between those modules. We claim that this architecture should work in general for spoken interfaces in small domains. In this paper we present two novel aspects of the architecture: (a) the use of constructions, integrating descriptions of form, meaning and context into one whole; and (b) the separation of domain knowledge from application knowledge. We describe the data structures for encoding constructions, the structure of the knowledge bases, and the interactions of the key modules of the system.

* appeared in Proc. Coling'94, Kyoto, Japan, 1994; 5 postscript pages; email to [email protected] 

  Access Paper or Ask Questions

Bridging the prosody GAP: Genetic Algorithm with People to efficiently sample emotional prosody

May 10, 2022
Pol van Rijn, Harin Lee, Nori Jacoby

The human voice effectively communicates a range of emotions with nuanced variations in acoustics. Existing emotional speech corpora are limited in that they are either (a) highly curated to induce specific emotions with predefined categories that may not capture the full extent of emotional experiences, or (b) entangled in their semantic and prosodic cues, limiting the ability to study these cues separately. To overcome this challenge, we propose a new approach called 'Genetic Algorithm with People' (GAP), which integrates human decision and production into a genetic algorithm. In our design, we allow creators and raters to jointly optimize the emotional prosody over generations. We demonstrate that GAP can efficiently sample from the emotional speech space and capture a broad range of emotions, and show comparable results to state-of-the-art emotional speech corpora. GAP is language-independent and supports large crowd-sourcing, thus can support future large-scale cross-cultural research.

* Accepted to CogSci'22 

  Access Paper or Ask Questions

Prosody Labelled Dataset for Hindi using Semi-Automated Approach

Dec 11, 2021
Esha Banerjee, Atul Kr. Ojha, Girish Nath Jha

This study aims to develop a semi-automatically labelled prosody database for Hindi, for enhancing the intonation component in ASR and TTS systems, which is also helpful for building Speech to Speech Machine Translation systems. Although no single standard for prosody labelling exists in Hindi, researchers in the past have employed perceptual and statistical methods in literature to draw inferences about the behaviour of prosody patterns in Hindi. Based on such existing research and largely agreed upon theories of intonation in Hindi, this study attempts to first develop a manually annotated prosodic corpus of Hindi speech data, which is then used for training prediction models for generating automatic prosodic labels. A total of 5,000 sentences (23,500 words) for declarative and interrogative types have been labelled. The accuracy of the trained models for pitch accent, intermediate phrase boundaries and accentual phrase boundaries is 73.40%, 93.20%, and 43% respectively.

* 6 

  Access Paper or Ask Questions

Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation

Dec 06, 2019
Naveen Arivazhagan, Colin Cherry, Te I, Wolfgang Macherey, Pallavi Baljekar, George Foster

We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them.

  Access Paper or Ask Questions

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

Jun 29, 2018
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across different attribute domains using a single generator network, (3) is able to generate converted speech signals quickly enough to allow real-time implementations and (4) requires only several minutes of training examples to generate reasonably realistic-sounding speech. Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

  Access Paper or Ask Questions

An Effective Dereverberation Algorithm by Fusing MVDR and MCLP

Mar 28, 2022
Fengqi Tan, Changchun Bao

In the scenario with reverberation, the experience of human-machine interaction will become worse. In order to solve this problem, many methods for the dereverberation have emerged. At present, how to update the parameters of the Kalman filter in the existing dereverberation methods based on multichannel linear prediction (MCLP) is a challenging task, especially, accurate power spectral density (PSD) estimation of target speech. In this paper, minimum variance distortionless response (MVDR) beamformer and MCLP are effectively fused in the dereverberation, where the PSD of target speech used for Kalman filter is modified in the MCLP. In order to construct a MVDR beamformer, the PSD of late reverberation and the PSD of the noise are estimated simultaneously by the blocking-based PSD estimator. Thus, the PSD of target speech used for Kalman filter can be obtained by subtracting the PSD of late reverberation and the PSD of the noise from the PSD of observation signal. Compared to the reference methods, the proposed method shows an outstanding performance.

* 5 pages, 3 figures, this paper have been submitted to INTERSPEECH 2022, Confirmation number:546 

  Access Paper or Ask Questions