Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Manuel Sam Ribeiro

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Nov 19, 2020

Manuel Sam Ribeiro, Jennifer Sanger, Jing-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, Steve Renals

Figure 1 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Figure 2 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Figure 3 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Figure 4 for TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Abstract:We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of parallel ultrasound, video, and audio data, of which approximately 13.5 hours are speech. This paper describes the corpus and presents benchmark results for the tasks of speech recognition, speech synthesis (articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound to audio. The TaL corpus is publicly available under the CC BY-NC 4.0 license.

* 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language Technology Workshop

Via

Access Paper or Ask Questions

Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

Aug 15, 2019

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

Figure 1 for Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

Figure 2 for Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

Figure 3 for Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

Figure 4 for Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

Abstract:We investigate the automatic processing of child speech therapy sessions using ultrasound visual biofeedback, with a specific focus on complementing acoustic features with ultrasound images of the tongue for the tasks of speaker diarization and time-alignment of target words. For speaker diarization, we propose an ultrasound-based time-domain signal which we call estimated tongue activity. For word-alignment, we augment an acoustic model with low-dimensional representations of ultrasound images of the tongue, learned by a convolutional neural network. We conduct our experiments using the Ultrasuite repository of ultrasound and speech recordings for child speech therapy sessions. For both tasks, we observe that systems augmented with ultrasound data outperform corresponding systems using only the audio signal.

* 5 pages, 3 figures, Accepted for publication at Interspeech 2019

Via

Access Paper or Ask Questions

UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Jul 01, 2019

Aciel Eshky, Manuel Sam Ribeiro, Joanne Cleland, Korin Richmond, Zoe Roxburgh, James Scobbie, Alan Wrench

Figure 1 for UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Figure 2 for UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Figure 3 for UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Figure 4 for UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Abstract:We introduce UltraSuite, a curated repository of ultrasound and acoustic data, collected from recordings of child speech therapy sessions. This release includes three data collections, one from typically developing children and two from children with speech sound disorders. In addition, it includes a set of annotations, some manual and some automatically produced, and software tools to process, transform and visualise the data.

* 5 pages, 1 figure, 3 tables; accepted to Interspeech 2018: 19th Annual Conference of the International Speech Communication Association (ISCA)

Via

Access Paper or Ask Questions

Synchronising audio and ultrasound by learning cross-modal embeddings

Jul 01, 2019

Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

Figure 1 for Synchronising audio and ultrasound by learning cross-modal embeddings

Figure 2 for Synchronising audio and ultrasound by learning cross-modal embeddings

Figure 3 for Synchronising audio and ultrasound by learning cross-modal embeddings

Figure 4 for Synchronising audio and ultrasound by learning cross-modal embeddings

Abstract:Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.

* 5 pages, 1 figure, 4 tables; accepted to Interspeech 2019: the 20th Annual Conference of the International Speech Communication Association (ISCA)

Via

Access Paper or Ask Questions

Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Jul 01, 2019

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

Figure 1 for Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Figure 2 for Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Figure 3 for Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Figure 4 for Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Abstract:Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously unseen speakers. In this work, we investigate the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted. We observe that models underperform when applied to data from speakers not seen at training time. However, when provided with minimal additional speaker information, such as the mean ultrasound frame, the models generalize better to unseen speakers.

* 5 pages, 4 figures, published in ICASSP2019 (IEEE International Conference on Acoustics, Speech and Signal Processing, 2019)

Via

Access Paper or Ask Questions