Alert button
Picture for James R. Glass

James R. Glass

Alert button

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

May 17, 2023
Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

Figure 1 for DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Figure 2 for DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Figure 3 for DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Figure 4 for DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

Viaarxiv icon

Learning Word-Like Units from Joint Audio-Visual Analysis

May 24, 2017
David Harwath, James R. Glass

Figure 1 for Learning Word-Like Units from Joint Audio-Visual Analysis
Figure 2 for Learning Word-Like Units from Joint Audio-Visual Analysis
Figure 3 for Learning Word-Like Units from Joint Audio-Visual Analysis
Figure 4 for Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, our model is able to detect spoken instances of the word 'lighthouse' within an utterance and associate them with image regions containing lighthouses. We do not use any form of conventional automatic speech recognition, nor do we use any text transcriptions or conventional linguistic annotations. Our model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images.

Viaarxiv icon