Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jennifer Cole

Probing neural audio codecs for distinctions among English nuclear tunes

Mar 14, 2026

Juan Pablo Vigneaux, Jennifer Cole

Abstract:State-of-the-art spoken dialogue models (Défossez et al. 2024; Schalkwyk et al. 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole et al. (2023) to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns. Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy (TATA): 0.31) and the five clusters of these tunes that are robust in human speech production and perception (TATA: 0.45). Greater accuracy (TATAs: 0.74-0.89) is attained for binary distinctions between classes of rising vs. falling tunes, respectively used for questions and assertions. Information about tunes is spread among all codebooks, which calls into question a distinction between 'semantic' and 'acoustic' codebooks found in the literature. Accuracies improve with nonlinear probes, but discrimination among the five clusters remains far from human performance, suggesting a fundamental limitation of current codecs.

* 5 pages; 1 table; 3 figures. Accepted as conference paper at Speech Prosody 2026

Via

Access Paper or Ask Questions

Crowdsourced and Automatic Speech Prominence Estimation

Oct 12, 2023

Max Morrison, Pranav Pawar, Nathan Pruyne, Jennifer Cole, Bryan Pardo

Figure 1 for Crowdsourced and Automatic Speech Prominence Estimation

Figure 2 for Crowdsourced and Automatic Speech Prominence Estimation

Figure 3 for Crowdsourced and Automatic Speech Prominence Estimation

Figure 4 for Crowdsourced and Automatic Speech Prominence Estimation

Abstract:The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, developing such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Prosodic entrainment in dialog acts

Oct 30, 2018

Uwe D. Reichel, Katalin Mády, Jennifer Cole

Figure 1 for Prosodic entrainment in dialog acts

Figure 2 for Prosodic entrainment in dialog acts

Figure 3 for Prosodic entrainment in dialog acts

Figure 4 for Prosodic entrainment in dialog acts

Abstract:We examined prosodic entrainment in spoken dialogs separately for several dialog acts in cooperative and competitive games. Entrainment was measured for intonation features derived from a superpositional intonation stylization as well as for rhythm features. The found differences can be related to the cooperative or competitive nature of the game, as well as to dialog act properties as its intrinsic authority, supportiveness and distributional characteristics. In cooperative games dialog acts with a high authority given by knowledge and with a high frequency showed the most entrainment. The results are discussed amongst others with respect to the degree of active entrainment control in cooperative behavior.

* This manuscript is under revision. Please contact the authors for information about updates

Via

Access Paper or Ask Questions