Alert button
Picture for Rudolf Uher

Rudolf Uher

Alert button

Significance of Speaker Embeddings and Temporal Context for Depression Detection

Jul 24, 2021
Sri Harsha Dumpala, Sebastian Rodriguez, Sheri Rempel, Rudolf Uher, Sageev Oore

Figure 1 for Significance of Speaker Embeddings and Temporal Context for Depression Detection
Figure 2 for Significance of Speaker Embeddings and Temporal Context for Depression Detection
Figure 3 for Significance of Speaker Embeddings and Temporal Context for Depression Detection
Figure 4 for Significance of Speaker Embeddings and Temporal Context for Depression Detection

Depression detection from speech has attracted a lot of attention in recent years. However, the significance of speaker-specific information in depression detection has not yet been explored. In this work, we analyze the significance of speaker embeddings for the task of depression detection from speech. Experimental results show that the speaker embeddings provide important cues to achieve state-of-the-art performance in depression detection. We also show that combining conventional OpenSMILE and COVAREP features, which carry complementary information, with speaker embeddings further improves the depression detection performance. The significance of temporal context in the training of deep learning models for depression detection is also analyzed in this paper.

Viaarxiv icon

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

Sep 12, 2019
Habibeh Naderi, Behrouz Haji Soleimani, Sheri Rempel, Stan Matwin, Rudolf Uher

Figure 1 for Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples
Figure 2 for Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples
Figure 3 for Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples
Figure 4 for Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

Key features of mental illnesses are reflected in speech. Our research focuses on designing a multimodal deep learning structure that automatically extracts salient features from recorded speech samples for predicting various mental disorders including depression, bipolar, and schizophrenia. We adopt a variety of pre-trained models to extract embeddings from both audio and text segments. We use several state-of-the-art embedding techniques including BERT, FastText, and Doc2VecC for the text representation learning and WaveNet and VGG-ish models for audio encoding. We also leverage huge auxiliary emotion-labeled text and audio corpora to train emotion-specific embeddings and use transfer learning in order to address the problem of insufficient annotated multimodal data available. All these embeddings are then combined into a joint representation in a multimodal fusion layer and finally a recurrent neural network is used to predict the mental disorder. Our results show that mental disorders can be predicted with acceptable accuracy through multimodal analysis of clinical interviews.

* arXiv admin note: text overlap with arXiv:1811.09362 by other authors 
Viaarxiv icon