Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

Spoken Speech Enhancement using EEG

Sep 13, 2019
Gautam Krishna, Yan Han, Co Tran, Mason Carnahan, Ahmed H Tewfik

Figure 1 for Spoken Speech Enhancement using EEG

Figure 2 for Spoken Speech Enhancement using EEG

Figure 3 for Spoken Speech Enhancement using EEG

Figure 4 for Spoken Speech Enhancement using EEG

In this paper we demonstrate spoken speech enhancement using electroencephalography (EEG) signals using a generative adversarial network (GAN) based model and Long short-term Memory (LSTM) regression based model. Our results demonstrate that EEG features can be used to clean speech recorded in presence of background noise.

* To be submitted to ICASSP 2020. arXiv admin note: text overlap with arXiv:1906.08044, arXiv:1906.08871, arXiv:1906.08045

Via

Access Paper or Ask Questions

Emotional Speaker Identification using a Novel Capsule Nets Model

Jan 09, 2022
Ali Bou Nassif, Ismail Shahin, Ashraf Elnagar, Divya Velayudhan, Adi Alhudhaif, Kemal Polat

Figure 1 for Emotional Speaker Identification using a Novel Capsule Nets Model

Figure 2 for Emotional Speaker Identification using a Novel Capsule Nets Model

Figure 3 for Emotional Speaker Identification using a Novel Capsule Nets Model

Figure 4 for Emotional Speaker Identification using a Novel Capsule Nets Model

Speaker recognition systems are widely used in various applications to identify a person by their voice; however, the high degree of variability in speech signals makes this a challenging task. Dealing with emotional variations is very difficult because emotions alter the voice characteristics of a person; thus, the acoustic features differ from those used to train models in a neutral environment. Therefore, speaker recognition models trained on neutral speech fail to correctly identify speakers under emotional stress. Although considerable advancements in speaker identification have been made using convolutional neural networks (CNN), CNNs cannot exploit the spatial association between low-level features. Inspired by the recent introduction of capsule networks (CapsNets), which are based on deep learning to overcome the inadequacy of CNNs in preserving the pose relationship between low-level features with their pooling technique, this study investigates the performance of using CapsNets in identifying speakers from emotional speech recordings. A CapsNet-based speaker identification model is proposed and evaluated using three distinct speech databases, i.e., the Emirati Speech Database, SUSAS Dataset, and RAVDESS (open-access). The proposed model is also compared to baseline systems. Experimental results demonstrate that the novel proposed CapsNet model trains faster and provides better results over current state-of-the-art schemes. The effect of the routing algorithm on speaker identification performance was also studied by varying the number of iterations, both with and without a decoder network.

* 11 pages, 8 figures

Via

Access Paper or Ask Questions

Quadrupedal Robotic Guide Dog with Vocal Human-Robot Interaction

Nov 25, 2021
Kavan Mehrizi

Figure 1 for Quadrupedal Robotic Guide Dog with Vocal Human-Robot Interaction

Figure 2 for Quadrupedal Robotic Guide Dog with Vocal Human-Robot Interaction

Figure 3 for Quadrupedal Robotic Guide Dog with Vocal Human-Robot Interaction

Guide dogs play a critical role in the lives of many, however training them is a time- and labor-intensive process. We are developing a method to allow an autonomous robot to physically guide humans using direct human-robot communication. The proposed algorithm will be deployed on a Unitree A1 quadrupedal robot and will autonomously navigate the person to their destination while communicating with the person using a speech interface compatible with the robot. This speech interface utilizes cloud based services such as Amazon Polly and Google Cloud to serve as the text-to-speech and speech-to-text engines.

* Hopper Dean & NSF REU: Transfer-to-Excellence Research Experiences for Undergraduates (TTE REU), University of California, Berkeley

Via

Access Paper or Ask Questions

EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

Apr 13, 2021
Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Ramon Fernandez Astudillo, Jan "Honza'' Černocký

Figure 1 for EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

Figure 2 for EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

Figure 3 for EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

Figure 4 for EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here we propose an enhanced ASR-TTS (EAT) model that incorporates two main features: 1) The ASR$\rightarrow$TTS direction is equipped with a language model reward to penalize the ASR hypotheses before forwarding it to TTS. 2) In the TTS$\rightarrow$ASR direction, a hyper-parameter is introduced to scale the attention context from synthesized speech before sending it to ASR to handle out-of-domain data. Training strategies and the effectiveness of the EAT model are explored under out-of-domain data conditions. The results show that EAT reduces the performance gap between supervised and self-supervised training significantly by absolute 2.6\% and 2.7\% on Librispeech and BABEL respectively.

Via

Access Paper or Ask Questions

The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates

Feb 24, 2021
Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J. M. Rothkrantz, Joeri Zwerts, Jelle Treep, Casper Kaandorp

Figure 1 for The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates

Figure 2 for The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates

Figure 3 for The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates

The INTERSPEECH 2021 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the COVID-19 Cough and COVID-19 Speech Sub-Challenges, a binary classification on COVID-19 infection has to be made based on coughing sounds and speech; in the Escalation SubChallenge, a three-way assessment of the level of escalation in a dialogue is featured; and in the Primates Sub-Challenge, four species vs background need to be classified. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the 'usual' COMPARE and BoAW features as well as deep unsupervised representation learning using the AuDeep toolkit, and deep feature extraction from pre-trained CNNs using the Deep Spectrum toolkit; in addition, we add deep end-to-end sequential modelling, and partially linguistic analysis.

* 5 pages

Via

Access Paper or Ask Questions

HaGRID - HAnd Gesture Recognition Image Dataset

Jun 16, 2022
Alexander Kapitanov, Andrew Makhlyarchuk, Karina Kvanchiani

Figure 1 for HaGRID - HAnd Gesture Recognition Image Dataset

Figure 2 for HaGRID - HAnd Gesture Recognition Image Dataset

Figure 3 for HaGRID - HAnd Gesture Recognition Image Dataset

Figure 4 for HaGRID - HAnd Gesture Recognition Image Dataset

In this paper, we introduce an enormous dataset HaGRID (HAnd Gesture Recognition Image Dataset) for hand gesture recognition (HGR) systems. This dataset contains 552,992 samples divided into 18 classes of gestures. The annotations consist of bounding boxes of hands with gesture labels and markups of leading hands. The proposed dataset allows for building HGR systems, which can be used in video conferencing services, home automation systems, the automotive sector, services for people with speech and hearing impairments, etc. We are especially focused on interaction with devices to manage them. That is why all 18 chosen gestures are functional, familiar to the majority of people, and may be an incentive to take some action. In addition, we used crowdsourcing platforms to collect the dataset and took into account various parameters to ensure data diversity. We describe the challenges of using existing HGR datasets for our task and provide a detailed overview of them. Furthermore, the baselines for the hand detection and gesture classification tasks are proposed.

* 11 pages, 9 figures, open-source dataset for computer vision

Via

Access Paper or Ask Questions

Visually Guided Self Supervised Learning of Speech Representations

Jan 13, 2020
Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic

Figure 1 for Visually Guided Self Supervised Learning of Speech Representations

Figure 2 for Visually Guided Self Supervised Learning of Speech Representations

Figure 3 for Visually Guided Self Supervised Learning of Speech Representations

Figure 4 for Visually Guided Self Supervised Learning of Speech Representations

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

* Submitted to ICASSP 2020

Via

Access Paper or Ask Questions

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

Feb 24, 2022
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw

Figure 1 for Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

Figure 2 for Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

Figure 3 for Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

Figure 4 for Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.

Via

Access Paper or Ask Questions

Eigenresiduals for improved Parametric Speech Synthesis

Jan 02, 2020
Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

Figure 1 for Eigenresiduals for improved Parametric Speech Synthesis

Figure 2 for Eigenresiduals for improved Parametric Speech Synthesis

Figure 3 for Eigenresiduals for improved Parametric Speech Synthesis

Figure 4 for Eigenresiduals for improved Parametric Speech Synthesis

Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb.

Via

Access Paper or Ask Questions

Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session

Mar 31, 2022
Dehua Tao, Tan Lee, Harold Chui, Sarah Luk

Figure 1 for Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session

Figure 2 for Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session

Figure 3 for Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session

Figure 4 for Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session

Counseling typically takes the form of spoken conversation between a therapist and a client. The empathy level expressed by the therapist is considered to be an essential quality factor of counseling outcome. This paper proposes a hierarchical recurrent network combined with two-level attention mechanisms to determine the therapist's empathy level solely from the acoustic features of conversational speech in a counseling session. The experimental results show that the proposed model can achieve an accuracy of 72.1% in classifying the therapist's empathy level as being "high" or "low". It is found that the speech from both the therapist and the client are contributing to predicting the empathy level that is subjectively rated by an expert observer. By analyzing speaker turns assigned with high attention weights, it is observed that 2 to 6 consecutive turns should be considered together to provide useful clues for detecting empathy, and the observer tends to take the whole session into consideration when rating the therapist empathy, instead of relying on a few specific speaker turns.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions