Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

An Online Multilingual Hate speech Recognition System

Dec 22, 2020
Neeraj Vashistha, Arkaitz Zubiaga, Shanky Sharma

The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.

* Information 12, no. 1: 5 (2021) 
* 11 pages, 5 figures, appear in Special Issue "Natural Language Processing for Social Media" on MDPI Information 2021, 12(1), 5 

  Access Paper or Ask Questions

FFC-SE: Fast Fourier Convolution for Speech Enhancement

Apr 06, 2022
Ivan Shchekotov, Pavel Andreev, Oleg Ivanov, Aibek Alanov, Dmitry Vetrov

Fast Fourier convolution (FFC) is the recently proposed neural operator showing promising performance in several computer vision problems. The FFC operator allows employing large receptive field operations within early layers of the neural network. It was shown to be especially helpful for inpainting of periodic structures which are common in audio processing. In this work, we design neural network architectures which adapt FFC for speech enhancement. We hypothesize that a large receptive field allows these networks to produce more coherent phases than vanilla convolutional models, and validate this hypothesis experimentally. We found that neural networks based on Fast Fourier convolution outperform analogous convolutional models and show better or comparable results with other speech enhancement baselines.

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

A Comparative Study on End-to-end Speech to Text Translation

Nov 20, 2019
Parnia Bahar, Tobias Bieschke, Hermann Ney

Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pre-trained models, and their impact on the final performance, which gives boosts up to 4% in BLEU and 5% in TER. Our experiments are performed on 270h IWSLT TED-talks En->De, and 100h LibriSpeech Audiobooks En->Fr. We also show improvements over the current end-to-end state-of-the-art systems on both tasks.

* 8 pages, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, December 2019 

  Access Paper or Ask Questions

Who Are We Talking About? Handling Person Names in Speech Translation

May 13, 2022
Marco Gaido, Matteo Negri, Marco Turchi

Recent work has shown that systems for speech translation (ST) -- similarly to automatic speech recognition (ASR) -- poorly handle person names. This shortcoming does not only lead to errors that can seriously distort the meaning of the input, but also hinders the adoption of such systems in application scenarios (like computer-assisted interpreting) where the translation of named entities, like person names, is crucial. In this paper, we first analyse the outputs of ASR/ST systems to identify the reasons of failures in person name transcription/translation. Besides the frequency in the training data, we pinpoint the nationality of the referred person as a key factor. We then mitigate the problem by creating multilingual models, and further improve our ST systems by forcing them to jointly generate transcripts and translations, prioritising the former over the latter. Overall, our solutions result in a relative improvement in token-level person name accuracy by 47.8% on average for three language pairs (en->es,fr,it).

* Accepted at IWSLT2022 

  Access Paper or Ask Questions

Speech Emotion Recognition using Semantic Information

Mar 04, 2021
Panagiotis Tzirakis, Anh Nguyen, Stefanos Zafeiriou, Björn W. Schuller

Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.

* ICASSP 2021 

  Access Paper or Ask Questions

Neural Architecture Search for Speech Recognition

Jul 27, 2020
Shoukang Hu, Xurong Xie, Shansong Liu, Mengzhe Geng, Xunying Liu, Helen Meng

Deep neural networks (DNNs) based automatic speech recognition (ASR) systems are often designed using expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two hyper-parameters that heavily affect the performance and model complexity of state-of-the-art factored time delay neural network (TDNN-F) acoustic models: i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These include the standard DARTS method fully integrating the estimation of architecture weights and TDNN parameters in lattice-free MMI (LF-MMI) training; Gumbel-Softmax DARTS that reduces the confusion between candidate architectures; Pipelined DARTS that circumvents the overfitting of architecture weights using held-out data; and Penalized DARTS that further incorporates resource constraints to adjust the trade-off between performance and system complexity. Parameter sharing among candidate architectures was also used to facilitate efficient search over up to $7^{28}$ different TDNN systems. Experiments conducted on a 300-hour Switchboard conversational telephone speech recognition task suggest the NAS auto-configured TDNN-F systems consistently outperform the baseline LF-MMI trained TDNN-F systems using manual expert configurations. Absolute word error rate reductions up to 1.0% and relative model size reduction of 28% were obtained.

* One of the authors disagrees to put the paper on the arxiv since the paper is not published. So now I would like to apply a formal withdraw of the paper. Hope you can understand our concerns 

  Access Paper or Ask Questions

Incorporating Wireless Communication Parameters into the E-Model Algorithm

Mar 05, 2021
Demóstenes Z. Rodríguez, Dick Carrillo Melgarejo, Miguel A. Ramírez, Pedro H. J. Nardelli, Sebastian Möller

Telecommunication service providers have to guarantee acceptable speech quality during a phone call to avoid a negative impact on the users' quality of experience. Currently, there are different speech quality assessment methods. ITU-T Recommendation G.107 describes the E-model algorithm, which is a computational model developed for network planning purposes focused on narrowband (NB) networks. Later, ITU-T Recommendations G.107.1 and G.107.2 were developed for wideband (WB) and fullband (FB) networks. These algorithms use different impairment factors, each one related to different speech communication steps. However, the NB, WB, and FB E-model algorithms do not consider wireless techniques used in these networks, such as Multiple-Input-Multiple-Output (MIMO) systems, which are used to improve the communication system robustness in the presence of different types of wireless channel degradation. In this context, the main objective of this study is to propose a general methodology to incorporate wireless network parameters into the NB and WB E-model algorithms. To accomplish this goal, MIMO and wireless channel parameters are incorporated into the E-model algorithms, specifically into the $I_{e,eff}$ and $I_{e,eff,WB}$ impairment factors. For performance validation, subjective tests were carried out, and the proposed methodology reached a Pearson correlation coefficient (PCC) and a root mean square error (RMSE) of $0.9732$ and $0.2351$, respectively. It is noteworthy that our proposed methodology does not affect the rest of the E-model input parameters, and it intends to be useful for wireless network planning in speech communication services.

* 18 pages 

  Access Paper or Ask Questions