With the multiplication of social media platforms, which offer anonymity, easy access and online community formation, and online debate, the issue of hate speech detection and tracking becomes a growing challenge to society, individual, policy-makers and researchers. Despite efforts for leveraging automatic techniques for automatic detection and monitoring, their performances are still far from satisfactory, which constantly calls for future research on the issue. This paper provides a systematic review of literature in this field, with a focus on natural language processing and deep learning technologies, highlighting the terminology, processing pipeline, core methods employed, with a focal point on deep learning architecture. From a methodological perspective, we adopt PRISMA guideline of systematic review of the last 10 years literature from ACM Digital Library and Google Scholar. In the sequel, existing surveys, limitations, and future research directions are extensively discussed.
In automatic speech recognition, many studies have shown performance improvements using language models (LMs). Recent studies have tried to use bidirectional LMs (biLMs) instead of conventional unidirectional LMs (uniLMs) for rescoring the $N$-best list decoded from the acoustic model. In spite of their theoretical benefits, the biLMs have not given notable improvements compared to the uniLMs in their experiments. This is because their biLMs do not consider the interaction between the two directions. In this paper, we propose a novel sentence scoring method considering the interaction between the past and the future words on the biLM. Our experimental results on the LibriSpeech corpus show that the biLM with the proposed sentence scoring outperforms the uniLM for the $N$-best list rescoring, consistently and significantly in all experimental conditions. The analysis of WERs by word position demonstrates that the biLM is more robust than the uniLM especially when a recognized sentence is short or a misrecognized word is at the beginning of the sentence.
Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to log-mel spectrograms to mitigate overfitting and improve the generalization of emotion recognition models. We evaluate the effectiveness of our proposed approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results.
Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a POS tagger on a sufficiently large corpus of labeled data. While the problem of POS tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due to the lack of native speakers and sparsity of training data. Moreover, most texts have no sentences as we know them today, nor a common orthography. These irregularities render the task of automated POS tagging more difficult and error-prone. Under these circumstances, instead of forcing the POS tagger to predict and commit to a single tag, it should be enabled to express its uncertainty. In this paper, we consider POS tagging within the framework of set-valued prediction, which allows the POS tagger to express its uncertainty via predicting a set of candidate POS tags instead of guessing a single one. The goal is to guarantee a high confidence that the correct POS tag is included while keeping the number of candidates small. In our experimental study, we find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings, especially for unknown words, i.e., words not occurring in the training data.
We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.
Emotion recognition has become an important field of research in the human-computer interactions domain. The latest advancements in the field show that combining visual with audio information lead to better results if compared to the case of using a single source of information separately. From a visual point of view, a human emotion can be recognized by analyzing the facial expression of the person. More precisely, the human emotion can be described through a combination of several Facial Action Units. In this paper, we propose a system that is able to recognize emotions with a high accuracy rate and in real time, based on deep Convolutional Neural Networks. In order to increase the accuracy of the recognition system, we analyze also the speech data and fuse the information coming from both sources, i.e., visual and audio. Experimental results show the effectiveness of the proposed scheme for emotion recognition and the importance of combining visual with audio data.
We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.