Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

FinnSentiment -- A Finnish Social Media Corpus for Sentiment Polarity Annotation

Dec 04, 2020
Krister Lindén, Tommi Jauhiainen, Sam Hardwick

Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e.g. when indicating hate speech and fake news. In our survey of previous work, we note that there is no large-scale social media data set with sentiment polarity annotations for Finnish. This publications aims to remedy this shortcoming by introducing a 27,000 sentence data set annotated independently with sentiment polarity by three native annotators. We had the same three annotators for the whole data set, which provides a unique opportunity for further studies of annotator behaviour over time. We analyse their inter-annotator agreement and provide two baselines to validate the usefulness of the data set.

  Access Paper or Ask Questions

Human Emotion Detection from Audio and Video Signals

Jun 21, 2020
Sai Nikhil Chennoor, B. R. K. Madhur, Moujiz Ali, T. Kishore Kumar

The primary objective is to teach a machine about human emotions, which has become an essential requirement in the field of social intelligence, also expedites the progress of human-machine interactions. The ability of a machine to understand human emotion and act accordingly has been a choice of great interest in today's world. The future generations of computers thus must be able to interact with a human being just like another. For example, people who have Autism often find it difficult to talk to someone about their state of mind. This model explicitly targets the userbase who are troubled and fail to express it. Also, this model's speech processing techniques provide an estimate of the emotion in the case of poor video quality and vice-versa.

  Access Paper or Ask Questions

Glottal Source Estimation using an Automatic Chirp Decomposition

May 16, 2020
Thomas Drugman, Baris Bozkurt, Thierry Dutoit

In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essential. This paper extends the formalism of ZZT by evaluating the Z-transform on a contour possibly different from the unit circle. A method is proposed for determining automatically this contour by inspecting the root distribution. The derived Zeros of the Chirp Z-Transform (ZCZT)-based technique turns out to be much more robust to GCI location errors.

  Access Paper or Ask Questions

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Apr 20, 2020
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization parameters and evaluate their choices on a wide range of neural network models for different application domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.

* 20 pages, 7 figures 

  Access Paper or Ask Questions

Improved Robust ASR for Social Robots in Public Spaces

Jan 14, 2020
Charles Jankowski, Vishwas Mruthyunjaya, Ruixi Lin

Social robots deployed in public spaces present a challenging task for ASR because of a variety of factors, including noise SNR of 20 to 5 dB. Existing ASR models perform well for higher SNRs in this range, but degrade considerably with more noise. This work explores methods for providing improved ASR performance in such conditions. We use the AiShell-1 Chinese speech corpus and the Kaldi ASR toolkit for evaluations. We were able to exceed state-of-the-art ASR performance with SNR lower than 20 dB, demonstrating the feasibility of achieving relatively high performing ASR with open-source toolkits and hundreds of hours of training data, which is commonly available.

  Access Paper or Ask Questions

Recurrent Neural Networks (RNNs): A gentle Introduction and Overview

Nov 23, 2019
Robin M. Schmidt

State-of-the-art solutions in the areas of "Language Modelling & Generating Text", "Speech Recognition", "Generating Image Descriptions" or "Video Tagging" have been using Recurrent Neural Networks as the foundation for their approaches. Understanding the underlying concepts is therefore of tremendous importance if we want to keep up with recent or upcoming publications in those areas. In this work we give a short overview over some of the most important concepts in the realm of Recurrent Neural Networks which enables readers to easily understand the fundamentals such as but not limited to "Backpropagation through Time" or "Long Short-Term Memory Units" as well as some of the more recent advances like the "Attention Mechanism" or "Pointer Networks". We also give recommendations for further reading regarding more complex topics where it is necessary.

  Access Paper or Ask Questions

Persian Signature Verification using Fully Convolutional Networks

Sep 20, 2019
Mohammad Rezaei, Nader Naderi

Fully convolutional networks (FCNs) have been recently used for feature extraction and classification in image and speech recognition, where their inputs have been raw signal or other complicated features. Persian signature verification is done using conventional convolutional neural networks (CNNs). In this paper, we propose to use FCN for learning a robust feature extraction from the raw signature images. FCN can be considered as a variant of CNN where its fully connected layers are replaced with a global pooling layer. In the proposed manner, FCN inputs are raw signature images and convolution filter size is fixed. Recognition accuracy on UTSig database, shows that FCN with a global average pooling outperforms CNN.

* 2nd National Conference on New Researches in Electrical and Computer Engineering, 2017 

  Access Paper or Ask Questions

VCWE: Visual Character-Enhanced Word Embeddings

Mar 25, 2019
Chi Sun, Xipeng Qiu, Xuanjing Huang

Chinese is a logographic writing system, and the shape of Chinese characters contain rich syntactic and semantic information. In this paper, we propose a model to learn Chinese word embeddings via three-level composition: (1) a convolutional neural network to extract the intra-character compositionality from the visual shape of a character; (2) a recurrent neural network with self-attention to compose character representation into word embeddings; (3) the Skip-Gram framework to capture non-compositionality directly from the contextual information. Evaluations demonstrate the superior performance of our model on four tasks: word similarity, sentiment analysis, named entity recognition and part-of-speech tagging.

* Accepted to NAACL 2019 

  Access Paper or Ask Questions

Speaker Diarization With Lexical Information

Nov 28, 2018
Tae Jin Park, Kyu Han, Ian Lane, Panayiotis Georgiou

This work presents a novel approach to leverage lexical information for speaker diarization. We introduce a speaker diarization system that can directly integrate lexical as well as acoustic information into a speaker clustering process. Thus, we propose an adjacency matrix integration technique to integrate word level speaker turn probabilities with speaker embeddings in a comprehensive way. Our proposed method works without any reference transcript. Words, and word boundary information are provided by an ASR system. We show that our proposed method improves a baseline speaker diarization system solely based on speaker embeddings, achieving a meaningful improvement on the CALLHOME American English Speech dataset.

* 5 pages, 6 figures 

  Access Paper or Ask Questions

Mind Your Language: Abuse and Offense Detection for Code-Switched Languages

Sep 23, 2018
Raghav Kapoor, Yaman Kumar, Kshitij Rajput, Rajiv Ratn Shah, Ponnurangam Kumaraguru, Roger Zimmermann

In multilingual societies like the Indian subcontinent, use of code-switched languages is much popular and convenient for the users. In this paper, we study offense and abuse detection in the code-switched pair of Hindi and English (i.e. Hinglish), the pair that is the most spoken. The task is made difficult due to non-fixed grammar, vocabulary, semantics and spellings of Hinglish language. We apply transfer learning and make a LSTM based model for hate speech classification. This model surpasses the performance shown by the current best models to establish itself as the state-of-the-art in the unexplored domain of Hinglish offensive text classification.We also release our model and the embeddings trained for research purposes

  Access Paper or Ask Questions