Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

End-to-End Multimodal Emotion Recognition using Deep Neural Networks

Apr 27, 2017
Panagiotis Tzirakis, George Trigeorgis, Mihalis A. Nicolaou, Björn Schuller, Stefanos Zafeiriou

Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a Convolutional Neural Network (CNN) to extract features from the speech, while for the visual modality a deep residual network (ResNet) of 50 layers. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, Long Short-Term Memory (LSTM) networks are utilized. The system is then trained in an end-to-end fashion where - by also taking advantage of the correlations of the each of the streams - we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.

  Access Paper or Ask Questions

Deep Multimodal Representation Learning from Temporal Data

Apr 11, 2017
Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, Jiebo Luo

In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video- and sensor-based activity classification, and audio-visual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.

* To appear in CVPR 2017 

  Access Paper or Ask Questions

Using Non-invertible Data Transformations to Build Adversarial-Robust Neural Networks

Dec 13, 2016
Qinglong Wang, Wenbo Guo, Alexander G. Ororbia II, Xinyu Xing, Lin Lin, C. Lee Giles, Xue Liu, Peng Liu, Gang Xiong

Deep neural networks have proven to be quite effective in a wide variety of machine learning tasks, ranging from improved speech recognition systems to advancing the development of autonomous vehicles. However, despite their superior performance in many applications, these models have been recently shown to be susceptible to a particular type of attack possible through the generation of particular synthetic examples referred to as adversarial samples. These samples are constructed by manipulating real examples from the training data distribution in order to "fool" the original neural model, resulting in misclassification (with high confidence) of previously correctly classified samples. Addressing this weakness is of utmost importance if deep neural architectures are to be applied to critical applications, such as those in the domain of cybersecurity. In this paper, we present an analysis of this fundamental flaw lurking in all neural architectures to uncover limitations of previously proposed defense mechanisms. More importantly, we present a unifying framework for protecting deep neural models using a non-invertible data transformation--developing two adversary-resilient architectures utilizing both linear and nonlinear dimensionality reduction. Empirical results indicate that our framework provides better robustness compared to state-of-art solutions while having negligible degradation in accuracy.

  Access Paper or Ask Questions

Spectral decomposition method of dialog state tracking via collective matrix factorization

Jun 16, 2016
Julien Perez

The task of dialog management is commonly decomposed into two sequential subtasks: dialog state tracking and dialog policy learning. In an end-to-end dialog system, the aim of dialog state tracking is to accurately estimate the true dialog state from noisy observations produced by the speech recognition and the natural language understanding modules. The state tracking task is primarily meant to support a dialog policy. From a probabilistic perspective, this is achieved by maintaining a posterior distribution over hidden dialog states composed of a set of context dependent variables. Once a dialog policy is learned, it strives to select an optimal dialog act given the estimated dialog state and a defined reward function. This paper introduces a novel method of dialog state tracking based on a bilinear algebric decomposition model that provides an efficient inference schema through collective matrix factorization. We evaluate the proposed approach on the second Dialog State Tracking Challenge (DSTC-2) dataset and we show that the proposed tracker gives encouraging results compared to the state-of-the-art trackers that participated in this standard benchmark. Finally, we show that the prediction schema is computationally efficient in comparison to the previous approaches.

* Dialogue & Discourse 7(3) (2016) 
* 13 pages, 3 figures, 1 Table. arXiv admin note: substantial text overlap with arXiv:1606.04052 

  Access Paper or Ask Questions

AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Oct 10, 2015
Izhar Wallach, Michael Dzamba, Abraham Heifets

Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.

  Access Paper or Ask Questions

Boosting Named Entity Recognition with Neural Character Embeddings

May 25, 2015
Cicero Nogueira dos Santos, Victor Guimarães

Most state-of-the-art named entity recognition (NER) systems rely on handcrafted features and on the output of other NLP tasks such as part-of-speech (POS) tagging and text chunking. In this work we propose a language-independent NER system that uses automatically learned features only. Our approach is based on the CharWNN deep neural network, which uses word-level and character-level representations (embeddings) to perform sequential classification. We perform an extensive number of experiments using two annotated corpora in two different languages: HAREM I corpus, which contains texts in Portuguese; and the SPA CoNLL-2002 corpus, which contains texts in Spanish. Our experimental results shade light on the contribution of neural character embeddings for NER. Moreover, we demonstrate that the same neural network which has been successfully applied to POS tagging can also achieve state-of-the-art results for language-independet NER, using the same hyperparameters, and without any handcrafted features. For the HAREM I corpus, CharWNN outperforms the state-of-the-art system by 7.9 points in the F1-score for the total scenario (ten NE classes), and by 7.2 points in the F1 for the selective scenario (five NE classes).

* 9 pages 

  Access Paper or Ask Questions

Maximum Likelihood Directed Enumeration Method in Piecewise-Regular Object Recognition

Nov 20, 2014
Andrey Savchenko

We explore the problems of classification of composite object (images, speech signals) with low number of models per class. We study the question of improving recognition performance for medium-sized database (thousands of classes). The key issue of fast approximate nearest-neighbor methods widely applied in this task is their heuristic nature. It is possible to strongly prove their efficiency by using the theory of algorithms only for simple similarity measures and artificially generated tasks. On the contrary, in this paper we propose an alternative, statistically optimal greedy algorithm. At each step of this algorithm joint density (likelihood) of distances to previously checked models is estimated for each class. The next model to check is selected from the class with the maximal likelihood. The latter is estimated based on the asymptotic properties of the Kullback-Leibler information discrimination and mathematical model of piecewise-regular object with distribution of each regular segment of exponential type. Experimental results in face recognition for FERET dataset prove that the proposed method is much more effective than not only brute force and the baseline (directed enumeration method) but also approximate nearest neighbor methods from FLANN and NonMetricSpaceLib libraries (randomized kd-tree, composite index, perm-sort).

* 13 pages, 6 figures, 20 references 

  Access Paper or Ask Questions

A Framework for On-Line Devanagari Handwritten Character Recognition

Oct 25, 2014
Sunil Kumar Kopparapu, Lajish V. L

The main challenge in on-line handwritten character recognition in Indian lan- guage is the large size of the character set, larger similarity between different characters in the script and the huge variation in writing style. In this paper we propose a framework for on-line handwitten script recognition taking cues from speech signal processing literature. The framework is based on identify- ing strokes, which in turn lead to recognition of handwritten on-line characters rather that the conventional character identification. Though the framework is described for Devanagari script, the framework is general and can be applied to any language. The proposed platform consists of pre-processing, feature extraction, recog- nition and post processing like the conventional character recognition but ap- plied to strokes. The on-line Devanagari character recognition reduces to one of recognizing one of 69 primitives and recognition of a character is performed by recognizing a sequence of such primitives. We further show the impact of noise removal on on-line raw data which is usually noisy. The use of Fuzzy Direc- tional Features to enhance the accuracy of stroke recognition is also described. The recognition results are compared with commonly used directional features in literature using several classifiers.

* 29 pages 

  Access Paper or Ask Questions

Three studies of grammar-based surface-syntactic parsing of unrestricted English text. A summary and orientation

Jun 27, 1994
Atro Voutilainen

The dissertation addresses the design of parsing grammars for automatic surface-syntactic analysis of unconstrained English text. It consists of a summary and three articles. {\it Morphological disambiguation} documents a grammar for morphological (or part-of-speech) disambiguation of English, done within the Constraint Grammar framework proposed by Fred Karlsson. The disambiguator seeks to discard those of the alternative morphological analyses proposed by the lexical analyser that are contextually illegitimate. The 1,100 constraints express some 23 general, essentially syntactic statements as restrictions on the linear order of morphological tags. The error rate of the morphological disambiguator is about ten times smaller than that of another state-of-the-art probabilistic disambiguator, given that both are allowed to leave some of the hardest ambiguities unresolved. This accuracy suggests the viability of the grammar-based approach to natural language parsing, thus also contributing to the more general debate concerning the viability of probabilistic vs.\ linguistic techniques. {\it Experiments with heuristics} addresses the question of how to resolve those ambiguities that survive the morphological disambiguator. Two approaches are presented and empirically evaluated: (i) heuristic disambiguation constraints and (ii) techniques for learning from the fully disambiguated part of the corpus and then applying this information to resolving remaining ambiguities.

* PhD dissertation. 36pp, gzipped and uuencoded .ps file 

  Access Paper or Ask Questions

Vector Representations of Idioms in Conversational Systems

May 07, 2022
Tosin Adewumi, Foteini Liwicki, Marcus Liwicki

We demonstrate, in this study, that an open-domain conversational system trained on idioms or figurative language generates more fitting responses to prompts containing idioms. Idioms are part of everyday speech in many languages, across many cultures, but they pose a great challenge for many Natural Language Processing (NLP) systems that involve tasks such as Information Retrieval (IR) and Machine Translation (MT), besides conversational AI. We utilize the Potential Idiomatic Expression (PIE)-English idioms corpus for the two tasks that we investigate: classification and conversation generation. We achieve state-of-the-art (SoTA) result of 98% macro F1 score on the classification task by using the SoTA T5 model. We experiment with three instances of the SoTA dialogue model, Dialogue Generative Pre-trained Transformer (DialoGPT), for conversation generation. Their performances are evaluated using the automatic metric perplexity and human evaluation. The results show that the model trained on the idiom corpus generates more fitting responses to prompts containing idioms 71.9% of the time, compared to a similar model not trained on the idioms corpus. We contribute the model checkpoint/demo and code on the HuggingFace hub for public access.

* 7 pages, 1 figure, 8 tables 

  Access Paper or Ask Questions