Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

A Case Study of Deep Learning Based Multi-Modal Methods for Predicting the Age-Suitability Rating of Movie Trailers

Jan 26, 2021
Mahsa Shafaei, Christos Smailis, Ioannis A. Kakadiaris, Thamar Solorio

In this work, we explore different approaches to combine modalities for the problem of automated age-suitability rating of movie trailers. First, we introduce a new dataset containing videos of movie trailers in English downloaded from IMDB and YouTube, along with their corresponding age-suitability rating labels. Secondly, we propose a multi-modal deep learning pipeline addressing the movie trailer age suitability rating problem. This is the first attempt to combine video, audio, and speech information for this problem, and our experimental results show that multi-modal approaches significantly outperform the best mono and bimodal models in this task.

  Access Paper or Ask Questions

Examining Racial Bias in an Online Abuse Corpus with Structural Topic Modeling

May 26, 2020
Thomas Davidson, Debasmita Bhattacharya

We use structural topic modeling to examine racial bias in data collected to train models to detect hate speech and abusive language in social media posts. We augment the abusive language dataset by adding an additional feature indicating the predicted probability of the tweet being written in African-American English. We then use structural topic modeling to examine the content of the tweets and how the prevalence of different topics is related to both abusiveness annotation and dialect prediction. We find that certain topics are disproportionately racialized and considered abusive. We discuss how topic modeling may be a useful approach for identifying bias in annotated data.

* Please cite the published version, see proceedings of ICWSM 2020 

  Access Paper or Ask Questions

Efficient Deep Learning of GMMs

Feb 15, 2019
Shirin Jalali, Carl Nuzman, Iraj Saniee

We show that a collection of Gaussian mixture models (GMMs) in $R^{n}$ can be optimally classified using $O(n)$ neurons in a neural network with two hidden layers (deep neural network), whereas in contrast, a neural network with a single hidden layer (shallow neural network) would require at least $O(\exp(n))$ neurons or possibly exponentially large coefficients. Given the universality of the Gaussian distribution in the feature spaces of data, e.g., in speech, image and text, our result sheds light on the observed efficiency of deep neural networks in practical classification problems.

  Access Paper or Ask Questions

Subword and Crossword Units for CTC Acoustic Models

Jun 18, 2018
Thomas Zenkel, Ramon Sanabria, Florian Metze, Alex Waibel

This paper proposes a novel approach to create an unit set for CTC based speech recognition systems. By using Byte Pair Encoding we learn an unit set of an arbitrary size on a given training text. In contrast to using characters or words as units this allows us to find a good trade-off between the size of our unit set and the available training data. We evaluate both Crossword units, that may span multiple word, and Subword units. By combining this approach with decoding methods using a separate language model we are able to achieve state of the art results for grapheme based CTC systems.

* Current version accepted at Interspeech 2018 

  Access Paper or Ask Questions

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

Apr 04, 2018
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors.

  Access Paper or Ask Questions

A Supervised STDP-based Training Algorithm for Living Neural Networks

Mar 21, 2018
Yuan Zeng, Kevin Devincentis, Yao Xiao, Zubayer Ibne Ferdous, Xiaochen Guo, Zhiyuan Yan, Yevgeny Berdichevsky

Neural networks have shown great potential in many applications like speech recognition, drug discovery, image classification, and object detection. Neural network models are inspired by biological neural networks, but they are optimized to perform machine learning tasks on digital computers. The proposed work explores the possibilities of using living neural networks in vitro as basic computational elements for machine learning applications. A new supervised STDP-based learning algorithm is proposed in this work, which considers neuron engineering constrains. A 74.7% accuracy is achieved on the MNIST benchmark for handwritten digit recognition.

* 5 pages, 3 figures, Accepted by ICASSP 2018 

  Access Paper or Ask Questions

Comparing approaches for mitigating intergroup variability in personality recognition

Jan 31, 2018
Guozhen An, Rivka Levitan

Personality have been found to predict many life outcomes, and there have been huge interests on automatic personality recognition from a speaker's utterance. Previously, we achieved accuracies between 37%-44% for three-way classification of high, medium or low for each of the Big Five personality traits (Openness to Experience, Conscientiousness, Extraversion, Agreeableness, Neuroticism). We show here that we can improve performance on this task by accounting for heterogeneity of gender and L1 in our data, which has English speech from female and male native speakers of Chinese and Standard American English (SAE). We experiment with personalizing models by L1 and gender and normalizing features by speaker, L1 group, and/or gender.

  Access Paper or Ask Questions

Natural Language Multitasking: Analyzing and Improving Syntactic Saliency of Hidden Representations

Jan 18, 2018
Gino Brunner, Yuyi Wang, Roger Wattenhofer, Michael Weigelt

We train multi-task autoencoders on linguistic tasks and analyze the learned hidden sentence representations. The representations change significantly when translation and part-of-speech decoders are added. The more decoders a model employs, the better it clusters sentences according to their syntactic similarity, as the representation space becomes less entangled. We explore the structure of the representation space by interpolating between sentences, which yields interesting pseudo-English sentences, many of which have recognizable syntactic structure. Lastly, we point out an interesting property of our models: The difference-vector between two sentences can be added to change a third sentence with similar features in a meaningful way.

* The 31st Annual Conference on Neural Information Processing (NIPS) - Workshop on Learning Disentangled Features: from Perception to Control, Long Beach, CA, December 2017 

  Access Paper or Ask Questions

Lexical-semantic resources: yet powerful resources for automatic personality classification

Nov 27, 2017
Xuan-Son Vu, Lucie Flekova, Lili Jiang, Iryna Gurevych

In this paper, we aim to reveal the impact of lexical-semantic resources, used in particular for word sense disambiguation and sense-level semantic categorization, on automatic personality classification task. While stylistic features (e.g., part-of-speech counts) have been shown their power in this task, the impact of semantics beyond targeted word lists is relatively unexplored. We propose and extract three types of lexical-semantic features, which capture high-level concepts and emotions, overcoming the lexical gap of word n-grams. Our experimental results are comparable to state-of-the-art methods, while no personality-specific resources are required.

* GWC 2018 The 9th Global WordNet Conference GWC 2018 The 9th Global WordNet Conference GWC 2018 The 9th Global WordNet Conference GWC 2018, the 9th Global WordNet Conference 

  Access Paper or Ask Questions

A comprehensive study of batch construction strategies for recurrent neural networks in MXNet

May 05, 2017
Patrick Doetsch, Pavel Golik, Hermann Ney

In this work we compare different batch construction methods for mini-batch training of recurrent neural networks. While popular implementations like TensorFlow and MXNet suggest a bucketing approach to improve the parallelization capabilities of the recurrent training process, we propose a simple ordering strategy that arranges the training sequences in a stochastic alternatingly sorted way. We compare our method to sequence bucketing as well as various other batch construction strategies on the CHiME-4 noisy speech recognition corpus. The experiments show that our alternated sorting approach is able to compete both in training time and recognition performance while being conceptually simpler to implement.

  Access Paper or Ask Questions