Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Applications of Recurrent Neural Network for Biometric Authentication & Anomaly Detection

Sep 13, 2021
Joseph M. Ackerson, Dave Rushit, Seliya Jim

Recurrent Neural Networks are powerful machine learning frameworks that allow for data to be saved and referenced in a temporal sequence. This opens many new possibilities in fields such as handwriting analysis and speech recognition. This paper seeks to explore current research being conducted on RNNs in four very important areas, being biometric authentication, expression recognition, anomaly detection, and applications to aircraft. This paper reviews the methodologies, purpose, results, and the benefits and drawbacks of each proposed method below. These various methodologies all focus on how they can leverage distinct RNN architectures such as the popular Long Short-Term Memory (LSTM) RNN or a Deep-Residual RNN. This paper also examines which frameworks work best in certain situations, and the advantages and disadvantages of each pro-posed model.

  Access Paper or Ask Questions

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

Aug 01, 2021
Zhaofeng Shi

With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. The corresponding technical methods are comprehensively classified and introduced, and their future development trends are prospected. This survey can provide some guidance for researchers who are interested in the areas like audio synthesis and audio-visual multimodal processing.

  Access Paper or Ask Questions

Encoder-Decoder Neural Architecture Optimization for Keyword Spotting

Jun 04, 2021
Tong Mo, Bang Liu

Keyword spotting aims to identify specific keyword audio utterances. In recent years, deep convolutional neural networks have been widely utilized in keyword spotting systems. However, their model architectures are mainly based on off-the shelfbackbones such as VGG-Net or ResNet, instead of specially designed for the task. In this paper, we utilize neural architecture search to design convolutional neural network models that can boost the performance of keyword spotting while maintaining an acceptable memory footprint. Specifically, we search the model operators and their connections in a specific search space with Encoder-Decoder neural architecture optimization. Extensive evaluations on Google's Speech Commands Dataset show that the model architecture searched by our approach achieves a state-of-the-art accuracy of over 97%.

* Accepted for Interspeech2021 

  Access Paper or Ask Questions

Towards Consistent Hybrid HMM Acoustic Modeling

Apr 28, 2021
Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney

High-performance hybrid automatic speech recognition (ASR) systems are often trained with clustered triphone outputs, and thus require a complex training pipeline to generate the clustering. The same complex pipeline is often utilized in order to generate an alignment for use in frame-wise cross-entropy training. In this work, we propose a flat-start factored hybrid model trained by modeling the full set of triphone states explicitly without relying on clustering methods. This greatly simplifies the training of new models. Furthermore, we study the effect of different alignments used for Viterbi training. Our proposed models achieve competitive performance on the Switchboard task compared to systems using clustered triphones and other flat-start models in the literature.

* Submitted to Interspeech 2021 

  Access Paper or Ask Questions

Sequence-based Machine Learning Models in Jet Physics

Feb 09, 2021
Rafael Teixeira de Lima

Sequence-based modeling broadly refers to algorithms that act on data that is represented as an ordered set of input elements. In particular, Machine Learning algorithms with sequences as inputs have seen successfull applications to important problems, such as Natural Language Processing (NLP) and speech signal modeling. The usage this class of models in collider physics leverages their ability to act on data with variable sequence lengths, such as constituents inside a jet. In this document, we explore the application of Recurrent Neural Networks (RNNs) and other sequence-based neural network architectures to classify jets, regress jet-related quantities and to build a physics-inspired jet representation, in connection to jet clustering algorithms. In addition, alternatives to sequential data representations are briefly discussed.

* To appear in Artificial Intelligence for Particle Physics, World Scientific Publishing 

  Access Paper or Ask Questions

ReINTEL: A Multimodal Data Challenge for Responsible Information Identification on Social Network Sites

Dec 16, 2020
Duc-Trong Le, Xuan-Son Vu, Nhu-Dung To, Huu-Quang Nguyen, Thuy-Trinh Nguyen, Linh Le, Anh-Tuan Nguyen, Minh-Duc Hoang, Nghia Le, Huyen Nguyen, Hoang D. Nguyen

This paper reports on the ReINTEL Shared Task for Responsible Information Identification on social network sites, which is hosted at the seventh annual workshop on Vietnamese Language and Speech Processing (VLSP 2020). Given a piece of news with respective textual, visual content and metadata, participants are required to classify whether the news is `reliable' or `unreliable'. In order to generate a fair benchmark, we introduce a novel human-annotated dataset of over 10,000 news collected from a social network in Vietnam. All models will be evaluated in terms of AUC-ROC score, a typical evaluation metric for classification. The competition was run on the Codalab platform. Within two months, the challenge has attracted over 60 participants and recorded nearly 1,000 submission entries.

  Access Paper or Ask Questions

Augmenting BERT Carefully with Underrepresented Linguistic Features

Nov 12, 2020
Aparna Balagopalan, Jekaterina Novikova

Fine-tuned Bidirectional Encoder Representations from Transformers (BERT)-based sequence classification models have proven to be effective for detecting Alzheimer's Disease (AD) from transcripts of human speech. However, previous research shows it is possible to improve BERT's performance on various tasks by augmenting the model with additional information. In this work, we use probing tasks as introspection techniques to identify linguistic information not well-represented in various layers of BERT, but important for the AD detection task. We supplement these linguistic features in which representations from BERT are found to be insufficient with hand-crafted features externally, and show that jointly fine-tuning BERT in combination with these features improves the performance of AD classification by upto 5\% over fine-tuned BERT alone.

* Machine Learning for Health (ML4H) at NeurIPS 2020 - Extended Abstract 

  Access Paper or Ask Questions

What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models

Apr 14, 2020
Wietse de Vries, Andreas van Cranenburgh, Malvina Nissim

Experiments with transfer learning on pre-trained language models such as BERT have shown that the layers of these models resemble the classical NLP pipeline, with progressively more complex tasks being concentrated in later layers of the network. We investigate to what extent these results also hold for a language other than English. For this we probe a Dutch BERT-based model and the multilingual BERT model for Dutch NLP tasks. In addition, by considering the task of part-of-speech tagging in more detail, we show that also within a given task, information is spread over different parts of the network and the pipeline might not be as neat as it seems. Each layer has different specialisations and it is therefore useful to combine information from different layers for best results, instead of selecting a single layer based on the best overall performance.

  Access Paper or Ask Questions

The Canonical Distortion Measure for Vector Quantization and Function Approximation

Nov 14, 2019
Jonathan Baxter

To measure the quality of a set of vector quantization points a means of measuring the distance between a random point and its quantization is required. Common metrics such as the {\em Hamming} and {\em Euclidean} metrics, while mathematically simple, are inappropriate for comparing natural signals such as speech or images. In this paper it is shown how an {\em environment} of functions on an input space $X$ induces a {\em canonical distortion measure} (CDM) on X. The depiction 'canonical" is justified because it is shown that optimizing the reconstruction error of X with respect to the CDM gives rise to optimal piecewise constant approximations of the functions in the environment. The CDM is calculated in closed form for several different function classes. An algorithm for training neural networks to implement the CDM is presented along with some encouraging experimental results.

* In: Thrun S., Pratt L. (eds) Learning to Learn (1998). Pages 159-177 

  Access Paper or Ask Questions