Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

A Machine Learning Application for Raising WASH Awareness in the Times of Covid-19 Pandemic

Mar 16, 2020
Rohan Pandey, Vaibhav Gautam, Kanav Bhagat, Tavpritesh Sethi

A proactive approach to raise awareness while preventing misinformation is a modern-day challenge in all domains including healthcare. Such awareness and sensitization approaches to prevention and containment are important components of a strong healthcare system, especially in the times of outbreaks such as the ongoing Covid-19 pandemic. However, there is a fine balance between continuous awareness-raising by providing new information and the risk of misinformation. In this work, we address this gap by creating a life-long learning application that delivers authentic information to users in Hindi, the most widely used local language in India. It does this by matching sources of verified and authentic information such as the WHO reports against daily news by using machine learning and natural language processing. It delivers the narrated content in Hindi by using state-of-the-art text to speech engines. Finally, the approach allows user input for continuous improvement of news feed relevance on a daily basis. We demonstrate a focused application of this approach for Water, Sanitation, Hygiene as it is critical in the containment of the currently raging Covid-19 pandemic through the WashKaro android application. Thirteen combinations of pre-processing strategies, word-embeddings, and similarity metrics were evaluated by eight human users via calculation of agreement statistics. The best performing combination achieved a Cohen's Kappa of 0.54 and was deployed in the WashKaro application back-end. Interventional studies for evaluating the effectiveness of the WashKaro application for preventing WASH-related diseases are planned to be carried out in the Mohalla clinics that provided 3.5 Million consults in 2019 in Delhi, India. Additionally, the application also features human-curated and vetted information to reach out to the community as audio-visual content in local languages.

* 7 pages, 5 figures, 3 tables 

  Access Paper or Ask Questions

Emo-CNN for Perceiving Stress from Audio Signals: A Brain Chemistry Approach

Jan 08, 2020
Anup Anand Deshmukh, Catherine Soladie, Renaud Seguier

Emotion plays a key role in many applications like healthcare, to gather patients emotional behavior. There are certain emotions which are given more importance due to their effectiveness in understanding human feelings. In this paper, we propose an approach that models human stress from audio signals. The research challenge in speech emotion detection is defining the very meaning of stress and being able to categorize it in a precise manner. Supervised Machine Learning models, including state of the art Deep Learning classification methods, rely on the availability of clean and labelled data. One of the problems in affective computation and emotion detection is the limited amount of annotated data of stress. The existing labelled stress emotion datasets are highly subjective to the perception of the annotator. We address the first issue of feature selection by exploiting the use of traditional MFCC features in Convolutional Neural Network. Our experiments show that Emo-CNN consistently and significantly outperforms the popular existing methods over multiple datasets. It achieves 90.2% categorical accuracy on the Emo-DB dataset. To tackle the second and the more significant problem of subjectivity in stress labels, we use Lovheim's cube, which is a 3-dimensional projection of emotions. The cube aims at explaining the relationship between these neurotransmitters and the positions of emotions in 3D space. The learnt emotion representations from the Emo-CNN are mapped to the cube using three component PCA (Principal Component Analysis) which is then used to model human stress. This proposed approach not only circumvents the need for labelled stress data but also complies with the psychological theory of emotions given by Lovheim's cube. We believe that this work is the first step towards creating a connection between Artificial Intelligence and the chemistry of human emotions.

* 2 pages, 2 tables and 2 figures 

  Access Paper or Ask Questions

CAGFuzz: Coverage-Guided Adversarial Generative Fuzzing Testing of Deep Learning Systems

Nov 14, 2019
Pengcheng Zhang, Qiyin Dai, Patrizio Pelliccione

Deep Learning systems (DL) based on Deep Neural Networks (DNNs) are more and more used in various aspects of our life, including unmanned vehicles, speech processing, and robotics. However, due to the limited dataset and the dependence on manual labeling data, DNNs often fail to detect their erroneous behaviors, which may lead to serious problems. Several approaches have been proposed to enhance the input examples for testing DL systems. However, they have the following limitations. First, they design and generate adversarial examples from the perspective of model, which may cause low generalization ability when they are applied to other models. Second, they only use surface feature constraints to judge the difference between the adversarial example generated and the original example. The deep feature constraints, which contain high-level semantic information, such as image object category and scene semantics are completely neglected. To address these two problems, in this paper, we propose CAGFuzz, a Coverage-guided Adversarial Generative Fuzzing testing approach, which generates adversarial examples for a targeted DNN to discover its potential defects. First, we train an adversarial case generator (AEG) from the perspective of general data set. Second, we extract the depth features of the original and adversarial examples, and constrain the adversarial examples by cosine similarity to ensure that the semantic information of adversarial examples remains unchanged. Finally, we retrain effective adversarial examples to improve neuron testing coverage rate. Based on several popular data sets, we design a set of dedicated experiments to evaluate CAGFuzz. The experimental results show that CAGFuzz can improve the neuron coverage rate, detect hidden errors, and also improve the accuracy of the target DNN.

  Access Paper or Ask Questions

Deep clustering: Discriminative embeddings for segmentation and separation

Aug 18, 2015
John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe

We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.

* Originally submitted on June 5, 2015 

  Access Paper or Ask Questions

Using Topological Framework for the Design of Activation Function and Model Pruning in Deep Neural Networks

Sep 03, 2021
Yogesh Kochar, Sunil Kumar Vengalil, Neelam Sinha

Success of deep neural networks in diverse tasks across domains of computer vision, speech recognition and natural language processing, has necessitated understanding the dynamics of training process and also working of trained models. Two independent contributions of this paper are 1) Novel activation function for faster training convergence 2) Systematic pruning of filters of models trained irrespective of activation function. We analyze the topological transformation of the space of training samples as it gets transformed by each successive layer during training, by changing the activation function. The impact of changing activation function on the convergence during training is reported for the task of binary classification. A novel activation function aimed at faster convergence for classification tasks is proposed. Here, Betti numbers are used to quantify topological complexity of data. Results of experiments on popular synthetic binary classification datasets with large Betti numbers(>150) using MLPs are reported. Results show that the proposed activation function results in faster convergence requiring fewer epochs by a factor of 1.5 to 2, since Betti numbers reduce faster across layers with the proposed activation function. The proposed methodology was verified on benchmark image datasets: fashion MNIST, CIFAR-10 and cat-vs-dog images, using CNNs. Based on empirical results, we propose a novel method for pruning a trained model. The trained model was pruned by eliminating filters that transform data to a topological space with large Betti numbers. All filters with Betti numbers greater than 300 were removed from each layer without significant reduction in accuracy. This resulted in faster prediction time and reduced memory size of the model.

  Access Paper or Ask Questions

SoundCLR: Contrastive Learning of Representations For Improved Environmental Sound Classification

Mar 02, 2021
Alireza Nasiri, Jianjun Hu

Environmental Sound Classification (ESC) is a challenging field of research in non-speech audio processing. Most of current research in ESC focuses on designing deep models with special architectures tailored for specific audio datasets, which usually cannot exploit the intrinsic patterns in the data. However recent studies have surprisingly shown that transfer learning from models trained on ImageNet is a very effective technique in ESC. Herein, we propose SoundCLR, a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance, which works by learning representations that disentangle the samples of each class from those of other classes. Our deep network models are trained by combining a contrastive loss that contributes to a better probability output by the classification layer with a cross-entropy loss on the output of the classifier layer to map the samples to their respective 1-hot encoded labels. Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline and apply the augmentations on both the sound signals and their log-mel spectrograms before inputting them to the model. Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance. Our extensive benchmark experiments show that our hybrid deep network models trained with combined contrastive and cross-entropy loss achieved the state-of-the-art performance on three benchmark datasets ESC-10, ESC-50, and US8K with validation accuracies of 99.75\%, 93.4\%, and 86.49\% respectively. The ensemble version of our models also outperforms other top ensemble methods. The code is available at

  Access Paper or Ask Questions

Robust Attack Detection Approach for IIoT Using Ensemble Classifier

Jan 30, 2021
V. Priya, I. Sumaiya Thaseen, Thippa Reddy Gadekallu, Mohamed K. Aboudaif, Emad Abouel Nasr

Generally, the risks associated with malicious threats are increasing for the IIoT and its related applications due to dependency on the Internet and the minimal resource availability of IoT devices. Thus, anomaly-based intrusion detection models for IoT networks are vital. Distinct detection methodologies need to be developed for the IIoT network as threat detection is a significant expectation of stakeholders. Machine learning approaches are considered to be evolving techniques that learn with experience, and such approaches have resulted in superior performance in various applications, such as pattern recognition, outlier analysis, and speech recognition. Traditional techniques and tools are not adequate to secure IIoT networks due to the use of various protocols in industrial systems and restricted possibilities of upgradation. In this paper, the objective is to develop a two-phase anomaly detection model to enhance the reliability of an IIoT network. In the first phase, SVM and Naive Bayes are integrated using an ensemble blending technique. K-fold cross-validation is performed while training the data with different training and testing ratios to obtain optimized training and test sets. Ensemble blending uses a random forest technique to predict class labels. An Artificial Neural Network (ANN) classifier that uses the Adam optimizer to achieve better accuracy is also used for prediction. In the second phase, both the ANN and random forest results are fed to the model's classification unit, and the highest accuracy value is considered the final result. The proposed model is tested on standard IoT attack datasets, such as WUSTL_IIOT-2018, N_BaIoT, and Bot_IoT. The highest accuracy obtained is 99%. The results also demonstrate that the proposed model outperforms traditional techniques and thus improves the reliability of an IIoT network.

  Access Paper or Ask Questions

Next Wave Artificial Intelligence: Robust, Explainable, Adaptable, Ethical, and Accountable

Dec 11, 2020
Odest Chadwicke Jenkins, Daniel Lopresti, Melanie Mitchell

The history of AI has included several "waves" of ideas. The first wave, from the mid-1950s to the 1980s, focused on logic and symbolic hand-encoded representations of knowledge, the foundations of so-called "expert systems". The second wave, starting in the 1990s, focused on statistics and machine learning, in which, instead of hand-programming rules for behavior, programmers constructed "statistical learning algorithms" that could be trained on large datasets. In the most recent wave research in AI has largely focused on deep (i.e., many-layered) neural networks, which are loosely inspired by the brain and trained by "deep learning" methods. However, while deep neural networks have led to many successes and new capabilities in computer vision, speech recognition, language processing, game-playing, and robotics, their potential for broad application remains limited by several factors. A concerning limitation is that even the most successful of today's AI systems suffer from brittleness-they can fail in unexpected ways when faced with situations that differ sufficiently from ones they have been trained on. This lack of robustness also appears in the vulnerability of AI systems to adversarial attacks, in which an adversary can subtly manipulate data in a way to guarantee a specific wrong answer or action from an AI system. AI systems also can absorb biases-based on gender, race, or other factors-from their training data and further magnify these biases in their subsequent decision-making. Taken together, these various limitations have prevented AI systems such as automatic medical diagnosis or autonomous vehicles from being sufficiently trustworthy for wide deployment. The massive proliferation of AI across society will require radically new ideas to yield technology that will not sacrifice our productivity, our quality of life, or our values.

* A Computing Community Consortium (CCC) white paper, 5 pages 

  Access Paper or Ask Questions

Enhancing deep neural networks with morphological information

Nov 24, 2020
Matej Klemen, Luka Krsnik, Marko Robnik-Šikonja

Currently, deep learning approaches are superior in natural language processing due to their ability to extract informative features and patterns from languages. Two most successful neural architectures are LSTM and transformers, the latter mostly used in the form of large pretrained language models such as BERT. While cross-lingual approaches are on the rise, a vast majority of current natural language processing techniques is designed and applied to English, and less-resourced languages are lagging behind. In morphologically rich languages, plenty of information is conveyed through changes in morphology, e.g., through different prefixes and suffixes modifying stems of words. The existing neural approaches do not explicitly use the information on word morphology. We analyze the effect of adding morphological features to LSTM and BERT models. We use three tasks available in many less-resourced languages: named entity recognition (NER), dependency parsing (DP), and comment filtering (CF). We construct sensible baselines involving LSTM and BERT models, which we adjust by adding additional input in the form of part of speech (POS) tags and universal features. We compare the obtained models across subsets of eight languages. Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task. The features improve the performance of LSTM-based models on the NER and DP tasks, while they do not benefit the performance on the CF task. For BERT-based models, the added morphological features only improve the performance on DP when they are of high quality, while they do not show any practical improvement when they are predicted. As in NER and CF datasets manually checked features are not available, we only experiment with the predicted morphological features and find that they do not cause any practical improvement in performance.

  Access Paper or Ask Questions

Learning Internal Representations (COLT 1995)

Dec 19, 2019
Jonathan Baxter

Probably the most important problem in machine learning is the preliminary biasing of a learner's hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner's hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner's hypothesis space for the learning of future tasks drawn from the same environment. An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples $m$ {\em per task} required to ensure good generalisation from a representation learner obeys $m = O(a+b/n)$ where $n$ is the number of tasks being learnt and $a$ and $b$ are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then $m=O(a+b)$. It is argued that for learning environments such as speech and character recognition $b\gg a$ and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if $n = O(b)$ (with $m=O(a+b/n)$) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to $O(a)$ (as opposed to $O(a+b)$ if no representation is used). It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.

* COLT '95 Proceedings of the eighth annual conference on Computational learning theory (1995) 311-320 

  Access Paper or Ask Questions