Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Laplacian Eigenmaps from Sparse, Noisy Similarity Measurements

Aug 16, 2016
Keith Levin, Vince Lyzinski

Manifold learning and dimensionality reduction techniques are ubiquitous in science and engineering, but can be computationally expensive procedures when applied to large data sets or when similarities are expensive to compute. To date, little work has been done to investigate the tradeoff between computational resources and the quality of learned representations. We present both theoretical and experimental explorations of this question. In particular, we consider Laplacian eigenmaps embeddings based on a kernel matrix, and explore how the embeddings behave when this kernel matrix is corrupted by occlusion and noise. Our main theoretical result shows that under modest noise and occlusion assumptions, we can (with high probability) recover a good approximation to the Laplacian eigenmaps embedding based on the uncorrupted kernel matrix. Our results also show how regularization can aid this approximation. Experimentally, we explore the effects of noise and occlusion on Laplacian eigenmaps embeddings of two real-world data sets, one from speech processing and one from neuroscience, as well as a synthetic data set.

  Access Paper or Ask Questions

Aggressive actions and anger detection from multiple modalities using Kinect

Jul 05, 2016
Amol Patwardhan, Gerald Knapp

Prison facilities, mental correctional institutions, sports bars and places of public protest are prone to sudden violence and conflicts. Surveillance systems play an important role in mitigation of hostile behavior and improvement of security by detecting such provocative and aggressive activities. This research proposed using automatic aggressive behavior and anger detection to improve the effectiveness of the surveillance systems. An emotion and aggression aware component will make the surveillance system highly responsive and capable of alerting the security guards in real time. This research proposed facial expression, head, hand and body movement and speech tracking for detecting anger and aggressive actions. Recognition was achieved using support vector machines and rule based features. The multimodal affect recognition precision rate for anger improved by 15.2% and recall rate improved by 11.7% when behavioral rule based features were used in aggressive action detection.

* 11 pages, 2 figures, 5 tables, in peer review with ACM TIST, Key words: Aggression, multimodal anger recognition, Kinect 

  Access Paper or Ask Questions

Cross-Lingual Morphological Tagging for Low-Resource Languages

Jun 14, 2016
Jan Buys, Jan A. Botha

Morphologically rich languages often lack the annotated linguistic resources required to develop accurate natural language processing tools. We propose models suitable for training morphological taggers with rich tagsets for low-resource languages without using direct supervision. Our approach extends existing approaches of projecting part-of-speech tags across languages, using bitext to infer constraints on the possible tags for a given word type or token. We propose a tagging model using Wsabie, a discriminative embedding-based model with rank-based learning. In our evaluation on 11 languages, on average this model performs on par with a baseline weakly-supervised HMM, while being more scalable. Multilingual experiments show that the method performs best when projecting between related language pairs. Despite the inherently lossy projection, we show that the morphological tags predicted by our models improve the downstream performance of a parser by +0.6 LAS on average.

* 11 pages. ACL 2016 

  Access Paper or Ask Questions

Lipreading with Long Short-Term Memory

Jan 29, 2016
Michael Wand, Jan Koutník, Jürgen Schmidhuber

Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feed-forward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).

* Accepted for publication at ICASSP 2016 

  Access Paper or Ask Questions

The Vicomtech Spoofing-Aware Biometric System for the SASV Challenge

Apr 04, 2022
Juan M. Martín-Doñas, Iván G. Torre, Aitor Álvarez, Joaquin Arellano

This paper describes our proposed integration system for the spoofing-aware speaker verification challenge. It consists of a robust spoofing-aware verification system that use the speaker verification and antispoofing embeddings extracted from specialized neural networks. First, an integration network, fed with the test utterance's speaker verification and spoofing embeddings, is used to compute a spoof-based score. This score is then linearly combined with the cosine similarity between the speaker verification embeddings from the enrollment and test utterances, thus obtaining the final scoring decision. Moreover, the integration network is trained using a one-class loss function to discriminate between target trials and unauthorized accesses. Our proposed system is evaluated in the ASVspoof19 database, exhibiting competitive performance compared to other integration approaches. In addition, we test, along with our integration approach, state of the art speaker verification and antispoofing systems based on self-supervised learning, yielding high-performance speech biometric systems.

* Submitted to Interspeech 2022 

  Access Paper or Ask Questions

g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin

Mar 24, 2022
Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang, Yi-Ren Yeh

Polyphone disambiguation is the most crucial task in Mandarin grapheme-to-phoneme (g2p) conversion. Previous studies have approached this problem using pre-trained language models, restricted output, and extra information from Part-Of-Speech (POS) tagging. Inspired by these strategies, we propose a novel approach, called g2pW, which adapts learnable softmax-weights to condition the outputs of BERT with the polyphonic character of interest and its POS tagging. Rather than using the hard mask as in previous works, our experiments show that learning a soft-weighting function for the candidate phonemes benefits performance. In addition, our proposed g2pW does not require extra pre-trained POS tagging models while using POS tags as auxiliary features since we train the POS tagging model simultaneously with the unified encoder. Experimental results show that our g2pW outperforms existing methods on the public CPP dataset. All codes, model weights, and a user-friendly package are publicly available.

* submitted to Insterspeech 2022 

  Access Paper or Ask Questions

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Mar 15, 2022
Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, Ruggerio L. Sardo, Elise Darragh-Ford, Douglas Guilbeault

Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character $n$-grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.

  Access Paper or Ask Questions

A Fast Network Exploration Strategy to Profile Low Energy Consumption for Keyword Spotting

Feb 04, 2022
Arnab Neelim Mazumder, Tinoosh Mohsenin

Keyword Spotting nowadays is an integral part of speech-oriented user interaction targeted for smart devices. To this extent, neural networks are extensively used for their flexibility and high accuracy. However, coming up with a suitable configuration for both accuracy requirements and hardware deployment is a challenge. We propose a regression-based network exploration technique that considers the scaling of the network filters ($s$) and quantization ($q$) of the network layers, leading to a friendly and energy-efficient configuration for FPGA hardware implementation. We experiment with different combinations of $\mathcal{NN}\scriptstyle\langle q,\,s\rangle \displaystyle$ on the FPGA to profile the energy consumption of the deployed network so that the user can choose the most energy-efficient network configuration promptly. Our accelerator design is deployed on the Xilinx AC 701 platform and has at least 2.1$\times$ and 4$\times$ improvements on energy and energy efficiency results, respectively, compared to recent hardware implementations for keyword spotting.

* accepted in tinyML Research Symposium 2022 

  Access Paper or Ask Questions

A deep complex network with multi-frame filtering for stereophonic acoustic echo cancellation

Feb 03, 2022
Linjuan Cheng, Chengshi Zheng, Andong Li, Renhua Peng, Xiaodong Li

In hands-free communication system, the coupling between the loudspeaker and the microphone will generate echo signal, which can severely impair the quality of communication. Meanwhile, various types of noise in the communication environment further destroy the speech quality and intelligibility. It is hard to extract the near-end signal from the microphone input signal within one step, especially in low signal-to-noise ratios. In this paper, we propose a multi-stage approach to address this issue. On the one hand, we decompose the echo cancellation into two stages, including linear echo cancellation module and residual echo suppression module. A multi-frame filtering strategy is introduced to benefit estimating linear echo by utilizing more inter-frame information. On the other hand, we decouple the complex spectral mapping into magnitude estimation and complex spectra refine. Experimental results demonstrate that our proposed approach achieves stage-of-the-art performance over previous advanced algorithms under various conditions.

  Access Paper or Ask Questions

Towards Relatable Explainable AI with the Perceptual Process

Dec 28, 2021
Wencan Zhang, Brian Y. Lim

Machine learning models need to provide contrastive explanations, since people often seek to understand why a puzzling prediction occurred instead of some expected outcome. Current contrastive explanations are rudimentary comparisons between examples or raw features, which remain difficult to interpret, since they lack semantic meaning. We argue that explanations must be more relatable to other concepts, hypotheticals, and associations. Inspired by the perceptual process from cognitive psychology, we propose the XAI Perceptual Processing Framework and RexNet model for relatable explainable AI with Contrastive Saliency, Counterfactual Synthetic, and Contrastive Cues explanations. We investigated the application of vocal emotion recognition, and implemented a modular multi-task deep neural network to predict and explain emotions from speech. From think-aloud and controlled studies, we found that counterfactual explanations were useful and further enhanced with semantic cues, but not saliency explanations. This work provides insights into providing and evaluating relatable contrastive explainable AI for perception applications.

* 14 pages, 7 figures, 4 tables, accepted by chi2022 

  Access Paper or Ask Questions