Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Multi-scale Geometric Summaries for Similarity-based Sensor Fusion

Oct 13, 2018
Christopher J. Tralie, Paul Bendich, John Harer

In this work, we address fusion of heterogeneous sensor data using wavelet-based summaries of fused self-similarity information from each sensor. The technique we develop is quite general, does not require domain specific knowledge or physical models, and requires no training. Nonetheless, it can perform surprisingly well at the general task of differentiating classes of time-ordered behavior sequences which are sensed by more than one modality. As a demonstration of our capabilities in the audio to video context, we focus on the differentiation of speech sequences. Data from two or more modalities first are represented using self-similarity matrices(SSMs) corresponding to time-ordered point clouds in feature spaces of each of these data sources; we note that these feature spaces can be of entirely different scale and dimensionality. A fused similarity template is then derived from the modality-specific SSMs using a technique called similarity network fusion (SNF). We investigate pipelines using SNF as both an upstream (feature-level) and a downstream (ranking-level) fusion technique. Multiscale geometric features of this template are then extracted using a recently-developed technique called the scattering transform, and these features are then used to differentiate speech sequences. This method outperforms unsupervised techniques which operate directly on the raw data, and it also outperforms stovepiped methods which operate on SSMs separately derived from the distinct modalities. The benefits of this method become even more apparent as the simulated peak signal to noise ratio decreases.

* 9 pages, 13 Figures 

  Access Paper or Ask Questions

A Sparse Non-negative Matrix Factorization Framework for Identifying Functional Units of Tongue Behavior from MRI

Sep 29, 2018
Jonghye Woo, Jerry L. Prince, Maureen Stone, Fangxu Xing, Arnold Gomez, Jordan R. Green, Christopher J. Hartnick, Thomas J. Brady, Timothy G. Reese, Van J. Wedeen, Georges El Fakhri

Muscle coordination patterns of lingual behaviors are synergies generated by deforming local muscle groups in a variety of ways. Functional units are functional muscle groups of local structural elements within the tongue that compress, expand, and move in a cohesive and consistent manner. Identifying the functional units using tagged-Magnetic Resonance Imaging (MRI) sheds light on the mechanisms of normal and pathological muscle coordination patterns, yielding improvement in surgical planning, treatment, or rehabilitation procedures. Here, to mine this information, we propose a matrix factorization and probabilistic graphical model framework to produce building blocks and their associated weighting map using motion quantities extracted from tagged-MRI. Our tagged-MRI imaging and accurate voxel-level tracking provide previously unavailable internal tongue motion patterns, thus revealing the inner workings of the tongue during speech or other lingual behaviors. We then employ spectral clustering on the weighting map to identify the cohesive regions defined by the tongue motion that may involve multiple or undocumented regions. To evaluate our method, we perform a series of experiments. We first use two-dimensional images and synthetic data to demonstrate the accuracy of our method. We then use three-dimensional synthetic and \textit{in vivo} tongue motion data using protrusion and simple speech tasks to identify subject-specific and data-driven functional units of the tongue in localized regions.

* Accepted at IEEE TMI (https://ieeexplore.ieee.org/document/8467354

  Access Paper or Ask Questions

WEMAC: Women and Emotion Multi-modal Affective Computing dataset

Mar 01, 2022
Jose A. Miranda, Esther Rituerto-González, Laura Gutiérrez-Martín, Clara Luis-Mingueza, Manuel F. Canabal, Alberto Ramírez Bárcenas, Jose M. Lanza-Gutiérrez, Carmen Peláez-Moreno, Celia López-Ongil

Among the seventeen Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the Fifth SDG is a call for action to turn Gender Equality into a fundamental human right and an essential foundation for a better world. It includes the eradication of all types of violence against women. Within this context, the UC3M4Safety research team aims to develop Bindi. This is a cyber-physical system which includes embedded Artificial Intelligence algorithms, for user real-time monitoring towards the detection of affective states, with the ultimate goal of achieving the early detection of risk situations for women. On this basis, we make use of wearable affective computing including smart sensors, data encryption for secure and accurate collection of presumed crime evidence, as well as the remote connection to protecting agents. Towards the development of such system, the recordings of different laboratory and into-the-wild datasets are in process. These are contained within the UC3M4Safety Database. Thus, this paper presents and details the first release of WEMAC, a novel multi-modal dataset, which comprises a laboratory-based experiment for 47 women volunteers that were exposed to validated audio-visual stimuli to induce real emotions by using a virtual reality headset while physiological, speech signals and self-reports were acquired and collected. We believe this dataset will serve and assist research on multi-modal affective computing using physiological and speech information.


  Access Paper or Ask Questions

EmoWrite: A Sentiment Analysis-Based Thought to Text Conversion

Mar 03, 2021
A. Shahid, I. Raza, S. A. Hussain

Brain Computer Interface (BCI) helps in processing and extraction of useful information from the acquired brain signals having applications in diverse fields such as military, medicine, neuroscience, and rehabilitation. BCI has been used to support paralytic patients having speech impediments with severe disabilities. To help paralytic patients communicate with ease, BCI based systems convert silent speech (thoughts) to text. However, these systems have an inconvenient graphical user interface, high latency, limited typing speed, and low accuracy rate. Apart from these limitations, the existing systems do not incorporate the inevitable factor of a patient's emotional states and sentiment analysis. The proposed system EmoWrite implements a dynamic keyboard with contextualized appearance of characters reducing the traversal time and improving the utilization of the screen space. The proposed system has been evaluated and compared with the existing systems for accuracy, convenience, sentimental analysis, and typing speed. This system results in 6.58 Words Per Minute (WPM) and 31.92 Characters Per Minute (CPM) with an accuracy of 90.36 percent. EmoWrite also gives remarkable results when it comes to the integration of emotional states. Its Information Transfer Rate (ITR) is also high as compared to other systems i.e., 87.55 bits per min with commands and 72.52 bits per min for letters. Furthermore, it provides easy to use interface with a latency of 2.685 sec.


  Access Paper or Ask Questions

Parameters Optimization for Improving ASR Performance in Adverse Real World Noisy Environmental Conditions

Mar 22, 2013
Urmila Shrawankar, Vilas Thakare

From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI).

* International Journal of Human Computer Interaction (IJHCI) 3(3), 58-70, 2012 
* 13 pages, 3 figures, 5 tables 

  Access Paper or Ask Questions

Classification of ALS patients based on acoustic analysis of sustained vowel phonations

Jan 11, 2021
Maxim Vashkevich, Yulia Rushkevich

Amyotrophic lateral sclerosis (ALS) is incurable neurological disorder with rapidly progressive course. Common early symptoms of ALS are difficulty in swallowing and speech. However, early acoustic manifestation of speech and voice symptoms is very variable, that making their detection very challenging, both by human specialists and automatic systems. This study presents an approach to voice assessment for automatic system that separates healthy people from patients with ALS. In particular, this work focus on analysing of sustain phonation of vowels /a/ and /i/ to perform automatic classification of ALS patients. A wide range of acoustic features such as MFCC, formants, jitter, shimmer, vibrato, PPE, GNE, HNR, etc. were analysed. We also proposed a new set of acoustic features for characterizing harmonic structure of the vowels. Calculation of these features is based on pitch synchronized voice analysis. A linear discriminant analysis (LDA) was used to classify the phonation produced by patients with ALS and those by healthy individuals. Several algorithms of feature selection were tested to find optimal feature subset for LDA model. The study's experiments show that the most successful LDA model based on 32 features picked out by LASSO feature selection algorithm attains 99.7% accuracy with 99.3% sensitivity and 99.9% specificity. Among the classifiers with a small number of features, we can highlight LDA model with 5 features, which has 89.0% accuracy (87.5% sensitivity and 90.4% specificity).

* Biomedical Signal Processing and Control, Volume 65, March 2021, 102350 

  Access Paper or Ask Questions

Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

Sep 01, 2021
Junxiao Xue, Hao Zhou, Yabo Wang

Speaker verification systems have been used in many production scenarios in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as voice conversion and speech synthesis, etc. In this paper, we propose a new method base on physiological-physical feature fusion to deal with voice spoofing attacks. This method involves feature extraction, a densely connected convolutional neural network with squeeze and excitation block (SE-DenseNet), multi-scale residual neural network with squeeze and excitation block (SE-Res2Net) and feature fusion strategies. We first pre-trained a convolutional neural network using the speaker's voice and face in the video as surveillance signals. It can extract physiological features from speech. Then we use SE-DenseNet and SE-Res2Net to extract physical features. Such a densely connection pattern has high parameter efficiency and squeeze and excitation block can enhance the transmission of the feature. Finally, we integrate the two features into the SE-Densenet to identify the spoofing attacks. Experimental results on the ASVspoof 2019 data set show that our model is effective for voice spoofing detection. In the logical access scenario, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 4% and 7%, respectively, compared with other methods. In the physical access scenario, our model improved t-DCF and EER scores by 8% and 10%, respectively.


  Access Paper or Ask Questions

Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning

Jun 14, 2021
Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, Hung-yi Lee

Previous works have shown that automatic speaker verification (ASV) is seriously vulnerable to malicious spoofing attacks, such as replay, synthetic speech, and recently emerged adversarial attacks. Great efforts have been dedicated to defending ASV against replay and synthetic speech; however, only a few approaches have been explored to deal with adversarial attacks. All the existing approaches to tackle adversarial attacks for ASV require the knowledge for adversarial samples generation, but it is impractical for defenders to know the exact attack algorithms that are applied by the in-the-wild attackers. This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. Inspired by self-supervised learning models (SSLMs) that possess the merits of alleviating the superficial noise in the inputs and reconstructing clean samples from the interrupted ones, this work regards adversarial perturbations as one kind of noise and conducts adversarial defense for ASV by SSLMs. Specifically, we propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%. Moreover, since there is no common metric for evaluating the adversarial defense performance for ASV, this work also formalizes evaluation metrics for adversarial defense considering both purification and detection based approaches into account. We sincerely encourage future works to benchmark their approaches based on the proposed evaluation framework.

* Submitted to TASLP on 19 April 2021 

  Access Paper or Ask Questions

Adversarial Defense for Automatic Speaker Verification by Self-Supervised Learning

Jun 01, 2021
Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, Hung-yi Lee

Previous works have shown that automatic speaker verification (ASV) is seriously vulnerable to malicious spoofing attacks, such as replay, synthetic speech, and recently emerged adversarial attacks. Great efforts have been dedicated to defending ASV against replay and synthetic speech; however, only a few approaches have been explored to deal with adversarial attacks. All the existing approaches to tackle adversarial attacks for ASV require the knowledge for adversarial samples generation, but it is impractical for defenders to know the exact attack algorithms that are applied by the in-the-wild attackers. This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. Inspired by self-supervised learning models (SSLMs) that possess the merits of alleviating the superficial noise in the inputs and reconstructing clean samples from the interrupted ones, this work regards adversarial perturbations as one kind of noise and conducts adversarial defense for ASV by SSLMs. Specifically, we propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%. Moreover, since there is no common metric for evaluating the adversarial defense performance for ASV, this work also formalizes evaluation metrics for adversarial defense considering both purification and detection based approaches into account. We sincerely encourage future works to benchmark their approaches based on the proposed evaluation framework.

* Submitted to TASLP on 03 May 2021 

  Access Paper or Ask Questions

<<
453
454
455
456
457
458
459
460
461
462
463
464
465
>>