Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Geometric deep learning on graphs and manifolds using mixture model CNNs

Dec 06, 2016
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, Michael M. Bronstein

Deep learning has achieved a remarkable performance breakthrough in several fields, most notably in speech recognition, natural language processing, and computer vision. In particular, convolutional neural network (CNN) architectures currently produce state-of-the-art performance on a variety of image analysis tasks such as object detection and recognition. Most of deep learning research has so far focused on dealing with 1D, 2D, or 3D Euclidean-structured data such as acoustic signals, images, or videos. Recently, there has been an increasing interest in geometric deep learning, attempting to generalize deep learning methods to non-Euclidean structured data such as graphs and manifolds, with a variety of applications from the domains of network analysis, computational social science, or computer graphics. In this paper, we propose a unified framework allowing to generalize CNN architectures to non-Euclidean domains (graphs and manifolds) and learn local, stationary, and compositional task-specific features. We show that various non-Euclidean CNN methods previously proposed in the literature can be considered as particular instances of our framework. We test the proposed method on standard tasks from the realms of image-, graph- and 3D shape analysis and show that it consistently outperforms previous approaches.

  Access Paper or Ask Questions

A Multiple Kernel Learning Approach for Human Behavioral Task Classification using STN-LFP Signal

Jul 27, 2016
Hosein M. Golshan, Adam O. Hebb, Sara J. Hanrahan, Joshua Nedrud, Mohammad H. Mahoor

Deep Brain Stimulation (DBS) has gained increasing attention as an effective method to mitigate Parkinsons disease (PD) disorders. Existing DBS systems are open-loop such that the system parameters are not adjusted automatically based on patients behavior. Classification of human behavior is an important step in the design of the next generation of DBS systems that are closed-loop. This paper presents a classification approach to recognize such behavioral tasks using the subthalamic nucleus (STN) Local Field Potential (LFP) signals. In our approach, we use the time-frequency representation (spectrogram) of the raw LFP signals recorded from left and right STNs as the feature vectors. Then these features are combined together via Support Vector Machines (SVM) with Multiple Kernel Learning (MKL) formulation. The MKL-based classification method is utilized to classify different tasks: button press, mouth movement, speech, and arm movement. Our experiments show that the lp-norm MKL significantly outperforms single kernel SVM-based classifiers in classifying behavioral tasks of five subjects even using signals acquired with a low sampling rate of 10 Hz. This leads to a lower computational cost.

* 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Scociety 

  Access Paper or Ask Questions

Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques

Mar 20, 2012
Tarek El-shishtawy, Abdulwahab Al-sammak

In this paper, a supervised learning technique for extracting keyphrases of Arabic documents is presented. The extractor is supplied with linguistic knowledge to enhance its efficiency instead of relying only on statistical information such as term frequency and distance. During analysis, an annotated Arabic corpus is used to extract the required lexical features of the document words. The knowledge also includes syntactic rules based on part of speech tags and allowed word sequences to extract the candidate keyphrases. In this work, the abstract form of Arabic words is used instead of its stem form to represent the candidate terms. The Abstract form hides most of the inflections found in Arabic words. The paper introduces new features of keyphrases based on linguistic knowledge, to capture titles and subtitles of a document. A simple ANOVA test is used to evaluate the validity of selected features. Then, the learning model is built using the LDA - Linear Discriminant Analysis - and training documents. Although, the presented system is trained using documents in the IT domain, experiments carried out show that it has a significantly better performance than the existing Arabic extractor systems, where precision and recall values reach double their corresponding values in the other systems especially for lengthy and non-scientific articles.

* Proceedings of the Second International Conference on Arabic Language Resources and Tools, 2009 

  Access Paper or Ask Questions

Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

May 14, 2022
Tanel Alumäe, Kunnar Kukk

This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge: constrained and unconstrained language recognition. For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition (ASR), using the provided training data that had transcripts available. The shared encoder of the multilingual ASR model was then finetuned for the language identification task. For the unconstrained task, we relied on both externally available pretrained models as well as external data: the multilingual XLSR-53 wav2vec2.0 model was finetuned on the VoxLingua107 corpus for the language recognition task, and finally finetuned on the provided target language training data, augmented with CommonVoice data. Our primary metric $C_{\rm avg}$ values on the Test set are 0.0079 for the constrained task and 0.0119 for the unconstrained task which resulted in the second place in both rankings. In post-evaluation experiments, we study the amount of target language data needed for training an accurate backend model, the importance of multilingual pretraining data, and compare different models as finetuning starting points.

* Accepted to Speaker Odyssey 2022 

  Access Paper or Ask Questions

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Apr 18, 2022
Chenyu Yang, Yu Wang

End-to-end speaker diarization approaches have shown exceptional performance over the traditional modular approaches. To further improve the performance of the end-to-end speaker diarization for real speech recordings, recently works have been proposed which integrate unsupervised clustering algorithms with the end-to-end neural diarization models. However, these methods have a number of drawbacks: 1) The unsupervised clustering algorithms cannot leverage the supervision from the available datasets; 2) The K-means-based unsupervised algorithms that are explored often suffer from the constraint violation problem; 3) There is unavoidable mismatch between the supervised training and the unsupervised inference. In this paper, a robust generic neural clustering approach is proposed that can be integrated with any chunk-level predictor to accomplish a fully supervised end-to-end speaker diarization model. Also, by leveraging the sequence modelling ability of a recurrent neural network, the proposed neural clustering approach can dynamically estimate the number of speakers during inference. Experimental show that when integrating an attractor-based chunk-level predictor, the proposed neural clustering approach can yield better Diarization Error Rate (DER) than the constrained K-means-based clustering approaches under the mismatched conditions.

* submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

Fusion of Self-supervised Learned Models for MOS Prediction

Apr 11, 2022
Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre, Raphael Rubino, Yi Zhao

We participated in the mean opinion score (MOS) prediction challenge, 2022. This challenge aims to predict MOS scores of synthetic speech on two tracks, the main track and a more challenging sub-track: out-of-domain (OOD). To improve the accuracy of the predicted scores, we have explored several model fusion-related strategies and proposed a fused framework in which seven pretrained self-supervised learned (SSL) models have been engaged. These pretrained SSL models are derived from three ASR frameworks, including Wav2Vec, Hubert, and WavLM. For the OOD track, we followed the 7 SSL models selected on the main track and adopted a semi-supervised learning method to exploit the unlabeled data. According to the official analysis results, our system has achieved 1st rank in 6 out of 16 metrics and is one of the top 3 systems for 13 out of 16 metrics. Specifically, we have achieved the highest LCC, SRCC, and KTAU scores at the system level on main track, as well as the best performance on the LCC, SRCC, and KTAU evaluation metrics at the utterance level on OOD track. Compared with the basic SSL models, the prediction accuracy of the fused system has been largely improved, especially on OOD sub-track.

* MOS 2022 shared task system description paper 

  Access Paper or Ask Questions

Deep Embeddings for Robust User-Based Amateur Vocal Percussion Classification

Apr 10, 2022
Alejandro Delgado, Emir Demirel, Vinod Subramanian, Charalampos Saitis, Mark Sandler

Vocal Percussion Transcription (VPT) is concerned with the automatic detection and classification of vocal percussion sound events, allowing music creators and producers to sketch drum lines on the fly. Classifier algorithms in VPT systems learn best from small user-specific datasets, which usually restrict modelling to small input feature sets to avoid data overfitting. This study explores several deep supervised learning strategies to obtain informative feature sets for amateur vocal percussion classification. We evaluated the performance of these sets on regular vocal percussion classification tasks and compared them with several baseline approaches including feature selection methods and a speech recognition engine. These proposed learning models were supervised with several label sets containing information from four different levels of abstraction: instrument-level, syllable-level, phoneme-level, and boxeme-level. Results suggest that convolutional neural networks supervised with syllable-level annotations produced the most informative embeddings for classification, which can be used as input representations to fit classifiers with. Finally, we used back-propagation-based saliency maps to investigate the importance of different spectrogram regions for feature learning.

* Accepted at Sound and Music Computing (SMC) conference 2022 

  Access Paper or Ask Questions

Query Expansion and Entity Weighting for Query Reformulation Retrieval in Voice Assistant Systems

Feb 22, 2022
Zhongkai Sun, Sixing Lu, Chengyuan Ma, Xiaohu Liu, Chenlei Guo

Voice assistants such as Alexa, Siri, and Google Assistant have become increasingly popular worldwide. However, linguistic variations, variability of speech patterns, ambient acoustic conditions, and other such factors are often correlated with the assistants misinterpreting the user's query. In order to provide better customer experience, retrieval based query reformulation (QR) systems are widely used to reformulate those misinterpreted user queries. Current QR systems typically focus on neural retrieval model training or direct entities retrieval for the reformulating. However, these methods rarely focus on query expansion and entity weighting simultaneously, which may limit the scope and accuracy of the query reformulation retrieval. In this work, we propose a novel Query Expansion and Entity Weighting method (QEEW), which leverages the relationships between entities in the entity catalog (consisting of users' queries, assistant's responses, and corresponding entities), to enhance the query reformulation performance. Experiments on Alexa annotated data demonstrate that QEEW improves all top precision metrics, particularly 6% improvement in top10 precision, compared with baselines not using query expansion and weighting; and more than 5% improvement in top10 precision compared with other baselines using query expansion and weighting.

  Access Paper or Ask Questions

Adaptive Discounting of Implicit Language Models in RNN-Transducers

Feb 21, 2022
Vinit Unni, Shreya Khare, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi, Samarth Bharadwaj

RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM discounting technique AdaptLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. AdaptLMD uses a two-pronged approach: 1) Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.

* Proceedings for ICASSP 2022 

  Access Paper or Ask Questions

Semantic Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion

Feb 05, 2022
Peiying Zhang, Xingzhe Huang, Yaqi Wang, Chunxiao Jiang, Shuqing He, Haifeng Wang

Natural language processing (NLP) task has achieved excellent performance in many fields, including semantic understanding, automatic summarization, image recognition and so on. However, most of the neural network models for NLP extract the text in a fine-grained way, which is not conducive to grasp the meaning of the text from a global perspective. To alleviate the problem, the combination of the traditional statistical method and deep learning model as well as a novel model based on multi model nonlinear fusion are proposed in this paper. The model uses the Jaccard coefficient based on part of speech, Term Frequency-Inverse Document Frequency (TF-IDF) and word2vec-CNN algorithm to measure the similarity of sentences respectively. According to the calculation accuracy of each model, the normalized weight coefficient is obtained and the calculation results are compared. The weighted vector is input into the fully connected neural network to give the final classification results. As a result, the statistical sentence similarity evaluation algorithm reduces the granularity of feature extraction, so it can grasp the sentence features globally. Experimental results show that the matching of sentence similarity calculation method based on multi model nonlinear fusion is 84%, and the F1 value of the model is 75%.

  Access Paper or Ask Questions