Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?

Mar 06, 2020
Yu Zhang, Zhenghua Li, Houquan Zhou, Min Zhang

In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing due to their important role in alleviating data sparseness of purely lexical features, and quite a few works focus on joint tagging and parsing models to avoid error propagation. In contrast, recent studies suggest that POS tagging becomes much less important or even useless for neural parsing, especially when using character-based word representations such as CharLSTM. Yet there still lacks a full and systematic investigation on this interesting issue, both empirically and linguistically. To answer this, we design four typical multi-task learning frameworks (i.e., Share-Loose, Share-Tight, Stack-Discrete, Stack-Hidden), for joint tagging and parsing based on the state-of-the-art biaffine parser. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS-tag data. We conduct experiments on both English and Chinese datasets, and the results clearly show that POS tagging (both homogeneous and heterogeneous) can still significantly improve parsing performance when using the Stack-Hidden joint framework. We conduct detailed analysis and gain more insights from the linguistic aspect.

  Access Paper or Ask Questions

Integrating Discrete and Neural Features via Mixed-feature Trans-dimensional Random Field Language Models

Feb 14, 2020
Silin Gao, Zhijian Ou, Wei Yang, Huifang Xu

There has been a long recognition that discrete features (n-gram features) and neural network based features have complementary strengths for language models (LMs). Improved performance can be obtained by model interpolation, which is, however, a suboptimal two-step integration of discrete and neural features. The trans-dimensional random field (TRF) framework has the potential advantage of being able to flexibly integrate a richer set of features. However, either discrete or neural features are used alone in previous TRF LMs. This paper develops a mixed-feature TRF LM and demonstrates its advantage in integrating discrete and neural features. Various LMs are trained over PTB and Google one-billion-word datasets, and evaluated in N-best list rescoring experiments for speech recognition. Among all single LMs (i.e. without model interpolation), the mixed-feature TRF LMs perform the best, improving over both discrete TRF LMs and neural TRF LMs alone, and also being significantly better than LSTM LMs. Compared to interpolating two separately trained models with discrete and neural features respectively, the performance of mixed-feature TRF LMs matches the best interpolated model, and with simplified one-step training process and reduced training time.

* 5 pages, 2 figures 

  Access Paper or Ask Questions

Emotion Detection and Analysis on Social Media

Jan 24, 2019
Bharat Gaind, Varun Syal, Sneha Padgalwar

In this paper, we address the problem of detection, classification and quantification of emotions of text in any form. We consider English text collected from social media like Twitter, which can provide information having utility in a variety of ways, especially opinion mining. Social media like Twitter and Facebook is full of emotions, feelings and opinions of people all over the world. However, analyzing and classifying text on the basis of emotions is a big challenge and can be considered as an advanced form of Sentiment Analysis. This paper proposes a method to classify text into six different Emotion-Categories: Happiness, Sadness, Fear, Anger, Surprise and Disgust. In our model, we use two different approaches and combine them to effectively extract these emotions from text. The first approach is based on Natural Language Processing, and uses several textual features like emoticons, degree words and negations, Parts Of Speech and other grammatical analysis. The second approach is based on Machine Learning classification algorithms. We have also successfully devised a method to automate the creation of the training-set itself, so as to eliminate the need of manual annotation of large datasets. Moreover, we have managed to create a large bag of emotional words, along with their emotion-intensities. On testing, it is shown that our model provides significant accuracy in classifying tweets taken from Twitter.

* In the proceedings of International Conference on Recent Trends In Computational Engineering and Technologies (ICTRCET'18), May 17-18, 2018, Bengaluru, India. ISBN: 978-93-88775-00-7 

  Access Paper or Ask Questions

DASPS: A Database for Anxious States based on a Psychological Stimulation

Jan 09, 2019
Asma Baghdadi, Yassine Aribi, Rahma Fourati, Najla Halouani, Patrick Siarry, Adel M. Alimi

Anxiety affects human capabilities and behavior as much as it affects productivity and quality of life. It can be considered as the main cause of depression and suicide. Anxious states are easily detectable by humans due to their acquired cognition, humans interpret the interlocutor's tone of speech, gesture, facial expressions and recognize their mental state. There is a need for non-invasive reliable techniques that performs the complex task of anxiety detection. In this paper, we present DASPS database containing recorded Electroencephalogram (EEG) signals of 23 participants during anxiety elicitation by means of face-to-face psychological stimuli. EEG signals were captured with Emotiv Epoc headset as it's a wireless wearable low-cost equipment. In our study, we investigate the impact of different parameters, notably: trial duration, feature type, feature combination and anxiety levels number. Our findings showed that anxiety is well elicited in 1 second. For instance, stacked sparse autoencoder with different type of features achieves 83.50% and 74.60% for 2 and 4 anxiety levels detection, respectively. The presented results prove the benefits of the use of a low-cost EEG headset instead of medical non-wireless devices and create a starting point for new researches in the field of anxiety detection.

* 12 pages, IEEE TAFFC 

  Access Paper or Ask Questions

Understanding and Controlling User Linkability in Decentralized Learning

May 15, 2018
Tribhuvanesh Orekondy, Seong Joon Oh, Bernt Schiele, Mario Fritz

Machine Learning techniques are widely used by online services (e.g. Google, Apple) in order to analyze and make predictions on user data. As many of the provided services are user-centric (e.g. personal photo collections, speech recognition, personal assistance), user data generated on personal devices is key to provide the service. In order to protect the data and the privacy of the user, federated learning techniques have been proposed where the data never leaves the user's device and "only" model updates are communicated back to the server. In our work, we propose a new threat model that is not concerned with learning about the content - but rather is concerned with the linkability of users during such decentralized learning scenarios. We show that model updates are characteristic for users and therefore lend themselves to linkability attacks. We show identification and matching of users across devices in closed and open world scenarios. In our experiments, we find our attacks to be highly effective, achieving 20x-175x chance-level performance. In order to mitigate the risks of linkability attacks, we study various strategies. As adding random noise does not offer convincing operation points, we propose strategies based on using calibrated domain-specific data; we find these strategies offers substantial protection against linkability threats with little effect to utility.

  Access Paper or Ask Questions

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

Mar 03, 2018
Zhehuai Chen, Qi Liu, Hao Li, Kai Yu

End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training a single model which integrates acoustic and language model into a whole. Although E2E training benefits from sequence modeling and simplified decoding pipelines, large amount of transcribed acoustic data is usually required, and traditional acoustic and language modelling techniques cannot be utilized. In this paper, a novel modular training framework of E2E ASR is proposed to separately train neural acoustic and language models during training stage, while still performing end-to-end inference in decoding stage. Here, an acoustics-to-phoneme model (A2P) and a phoneme-to-word model (P2W) are trained using acoustic data and text data respectively. A phone synchronous decoding (PSD) module is inserted between A2P and P2W to reduce sequence lengths without precision loss. Finally, modules are integrated into an acousticsto-word model (A2W) and jointly optimized using acoustic data to retain the advantage of sequence modeling. Experiments on a 300- hour Switchboard task show significant improvement over the direct A2W model. The efficiency in both training and decoding also benefits from the proposed method.

* accepted by ICASSP2018 

  Access Paper or Ask Questions

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Aug 13, 2017
Amirsina Torfi, Seyed Mehdi Iranmanesh, Nasser M. Nasrabadi, Jeremy Dawson

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We propose the use of a coupled 3D Convolutional Neural Network (3D-CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset for training, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the Equal Error Rate (EER) and over 7% on the Average Precision (AP) in comparison to the state-of-the-art method.

* IEEE Access (Year: 2017, Volume: PP, Issue: 99 ) 

  Access Paper or Ask Questions

Neural Architecture Search with Reinforcement Learning

Feb 15, 2017
Barret Zoph, Quoc V. Le

Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.

  Access Paper or Ask Questions

A General Framework for Density Based Time Series Clustering Exploiting a Novel Admissible Pruning Strategy

Dec 02, 2016
Nurjahan Begum, Liudmila Ulanova, Hoang Anh Dau, Jun Wang, Eamonn Keogh

Time Series Clustering is an important subroutine in many higher-level data mining analyses, including data editing for classifiers, summarization, and outlier detection. It is well known that for similarity search the superiority of Dynamic Time Warping (DTW) over Euclidean distance gradually diminishes as we consider ever larger datasets. However, as we shall show, the same is not true for clustering. Clustering time series under DTW remains a computationally expensive operation. In this work, we address this issue in two ways. We propose a novel pruning strategy that exploits both the upper and lower bounds to prune off a very large fraction of the expensive distance calculations. This pruning strategy is admissible and gives us provably identical results to the brute force algorithm, but is at least an order of magnitude faster. For datasets where even this level of speedup is inadequate, we show that we can use a simple heuristic to order the unavoidable calculations in a most-useful-first ordering, thus casting the clustering into an anytime framework. We demonstrate the utility of our ideas with both single and multidimensional case studies in the domains of astronomy, speech physiology, medicine and entomology. In addition, we show the generality of our clustering framework to other domains by efficiently obtaining semantically significant clusters in protein sequences using the Edit Distance, the discrete data analogue of DTW.

  Access Paper or Ask Questions

Large-Scale Approximate Kernel Canonical Correlation Analysis

Feb 29, 2016
Weiran Wang, Karen Livescu

Kernel canonical correlation analysis (KCCA) is a nonlinear multi-view representation learning technique with broad applicability in statistics and machine learning. Although there is a closed-form solution for the KCCA objective, it involves solving an $N\times N$ eigenvalue system where $N$ is the training set size, making its computational requirements in both memory and time prohibitive for large-scale problems. Various approximation techniques have been developed for KCCA. A commonly used approach is to first transform the original inputs to an $M$-dimensional random feature space so that inner products in the feature space approximate kernel evaluations, and then apply linear CCA to the transformed inputs. In many applications, however, the dimensionality $M$ of the random feature space may need to be very large in order to obtain a sufficiently good approximation; it then becomes challenging to perform the linear CCA step on the resulting very high-dimensional data matrices. We show how to use a stochastic optimization algorithm, recently proposed for linear CCA and its neural-network extension, to further alleviate the computation requirements of approximate KCCA. This approach allows us to run approximate KCCA on a speech dataset with $1.4$ million training samples and a random feature space of dimensionality $M=100000$ on a typical workstation.

* Published as a conference paper at International Conference on Learning Representations (ICLR) 2016 

  Access Paper or Ask Questions