We present a corpus-based analysis of the effects of age, gender and region of origin on the production of both "netspeak" or "chatspeak" features and regional speech features in Flemish Dutch posts that were collected from a Belgian online social network platform. The present study shows that combining quantitative and qualitative approaches is essential for understanding non-standard linguistic variation in a CMC corpus. It also presents a methodology that enables the systematic study of this variation by including all non-standard words in the corpus. The analyses resulted in a convincing illustration of the Adolescent Peak Principle. In addition, our approach revealed an intriguing correlation between the use of regional speech features and chatspeak features.
In cocktail party listening scenarios, the human brain is able to separate competing speech signals. However, the signal processing implemented by the brain to perform cocktail party listening is not well understood. Here, we trained two separate convolutive autoencoder deep neural networks (DNN) to separate monaural and binaural mixtures of two concurrent speech streams. We then used these DNNs as convolutive deep transform (CDT) devices to perform probabilistic re-synthesis. The CDTs operated directly in the time-domain. Our simulations demonstrate that very simple neural networks are capable of exploiting monaural and binaural information available in a cocktail party listening scenario.
The speech feature extraction has been a key focus in robust speech recognition research; it significantly affects the recognition performance. In this paper, we first study a set of different features extraction methods such as linear predictive coding (LPC), mel frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) with several features normalization techniques like rasta filtering and cepstral mean subtraction (CMS). Based on this, a comparative evaluation of these features is performed on the task of text independent speaker identification using a combination between gaussian mixture models (GMM) and linear and non-linear kernels based on support vector machine (SVM).
Stress is a major threat to well-being that manifests in a variety of physiological and mental symptoms. Utilising speech samples collected while the subject is undergoing an induced stress episode has recently shown promising results for the automatic characterisation of individual stress responses. In this work, we introduce new findings that shed light onto whether speech signals are suited to model physiological biomarkers, as obtained via cortisol measurements, or self-assessed appraisal and affect measurements. Our results show that different indicators impact acoustic features in a diverse way, but that their complimentary information can nevertheless be effectively harnessed by a multi-tasking architecture to improve prediction performance for all of them.
Automated speech and text interfaces are continuing to improve, resulting in increased research in the area of dialogue systems. Moreover, conferences and workshops from various fields are focusing more on language through speech and text mediums as candidates for interaction with applications such as search interfaces and robots. In this paper, we explore how visible the SigDial conference is to outside conferences by analysing papers from top Natural Langauge Processing conferences since 2015 to determine the popularity of certain SigDial-related topics, as well as analysing what SigDial papers are being cited by others outside of SigDial. We find that despite a dramatic increase in dialogue-related research, SigDial visibility has not increased. We conclude by offering some suggestions.
We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side large-vocabulary continuous speech recognizer (LVCSR) via a neural network. We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative manner. We propose using a Bidirectional Lattice Recurrent Neural Network (LatticeRNN) for the task, and show that it can significantly improve detection accuracy over using the 1-best result or the posterior.
In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irregularities in the phonation. Besides the respect of the mixed-phase model of speech is discussed. The proposed phase-based features are evaluated and compared to other parameters derived from the magnitude spectrum. Both streams are shown to be interestingly complementary. Furthermore phase-based features turn out to convey a great amount of relevant information, leading to high discrimination performance.
We introduce a data-driven approach for unsupervised video retargeting that translates content from one domain to another while preserving the style native to a domain, i.e., if contents of John Oliver's speech were to be transferred to Stephen Colbert, then the generated content/speech should be in Stephen Colbert's style. Our approach combines both spatial and temporal information along with adversarial losses for content translation and style preservation. In this work, we first study the advantages of using spatiotemporal constraints over spatial constraints for effective retargeting. We then demonstrate the proposed approach for the problems where information in both space and time matters such as face-to-face translation, flower-to-flower, wind and cloud synthesis, sunrise and sunset.
As more users across the world are interacting with dialog agents in their daily life, there is a need for better speech understanding that calls for renewed attention to the dynamics between research in automatic speech recognition (ASR) and natural language understanding (NLU). We briefly review these research areas and lay out the current relationship between them. In light of the observations we make in this paper, we argue that (1) NLU should be cognizant of the presence of ASR models being used upstream in a dialog system's pipeline, (2) ASR should be able to learn from errors found in NLU, (3) there is a need for end-to-end datasets that provide semantic annotations on spoken input, (4) there should be stronger collaboration between ASR and NLU research communities.
This system description describes our submission system to the Third DIHARD Speech Diarization Challenge. Besides the traditional clustering based system, the innovation of our system lies in the combination of various front-end techniques to solve the diarization problem, including speech separation and target-speaker based voice activity detection (TS-VAD), combined with iterative data purification. We also adopted audio domain classification to design domain-dependent processing. Finally, we performed post processing to do system fusion and selection. Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set, respectively.