Brain-computer interface (BCI) is the technology that enables the communication between humans and devices by reflecting status and intentions of humans. When conducting imagined speech, the users imagine the pronunciation as if actually speaking. In the case of decoding imagined speech-based EEG signals, complex task can be conducted more intuitively, but decoding performance is lower than that of other BCI paradigms. We modified our previous model for decoding imagined speech-based EEG signals. Ten subjects participated in the experiment. The average accuracy of our proposed method was 0.5648 for classifying four words. In other words, our proposed method has significant strength in learning local features. Hence, we demonstrated the feasibility of decoding imagined speech-based EEG signals with robust performance.
Designing reliable Speech Emotion Recognition systems is a complex task that inevitably requires sufficient data for training purposes. Such extensive datasets are currently available in only a few languages, including English, German, and Italian. In this paper, we present SEMOUR, the first scripted database of emotion-tagged speech in the Urdu language, to design an Urdu Speech Recognition System. Our gender-balanced dataset contains 15,040 unique instances recorded by eight professional actors eliciting a syntactically complex script. The dataset is phonetically balanced, and reliably exhibits a varied set of emotions as marked by the high agreement scores among human raters in experiments. We also provide various baseline speech emotion prediction scores on the database, which could be used for various applications like personalized robot assistants, diagnosis of psychological disorders, and getting feedback from a low-tech-enabled population, etc. On a random test sample, our model correctly predicts an emotion with a state-of-the-art 92% accuracy.
There are a few reasons for the recent increased interest in the study of local features of speech files. It is stated that many essential features of the speaker language used can appear in the form of the speech signal. The traditional instruments - short Fourier transform, wavelet transform, Hadamard transforms, autocorrelation, and the like can detect not all particular properties of the language. In this paper, we suggest a new approach to the exploration of such properties. The source signal is approximated by a new one that has its values taken from a finite set. Then we construct a new sequence of vectors of a fixed size on the base of those approximations. Examination of the distribution of the produced vectors provides a new method for a description of speech files local characteristics. Finally, the developed technique is applied to the problem of the automatic distinguishing of two known languages used in speech files. For this purpose, a simple neural net is consumed.
Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollment-based approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where speaker activity obtained from a diarization system is used as a speaker clue for ADEnet. We show that this simple yet practical approach can successfully extract speakers after diarization, which leads to improved ASR performance when using a single microphone, especially in high overlapping conditions, with a relative word error rate reduction of up to 25 %.
This study proposes a trainable adaptive window switching (AWS) method and apply it to a deep-neural-network (DNN) for speech enhancement in the modified discrete cosine transform domain. Time-frequency (T-F) mask processing in the short-time Fourier transform (STFT)-domain is a typical speech enhancement method. To recover the target signal precisely, DNN-based short-time frequency transforms have recently been investigated and used instead of the STFT. However, since such a fixed-resolution short-time frequency transform method has a T-F resolution problem based on the uncertainty principle, not only the short-time frequency transform but also the length of the windowing function should be optimized. To overcome this problem, we incorporate AWS into the speech enhancement procedure, and the windowing function of each time-frame is manipulated using a DNN depending on the input signal. We confirmed that the proposed method achieved a higher signal-to-distortion ratio than conventional speech enhancement methods in fixed-resolution frequency domains.
In this paper, we present a data set and methods to compare speech processing models and human behaviour on a phone discrimination task. We provide Perceptimatic, an open data set which consists of French and English speech stimuli, as well as the results of 91 English- and 93 French-speaking listeners. The stimuli test a wide range of French and English contrasts, and are extracted directly from corpora of natural running read speech, used for the 2017 Zero Resource Speech Challenge. We provide a method to compare humans' perceptual space with models' representational space, and we apply it to models previously submitted to the Challenge. We show that, unlike unsupervised models and supervised multilingual models, a standard supervised monolingual HMM-GMM phone recognition system, while good at discriminating phones, yields a representational space very different from that of human native listeners.
Speech enhancement aims to obtain speech signals with high intelligibility and quality from noisy speech. Recent work has demonstrated the excellent performance of time-domain deep learning methods, such as Conv-TasNet. However, these methods can be degraded by the arbitrary scales of the waveform induced by the scale-invariant signal-to-noise ratio (SI-SNR) loss. This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN), which is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem, and provide model training stability, thus achieving performance improvement. In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN, and explain why it is better than the Wasserstein GAN. Experiments conducted demonstrate the effectiveness of our proposed method, and illustrate the advantage of Metric GAN.
Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. CTC creates sparse "peaky" output activations, and we analyze the differences in the alignments of output targets (phonemes or visemes) between audio-only, video-only, and audio-visual feature representations. We present the first such experiments on the large vocabulary IBM ViaVoice database, which outperform previously published approaches on phone accuracy in clean and noisy conditions.
This paper discusses the dominancy of local features (LFs), as input to the multilayer neural network (MLN), extracted from a Bangla input speech over mel frequency cepstral coefficients (MFCCs). Here, LF-based method comprises three stages: (i) LF extraction from input speech, (ii) phoneme probabilities extraction using MLN from LF and (iii) the hidden Markov model (HMM) based classifier to obtain more accurate phoneme strings. In the experiments on Bangla speech corpus prepared by us, it is observed that the LFbased automatic speech recognition (ASR) system provides higher phoneme correct rate than the MFCC-based system. Moreover, the proposed system requires fewer mixture components in the HMMs.
State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30\% over LSTM-VAE and by at least 7\% over vanilla Transformer-VAE.