The state-of-the-art speech enhancement has limited performance in speech estimation accuracy. Recently, in deep learning, the Transformer shows the potential to exploit the long-range dependency in speech by self-attention. Therefore, it is introduced in speech enhancement to improve the speech estimation accuracy from a noise mixture. However, to address the computational cost issue in Transformer with self-attention, the axial attention is the option i.e., to split a 2D attention into two 1D attentions. Inspired by the axial attention, in the proposed method we calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps. Moreover, different from the axial attention, the proposed method provides two parallel multi-head attentions for time- and frequency-axis. Furthermore, it is proven in the literature that the lower frequency-band in speech, generally, contains more desired information than the higher frequency-band, in a noise mixture. Therefore, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA). The U-shaped Transformer is also first time introduced in the proposed method to further improve the speech estimation accuracy. The extensive evaluations over four public datasets, confirm the efficacy of the proposed method.
Research has shown that trust is an essential aspect of human-computer interaction directly determining the degree to which the person is willing to use a system. An automatic prediction of the level of trust that a user has on a certain system could be used to attempt to correct potential distrust by having the system take relevant actions like, for example, apologizing or explaining its decisions. In this work, we explore the feasibility of automatically detecting the level of trust that a user has on a virtual assistant (VA) based on their speech. We developed a novel protocol for collecting speech data from subjects induced to have different degrees of trust in the skills of a VA. The protocol consists of an interactive session where the subject is asked to respond to a series of factual questions with the help of a virtual assistant. In order to induce subjects to either trust or distrust the VA's skills, they are first informed that the VA was previously rated by other users as being either good or bad; subsequently, the VA answers the subjects' questions consistently to its alleged abilities. All interactions are speech-based, with subjects and VAs communicating verbally, which allows the recording of speech produced under different trust conditions. Using this protocol, we collected a speech corpus in Argentine Spanish. We show clear evidence that the protocol effectively succeeded in influencing subjects into the desired mental state of either trusting or distrusting the agent's skills, and present results of a perceptual study of the degree of trust performed by expert listeners. Finally, we found that the subject's speech can be used to detect which type of VA they were using, which could be considered a proxy for the user's trust toward the VA's abilities, with an accuracy up to 76%, compared to a random baseline of 50%.
Applications which use human speech as an input require a speech interface with high recognition accuracy. The words or phrases in the recognised text are annotated with a machine-understandable meaning and linked to knowledge graphs for further processing by the target application. These semantic annotations of recognised words can be represented as a subject-predicate-object triples which collectively form a graph often referred to as a knowledge graph. This type of knowledge representation facilitates to use speech interfaces with any spoken input application, since the information is represented in logical, semantic form, retrieving and storing can be followed using any web standard query languages. In this work, we develop a methodology for linking speech input to knowledge graphs and study the impact of recognition errors in the overall process. We show that for a corpus with lower WER, the annotation and linking of entities to the DBpedia knowledge graph is considerable. DBpedia Spotlight, a tool to interlink text documents with the linked open data is used to link the speech recognition output to the DBpedia knowledge graph. Such a knowledge-based speech recognition interface is useful for applications such as question answering or spoken dialog systems.
Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
Recent advancement in Generative Adversarial Networks in speech synthesis domain, have shown, that it's possible to train GANs  in a reliable manner for high quality coherent waveform generation from mel-spectograms. We propose that it is possible to transfer the MelGAN's  robustness in learning speech features to speech enhancement and noise reduction domain without any model modification tasks. Our proposed method generalizes over multi-speaker speech dataset and is able to robustly handle unseen background noises during the inference. Also, we show that by increasing the batch size for this particular approach not only yields better speech results, but generalizes over multi-speaker dataset easily and leads to faster convergence. Additionally, it outperforms previous state of the art GAN approach for speech enhancement SEGAN  in two domains: 1. quality ; 2. speed. Proposed method runs at more than 100x faster than realtime on GPU and more than 2x faster than real time on CPU without any hardware optimization tasks, right at the speed of MelGAN .
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.
In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.
A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused by lack of large speech dataset and low robustness of acoustic features in the recognition of speech emotion. First, handcrafted and deep automatic features are extracted from existing data in Chinese and English speech emotions. Then, the various features are fused respectively. Finally, the fused features of different languages are fused again and trained in a classification model. Distinguishing the fused features with the unfused ones, the results manifest that the fused features significantly enhance the accuracy of speech emotion recognition algorithm. The proposed solution is evaluated on the two Chinese corpus and two English corpus, and is shown to provide more accurate predictions compared to original solution. As a result of this study, the multi-feature and Multi-lingual fusion algorithm can significantly improve the speech emotion recognition accuracy when the dataset is small.