Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

Comparing Performance of Different Linguistically-Backed Word Embeddings for Cyberbullying Detection

Jun 04, 2022
Juuso Eronen, Michal Ptaszynski, Fumito Masui

In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas. This includes pre-trained language models like BERT. To investigate on the potential of capturing deeper relations between lexical items and structures and to filter out redundant information, we propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas. This means, for example, including parts-of-speech or dependency information within the used lexical features. The word embeddings can then be trained on the combinations instead of just raw tokens. It is also possible to later apply this method to the pre-training of huge language models and possibly enhance their performance. This would aid in tackling problems which are more sophisticated from the point of view of linguistic representation, such as detection of cyberbullying.

* Proceedings of the 2021 International Workshop on Modern Science and Technology, September 29, 2021

Via

Access Paper or Ask Questions

Multimodal Representation Learning With Text and Images

Apr 30, 2022
Aishwarya Jayagopal, Ankireddy Monica Aiswarya, Ankita Garg, Srinivasan Kolumam Nandakumar

Figure 1 for Multimodal Representation Learning With Text and Images

Figure 2 for Multimodal Representation Learning With Text and Images

Figure 3 for Multimodal Representation Learning With Text and Images

Figure 4 for Multimodal Representation Learning With Text and Images

In recent years, multimodal AI has seen an upward trend as researchers are integrating data of different types such as text, images, speech into modelling to get the best results. This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously, thereby employing the widely used techniques of Natural Language Processing (NLP) and Computer Vision. The learnt representations are evaluated using downstream classification and regression tasks. The methodology adopted can be extended beyond the scope of this project as it uses Auto-Encoders for unsupervised representation learning.

Via

Access Paper or Ask Questions

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Oct 23, 2020
Rui Liu, Berrak Sisman, Haizhou Li

Figure 1 for GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Figure 2 for GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Figure 3 for GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Figure 4 for GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech encodes explicitly the syntactic relation of input lexical tokens in a sentence, and incorporates such information to derive syntactically motivated character embeddings for TTS attention mechanism. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.

* This paper was submitted to ICASSP2021

Via

Access Paper or Ask Questions

Gaze-Vergence-Controlled See-Through Vision in Augmented Reality

Jul 06, 2022
Zhimin Wang, Yuxin Zhao, Feng Lu

Figure 1 for Gaze-Vergence-Controlled See-Through Vision in Augmented Reality

Figure 2 for Gaze-Vergence-Controlled See-Through Vision in Augmented Reality

Figure 3 for Gaze-Vergence-Controlled See-Through Vision in Augmented Reality

Figure 4 for Gaze-Vergence-Controlled See-Through Vision in Augmented Reality

Augmented Reality (AR) see-through vision is an interesting research topic since it enables users to see through a wall and see the occluded objects. Most existing research focuses on the visual effects of see-through vision, while the interaction method is less studied. However, we argue that using common interaction modalities, e.g., midair click and speech, may not be the optimal way to control see-through vision. This is because when we want to see through something, it is physically related to our gaze depth/vergence and thus should be naturally controlled by the eyes. Following this idea, this paper proposes a novel gaze-vergence-controlled (GVC) see-through vision technique in AR. Since gaze depth is needed, we build a gaze tracking module with two infrared cameras and the corresponding algorithm and assemble it into the Microsoft HoloLens 2 to achieve gaze depth estimation. We then propose two different GVC modes for see-through vision to fit different scenarios. Extensive experimental results demonstrate that our gaze depth estimation is efficient and accurate. By comparing with conventional interaction modalities, our GVC techniques are also shown to be superior in terms of efficiency and more preferred by users. Finally, we present four example applications of gaze-vergence-controlled see-through vision.

* 11 papges, 13 figures

Via

Access Paper or Ask Questions

Adversarial Text Normalization

Jun 08, 2022
Joanna Bitton, Maya Pavlova, Ivan Evtimov

Figure 1 for Adversarial Text Normalization

Figure 2 for Adversarial Text Normalization

Figure 3 for Adversarial Text Normalization

Figure 4 for Adversarial Text Normalization

Text-based adversarial attacks are becoming more commonplace and accessible to general internet users. As these attacks proliferate, the need to address the gap in model robustness becomes imminent. While retraining on adversarial data may increase performance, there remains an additional class of character-level attacks on which these models falter. Additionally, the process to retrain a model is time and resource intensive, creating a need for a lightweight, reusable defense. In this work, we propose the Adversarial Text Normalizer, a novel method that restores baseline performance on attacked content with low computational overhead. We evaluate the efficacy of the normalizer on two problem areas prone to adversarial attacks, i.e. Hate Speech and Natural Language Inference. We find that text normalization provides a task-agnostic defense against character-level attacks that can be implemented supplementary to adversarial retraining solutions, which are more suited for semantic alterations.

Via

Access Paper or Ask Questions

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Apr 12, 2021
Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

Figure 1 for Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Figure 2 for Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Figure 3 for Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Figure 4 for Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which work well for large datasets, but tend to overfit when applied in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems. We present a novel approach of silence correction in the data pre-processing for TTS systems which increases the robustness when training on corpora targeted for ASR applications. In this work we do not only show the successful application of synthetic data for AED systems, but also test the same method on a highly optimized state-of-the-art Hybrid ASR system and a competitive monophone based system using connectionist-temporal-classification (CTC). We show that for the later systems the addition of synthetic data only has a minor effect, but they still outperform the AED systems by a large margin on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a Hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems that do not include unlabeled audio data.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Continuous Speech Recognition using EEG and Video

Dec 19, 2019
Gautam Krishna, Mason Carnahan, Co Tran, Ahmed H Tewfik

Figure 1 for Continuous Speech Recognition using EEG and Video

Figure 2 for Continuous Speech Recognition using EEG and Video

Figure 3 for Continuous Speech Recognition using EEG and Video

Figure 4 for Continuous Speech Recognition using EEG and Video

In this paper we investigate whether electroencephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition.

* On preparation for submission to EUSIPCO 2020. arXiv admin note: text overlap with arXiv:1911.11610, arXiv:1911.04261

Via

Access Paper or Ask Questions

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Nov 06, 2020
Guanghui Xu, Wei Song, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou

Figure 1 for Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Figure 2 for Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Figure 3 for Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Figure 4 for Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the LJ-Speech English audiobook dataset demonstrate the use of CU information can improve the naturalness and expressiveness of the synthesized speech. Subjective listening testing shows most of the participants prefer the voice generated using the CU encoder over that generated using standard Tacotron2. It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Oct 14, 2020
Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson, Chenfang Li, Tatiana Merkulova, Yin May Oo, Knot Pipatsrisawat, Clara Rivera, Supheakmungkol Sarin, Pasindu de Silva, Keshan Sodimana, Richard Sproat, Theeraphol Wattanavekin, Jaka Aris Eko Wibawa

This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.

* Appeared in 2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4-6 December, Paris, France

Via

Access Paper or Ask Questions