Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

TuGeBiC: A Turkish German Bilingual Code-Switching Corpus

May 02, 2022
Jeanine Treffers-Daller and, Ozlem Çetinoğlu

In this paper we describe the process of collection, transcription, and annotation of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGeBiC. Participants in the study were adult Turkish-German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language. The corpus is roughly balanced between both languages. We also present quantitative information about the number of code-switches, and give examples of different types of code-switching found in the data. The resulting corpus has been made freely available to the research community.

  Access Paper or Ask Questions

VScript: Controllable Script Generation with Audio-Visual Presentation

Mar 01, 2022
Ziwei Ji, Yan Xu, I-Tsun Cheng, Samuel Cahyawijaya, Rita Frieske, Etsuko Ishii, Min Zeng, Andrea Madotto, Pascale Fung

Automatic script generation could save a considerable amount of resources and offer inspiration to professional scriptwriters. We present VScript, a controllable pipeline that generates complete scripts including dialogues and scene descriptions, and presents visually using video retrieval and aurally using text-to-speech for spoken dialogue. With an interactive interface, our system allows users to select genres and input starting words that control the theme and development of the generated script. We adopt a hierarchical structure, which generates the plot, then the script and its audio-visual presentation. We also introduce a novel approach to plot-guided dialogue generation by treating it as an inverse dialogue summarization. Experiment results show that our approach outperforms the baselines on both automatic and human evaluations, especially in terms of genre control.

  Access Paper or Ask Questions

L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment

Feb 21, 2022
Eric Guizzo, Christian Marinoni, Marco Pennese, Xinlei Ren, Xiguang Zheng, Chen Zhang, Bruno Masiero, Aurelio Uncini, Danilo Comminiello

The L3DAS22 Challenge is aimed at encouraging the development of machine learning strategies for 3D speech enhancement and 3D sound localization and detection in office-like environments. This challenge improves and extends the tasks of the L3DAS21 edition. We generated a new dataset, which maintains the same general characteristics of L3DAS21 datasets, but with an extended number of data points and adding constrains that improve the baseline model's efficiency and overcome the major difficulties encountered by the participants of the previous challenge. We updated the baseline model of Task 1, using the architecture that ranked first in the previous challenge edition. We wrote a new supporting API, improving its clarity and ease-of-use. In the end, we present and discuss the results submitted by all participants. L3DAS22 Challenge website:

* Accepted to 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022). arXiv admin note: substantial text overlap with arXiv:2104.05499 

  Access Paper or Ask Questions

Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition

Jan 17, 2022
Pengfei Liu, Kun Li, Helen Meng

Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states to explicitly capture the alignment relationship between speech and text, and a novel group gated fusion (GGF) layer to integrate the representations of different modalities. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly, and the proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.

* Published in INTERSPEECH-2020 

  Access Paper or Ask Questions

Dataset of Spatial Room Impulse Responses in a Variable Acoustics Room for Six Degrees-of-Freedom Rendering and Analysis

Nov 23, 2021
Thomas McKenzie, Leo McCormack, Christoph Hold

Room acoustics measurements are used in many areas of audio research, from physical acoustics modelling and speech enhancement to virtual reality applications. This paper documents the technical specifications and choices made in the measurement of a dataset of spatial room impulse responses (SRIRs) in a variable acoustics room. Two spherical microphone arrays are used: the mh Acoustics Eigenmike em32 and the Zylia ZM-1, capable of up to fourth- and third-order Ambisonic capture, respectively. The dataset consists of three source and seven receiver positions, repeated with five configurations of the room's acoustics with varying levels of reverberation. Possible applications of the dataset include six degrees-of-freedom (6DoF) analysis and rendering, SRIR interpolation methods, and spatial dereverberation techniques.

* 3 pages, 3 figures, 2 tables 

  Access Paper or Ask Questions

Vowel-based Meeteilon dialect identification using a Random Forest classifier

Jul 26, 2021
Thangjam Clarinda Devi, Kabita Thaoroijam

This paper presents a vowel-based dialect identification system for Meeteilon. For this work, a vowel dataset is created by using Meeteilon Speech Corpora available at Linguistic Data Consortium for Indian Languages (LDC-IL). Spectral features such as formant frequencies (F1, F1 and F3) and prosodic features such as pitch (F0), energy, intensity and segment duration values are extracted from monophthong vowel sounds. Random forest classifier, a decision tree-based ensemble algorithm is used for classification of three major dialects of Meeteilon namely, Imphal, Kakching and Sekmai. Model has shown an average dialect identification performance in terms of accuracy of around 61.57%. The role of spectral and prosodic features are found to be significant in Meeteilon dialect classification.

* 5 pages, double coulumn, 8 Figures, 1 table. Already presented as poster presentation at OCOCOSDA 2020 but not yet published 

  Access Paper or Ask Questions

NLP is Not enough -- Contextualization of User Input in Chatbots

May 13, 2021
Nathan Dolbir, Triyasha Dastidar, Kaushik Roy

AI chatbots have made vast strides in technology improvement in recent years and are already operational in many industries. Advanced Natural Language Processing techniques, based on deep networks, efficiently process user requests to carry out their functions. As chatbots gain traction, their applicability in healthcare is an attractive proposition due to the reduced economic and people costs of an overburdened system. However, healthcare bots require safe and medically accurate information capture, which deep networks aren't yet capable of due to user text and speech variations. Knowledge in symbolic structures is more suited for accurate reasoning but cannot handle natural language processing directly. Thus, in this paper, we study the effects of combining knowledge and neural representations on chatbot safety, accuracy, and understanding.

  Access Paper or Ask Questions

Personalized Keyphrase Detection using Speaker and Environment Information

Apr 28, 2021
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng, Huang, Arun Narayanan, Ian McGraw

In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model. To address the challenge of detecting these keyphrases under various noisy conditions, a speaker separation model is added to the feature frontend of the speaker verification model, and an adaptive noise cancellation (ANC) algorithm is included to exploit cross-microphone noise coherence. Our experiments show that the text-independent speaker verification model largely reduces the false triggering rate of the keyphrase detection, while the speaker separation model and adaptive noise cancellation largely reduce false rejections.

  Access Paper or Ask Questions

Data Quality as Predictor of Voice Anti-Spoofing Generalization

Mar 26, 2021
Bhusan Chettri, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen

Voice anti-spoofing aims at classifying a given speech input either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Numerous voice anti-spoofing methods have been proposed but most of them fail to generalize across domains (corpora) -- and we do not know \emph{why}. We outline a novel interpretative framework for gauging the impact of data quality upon anti-spoofing performance. Our within- and between-domain experiments pool data from seven public corpora and three anti-spoofing methods based on Gaussian mixture and convolutive neural network models. We assess the impacts of long-term spectral information, speaker population (through x-vector speaker embeddings), signal-to-noise ratio, and selected voice quality features.

* Submitted to INTERSPEECH 2021 

  Access Paper or Ask Questions

Commonsense Knowledge Mining from Term Definitions

Feb 01, 2021
Zhicheng Liang, Deborah L. McGuinness

Commonsense knowledge has proven to be beneficial to a variety of application areas, including question answering and natural language understanding. Previous work explored collecting commonsense knowledge triples automatically from text to increase the coverage of current commonsense knowledge graphs. We investigate a few machine learning approaches to mining commonsense knowledge triples using dictionary term definitions as inputs and provide some initial evaluation of the results. We start from extracting candidate triples using part-of-speech tag patterns from text, and then compare the performance of three existing models for triple scoring. Our experiments show that term definitions contain some valid and novel commonsense knowledge triples for some semantic relations, and also indicate some challenges with using existing triple scoring models.

* In the Commonsense Knowledge Graphs (CSKGs) Workshop of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021) 

  Access Paper or Ask Questions