Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Coupling Phonology and Phonetics in a Constraint-Based Gestural Model

Dec 23, 1994
Markus Walther, Bernd J. Kroeger

An implemented approach which couples a constraint-based phonology component with an articulatory speech synthesizer is proposed. Articulatory gestures ensure a tight connection between both components, as they comprise both physical-phonetic and phonological aspects. The phonological modelling of e.g. syllabification and phonological processes such as German final devoicing is expressed in the constraint logic programming language CUF. Extending CUF by arithmetic constraints allows the simultaneous description of both phonology and phonetics. Thus declarative lexicalist theories of grammar such as HPSG may be enriched up to the level of detailed phonetic realisation. Initial acoustic demonstrations show that our approach is in principle capable of synthesizing full utterances in a linguistically motivated fashion.

* English version of the German original: Walther, Markus and Kroeger, Bernd J. (1994): Phonologie-Phonetikkopplung in einem constraint- basierten gesturalen Modell. In: Harald Trost (ed.), Proceedings KONVENS '94, Vienna. 10 pages, gzip'ed, uuencoded postscript 

  Access Paper or Ask Questions

SubER: A Metric for Automatic Evaluation of Subtitle Quality

May 11, 2022
Patrick Wilken, Panayota Georgakopoulou, Evgeny Matusov

This paper addresses the problem of evaluating the quality of automatically generated subtitles, which includes not only the quality of the machine-transcribed or translated speech, but also the quality of line segmentation and subtitle timing. We propose SubER - a single novel metric based on edit distance with shifts that takes all of these subtitle properties into account. We compare it to existing metrics for evaluating transcription, translation, and subtitle quality. A careful human evaluation in a post-editing scenario shows that the new metric has a high correlation with the post-editing effort and direct human assessment scores, outperforming baseline metrics considering only the subtitle text, such as WER and BLEU, and existing methods to integrate segmentation and timing features.

* IWSLT 2022 

  Access Paper or Ask Questions

Prosody-Aware Neural Machine Translation for Dubbing

Dec 16, 2021
Derek Tam, Surafel M. Lakew, Yogesh Virkar, Prashant Mathur, Marcello Federico

We introduce the task of prosody-aware machine translation which aims at generating translations suitable for dubbing. Dubbing of a spoken sentence requires transferring the content as well as the prosodic structure of the source into the target language to preserve timing information. Practically, this implies correctly projecting pauses from the source to the target and ensuring that target speech segments have roughly the same duration of the corresponding source segments. In this work, we propose an implicit and explicit modeling approaches to integrate prosody information into neural machine translation. Experiments on English-German/French with automatic metrics show that the simplest of the considered approaches works best. Results are confirmed by human evaluations of translations and dubbed videos.

* Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022 

  Access Paper or Ask Questions

Battling Hateful Content in Indic Languages HASOC '21

Nov 05, 2021
Aditya Kadam, Anmol Goel, Jivitesh Jain, Jushaan Singh Kalra, Mallika Subramanian, Manvith Reddy, Prashant Kodali, T. H. Arjun, Manish Shrivastava, Ponnurangam Kumaraguru

The extensive rise in consumption of online social media (OSMs) by a large number of people poses a critical problem of curbing the spread of hateful content on these platforms. With the growing usage of OSMs in multiple languages, the task of detecting and characterizing hate becomes more complex. The subtle variations of code-mixed texts along with switching scripts only add to the complexity. This paper presents a solution for the HASOC 2021 Multilingual Twitter Hate-Speech Detection challenge by team PreCog IIIT Hyderabad. We adopt a multilingual transformer based approach and describe our architecture for all 6 subtasks as part of the challenge. Out of the 6 teams that participated in all the subtasks, our submissions rank 3rd overall.

* 12 pages, 6 figures, 2 tables, Accepted at FIRE 2021, CEUR Workshop Proceedings (

  Access Paper or Ask Questions

BotNet Detection On Social Media

Oct 12, 2021
Aniket Chandrakant Devle, Julia Ann Jose, Abhay Shrinivas Saraswathula, Shubham Mehta, Siddhant Srivastava, Sirisha Kona, Sudheera Daggumalli

Given the popularity of social media and the notion of it being a platform encouraging free speech, it has become an open playground for user (bot) accounts trying to manipulate other users using these platforms. Social bots not only learn human conversations, manners, and presence but also manipulate public opinion, act as scammers, manipulate stock markets, etc. There has been evidence of bots manipulating the election results which can be a great threat to the whole nation and hence the whole world. So identification and prevention of such campaigns that release or create the bots have become critical to tackling it at its source of origin. Our goal is to leverage semantic web mining techniques to identify fake bots or accounts involved in these activities.

  Access Paper or Ask Questions

Improving Real-time Score Following in Opera by Combining Music with Lyrics Tracking

Oct 06, 2021
Charles Brazier, Gerhard Widmer

Fully automatic opera tracking is challenging because of the acoustic complexity of the genre, combining musical and linguistic information (singing, speech) in complex ways. In this paper, we propose a new pipeline for complete opera tracking. The pipeline is based on two trackers. A music tracker that has proven to be effective at tracking orchestral parts, will lead the tracking process. In addition, a lyrics tracker, that has recently been shown to reliably track the lyrics of opera songs, will correct the music tracker when tracking parts that have a text dominance over the music. We will demonstrate the efficiency of this method on the opera Don Giovanni, showing that this technique helps improving accuracy and robustness of a complete opera tracker.

* 5 pages, In Proceedings of the 2nd Workshop on NLP for Music and Audio (NLP4MusA), Online, 2021 

  Access Paper or Ask Questions

Fine-tuning wav2vec2 for speaker recognition

Sep 30, 2021
Nik Vaessen, David A. van Leeuwen

This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at

* under review for ICASSP 2022 

  Access Paper or Ask Questions

COMBO: State-of-the-Art Morphosyntactic Analysis

Sep 11, 2021
Mateusz Klimaszewski, Alina Wr贸blewska

We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages. It maintains a balance between efficiency and quality. As it is an end-to-end system and its modules are jointly trained, its training is competitively fast. As its models are optimised for accuracy, they achieve often better prediction quality than SOTA. The COMBO library is available at:

* Accepted at EMNLP 2021 Demonstrations Program 

  Access Paper or Ask Questions

Word-Free Spoken Language Understanding for Mandarin-Chinese

Jul 01, 2021
Zhiyuan Guo, Yuexin Li, Guo Chen, Xingyu Chen, Akshat Gupta

Spoken dialogue systems such as Siri and Alexa provide great convenience to people's everyday life. However, current spoken language understanding (SLU) pipelines largely depend on automatic speech recognition (ASR) modules, which require a large amount of language-specific training data. In this paper, we propose a Transformer-based SLU system that works directly on phones. This acoustic-based SLU system consists of only two blocks and does not require the presence of ASR module. The first block is a universal phone recognition system, and the second block is a Transformer-based language model for phones. We verify the effectiveness of the system on an intent classification dataset in Mandarin Chinese.

  Access Paper or Ask Questions

Annotating Hate and Offenses on Social Media

Mar 27, 2021
Francielle Alves Vargas, Isabelle Carvalho, Fabiana Rodrigues de G贸es, Fabr铆cio Benevenuto de Souza, Thiago Alexandre Salgueiro Pardo

This paper describes a corpus annotation process to support the identification of hate speech and offensive language in social media.The corpus was collected from Instagram pages of political personalities and manually annotated, being composed by 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), the level of the offense (highly offensive, moderately offensive and slightly offensive messages), and the identification regarding the target of the discriminatory content (xenophobia, racism, homophobia, sexism, religion intolerance, partyism, apology to the dictatorship, antisemitism and fat phobia). Each comment was annotated by three different annotators, which achieved high inter-annotator agreement.

  Access Paper or Ask Questions