Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

BERTi─ç -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Apr 19, 2021
Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation, we introduce COPA-HR -- a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTi\'c model is made available for free usage and further task-specific fine-tuning through HuggingFace.

  Access Paper or Ask Questions

What is Multimodality?

Mar 10, 2021
Letitia Parcalabescu, Nils Trost, Anette Frank

The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.

* 9 pages, 3 figures 

  Access Paper or Ask Questions

Detecting Audio Attacks on ASR Systems with Dropout Uncertainty

Jun 02, 2020
Tejas Jayashankar, Jonathan Le Roux, Pierre Moulin

Various adversarial audio attacks have recently been developed to fool automatic speech recognition (ASR) systems. We here propose a defense against such attacks based on the uncertainty introduced by dropout in neural networks. We show that our defense is able to detect attacks created through optimized perturbations and frequency masking on a state-of-the-art end-to-end ASR system. Furthermore, the defense can be made robust against attacks that are immune to noise reduction. We test our defense on Mozilla's CommonVoice dataset, the UrbanSound dataset, and an excerpt of the LibriSpeech dataset, showing that it achieves high detection accuracy in a wide range of scenarios.

  Access Paper or Ask Questions

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Apr 22, 2020
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Haji─Ź, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

* LREC 2020 

  Access Paper or Ask Questions

Database Meets Deep Learning: Challenges and Opportunities

Jun 21, 2019
Wei Wang, Meihui Zhang, Gang Chen, H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan

Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications for many years, and therefore should be playing a lead role in supporting this new wave. However, databases and deep learning are different in terms of both techniques and applications. In this paper, we discuss research problems at the intersection of the two fields. In particular, we discuss possible improvements for deep learning systems from a database perspective, and analyze database applications that may benefit from deep learning techniques.

* SIGMOD Rec.,45(2):17{22, Sept. 2016 
* The previous version of this paper has appeared in SIGMOD Record. In this version, we extend it to include the recent developments in this field and references to recent work 

  Access Paper or Ask Questions

Data Efficient Voice Cloning for Neural Singing Synthesis

Feb 19, 2019
Merlijn Blaauw, Jordi Bonada, Ryunosuke Daido

There are many use cases in singing synthesis where creating voices from small amounts of data is desirable. In text-to-speech there have been several promising results that apply voice cloning techniques to modern deep learning based models. In this work, we adapt one such technique to the case of singing synthesis. By leveraging data from many speakers to first create a multispeaker model, small amounts of target data can then efficiently adapt the model to new unseen voices. We evaluate the system using listening tests across a number of different use cases, languages and kinds of data.

* Accepted to ICASSP 2019 

  Access Paper or Ask Questions

Diseño de un espacio semántico sobre la base de la Wikipedia. Una propuesta de análisis de la semántica latente para el idioma español

Jan 28, 2019
Dalina Aidee Villa, Igor Barahona, Luis Javier Álvarez

Latent Semantic Analysis (LSA) was initially conceived by the cognitive psychology at the 90s decade. Since its emergence, the LSA has been used to model cognitive processes, pointing out academic texts, compare literature works and analyse political speeches, among other applications. Taking as starting point multivariate method for dimensionality reduction, this paper propose a semantic space for Spanish language. Out results include a document text matrix with dimensions 1.3 x10^6 and 5.9x10^6, which later is decomposed into singular values. Those singular values are used to semantically words or text.

* 14 pages, in Spanish, 4 figures 

  Access Paper or Ask Questions

A Novel Approach for Effective Learning in Low Resourced Scenarios

Dec 15, 2017
Sri Harsha Dumpala, Rupayan Chakraborty, Sunil Kumar Kopparapu

Deep learning based discriminative methods, being the state-of-the-art machine learning techniques, are ill-suited for learning from lower amounts of data. In this paper, we propose a novel framework, called simultaneous two sample learning (s2sL), to effectively learn the class discriminative characteristics, even from very low amount of data. In s2sL, more than one sample (here, two samples) are simultaneously considered to both, train and test the classifier. We demonstrate our approach for speech/music discrimination and emotion classification through experiments. Further, we also show the effectiveness of s2sL approach for classification in low-resource scenario, and for imbalanced data.

* Presented at NIPS 2017 Machine Learning for Audio Signal Processing (ML4Audio) Workshop, Dec. 2017 

  Access Paper or Ask Questions

Combining Residual Networks with LSTMs for Lipreading

Sep 08, 2017
Themos Stafylakis, Georgios Tzimiropoulos

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

* Submitted to Interspeech 2017 

  Access Paper or Ask Questions

Face Recognition with Machine Learning in OpenCV_ Fusion of the results with the Localization Data of an Acoustic Camera for Speaker Identification

Jul 04, 2017
Johannes Reschke, Armin Sehr

This contribution gives an overview of face recogni-tion algorithms, their implementation and practical uses. First, a training set of different persons' faces has to be collected and used to train a face recognizer. The resulting face model can be utilized to classify people in specific individuals or unknowns. After tracking the recognized face and estimating the acoustic sound source's position, both can be combined to give detailed information about possible speakers and if they are talking or not. This leads to a precise real-time description of the situation, which can be used for further applications, e.g. for multi-channel speech enhancement by adaptive beamformers.

* Applied Research Conference 2017 (Munich) 

  Access Paper or Ask Questions