Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

[email protected]: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection

Oct 05, 2020
Gaurav Arora

This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally, our model achieved 0.91 weighted F1-score (4th Rank) for mixed-script Malayalam-English in Sub-task A and 0.74 weighted F1-score (5th Rank) for code-mixed Malayalam-English language in Sub-task B.

* System description paper for 2nd ranked system in Sub-task B at [email protected] 

  Access Paper or Ask Questions

Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus)

Jan 30, 2022
Hossein Hassani

Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at We plan to make the whole tagged corpus available after further investigation on the outcome. The dataset can help in developing POS-tagged lexicons for other Kurdish dialects and automated Kurdish corpora tagging.

* 7pages, 2 tables, 3 figures 

  Access Paper or Ask Questions

An Artificial Intelligence Browser Architecture (AIBA) For Our Kind and Others: A Voice Name System Speech implementation with two warrants, Wake Neutrality and Value Preservation of Personally Identifiable Information

Apr 01, 2022
Brian Subirana

Conversational commerce, first pioneered by Apple's Siri, is the first of may applications based on always-on artificial intelligence systems that decide on its own when to interact with the environment, potentially collecting 24x7 longitudinal training data that is often Personally Identifiable Information (PII). A large body of scholarly papers, on the order of a million according to a simple Google Scholar search, suggests that the treatment of many health conditions, including COVID-19 and dementia, can be vastly improved by this data if the dataset is large enough as it has happened in other domains (e.g. GPT3). In contrast, current dominant systems are closed garden solutions without wake neutrality and that can't fully exploit the PII data they have because of IRB and Cohues-type constraints. We present a voice browser-and-server architecture that aims to address these two limitations by offering wake neutrality and the possibility to handle PII aiming to maximize its value. We have implemented this browser for the collection of speech samples and have successfully demonstrated it can capture over 200.000 samples of COVID-19 coughs. The architecture we propose is designed so it can grow beyond our kind into other domains such as collecting sound samples from vehicles, video images from nature, ingestible robotics, multi-modal signals (EEG, EKG,...), or even interacting with other kinds such as dogs and cats.

  Access Paper or Ask Questions

The "Whiteboard" Architecture: a way to integrate heterogeneous components of NLP systems

Nov 04, 1994
Christian Boitet, Mark Seligman

We present a new software architecture for NLP systems made of heterogeneous components, and demonstrate an architectural prototype we have built at ATR in the context of Speech Translation.

* COLING-94 
* Postscript, 6 pages 

  Access Paper or Ask Questions

Identifying Phrasemes via Interlingual Association Measures -- A Data-driven Approach on Dependency-parsed and Word-aligned Parallel Corpora

Sep 24, 2017
Johannes Graën

This is a preprint of the article "Identifying Phrasemes via Interlingual Association Measures" that was presented in February 2016 at the LeKo (Lexical combinations and typified speech in a multilingual context) conference in Innsbruck.

  Access Paper or Ask Questions

A procedure for unsupervised lexicon learning

Nov 30, 2001
Anand Venkataraman

We describe an incremental unsupervised procedure to learn words from transcribed continuous speech. The algorithm is based on a conservative and traditional statistical model, and results of empirical tests show that it is competitive with other algorithms that have been proposed recently for this task.

* Proceedings of the eighteenth international conference on machine learning, ICML-01, pp.569--576, 2001 
* Expanded version of this paper appears in Computational Linguistics 27(3) 

  Access Paper or Ask Questions

Weakly Supervised Multi-Embeddings Learning of Acoustic Models

Apr 20, 2015
Gabriel Synnaeve, Emmanuel Dupoux

We trained a Siamese network with multi-task same/different information on a speech dataset, and found that it was possible to share a network for both tasks without a loss in performance. The first task was to discriminate between two same or different words, and the second was to discriminate between two same or different talkers.

* 6 pages, 3 figures 

  Access Paper or Ask Questions

Tagging the Teleman Corpus

May 11, 1995
Thorsten Brants, Christer Samuelsson

Experiments were carried out comparing the Swedish Teleman and the English Susanne corpora using an HMM-based and a novel reductionistic statistical part-of-speech tagger. They indicate that tagging the Teleman corpus is the more difficult task, and that the performance of the two different taggers is comparable.

* 14 pages, LaTeX, to appear in Proceedings of the 10th Nordic Conference of Computational Linguistics, Helsinki, Finland, 1995 

  Access Paper or Ask Questions

N-dimensional nonlinear prediction with MLP

Feb 24, 2022
Marcos Faundez-Zanuy

In this paper we propose a Non-Linear Predictive Vector quantizer (PVQ) for speech coding, based on Multi-Layer Perceptrons. With this scheme we have improved the results of our previous ADPCM coder with nonlinear prediction, and we have reduced the bit rate up to 1 bit per sample.

* 2002 11th European Signal Processing Conference, 2002, pp. 1-4 
* 4 pages 

  Access Paper or Ask Questions

VLSI Systems for signal processing and Communications

Jun 10, 2021
Aditya Kulkarni, Atharva Kulkarni, Ankit Lad, Laksh Maheshwari, Jayant Majji

The growing advances in VLSI technology and design tools have exponentially expanded the application domain of digital signal processing over the past 10 years. This survey emphasises on the architectural and performance parameters of VLSI for DSP applications such as speech processing, wireless communication, analog to digital converters, etc

  Access Paper or Ask Questions