This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally, our model achieved 0.91 weighted F1-score (4th Rank) for mixed-script Malayalam-English in Sub-task A and 0.74 weighted F1-score (5th Rank) for code-mixed Malayalam-English language in Sub-task B.
Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged corpus available after further investigation on the outcome. The dataset can help in developing POS-tagged lexicons for other Kurdish dialects and automated Kurdish corpora tagging.
Conversational commerce, first pioneered by Apple's Siri, is the first of may applications based on always-on artificial intelligence systems that decide on its own when to interact with the environment, potentially collecting 24x7 longitudinal training data that is often Personally Identifiable Information (PII). A large body of scholarly papers, on the order of a million according to a simple Google Scholar search, suggests that the treatment of many health conditions, including COVID-19 and dementia, can be vastly improved by this data if the dataset is large enough as it has happened in other domains (e.g. GPT3). In contrast, current dominant systems are closed garden solutions without wake neutrality and that can't fully exploit the PII data they have because of IRB and Cohues-type constraints. We present a voice browser-and-server architecture that aims to address these two limitations by offering wake neutrality and the possibility to handle PII aiming to maximize its value. We have implemented this browser for the collection of speech samples and have successfully demonstrated it can capture over 200.000 samples of COVID-19 coughs. The architecture we propose is designed so it can grow beyond our kind into other domains such as collecting sound samples from vehicles, video images from nature, ingestible robotics, multi-modal signals (EEG, EKG,...), or even interacting with other kinds such as dogs and cats.
We present a new software architecture for NLP systems made of heterogeneous components, and demonstrate an architectural prototype we have built at ATR in the context of Speech Translation.
This is a preprint of the article "Identifying Phrasemes via Interlingual Association Measures" that was presented in February 2016 at the LeKo (Lexical combinations and typified speech in a multilingual context) conference in Innsbruck.
We describe an incremental unsupervised procedure to learn words from transcribed continuous speech. The algorithm is based on a conservative and traditional statistical model, and results of empirical tests show that it is competitive with other algorithms that have been proposed recently for this task.
We trained a Siamese network with multi-task same/different information on a speech dataset, and found that it was possible to share a network for both tasks without a loss in performance. The first task was to discriminate between two same or different words, and the second was to discriminate between two same or different talkers.
Experiments were carried out comparing the Swedish Teleman and the English Susanne corpora using an HMM-based and a novel reductionistic statistical part-of-speech tagger. They indicate that tagging the Teleman corpus is the more difficult task, and that the performance of the two different taggers is comparable.
In this paper we propose a Non-Linear Predictive Vector quantizer (PVQ) for speech coding, based on Multi-Layer Perceptrons. With this scheme we have improved the results of our previous ADPCM coder with nonlinear prediction, and we have reduced the bit rate up to 1 bit per sample.
The growing advances in VLSI technology and design tools have exponentially expanded the application domain of digital signal processing over the past 10 years. This survey emphasises on the architectural and performance parameters of VLSI for DSP applications such as speech processing, wireless communication, analog to digital converters, etc