Why is our language vague? We argue that in contexts in which a cooperative speaker is not perfectly informed about the world, the use of vague expressions can offer an optimal tradeoff between truthfulness (Gricean Quality) and informativeness (Gricean Quantity). Focusing on expressions of approximation such as "around", which are semantically vague, we show that they allow the speaker to convey indirect probabilistic information, in a way that gives the listener a more accurate representation of the information available to the speaker than any more precise expression would (intervals of the form "between"). We give a probabilistic treatment of the interpretation of "around", and offer a model for the interpretation and use of "around"-statements within the Rational Speech Act (RSA) framework. Our model differs in substantive ways from the Lexical Uncertainty model often used within the RSA framework for vague predicates.
Morphological analysis is the study of the formation and structure of words. It plays a crucial role in various tasks in Natural Language Processing (NLP) and Computational Linguistics (CL) such as machine translation and text and speech generation. Kurdish is a less-resourced multi-dialect Indo-European language with highly inflectional morphology. In this paper, as the first attempt of its kind, the morphology of the Kurdish language (Sorani dialect) is described from a computational point of view. We extract morphological rules which are transformed into finite-state transducers for generating and analyzing words. The result of this research assists in conducting studies on language generation for Kurdish and enhances the Information Retrieval (IR) capacity for the language while leveraging the Kurdish NLP and CL into a more advanced computational level.
As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.
Audio features have been proven useful for increasing the performance of automated topic segmentation systems. This study explores the novel task of using audio embeddings for automated, topically coherent segmentation of radio shows. We created three different audio embedding generators using multi-class classification tasks on three datasets from different domains. We evaluate topic segmentation performance of the audio embeddings and compare it against a text-only baseline. We find that a set-up including audio embeddings generated through a non-speech sound event classification task significantly outperforms our text-only baseline by 32.3% in F1-measure. In addition, we find that different classification tasks yield audio embeddings that vary in segmentation performance.
We propose NovoGrad, a first-order stochastic gradient method with layer-wise gradient normalization via second moment estimators and with decoupled weight decay for a better regularization. The method requires half as much memory as Adam/AdamW. We evaluated NovoGrad on the diverse set of problems, including image classification, speech recognition, neural machine translation and language modeling. On these problems, NovoGrad performed equal to or better than SGD and Adam/AdamW. Empirically we show that NovoGrad (1) is very robust during the initial training phase and does not require learning rate warm-up, (2) works well with the same learning rate policy for different problems, and (3) generally performs better than other optimizers for very large batch sizes
We describe a method for unmixing mixtures of freely independent random variables in a manner analogous to the independent component analysis (ICA) based method for unmixing independent random variables from their additive mixtures. Random matrices play the role of free random variables in this context so the method we develop, which we call Free component analysis (FCA), unmixes matrices from additive mixtures of matrices. We describe the theory, the various algorithms, and compare FCA to ICA. We show that FCA performs comparably to, and often better than, ICA in every application, such as image and speech unmixing, where ICA has been known to succeed. Our computational experiments suggest that not-so-random matrices, such as images and spectrograms of waveforms are (closer to being) freer "in the wild" than we might have theoretically expected.
We present a neural transducer model with visual attention that learns to generate LaTeX markup of a real-world math formula given its image. Applying sequence modeling and transduction techniques that have been very successful across modalities such as natural language, image, handwriting, speech and audio; we construct an image-to-markup model that learns to produce syntactically and semantically correct LaTeX markup code over 150 words long and achieves a BLEU score of 89%; improving upon the previous state-of-art for the Im2Latex problem. We also demonstrate with heat-map visualization how attention helps in interpreting the model and can pinpoint (detect and localize) symbols on the image accurately despite having been trained without any bounding box data.
In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework. In particular, we derive new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we utilize content information extracted from a network representing an implicit language model. Finally, we introduce vector based attention weights that are applied on context vectors across both time and their individual components. We evaluate our system on a 3400 hours Microsoft Cortana voice assistant task and demonstrate that our proposed model consistently outperforms the baseline model achieving about 20% relative reduction in word error rates.
This paper presents Centre for Development of Advanced Computing Mumbai's (CDACM) submission to the NLP Tools Contest on Part-Of-Speech (POS) Tagging For Code-mixed Indian Social Media Text (POSCMISMT) 2015 (collocated with ICON 2015). We submitted results for Hindi (hi), Bengali (bn), and Telugu (te) languages mixed with English (en). In this paper, we have described our approaches to the POS tagging techniques, we exploited for this task. Machine learning has been used to POS tag the mixed language text. For POS tagging, distributed representations of words in vector space (word2vec) for feature extraction and Log-linear models have been tried. We report our work on all three languages hi, bn, and te mixed with en.
Speech recognition, especially name recognition, is widely used in phone services such as company directory dialers, stock quote providers or location finders. It is usually challenging due to pronunciation variations. This paper proposes an efficient and robust data-driven technique which automatically learns acceptable word pronunciations and updates the pronunciation dictionary to build a better lexicon without affecting recognition of other words similar to the target word. It generalizes well on datasets with various sizes, and reduces the error rate on a database with 13000+ human names by 42%, compared to a baseline with regular dictionaries already covering canonical pronunciations of 97%+ words in names, plus a well-trained spelling-to-pronunciation (STP) engine.