Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues

Mar 12, 2022
Shivani Kumar, Atharva Kulkarni, Md Shad Akhtar, Tanmoy Chakraborty

Indirect speech such as sarcasm achieves a constellation of discourse goals in human communication. While the indirectness of figurative language warrants speakers to achieve certain pragmatic goals, it is challenging for AI agents to comprehend such idiosyncrasies of human communication. Though sarcasm identification has been a well-explored topic in dialogue analysis, for conversational systems to truly grasp a conversation's innate meaning and generate appropriate responses, simply detecting sarcasm is not enough; it is vital to explain its underlying sarcastic connotation to capture its true essence. In this work, we study the discourse structure of sarcastic conversations and propose a novel task - Sarcasm Explanation in Dialogue (SED). Set in a multimodal and code-mixed setting, the task aims to generate natural language explanations of satirical conversations. To this end, we curate WITS, a new dataset to support our task. We propose MAF (Modality Aware Fusion), a multimodal context-aware attention and global information fusion module to capture multimodality and use it to benchmark WITS. The proposed attention module surpasses the traditional multimodal fusion baselines and reports the best performance on almost all metrics. Lastly, we carry out detailed analyses both quantitatively and qualitatively.

* Accepted in ACL 2022. 13 pages, 4 figures, 12 tables 

  Access Paper or Ask Questions

Spiking Cochlea with System-level Local Automatic Gain Control

Feb 14, 2022
Ilya Kiselev, Chang Gao, Shih-Chii Liu

Including local automatic gain control (AGC) circuitry into a silicon cochlea design has been challenging because of transistor mismatch and model complexity. To address this, we present an alternative system-level algorithm that implements channel-specific AGC in a silicon spiking cochlea by measuring the output spike activity of individual channels. The bandpass filter gain of a channel is adapted dynamically to the input amplitude so that the average output spike rate stays within a defined range. Because this AGC mechanism only needs counting and adding operations, it can be implemented at low hardware cost in a future design. We evaluate the impact of the local AGC algorithm on a classification task where the input signal varies over 32 dB input range. Two classifier types receiving cochlea spike features were tested on a speech versus noise classification task. The logistic regression classifier achieves an average of 6% improvement and 40.8% relative improvement in accuracy when the AGC is enabled. The deep neural network classifier shows a similar improvement for the AGC case and achieves a higher mean accuracy of 96% compared to the best accuracy of 91% from the logistic regression classifier.

* Accepted for publication at the IEEE Transactions on Circuits and Systems I - Regular Papers, 2022 

  Access Paper or Ask Questions

Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model

Feb 14, 2022
Keisuke Kinoshita, Marc Delcroix, Tomoharu Iwata

Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)-and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. However, the approaches proposed so far have not realized {\it tight} integration yet, because the clustering employed therein was not optimal in any sense for clustering the speaker embeddings estimated by the EEND module. To address this problem, this paper introduces a {\it trainable} clustering algorithm into the integration framework, by deep-unfolding a non-parametric Bayesian model called the infinite Gaussian mixture model (iGMM). Specifically, the speaker embeddings are optimized during training such that it better fits iGMM clustering, based on a novel clustering loss based on Adjusted Rand Index (ARI). Experimental results based on CALLHOME data show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.

* Accepted to IEEE ICASSP-2022, 5 pages, 2 figures 

  Access Paper or Ask Questions

An Enactivist account of Mind Reading in Natural Language Understanding

Nov 11, 2021
Peter Wallis, Bruce Edmonds

In this paper we apply our understanding of the radical enactivist agenda to a classic AI-hard problem. Natural Language Understanding is a sub-field of AI research that looked easy to the pioneers. Thus the Turing Test, in its original form, assumed that the computer could use language and the challenge was to fake human intelligence. It turned out that playing chess and formal logic were easy compared to the necessary language skills. The techniques of good old-fashioned AI (GOFAI) assume symbolic representation is the core of reasoning and human communication consisted of transferring representations from one mind to another. But by this model one finds that representations appear in another's mind, without appearing in the intermediary language. People communicate by mind reading it seems. Systems with speech interfaces such as Alexa and Siri are of course common but they are limited. Rather than adding mind reading skills, we introduced a "cheat" that enabled our systems to fake it. The cheat is simple and only slightly interesting to computer scientists and not at all interesting to philosophers. However, reading about the enactivist idea that we "directly perceive" the intentions of others, our cheat took on a new light and in this paper look again at how natural language understanding might actually work between humans.

  Access Paper or Ask Questions

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Oct 26, 2021
Arij Riabi, Benoît Sagot, Djamé Seddah

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.

* Camera ready version. Accepted to WNUT 2021 

  Access Paper or Ask Questions

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

Sep 29, 2021
Hadeel Saadany, Constantin Orasan

Social media companies as well as authorities make extensive use of artificial intelligence (AI) tools to monitor postings of hate speech, celebrations of violence or profanity. Since AI software requires massive volumes of data to train computers, Machine Translation (MT) of the online content is commonly used to process posts written in several languages and hence augment the data needed for training. However, MT mistakes are a regular occurrence when translating sentiment-oriented user-generated content (UGC), especially when a low-resource language is involved. The adequacy of the whole process relies on the assumption that the evaluation metrics used give a reliable indication of the quality of the translation. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors which can cause serious misunderstanding of the affect message. We compare the performance of three canonical metrics on meaningless translations where the semantic content is seriously impaired as compared to meaningful translations with a critical error which exclusively distorts the sentiment of the source text. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.

* TRITON (2021) 48-56 
* Accepted for TRITON (TRanslation and Interpreting Technology ONline) 2021 

  Access Paper or Ask Questions

GERNERMED -- An Open German Medical NER Model

Sep 24, 2021
Johann Frei, Frank Kramer

The current state of adoption of well-structured electronic health records and integration of digital methods for storing medical patient data in structured formats can often considered as inferior compared to the use of traditional, unstructured text based patient data documentation. Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data. In natural language processing (NLP), statistical models have been shown successful in various tasks like part-of-speech tagging, relation extraction (RE) and named entity recognition (NER). In this work, we present GERNERMED, the first open, neural NLP model for NER tasks dedicated to detect medical entity types in German text data. Here, we avoid the conflicting goals of protection of sensitive patient data from training data extraction and the publication of the statistical model weights by training our model on a custom dataset that was translated from publicly available datasets in foreign language by a pretrained neural machine translation model. The sample code and the statistical model is available at:

  Access Paper or Ask Questions

Sharing Cognition: Human Gesture and Natural Language Grounding Based Planning and Navigation for Indoor Robots

Aug 14, 2021
Gourav Kumar, Soumyadip Maity, Ruddra dev Roychoudhury, Brojeshwar Bhowmick

Cooperation among humans makes it easy to execute tasks and navigate seamlessly even in unknown scenarios. With our individual knowledge and collective cognition skills, we can reason about and perform well in unforeseen situations and environments. To achieve a similar potential for a robot navigating among humans and interacting with them, it is crucial for it to acquire the ability for easy, efficient and natural ways of communication and cognition sharing with humans. In this work, we aim to exploit human gestures which is known to be the most prominent modality of communication after the speech. We demonstrate how the incorporation of gestures for communicating spatial understanding can be achieved in a very simple yet effective way using a robot having the vision and listening capability. This shows a big advantage over using only Vision and Language-based Navigation, Language Grounding or Human-Robot Interaction in a task requiring the development of cognition and indoor navigation. We adapt the state-of-the-art modules of Language Grounding and Human-Robot Interaction to demonstrate a novel system pipeline in real-world environments on a Telepresence robot for performing a set of challenging tasks. To the best of our knowledge, this is the first pipeline to couple the fields of HRI and language grounding in an indoor environment to demonstrate autonomous navigation.

  Access Paper or Ask Questions

Learning Word-Level Confidence For Subword End-to-End ASR

Mar 11, 2021
David Qiu, Qiujia Li, Yanzhang He, Yu Zhang, Bo Li, Liangliang Cao, Rohit Prabhavalkar, Deepti Bhatia, Wei Li, Ke Hu, Tara N. Sainath, Ian McGraw

We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.

* To appear in ICASSP 2021 

  Access Paper or Ask Questions

Robustness of on-device Models: Adversarial Attack to Deep Learning Models on Android Apps

Feb 04, 2021
Yujin Huang, Han Hu, Chunyang Chen

Deep learning has shown its power in many applications, including object detection in images, natural-language understanding, and speech recognition. To make it more accessible to end users, many deep learning models are now embedded in mobile apps. Compared to offloading deep learning from smartphones to the cloud, performing machine learning on-device can help improve latency, connectivity, and power consumption. However, most deep learning models within Android apps can easily be obtained via mature reverse engineering, while the models' exposure may invite adversarial attacks. In this study, we propose a simple but effective approach to hacking deep learning models using adversarial attacks by identifying highly similar pre-trained models from TensorFlow Hub. All 10 real-world Android apps in the experiment are successfully attacked by our approach. Apart from the feasibility of the model attack, we also carry out an empirical study that investigates the characteristics of deep learning models used by hundreds of Android apps on Google Play. The results show that many of them are similar to each other and widely use fine-tuning techniques to pre-trained models on the Internet.

* Accepted to the 43rd International Conference on Software Engineering, Software Engineering in Practice Track. This is a preprint version, the copyright belongs to The Institute of Electrical and Electronics Engineers 

  Access Paper or Ask Questions