Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

Aug 31, 2021
Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush, Scott A. Hale

Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is a key emerging challenge for automated detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HatemojiTrain dataset using a human-and-model-in-the-loop approach. Models trained on these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HatemojiCheck and HatemojiTrain are made publicly available.

  Access Paper or Ask Questions

Levi Graph AMR Parser using Heterogeneous Attention

Jul 09, 2021
Han He, Jinho D. Choi

Coupled with biaffine decoders, transformers have been effectively adapted to text-to-graph transduction and achieved state-of-the-art performance on AMR parsing. Many prior works, however, rely on the biaffine decoder for either or both arc and label predictions although most features used by the decoder may be learned by the transformer already. This paper presents a novel approach to AMR parsing by combining heterogeneous data (tokens, concepts, labels) as one input to a transformer to learn attention, and use only attention matrices from the transformer to predict all elements in AMR graphs (concepts, arcs, labels). Although our models use significantly fewer parameters than the previous state-of-the-art graph parser, they show similar or better accuracy on AMR 2.0 and 3.0.

* Accepted in IWPT 2021: The 17th International Conference on Parsing Technologies 

  Access Paper or Ask Questions

Challenges and Considerations with Code-Mixed NLP for Multilingual Societies

Jun 15, 2021
Vivek Srivastava, Mayank Singh

Multilingualism refers to the high degree of proficiency in two or more languages in the written and oral communication modes. It often results in language mixing, a.k.a. code-mixing, when a multilingual speaker switches between multiple languages in a single utterance of a text or speech. This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good crisis management, healthcare, political campaigning, fake news, and hate speech for multilingual societies. We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good. As a representative example, we consider English-Hindi code-mixing but draw similar inferences for other language pairs

  Access Paper or Ask Questions

NLP is Not enough -- Contextualization of User Input in Chatbots

May 13, 2021
Nathan Dolbir, Triyasha Dastidar, Kaushik Roy

AI chatbots have made vast strides in technology improvement in recent years and are already operational in many industries. Advanced Natural Language Processing techniques, based on deep networks, efficiently process user requests to carry out their functions. As chatbots gain traction, their applicability in healthcare is an attractive proposition due to the reduced economic and people costs of an overburdened system. However, healthcare bots require safe and medically accurate information capture, which deep networks aren't yet capable of due to user text and speech variations. Knowledge in symbolic structures is more suited for accurate reasoning but cannot handle natural language processing directly. Thus, in this paper, we study the effects of combining knowledge and neural representations on chatbot safety, accuracy, and understanding.

  Access Paper or Ask Questions

A novel segmentation dataset for signatures on bank checks

Apr 28, 2021
Muhammad Saif Ullah Khan

The dataset presented provides high-resolution images of real, filled out bank checks containing various complex backgrounds, and handwritten text and signatures in the respective fields, along with both pixel-level and patch-level segmentation masks for the signatures on the checks. The images of bank checks were obtained from different sources, including other publicly available check datasets, publicly available images on the internet, as well as scans and images of real checks. Using the GIMP graphics software, pixel-level segmentation masks for signatures on these checks were manually generated as binary images. An automated script was then used to generate patch-level masks. The dataset was created to train and test networks for extracting signatures from bank checks and other similar documents with very complex backgrounds.

  Access Paper or Ask Questions

A Comprehensive Attempt to Research Statement Generation

Apr 25, 2021
Wenhao Wu, Sujian Li

For a researcher, writing a good research statement is crucial but costs a lot of time and effort. To help researchers, in this paper, we propose the research statement generation (RSG) task which aims to summarize one's research achievements and help prepare a formal research statement. For this task, we conduct a comprehensive attempt including corpus construction, method design, and performance evaluation. First, we construct an RSG dataset with 62 research statements and the corresponding 1,203 publications. Due to the limitation of our resources, we propose a practical RSG method which identifies a researcher's research directions by topic modeling and clustering techniques and extracts salient sentences by a neural text summarizer. Finally, experiments show that our method outperforms all the baselines with better content coverage and coherence.

  Access Paper or Ask Questions

DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels

Apr 18, 2021
Zhaohao Zeng, Tetsuya Sakai

We introduce a data set called DCH-2, which contains 4,390 real customer-helpdesk dialogues in Chinese and their English translations. DCH-2 also contains dialogue-level annotations and turn-level annotations obtained independently from either 19 or 20 annotators. The data set was built through our effort as organisers of the NTCIR-14 Short Text Conversation and NTCIR-15 Dialogue Evaluation tasks, to help researchers understand what constitutes an effective customer-helpdesk dialogue, and thereby build efficient and helpful helpdesk systems that are available to customers at all times. In addition, DCH-2 may be utilised for other purposes, for example, as a repository for retrieval-based dialogue systems, or as a parallel corpus for machine translation in the helpdesk domain.

* 6 pages, 3 figures 

  Access Paper or Ask Questions

DAGN: Discourse-Aware Graph Network for Logical Reasoning

Mar 26, 2021
Yinya Huang, Meng Fang, Yu Cao, Liwei Wang, Xiaodan Liang

Recent QA with logical reasoning questions requires passage-level relations among the sentences. However, current approaches still focus on sentence-level relations interacting among tokens. In this work, we explore aggregating passage-level clues for solving logical reasoning QA by using discourse-based information. We propose a discourse-aware graph network (DAGN) that reasons relying on the discourse structure of the texts. The model encodes discourse information as a graph with elementary discourse units (EDUs) and discourse relations, and learns the discourse-aware features via a graph network for downstream QA tasks. Experiments are conducted on two logical reasoning QA datasets, ReClor and LogiQA, and our proposed DAGN achieves competitive results.

* Accepted by NAACL 2021 

  Access Paper or Ask Questions

Is the User Enjoying the Conversation? A Case Study on the Impact on the Reward Function

Jan 13, 2021
Lina M. Rojas-Barahona

The impact of user satisfaction in policy learning task-oriented dialogue systems has long been a subject of research interest. Most current models for estimating the user satisfaction either (i) treat out-of-context short-texts, such as product reviews, or (ii) rely on turn features instead of on distributed semantic representations. In this work we adopt deep neural networks that use distributed semantic representation learning for estimating the user satisfaction in conversations. We evaluate the impact of modelling context length in these networks. Moreover, we show that the proposed hierarchical network outperforms state-of-the-art quality estimators. Furthermore, we show that applying these networks to infer the reward function in a Partial Observable Markov Decision Process (POMDP) yields to a great improvement in the task success rate.

* Accepted at the Human in the Loop Dialogue Systems, 34st Conference on Neural Information Processing Systems (NeurIPS 2020). Paper updated with minor changes 

  Access Paper or Ask Questions