Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Mining Commonsense Facts from the Physical World

Feb 11, 2020
Yanyan Zou, Wei Lu, Xu Sun

Textual descriptions of the physical world implicitly mention commonsense facts, while the commonsense knowledge bases explicitly represent such facts as triples. Compared to dramatically increased text data, the coverage of existing knowledge bases is far away from completion. Most of the prior studies on populating knowledge bases mainly focus on Freebase. To automatically complete commonsense knowledge bases to improve their coverage is under-explored. In this paper, we propose a new task of mining commonsense facts from the raw text that describes the physical world. We build an effective new model that fuses information from both sequence text and existing knowledge base resource. Then we create two large annotated datasets each with approximate 200k instances for commonsense knowledge base completion. Empirical results demonstrate that our model significantly outperforms baselines.


  Access Paper or Ask Questions

Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA

Nov 28, 2019
Mikhail Fain, Andrey Ponikar, Ryan Fox, Danushka Bollegala

We propose a novel non-parametric method for cross-modal retrieval which is applied on top of precomputed image and text embeddings. By combining our method with standard approaches for building image and text encoders, trained independently with a self-supervised classification objective, we create a baseline model which outperforms most existing methods on a challenging image-to-recipe task. We also use our method for comparing image and text encoders trained using different modern approaches, thus addressing the issues hindering the developments of novel methods for cross-modal recipe retrieval. We demonstrate how to use the insights from model comparison and extend our baseline model with standard triplet loss that improves SoTA on the Recipe1M dataset by a large margin, while using only precomputed features and with much less complexity than existing methods.

* 12 pages, 4 figures 

  Access Paper or Ask Questions

Robustness to Capitalization Errors in Named Entity Recognition

Nov 13, 2019
Sravan Bodapati, Hyokun Yun, Yaser Al-Onaizan

Robustness to capitalization errors is a highly desirable characteristic of named entity recognizers, yet we find standard models for the task are surprisingly brittle to such noise. Existing methods to improve robustness to the noise completely discard given orthographic information, mwhich significantly degrades their performance on well-formed text. We propose a simple alternative approach based on data augmentation, which allows the model to \emph{learn} to utilize or ignore orthographic information depending on its usefulness in the context. It achieves competitive robustness to capitalization errors while making negligible compromise to its performance on well-formed text and significantly improving generalization power on noisy user-generated text. Our experiments clearly and consistently validate our claim across different types of machine learning models, languages, and dataset sizes.

* http://noisy-text.github.io/2019/ 
* Accepted to EMNLP 2019 Workshop : W-NUT 2019 5th Workshop on Noisy User Generated Text 

  Access Paper or Ask Questions

Is artificial data useful for biomedical Natural Language Processing algorithms?

Aug 07, 2019
Zixu Wang, Julia Ive, Sumithra Velupillai, Lucia Specia

A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical text with key phrases. We use the artificial data as additional training data in two key biomedical NLP tasks: text classification and temporal relation extraction. We show that artificially generated training data used in conjunction with real training data can lead to performance boosts for data-greedy neural network algorithms. We also demonstrate the usefulness of the generated data for NLP setups where it fully replaces real training data.

* BioNLP 2019 

  Access Paper or Ask Questions

Is artificial data useful for biomedical Natural Language Processing

Jul 01, 2019
Zixu Wang, Julia Ive, Sumithra Velupillai, Lucia Specia

A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical text with key phrases. We use the artificial data as additional training data in two key biomedical NLP tasks: text classification and temporal relation extraction. We show that artificially generated training data used in conjunction with real training data can lead to performance boosts for data-greedy neural network algorithms. We also demonstrate the usefulness of the generated data for NLP setups where it fully replaces real training data.

* Accepted by BioNLP 2019 

  Access Paper or Ask Questions

Semi-supervised Stochastic Multi-Domain Learning using Variational Inference

Jun 07, 2019
Yitong Li, Timothy Baldwin, Trevor Cohn

Supervised models of NLP rely on large collections of text which closely resemble the intended testing setting. Unfortunately matching text is often not available in sufficient quantity, and moreover, within any domain of text, data is often highly heterogenous. In this paper we propose a method to distill the important domain signal as part of a multi-domain learning system, using a latent variable model in which parts of a neural model are stochastically gated based on the inferred domain. We compare the use of discrete versus continuous latent variables, operating in a domain-supervised or a domain semi-supervised setting, where the domain is known only for a subset of training inputs. We show that our model leads to substantial performance improvements over competitive benchmark domain adaptation methods, including methods using adversarial learning.

* ACL 2019 (9 pages + 2 references + 1 appendices) 

  Access Paper or Ask Questions

Semi-automatic System for Title Construction

May 01, 2019
Swagata Duari, Vasudha Bhatnagar

In this paper, we propose a semi-automatic system for title construction from scientific abstracts. The system extracts and recommends impactful words from the text, which the author can creatively use to construct an appropriate title for the manuscript. The work is based on the hypothesis that keywords are good candidates for title construction. We extract important words from the document by inducing a supervised keyword extraction model. The model is trained on novel features extracted from graph-of-text representation of the document. We empirically show that these graph-based features are capable of discriminating keywords from non-keywords. We further establish empirically that the proposed approach can be applied to any text irrespective of the training domain and corpus. We evaluate the proposed system by computing the overlap between extracted keywords and the list of title-words for documents, and we observe a macro-averaged precision of 82%.

* 12 pages, 2 figures, conference paper, accepted for publication 

  Access Paper or Ask Questions

Strategies for Structuring Story Generation

Feb 04, 2019
Angela Fan, Mike Lewis, Yann Dauphin

Writers generally rely on plans or sketches to write long stories, but most current language models generate word by word from left to right. We explore coarse-to-fine models for creating narrative texts of several hundred words, and introduce new models which decompose stories by abstracting over actions and entities. The model first generates the predicate-argument structure of the text, where different mentions of the same entity are marked with placeholder tokens. It then generates a surface realization of the predicate-argument structure, and finally replaces the entity placeholders with context-sensitive names and references. Human judges prefer the stories from our models to a wide range of previous approaches to hierarchical text generation. Extensive analysis shows that our methods can help improve the diversity and coherence of events and entities in generated stories.


  Access Paper or Ask Questions

Automatic Annotation of Locative and Directional Expressions in Arabic

May 26, 2018
Rita Hijazi, Amani Sabra, Moustafa Al-Hajj

In this paper, we introduce a rule-based approach to annotate Locative and Directional Expressions in Arabic natural language text. The annotation is based on a constructed semantic map of the spatiality domain. Challenges are twofold: first, we need to study how locative and directional expressions are expressed linguistically in these texts; and second, we need to automatically annotate the relevant textual segments accordingly. The research method we will use in this article is analytic-descriptive. We will validate this approach on specific novel rich with these expressions and show that it has very promising results. We will be using NOOJ as a software tool to implement finite-state transducers to annotate linguistic elements according to Locative and Directional Expressions. In conclusion, NOOJ allowed us to write linguistic rules for the automatic annotation in Arabic text of Locative and Directional Expressions.

* 20 pages, in French 

  Access Paper or Ask Questions

Multilingual Language Processing From Bytes

Apr 02, 2016
Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.


  Access Paper or Ask Questions

<<
443
444
445
446
447
448
449
450
451
452
453
454
455
>>