Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Slot Filling for Biomedical Information Extraction

Sep 17, 2021
Yannis Papanikolaou, Francine Bennett

Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in the above sub-tasks.In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and pretrained data are available at

Access Paper or Ask Questions

Joint Extraction of Events and Entities within a Document Context

Sep 12, 2016
Bishan Yang, Tom Mitchell

Events and entities are closely related; entities are often actors or participants in events and events without entities are uncommon. The interpretation of events and entities is highly contextually dependent. Existing work in information extraction typically models events separately from entities, and performs inference at the sentence level, ignoring the rest of the document. In this paper, we propose a novel approach that models the dependencies among variables of events, entities, and their relations, and performs joint inference of these variables across a document. The goal is to enable access to document-level contextual information and facilitate context-aware predictions. We demonstrate that our approach substantially outperforms the state-of-the-art methods for event extraction as well as a strong baseline for entity extraction.

* Proceedings of NAACL-HLT 2016, pages 289-299 
* 11 pages, 2 figures, published at NAACL 2016 
Access Paper or Ask Questions

MRZ code extraction from visa and passport documents using convolutional neural networks

Sep 11, 2020
Yichuan Liu, Hailey James, Otkrist Gupta, Dan Raviv

Detecting and extracting information from Machine-Readable Zone (MRZ) on passports and visas is becoming increasingly important for verifying document authenticity. However, computer vision methods for performing similar tasks, such as optical character recognition (OCR), fail to extract the MRZ given digital images of passports with reasonable accuracy. We present a specially designed model based on convolutional neural networks that is able to successfully extract MRZ information from digital images of passports of arbitrary orientation and size. Our model achieved 100% MRZ detection rate and 98.36% character recognition macro-f1 score on a passport and visa dataset.

Access Paper or Ask Questions

An evaluation of keyword extraction from online communication for the characterisation of social relations

Feb 11, 2014
Jan Hauffa, Tobias Lichtenberg, Georg Groh

The set of interpersonal relationships on a social network service or a similar online community is usually highly heterogenous. The concept of tie strength captures only one aspect of this heterogeneity. Since the unstructured text content of online communication artefacts is a salient source of information about a social relationship, we investigate the utility of keywords extracted from the message body as a representation of the relationship's characteristics as reflected by the conversation topics. Keyword extraction is performed using standard natural language processing methods. Communication data and human assessments of the extracted keywords are obtained from Facebook users via a custom application. The overall positive quality assessment provides evidence that the keywords indeed convey relevant information about the relationship.

Access Paper or Ask Questions

PET: A new Dataset for Process Extraction from Natural Language Text

Mar 09, 2022
Patrizio Bellan, Han van der Aa, Mauro Dragoni, Chiara Ghidini, Simone Paolo Ponzetto

Although there is a long tradition of work in NLP on extracting entities and relations from text, to date there exists little work on the acquisition of business processes from unstructured data such as textual corpora of process descriptions. With this work we aim at filling this gap and establishing the first steps towards bridging data-driven information extraction methodologies from Natural Language Processing and the model-based formalization that is aimed from Business Process Management. For this, we develop the first corpus of business process descriptions annotated with activities, gateways, actors and flow information. We present our new resource, including a detailed overview of the annotation schema and guidelines, as well as a variety of baselines to benchmark the difficulty and challenges of business process extraction from text.

Access Paper or Ask Questions

Now You See Me (CME): Concept-based Model Extraction

Oct 25, 2020
Dmitry Kazhdan, Botty Dimanov, Mateja Jamnik, Pietro Liò, Adrian Weller

Deep Neural Networks (DNNs) have achieved remarkable performance on a range of tasks. A key step to further empowering DNN-based approaches is improving their explainability. In this work we present CME: a concept-based model extraction framework, used for analysing DNN models via concept-based extracted models. Using two case studies (dSprites, and Caltech UCSD Birds), we demonstrate how CME can be used to (i) analyse the concept information learned by a DNN model (ii) analyse how a DNN uses this concept information when predicting output labels (iii) identify key concept information that can further improve DNN predictive performance (for one of the case studies, we showed how model accuracy can be improved by over 14%, using only 30% of the available concepts).

* Presented at the AIMLAI workshop at the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020) 
Access Paper or Ask Questions

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Sep 25, 2020
Ehsan Doostmohammadi, Mohammad Hadi Bokaei, Hossein Sameti

Keyphrases provide an extremely dense summary of a text. Such information can be used in many Natural Language Processing tasks, such as information retrieval and text summarization. Since previous studies on Persian keyword or keyphrase extraction have not published their data, the field suffers from the lack of a human extracted keyphrase dataset. In this paper, we introduce PerKey, a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases. The resulted data was put into human assessment to ensure the quality of the keyphrases. We also measured the performance of different supervised and unsupervised techniques, e.g. TFIDF, MultipartiteRank, KEA, etc. on the dataset using precision, recall, and F1-score.

Access Paper or Ask Questions

From POS tagging to dependency parsing for biomedical event extraction

Aug 11, 2018
Dat Quoc Nguyen, Karin Verspoor

Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance. In this paper, we perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core NLP tasks of POS tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction.

Access Paper or Ask Questions

Using Neural Networks for Relation Extraction from Biomedical Literature

May 27, 2019
Diana Sousa, Andre Lamurias, Francisco M. Couto

Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.

* Preprint 
Access Paper or Ask Questions

Plumber: A Modular Framework to Create Information Extraction Pipelines

Jun 03, 2022
Mohamad Yaser Jaradeh, Kuldeep Singh, Markus Stocker, Sören Auer

Information Extraction (IE) tasks are commonly studied topics in various domains of research. Hence, the community continuously produces multiple techniques, solutions, and tools to perform such tasks. However, running those tools and integrating them within existing infrastructure requires time, expertise, and resources. One pertinent task here is triples extraction and linking, where structured triples are extracted from a text and aligned to an existing Knowledge Graph (KG). In this paper, we present PLUMBER, the first framework that allows users to manually and automatically create suitable IE pipelines from a community-created pool of tools to perform triple extraction and alignment on unstructured text. Our approach provides an interactive medium to alter the pipelines and perform IE tasks. A short video to show the working of the framework for different use-cases is available online under:

* pre-print for WWW'21 demo of ICWE PLUMBER publication 
Access Paper or Ask Questions