Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Mining Measured Information from Text

May 05, 2015
Arun S. Maiya, Dale Visser, Andrew Wan

We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 {\mu}m"). Finally, we present MQSearch: the realization of a search engine with full support for measured information.

* 4 pages; 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15) 

High Order Local Directional Pattern Based Pyramidal Multi-structure for Robust Face Recognition

Dec 12, 2020
Almabrok Essa, Vijayan Asari

Derived from a general definition of texture in a local neighborhood, local directional pattern (LDP) encodes the directional information in the small local 3x3 neighborhood of a pixel, which may fail to extract detailed information especially during changes in the input image due to illumination variations. Therefore, in this paper we introduce a novel feature extraction technique that calculates the nth order direction variation patterns, named high order local directional pattern (HOLDP). The proposed HOLDP can capture more detailed discriminative information than the conventional LDP. Unlike the LDP operator, our proposed technique extracts nth order local information by encoding various distinctive spatial relationships from each neighborhood layer of a pixel in the pyramidal multi-structure way. Then we concatenate the feature vector of each neighborhood layer to form the final HOLDP feature vector. The performance evaluation of the proposed HOLDP algorithm is conducted on several publicly available face databases and observed the superiority of HOLDP under extreme illumination conditions.

* 9 pages, 10 figures 

Slot Filling for Biomedical Information Extraction

Sep 17, 2021
Yannis Papanikolaou, Francine Bennett

Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in the above sub-tasks.In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and pretrained data are available at


Joint Extraction of Events and Entities within a Document Context

Sep 12, 2016
Bishan Yang, Tom Mitchell

Events and entities are closely related; entities are often actors or participants in events and events without entities are uncommon. The interpretation of events and entities is highly contextually dependent. Existing work in information extraction typically models events separately from entities, and performs inference at the sentence level, ignoring the rest of the document. In this paper, we propose a novel approach that models the dependencies among variables of events, entities, and their relations, and performs joint inference of these variables across a document. The goal is to enable access to document-level contextual information and facilitate context-aware predictions. We demonstrate that our approach substantially outperforms the state-of-the-art methods for event extraction as well as a strong baseline for entity extraction.

* Proceedings of NAACL-HLT 2016, pages 289-299 
* 11 pages, 2 figures, published at NAACL 2016 

MRZ code extraction from visa and passport documents using convolutional neural networks

Sep 11, 2020
Yichuan Liu, Hailey James, Otkrist Gupta, Dan Raviv

Detecting and extracting information from Machine-Readable Zone (MRZ) on passports and visas is becoming increasingly important for verifying document authenticity. However, computer vision methods for performing similar tasks, such as optical character recognition (OCR), fail to extract the MRZ given digital images of passports with reasonable accuracy. We present a specially designed model based on convolutional neural networks that is able to successfully extract MRZ information from digital images of passports of arbitrary orientation and size. Our model achieved 100% MRZ detection rate and 98.36% character recognition macro-f1 score on a passport and visa dataset.


An evaluation of keyword extraction from online communication for the characterisation of social relations

Feb 11, 2014
Jan Hauffa, Tobias Lichtenberg, Georg Groh

The set of interpersonal relationships on a social network service or a similar online community is usually highly heterogenous. The concept of tie strength captures only one aspect of this heterogeneity. Since the unstructured text content of online communication artefacts is a salient source of information about a social relationship, we investigate the utility of keywords extracted from the message body as a representation of the relationship's characteristics as reflected by the conversation topics. Keyword extraction is performed using standard natural language processing methods. Communication data and human assessments of the extracted keywords are obtained from Facebook users via a custom application. The overall positive quality assessment provides evidence that the keywords indeed convey relevant information about the relationship.


PET: A new Dataset for Process Extraction from Natural Language Text

Mar 09, 2022
Patrizio Bellan, Han van der Aa, Mauro Dragoni, Chiara Ghidini, Simone Paolo Ponzetto

Although there is a long tradition of work in NLP on extracting entities and relations from text, to date there exists little work on the acquisition of business processes from unstructured data such as textual corpora of process descriptions. With this work we aim at filling this gap and establishing the first steps towards bridging data-driven information extraction methodologies from Natural Language Processing and the model-based formalization that is aimed from Business Process Management. For this, we develop the first corpus of business process descriptions annotated with activities, gateways, actors and flow information. We present our new resource, including a detailed overview of the annotation schema and guidelines, as well as a variety of baselines to benchmark the difficulty and challenges of business process extraction from text.


Now You See Me (CME): Concept-based Model Extraction

Oct 25, 2020
Dmitry Kazhdan, Botty Dimanov, Mateja Jamnik, Pietro Liò, Adrian Weller

Deep Neural Networks (DNNs) have achieved remarkable performance on a range of tasks. A key step to further empowering DNN-based approaches is improving their explainability. In this work we present CME: a concept-based model extraction framework, used for analysing DNN models via concept-based extracted models. Using two case studies (dSprites, and Caltech UCSD Birds), we demonstrate how CME can be used to (i) analyse the concept information learned by a DNN model (ii) analyse how a DNN uses this concept information when predicting output labels (iii) identify key concept information that can further improve DNN predictive performance (for one of the case studies, we showed how model accuracy can be improved by over 14%, using only 30% of the available concepts).

* Presented at the AIMLAI workshop at the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020) 

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Sep 25, 2020
Ehsan Doostmohammadi, Mohammad Hadi Bokaei, Hossein Sameti

Keyphrases provide an extremely dense summary of a text. Such information can be used in many Natural Language Processing tasks, such as information retrieval and text summarization. Since previous studies on Persian keyword or keyphrase extraction have not published their data, the field suffers from the lack of a human extracted keyphrase dataset. In this paper, we introduce PerKey, a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases. The resulted data was put into human assessment to ensure the quality of the keyphrases. We also measured the performance of different supervised and unsupervised techniques, e.g. TFIDF, MultipartiteRank, KEA, etc. on the dataset using precision, recall, and F1-score.


From POS tagging to dependency parsing for biomedical event extraction

Aug 11, 2018
Dat Quoc Nguyen, Karin Verspoor

Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance. In this paper, we perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core NLP tasks of POS tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction.