Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Document Intelligence Metrics for Visually Rich Document Evaluation

May 23, 2022
Jonathan DeGange, Swapnil Gupta, Zhuoyu Han, Krzysztof Wilkosz, Adam Karwan

The processing of Visually-Rich Documents (VRDs) is highly important in information extraction tasks associated with Document Intelligence. We introduce DI-Metrics, a Python library devoted to VRD model evaluation comprising text-based, geometric-based and hierarchical metrics for information extraction tasks. We apply DI-Metrics to evaluate information extraction performance using publicly available CORD dataset, comparing performance of three SOTA models and one industry model. The open-source library is available on GitHub.


Robust Topological Feature Extraction for Mapping of Environments using Bio-Inspired Sensor Networks

Oct 17, 2014
Alireza Dirafzoon, Edgar Lobaton

In this paper, we exploit minimal sensing information gathered from biologically inspired sensor networks to perform exploration and mapping in an unknown environment. A probabilistic motion model of mobile sensing nodes, inspired by motion characteristics of cockroaches, is utilized to extract weak encounter information in order to build a topological representation of the environment. Neighbor to neighbor interactions among the nodes are exploited to build point clouds representing spatial features of the manifold characterizing the environment based on the sampled data. To extract dominant features from sampled data, topological data analysis is used to produce persistence intervals for features, to be used for topological mapping. In order to improve robustness characteristics of the sampled data with respect to outliers, density based subsampling algorithms are employed. Moreover, a robust scale-invariant classification algorithm for persistence diagrams is proposed to provide a quantitative representation of desired features in the data. Furthermore, various strategies for defining encounter metrics with different degrees of information regarding agents' motion are suggested to enhance the precision of the estimation and classification performance of the topological method.

* 14 pages, 7 figures 

TIFTI: A Framework for Extracting Drug Intervals from Longitudinal Clinic Notes

Dec 03, 2018
Monica Agrawal, Griffin Adams, Nathan Nussbaum, Benjamin Birnbaum

Oral drugs are becoming increasingly common in oncology care. In contrast to intravenous chemotherapy, which is administered in the clinic and carefully tracked via structure electronic health records (EHRs), oral drug treatment is self-administered and therefore not tracked as well. Often, the details of oral cancer treatment occur only in unstructured clinic notes. Extracting this information is critical to understanding a patient's treatment history. Yet, this a challenging task because treatment intervals must be inferred longitudinally from both explicit mentions in the text as well as from document timestamps. In this work, we present TIFTI (Temporally Integrated Framework for Treatment Intervals), a robust framework for extracting oral drug treatment intervals from a patient's unstructured notes. TIFTI leverages distinct sources of temporal information by breaking the problem down into two separate subtasks: document-level sequence labeling and date extraction. On a labeled dataset of metastatic renal-cell carcinoma (RCC) patients, it exactly matched the labeled start date in 46% of the examples (86% of the examples within 30 days), and it exactly matched the labeled end date in 52% of the examples (78% of the examples within 30 days). Without retraining, the model achieved a similar level of performance on a labeled dataset of advanced non-small-cell lung cancer (NSCLC) patients.

* Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216 

OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction

Oct 07, 2020
Keshav Kolluru, Vaibhav Adlakha, Samarth Aggarwal, Mausam, Soumen Chakrabarti

A recent state-of-the-art neural open information extraction (OpenIE) system generates extractions iteratively, requiring repeated encoding of partial outputs. This comes at a significant computational cost. On the other hand, sequence labeling approaches for OpenIE are much faster, but worse in extraction quality. In this paper, we bridge this trade-off by presenting an iterative labeling-based system that establishes a new state of the art for OpenIE, while extracting 10x faster. This is achieved through a novel Iterative Grid Labeling (IGL) architecture, which treats OpenIE as a 2-D grid labeling task. We improve its performance further by applying coverage (soft) constraints on the grid at training time. Moreover, on observing that the best OpenIE systems falter at handling coordination structures, our OpenIE system also incorporates a new coordination analyzer built with the same IGL architecture. This IGL based coordination analyzer helps our OpenIE system handle complicated coordination structures, while also establishing a new state of the art on the task of coordination analysis, with a 12.3 pts improvement in F1 over previous analyzers. Our OpenIE system, OpenIE6, beats the previous systems by as much as 4 pts in F1, while being much faster.

* EMNLP 2020 (Long) 

AnnIE: An Annotation Platform for Constructing Complete Open Information Extraction Benchmark

Sep 15, 2021
Niklas Friedrich, Kiril Gashteovski, Mingying Yu, Bhushan Kotnis, Carolin Lawrence, Mathias Niepert, Goran Glavaš

Open Information Extraction (OIE) is the task of extracting facts from sentences in the form of relations and their corresponding arguments in schema-free manner. Intrinsic performance of OIE systems is difficult to measure due to the incompleteness of existing OIE benchmarks: the ground truth extractions do not group all acceptable surface realizations of the same fact that can be extracted from a sentence. To measure performance of OIE systems more realistically, it is necessary to manually annotate complete facts (i.e., clusters of all acceptable surface realizations of the same fact) from input sentences. We propose AnnIE: an interactive annotation platform that facilitates such challenging annotation tasks and supports creation of complete fact-oriented OIE evaluation benchmarks. AnnIE is modular and flexible in order to support different use case scenarios (i.e., benchmarks covering different types of facts). We use AnnIE to build two complete OIE benchmarks: one with verb-mediated facts and another with facts encompassing named entities. Finally, we evaluate several OIE systems on our complete benchmarks created with AnnIE. Our results suggest that existing incomplete benchmarks are overly lenient, and that OIE systems are not as robust as previously reported. We publicly release AnnIE under non-restrictive license.


User preference extraction using dynamic query sliders in conjunction with UPS-EMO algorithm

Oct 27, 2011
Timo Aittokoski, Suvi Tarkkanen

One drawback of evolutionary multiobjective optimization algorithms (EMOA) has traditionally been high computational cost to create an approximation of the Pareto front: number of required objective function evaluations usually grows high. On the other hand, for the decision maker (DM) it may be difficult to select one of the many produced solutions as the final one, especially in the case of more than two objectives. To overcome the above mentioned drawbacks number of EMOA's incorporating the decision makers preference information have been proposed. In this case, it is possible to save objective function evaluations by generating only the part of the front the DM is interested in, thus also narrowing down the pool of possible selections for the final solution. Unfortunately, most of the current EMO approaches utilizing preferences are not very intuitive to use, i.e. they may require tweaking of unintuitive parameters, and it is not always clear what kind of results one can get with given set of parameters. In this study we propose a new approach to visually inspect produced solutions, and to extract preference information from the DM to further guide the search. Our approach is based on intuitive use of dynamic query sliders, which serve as a means to extract preference information and are part of the graphical user interface implemented for the efficient UPS-EMO algorithm.


Performance Effectiveness of Multimedia Information Search Using the Epsilon-Greedy Algorithm

Nov 22, 2019
Nikki Lijing Kuang, Clement H. C. Leung

In the search and retrieval of multimedia objects, it is impractical to either manually or automatically extract the contents for indexing since most of the multimedia contents are not machine extractable, while manual extraction tends to be highly laborious and time-consuming. However, by systematically capturing and analyzing the feedback patterns of human users, vital information concerning the multimedia contents can be harvested for effective indexing and subsequent search. By learning from the human judgment and mental evaluation of users, effective search indices can be gradually developed and built up, and subsequently be exploited to find the most relevant multimedia objects. To avoid hovering around a local maximum, we apply the epsilon-greedy method to systematically explore the search space. Through such methodic exploration, we show that the proposed approach is able to guarantee that the most relevant objects can always be discovered, even though initially it may have been overlooked or not regarded as relevant. The search behavior of the present approach is quantitatively analyzed, and closed-form expressions are obtained for the performance of two variants of the epsilon-greedy algorithm, namely EGSE-A and EGSE-B. Simulations and experiments on real data set have been performed which show good agreement with the theoretical findings. The present method is able to leverage exploration in an effective way to significantly raise the performance of multimedia information search, and enables the certain discovery of relevant objects which may be otherwise undiscoverable.

* 8 pages, 10 figures. IEEE ICMLA 2019 

A Simple yet Effective Relation Information Guided Approach for Few-Shot Relation Extraction

May 19, 2022
Yang Liu, Jinpeng Hu, Xiang Wan, Tsung-Hui Chang

Few-Shot Relation Extraction aims at predicting the relation for a pair of entities in a sentence by training with a few labelled examples in each relation. Some recent works have introduced relation information (i.e., relation labels or descriptions) to assist model learning based on Prototype Network. However, most of them constrain the prototypes of each relation class implicitly with relation information, generally through designing complex network structures, like generating hybrid features, combining with contrastive learning or attention networks. We argue that relation information can be introduced more explicitly and effectively into the model. Thus, this paper proposes a direct addition approach to introduce relation information. Specifically, for each relation class, the relation representation is first generated by concatenating two views of relations (i.e., [CLS] token embedding and the mean value of embeddings of all tokens) and then directly added to the original prototype for both train and prediction. Experimental results on the benchmark dataset FewRel 1.0 show significant improvements and achieve comparable results to the state-of-the-art, which demonstrates the effectiveness of our proposed approach. Besides, further analyses verify that the direct addition is a much more effective way to integrate the relation representations and the original prototypes.

* accepted to ACL2022 findings 

Inscriptis -- A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Jul 12, 2021
Albert Weichselbraun

Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to existing software packages such as HTML2text, jusText and Lynx, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML attributes that determine the text alignment. In addition, it (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled.


Multi-Modality Information Fusion for Radiomics-based Neural Architecture Search

Jul 12, 2020
Yige Peng, Lei Bi, Michael Fulham, Dagan Feng, Jinman Kim

'Radiomics' is a method that extracts mineable quantitative features from radiographic images. These features can then be used to determine prognosis, for example, predicting the development of distant metastases (DM). Existing radiomics methods, however, require complex manual effort including the design of hand-crafted radiomic features and their extraction and selection. Recent radiomics methods, based on convolutional neural networks (CNNs), also require manual input in network architecture design and hyper-parameter tuning. Radiomic complexity is further compounded when there are multiple imaging modalities, for example, combined positron emission tomography - computed tomography (PET-CT) where there is functional information from PET and complementary anatomical localization information from computed tomography (CT). Existing multi-modality radiomics methods manually fuse the data that are extracted separately. Reliance on manual fusion often results in sub-optimal fusion because they are dependent on an 'expert's' understanding of medical images. In this study, we propose a multi-modality neural architecture search method (MM-NAS) to automatically derive optimal multi-modality image features for radiomics and thus negate the dependence on a manual process. We evaluated our MM-NAS on the ability to predict DM using a public PET-CT dataset of patients with soft-tissue sarcomas (STSs). Our results show that our MM-NAS had a higher prediction accuracy when compared to state-of-the-art radiomics methods.

* Accepted by MICCAI 2020