Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Explaining black-box text classifiers for disease-treatment information extraction

Oct 21, 2020
Milad Moradi, Matthias Samwald

Deep neural networks and other intricate Artificial Intelligence (AI) models have reached high levels of accuracy on many biomedical natural language processing tasks. However, their applicability in real-world use cases may be limited due to their vague inner working and decision logic. A post-hoc explanation method can approximate the behavior of a black-box AI model by extracting relationships between feature values and outcomes. In this paper, we introduce a post-hoc explanation method that utilizes confident itemsets to approximate the behavior of black-box classifiers for medical information extraction. Incorporating medical concepts and semantics into the explanation process, our explanator finds semantic relations between inputs and outputs in different parts of the decision space of a black-box classifier. The experimental results show that our explanation method can outperform perturbation and decision set based explanators in terms of fidelity and interpretability of explanations produced for predictions on a disease-treatment information extraction task.


Distantly-Supervised Neural Relation Extraction with Side Information using BERT

May 10, 2020
Johny Moreira, Chaina Oliveira, David Macêdo, Cleber Zanchettin, Luciano Barbosa

Relation extraction (RE) consists in categorizing the relationship between entities in a sentence. A recent paradigm to develop relation extractors is Distant Supervision (DS), which allows the automatic creation of new datasets by taking an alignment between a text corpus and a Knowledge Base (KB). KBs can sometimes also provide additional information to the RE task. One of the methods that adopt this strategy is the RESIDE model, which proposes a distantly-supervised neural relation extraction using side information from KBs. Considering that this method outperformed state-of-the-art baselines, in this paper, we propose a related approach to RESIDE also using additional side information, but simplifying the sentence encoding with BERT embeddings. Through experiments, we show the effectiveness of the proposed method in Google Distant Supervision and Riedel datasets concerning the BGWA and RESIDE baseline methods. Although Area Under the Curve is decreased because of unbalanced datasets, [email protected] results have shown that the use of BERT as sentence encoding allows superior performance to baseline methods.


Video Summarization using Keyframe Extraction and Video Skimming

Oct 10, 2019
Shruti Jadon, Mahmood Jasim

Video is one of the robust sources of information and the consumption of online and offline videos has reached an unprecedented level in the last few years. A fundamental challenge of extracting information from videos is a viewer has to go through the complete video to understand the context, as opposed to an image where the viewer can extract information from a single frame. In this work, we attempt to employ different Algorithmic methodologies including local features and deep neural networks along with multiple clustering methods to find an effective way of summarizing a video by interesting keyframe extraction.

* 5 pages, 3 figures. Technical Report 

Research on the pixel-based and object-oriented methods of urban feature extraction with GF-2 remote-sensing images

Mar 08, 2019
Dong-dong Zhang, Lei Zhang, Vladimir Zaborovsky, Feng Xie, Yan-wen Wu, Ting-ting Lu

During the rapid urbanization construction of China, acquisition of urban geographic information and timely data updating are important and fundamental tasks for the refined management of cities. With the development of domestic remote sensing technology, the application of Gaofen-2 (GF-2) high-resolution remote sensing images can greatly improve the accuracy of information extraction. This paper introduces an approach using object-oriented classification methods for urban feature extraction based on GF-2 satellite data. A combination of spectral, spatial attributes and membership functions was employed for mapping the urban features of Qinhuai District, Nanjing. The data preprocessing is carried out by ENVI software, and the subsequent data is exported into the eCognition software for object-oriented classification and extraction of urban feature information. Finally, the obtained raster image classification results are vectorized using the ARCGIS software, and the vector graphics are stored in the library, which can be used for further analysis and modeling. Accuracy assessment was performed using ground truth data acquired by visual interpretation and from other reliable secondary data sources. Compared with the result of pixel-based supervised (neural net) classification, the developed object-oriented method can significantly improve extraction accuracy, and after manual interpretation, an overall accuracy of 95.44% can be achieved, with a Kappa coefficient of 0.9405, which objectively confirmed the superiority of the object-oriented method and the feasibility of the utilization of GF-2 satellite data.


Relational Learning and Feature Extraction by Querying over Heterogeneous Information Networks

Jul 25, 2017
Parisa Kordjamshidi, Sameer Singh, Daniel Khashabi, Christos Christodoulopoulos, Mark Summons, Saurabh Sinha, Dan Roth

Many real world systems need to operate on heterogeneous information networks that consist of numerous interacting components of different types. Examples include systems that perform data analysis on biological information networks; social networks; and information extraction systems processing unstructured data to convert raw text to knowledge graphs. Many previous works describe specialized approaches to perform specific types of analysis, mining and learning on such networks. In this work, we propose a unified framework consisting of a data model -a graph with a first order schema along with a declarative language for constructing, querying and manipulating such networks in ways that facilitate relational and structured machine learning. In particular, we provide an initial prototype for a relational and graph traversal query language where queries are directly used as relational features for structured machine learning models. Feature extraction is performed by making declarative graph traversal queries. Learning and inference models can directly operate on this relational representation and augment it with new data and knowledge that, in turn, is integrated seamlessly into the relational structure to support new predictions. We demonstrate this system's capabilities by showcasing tasks in natural language processing and computational biology domains.

* Seventh International Workshop on Statistical Relational AI, 2017 

MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction

Sep 30, 2021
Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, Mausam

An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible.


PFAx: Predictable Feature Analysis to Perform Control

Dec 02, 2017
Stefan Richthofer, Laurenz Wiskott

Predictable Feature Analysis (PFA) (Richthofer, Wiskott, ICMLA 2015) is an algorithm that performs dimensionality reduction on high dimensional input signal. It extracts those subsignals that are most predictable according to a certain prediction model. We refer to these extracted signals as predictable features. In this work we extend the notion of PFA to take supplementary information into account for improving its predictions. Such information can be a multidimensional signal like the main input to PFA, but is regarded external. That means it won't participate in the feature extraction - no features get extracted or composed of it. Features will be exclusively extracted from the main input such that they are most predictable based on themselves and the supplementary information. We refer to this enhanced PFA as PFAx (PFA extended). Even more important than improving prediction quality is to observe the effect of supplementary information on feature selection. PFAx transparently provides insight how the supplementary information adds to prediction quality and whether it is valuable at all. Finally we show how to invert that relation and can generate the supplementary information such that it would yield a certain desired outcome of the main signal. We apply this to a setting inspired by reinforcement learning and let the algorithm learn how to control an agent in an environment. With this method it is feasible to locally optimize the agent's state, i.e. reach a certain goal that is near enough. We are preparing a follow-up paper that extends this method such that also global optimization is feasible.


Dynamic Visual Analytics for Elicitation Meetings with ELICA

Jul 10, 2018
Zahra Shakeri Hossein Abad, Munib Rahman, Abdullah Cheema, Vincenzo Gervasi, Didar Zowghi, Ken Barker

Requirements elicitation can be very challenging in projects that require deep domain knowledge about the system at hand. As analysts have the full control over the elicitation process, their lack of knowledge about the system under study inhibits them from asking related questions and reduces the accuracy of requirements provided by stakeholders. We present ELICA, a generic interactive visual analytics tool to assist analysts during requirements elicitation process. ELICA uses a novel information extraction algorithm based on a combination of Weighted Finite State Transducers (WFSTs) (generative model) and SVMs (discriminative model). ELICA presents the extracted relevant information in an interactive GUI (including zooming, panning, and pinching) that allows analysts to explore which parts of the ongoing conversation (or specification document) match with the extracted information. In this demonstration, we show that ELICA is usable and effective in practice, and is able to extract the related information in real-time. We also demonstrate how carefully designed features in ELICA facilitate the interactive and dynamic process of information extraction.


An Information Extraction Approach to Prescreen Heart Failure Patients for Clinical Trials

Sep 06, 2016
Abhishek Kalyan Adupa, Ravi Prakash Garg, Jessica Corona-Cox, Sanjiv. J. Shah, Siddhartha R. Jonnalagadda

To reduce the large amount of time spent screening, identifying, and recruiting patients into clinical trials, we need prescreening systems that are able to automate the data extraction and decision-making tasks that are typically relegated to clinical research study coordinators. However, a major obstacle is the vast amount of patient data available as unstructured free-form text in electronic health records. Here we propose an information extraction-based approach that first automatically converts unstructured text into a structured form. The structured data are then compared against a list of eligibility criteria using a rule-based system to determine which patients qualify for enrollment in a heart failure clinical trial. We show that we can achieve highly accurate results, with recall and precision values of 0.95 and 0.86, respectively. Our system allowed us to significantly reduce the time needed for prescreening patients from a few weeks to a few minutes. Our open-source information extraction modules are available for researchers and could be tested and validated in other cardiovascular trials. An approach such as the one we demonstrate here may decrease costs and expedite clinical trials, and could enhance the reproducibility of trials across institutions and populations.