Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Learning with Scope, with Application to Information Extraction and Classification

Dec 12, 2012
David Blei, J Andrew Bagnell, Andrew McCallum

In probabilistic approaches to classification and information extraction, one typically builds a statistical model of words under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data. For example, in information extraction from web pages, word formatting may be indicative of extraction category in different ways on different web pages. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. In this paper, we propose a hierarchical probabilistic model that uses both local/scope-limited features, such as word formatting, and global features, such as word content. The local regularities are modeled as an unobserved random parameter which is drawn once for each local data set. This random parameter is estimated during the inference process and then used to perform classification with both the local and global features--- a procedure which is akin to automatically retuning the classifier to the local regularities on each newly encountered web page. Exact inference is intractable and we present approximations via point estimates and variational methods. Empirical results on large collections of web data demonstrate that this method significantly improves performance from traditional models of global features alone.

* Appears in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI2002) 

Extracting Factual Min/Max Age Information from Clinical Trial Studies

Apr 05, 2019
Yufang Hou, Debasis Ganguly, Lea A. Deleris, Francesca Bonin

Population age information is an essential characteristic of clinical trials. In this paper, we focus on extracting minimum and maximum (min/max) age values for the study samples from clinical research articles. Specifically, we investigate the use of a neural network model for question answering to address this information extraction task. The min/max age QA model is trained on the massive structured clinical study records from For each article, based on multiple min and max age values extracted from the QA model, we predict both actual min/max age values for the study samples and filter out non-factual age expressions. Our system improves the results over (i) a passage retrieval based IE system and (ii) a CRF-based system by a large margin when evaluated on an annotated dataset consisting of 50 research papers on smoking cessation.

* 10 pages, accepted in ClinicalNLP2019 

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction Mention Extraction

Sep 06, 2017
Shashank Gupta, Sachin Pawar, Nitin Ramrakhiyani, Girish Palshikar, Vasudeva Varma

Social media is an useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from twitter. Medical information extraction from social media is challenging, mainly due to short and highly information nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction relies on supervised learning methods, which suffers from labeled data scarcity problem. The State-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which are Long-Short-Term-Memory networks (LSTMs) \cite{hochreiter1997long}. Deep neural networks, due to their large number of free parameters relies heavily on large annotated corpora for learning the end task. But in real-world, it is hard to get large labeled data, mainly due to heavy cost associated with manual annotation. Towards this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction.

* Accepted at DTMBIO workshop, CIKM 2017. To appear in BMC Bioinformatics. Pls cite that version 

On-Device Information Extraction from SMS using Hybrid Hierarchical Classification

Feb 03, 2020
Shubham Vatsal, Naresh Purre, Sukumar Moharana, Gopi Ramena, Debi Prasanna Mohanty

Cluttering of SMS inbox is one of the serious problems that users today face in the digital world where every online login, transaction, along with promotions generate multiple SMS. This problem not only prevents users from searching and navigating messages efficiently but often results in users missing out the relevant information associated with the corresponding SMS like offer codes, payment reminders etc. In this paper, we propose a unique architecture to organize and extract the appropriate information from SMS and further display it in an intuitive template. In the proposed architecture, we use a Hybrid Hierarchical Long Short Term Memory (LSTM)-Convolutional Neural Network (CNN) to categorize SMS into multiple classes followed by a set of entity parsers used to extract the relevant information from the classified message. The architecture using its preprocessing techniques not only takes into account the enormous variations observed in SMS data but also makes it efficient for its on-device (mobile phone) functionalities in terms of inference timing and size.

* to be published in IEEE ICSC 2020 proceedings 

CDM: Combining Extraction and Generation for Definition Modeling

Nov 14, 2021
Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang

Definitions are essential for term understanding. Recently, there is an increasing interest in extracting and generating definitions of terms automatically. However, existing approaches for this task are either extractive or abstractive - definitions are either extracted from a corpus or generated by a language generation model. In this paper, we propose to combine extraction and generation for definition modeling: first extract self- and correlative definitional information of target terms from the Web and then generate the final definitions by incorporating the extracted definitional information. Experiments demonstrate our framework can generate high-quality definitions for technical terms and outperform state-of-the-art models for definition modeling significantly.


Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference

Jun 01, 2021
Tuan Lai, Heng Ji, ChengXiang Zhai, Quan Hung Tran

Compared to the general news domain, information extraction (IE) from biomedical text requires much broader domain knowledge. However, many previous IE methods do not utilize any external knowledge during inference. Due to the exponential growth of biomedical publications, models that do not go beyond their fixed set of parameters will likely fall behind. Inspired by how humans look up relevant information to comprehend a scientific text, we present a novel framework that utilizes external knowledge for joint entity and relation extraction named KECI (Knowledge-Enhanced Collective Inference). Given an input text, KECI first constructs an initial span graph representing its initial understanding of the text. It then uses an entity linker to form a knowledge graph containing relevant background knowledge for the the entity mentions in the text. To make the final predictions, KECI fuses the initial span graph and the knowledge graph into a more refined graph using an attention mechanism. KECI takes a collective approach to link mention spans to entities by integrating global relational information into local representations using graph convolutional networks. Our experimental results show that the framework is highly effective, achieving new state-of-the-art results in two different benchmark datasets: BioRelEx (binding interaction detection) and ADE (adverse drug event extraction). For example, KECI achieves absolute improvements of 4.59% and 4.91% in F1 scores over the state-of-the-art on the BioRelEx entity and relation extraction tasks.

* Accepted by ACL 2021 

Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction

Dec 17, 2021
Guangyan Chen, Meiling Wang, Yufeng Yue, Qingxiang Zhang, Li Yuan

Recent Transformer-based methods have achieved advanced performance in point cloud registration by utilizing advantages of the Transformer in order-invariance and modeling dependency to aggregate information. However, they still suffer from indistinct feature extraction, sensitivity to noise, and outliers. The reasons are: (1) the adoption of CNNs fails to model global relations due to their local receptive fields, resulting in extracted features susceptible to noise; (2) the shallow-wide architecture of Transformers and lack of positional encoding lead to indistinct feature extraction due to inefficient information interaction; (3) the omission of geometrical compatibility leads to inaccurate classification between inliers and outliers. To address above limitations, a novel full Transformer network for point cloud registration is proposed, named the Deep Interaction Transformer (DIT), which incorporates: (1) a Point Cloud Structure Extractor (PSE) to model global relations and retrieve structural information with Transformer encoders; (2) a deep-narrow Point Feature Transformer (PFT) to facilitate deep information interaction across two point clouds with positional encoding, such that Transformers can establish comprehensive associations and directly learn relative position between points; (3) a Geometric Matching-based Correspondence Confidence Evaluation (GMCCE) method to measure spatial consistency and estimate inlier confidence by designing the triangulated descriptor. Extensive experiments on clean, noisy, partially overlapping point cloud registration demonstrate that our method outperforms state-of-the-art methods.

* 10pages, 7figures 

Boosting Span-based Joint Entity and Relation Extraction via Squence Tagging Mechanism

May 21, 2021
Bin Ji, Shasha Li, Jie Yu, Jun Ma, Huijun Liu

Span-based joint extraction simultaneously conducts named entity recognition (NER) and relation extraction (RE) in text span form. Recent studies have shown that token labels can convey crucial task-specific information and enrich token semantics. However, as far as we know, due to completely abstain from sequence tagging mechanism, all prior span-based work fails to use token label in-formation. To solve this problem, we pro-pose Sequence Tagging enhanced Span-based Network (STSN), a span-based joint extrac-tion network that is enhanced by token BIO label information derived from sequence tag-ging based NER. By stacking multiple atten-tion layers in depth, we design a deep neu-ral architecture to build STSN, and each atten-tion layer consists of three basic attention units. The deep neural architecture first learns seman-tic representations for token labels and span-based joint extraction, and then constructs in-formation interactions between them, which also realizes bidirectional information interac-tions between span-based NER and RE. Fur-thermore, we extend the BIO tagging scheme to make STSN can extract overlapping en-tity. Experiments on three benchmark datasets show that our model consistently outperforms previous optimal models by a large margin, creating new state-of-the-art results.

* 10pages, 6 figures, 4 tables 

A Novel Biologically Mechanism-Based Visual Cognition Model--Automatic Extraction of Semantics, Formation of Integrated Concepts and Re-selection Features for Ambiguity

Mar 25, 2016
Peijie Yin, Hong Qiao, Wei Wu, Lu Qi, YinLin Li, Shanlin Zhong, Bo Zhang

Integration between biology and information science benefits both fields. Many related models have been proposed, such as computational visual cognition models, computational motor control models, integrations of both and so on. In general, the robustness and precision of recognition is one of the key problems for object recognition models. In this paper, inspired by features of human recognition process and their biological mechanisms, a new integrated and dynamic framework is proposed to mimic the semantic extraction, concept formation and feature re-selection in human visual processing. The main contributions of the proposed model are as follows: (1) Semantic feature extraction: Local semantic features are learnt from episodic features that are extracted from raw images through a deep neural network; (2) Integrated concept formation: Concepts are formed with local semantic information and structural information learnt through network. (3) Feature re-selection: When ambiguity is detected during recognition process, distinctive features according to the difference between ambiguous candidates are re-selected for recognition. Experimental results on hand-written digits and facial shape dataset show that, compared with other methods, the new proposed model exhibits higher robustness and precision for visual recognition, especially in the condition when input samples are smantic ambiguous. Meanwhile, the introduced biological mechanisms further strengthen the interaction between neuroscience and information science.


Text Extraction and Retrieval from Smartphone Screenshots: Building a Repository for Life in Media

Jan 04, 2018
Agnese Chiatti, Mu Jung Cho, Anupriya Gagneja, Xiao Yang, Miriam Brinberg, Katie Roehrick, Sagnik Ray Choudhury, Nilam Ram, Byron Reeves, C. Lee Giles

Daily engagement in life experiences is increasingly interwoven with mobile device use. Screen capture at the scale of seconds is being used in behavioral studies and to implement "just-in-time" health interventions. The increasing psychological breadth of digital information will continue to make the actual screens that people view a preferred if not required source of data about life experiences. Effective and efficient Information Extraction and Retrieval from digital screenshots is a crucial prerequisite to successful use of screen data. In this paper, we present the experimental workflow we exploited to: (i) pre-process a unique collection of screen captures, (ii) extract unstructured text embedded in the images, (iii) organize image text and metadata based on a structured schema, (iv) index the resulting document collection, and (v) allow for Image Retrieval through a dedicated vertical search engine application. The adopted procedure integrates different open source libraries for traditional image processing, Optical Character Recognition (OCR), and Image Retrieval. Our aim is to assess whether and how state-of-the-art methodologies can be applied to this novel data set. We show how combining OpenCV-based pre-processing modules with a Long short-term memory (LSTM) based release of Tesseract OCR, without ad hoc training, led to a 74% character-level accuracy of the extracted text. Further, we used the processed repository as baseline for a dedicated Image Retrieval system, for the immediate use and application for behavioral and prevention scientists. We discuss issues of Text Information Extraction and Retrieval that are particular to the screenshot image case and suggest important future work.