Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Learning from Noisy Labels for Entity-Centric Information Extraction

Apr 17, 2021
Wenxuan Zhou, Muhao Chen

Recent efforts for information extraction have relied on many deep neural models. However, any such models can easily overfit noisy labels and suffer from performance degradation. While it is very costly to filter noisy labels in large learning resources, recent studies show that such labels take more training steps to be memorized and are more frequently forgotten than clean labels, therefore are identifiable in training. Motivated by such properties, we propose a simple co-regularization framework for entity-centric information extraction, which consists of several neural models with different parameter initialization. These models are jointly optimized with task-specific loss, and are regularized to generate similar predictions based on an agreement loss, which prevents overfitting on noisy labels. In the end, we can take any of the trained models for inference. Extensive experiments on two widely used but noisy benchmarks for information extraction, TACRED and CoNLL03, demonstrate the effectiveness of our framework.


Using Neighborhood Context to Improve Information Extraction from Visual Documents Captured on Mobile Phones

Aug 23, 2021
Kalpa Gunaratna, Vijay Srinivasan, Sandeep Nama, Hongxia Jin

Information Extraction from visual documents enables convenient and intelligent assistance to end users. We present a Neighborhood-based Information Extraction (NIE) approach that uses contextual language models and pays attention to the local neighborhood context in the visual documents to improve information extraction accuracy. We collect two different visual document datasets and show that our approach outperforms the state-of-the-art global context-based IE technique. In fact, NIE outperforms existing approaches in both small and large model sizes. Our on-device implementation of NIE on a mobile platform that generally requires small models showcases NIE's usefulness in practical real-world applications.

* accepted at CIKM 2021, pre-print version 

IoT Virtualization with ML-based Information Extraction

Jun 10, 2021
Martin Bauer

For IoT to reach its full potential, the sharing and reuse of information in different applications and across verticals is of paramount importance. However, there are a plethora of IoT platforms using different representations, protocols and interaction patterns. To address this issue, the Fed4IoT project has developed an IoT virtualization platform that, on the one hand, integrates information from many different source platforms and, on the other hand, makes the information required by the respective users available in the target platform of choice. To enable this, information is translated into a common, neutral exchange format. The format of choice is NGSI-LD, which is being standardized by the ETSI Industry Specification Group on Context Information Management (ETSI ISG CIM). Thing Visors are the components that translate the source information to NGSI-LD, which is then delivered to the target platform and translated into the target format. ThingVisors can be implemented by hand, but this requires significant human effort, especially considering the heterogeneity of low level information produced by a multitude of sensors. Thus, supporting the human developer and, ideally, fully automating the process of extracting and enriching data and translating it to NGSI-LD is a crucial step. Machine learning is a promising approach for this, but it typically requires large amounts of hand-labelled data for training, an effort that makes it unrealistic in many IoT scenarios. A programmatic labelling approach called knowledge infusion that encodes expert knowledge is used for matching a schema or ontology extracted from the data with a target schema or ontology, providing the basis for annotating the data and facilitating the translation to NGSI-LD.


Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Jan 14, 2022
Ramon Pires, Fábio C. de Souza, Guilherme Rosa, Roberto A. Lotufo, Rodrigo Nogueira

A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines.


Hierarchical RNN for Information Extraction from Lawsuit Documents

Apr 25, 2018
Xi Rao, Zhenxing Ke

Every lawsuit document contains the information about the party's claim, court's analysis, decision and others, and all of this information are helpful to understand the case better and predict the judge's decision on similar case in the future. However, the extraction of these information from the document is difficult because the language is too complicated and sentences varied at length. We treat this problem as a task of sequence labeling, and this paper presents the first research to extract relevant information from the civil lawsuit document in China with the hierarchical RNN framework.

* IMECS2018 

Musical Information Extraction from the Singing Voice

Apr 07, 2022
Preeti Rao

Music information retrieval is currently an active research area that addresses the extraction of musically important information from audio signals, and the applications of such information. The extracted information can be used for search and retrieval of music in recommendation systems, or to aid musicological studies or even in music learning. Sophisticated signal processing techniques are applied to convert low-level acoustic signal properties to musical attributes which are further embedded in a rule-based or statistical classification framework to link with high-level descriptions such as melody, genre, mood and artist type. Vocal music comprises a large and interesting category of music where the lead instrument is the singing voice. The singing voice is more versatile than many musical instruments and therefore poses interesting challenges to information retrieval systems. In this paper, we provide a brief overview of research in vocal music processing followed by a description of related work at IIT Bombay leading to the development of an interface for melody detection of singing voice in polyphony.


Dimension Reduction by Mutual Information Feature Extraction

Jul 14, 2012
Ali Shadvar

During the past decades, to study high-dimensional data in a large variety of problems, researchers have proposed many Feature Extraction algorithms. One of the most effective approaches for optimal feature extraction is based on mutual information (MI). However it is not always easy to get an accurate estimation for high dimensional MI. In terms of MI, the optimal feature extraction is creating a feature set from the data which jointly have the largest dependency on the target class and minimum redundancy. In this paper, a component-by-component gradient ascent method is proposed for feature extraction which is based on one-dimensional MI estimates. We will refer to this algorithm as Mutual Information Feature Extraction (MIFX). The performance of this proposed method is evaluated using UCI databases. The results indicate that MIFX provides a robust performance over different data sets which are almost always the best or comparable to the best ones.

* International Journal of Computer Science & Information Technology (IJCSIT). arXiv admin note: substantial text overlap with arXiv:1206.2058 

Annotating Social Determinants of Health Using Active Learning, and Characterizing Determinants Using Neural Event Extraction

Apr 11, 2020
Kevin Lybarger, Mari Ostendorf, Meliha Yetisgen

Social determinants of health (SDOH) affect health outcomes, and knowledge of SDOH can inform clinical decision-making. Automatically extracting SDOH information from clinical text requires data-driven information extraction models trained on annotated corpora that are heterogeneous and frequently include critical SDOH. This work presents a new corpus with SDOH annotations, a novel active learning framework, and the first extraction results on the new corpus. The Social History Annotation Corpus (SHAC) includes 4,480 social history sections with detailed annotation for 12 SDOH characterizing the status, extent, and temporal information of 18K distinct events. We introduce a novel active learning framework that selects samples for annotation using a surrogate text classification task as a proxy for a more complex event extraction task. The active learning framework successfully increases the frequency of health risk factors and improves automatic detection of these events over undirected annotation. An event extraction model trained on SHAC achieves high extraction performance for substance use status (0.82-0.93 F1), employment status (0.81-0.86 F1), and living status type (0.81-0.93 F1) on data from three institutions.

* 29 pages, 14 figures, 4 tables 

Informative Causality Extraction from Medical Literature via Dependency-tree based Patterns

Mar 13, 2022
Md. Ahsanul Kabir, AlJohara Almulhim, Xiao Luo, Mohammad Al Hasan

Extracting cause-effect entities from medical literature is an important task in medical information retrieval. A solution for solving this task can be used for compilation of various causality relations, such as, causality between disease and symptoms, between medications and side effects, between genes and diseases, etc. Existing solutions for extracting cause-effect entities work well for sentences where the cause and the effect phrases are name entities, single-word nouns, or noun phrases consisting of two to three words. Unfortunately, in medical literature, cause and effect phrases in a sentence are not simply nouns or noun phrases, rather they are complex phrases consisting of several words, and existing methods fail to correctly extract the cause and effect entities in such sentences. Partial extraction of cause and effect entities conveys poor quality, non informative, and often, contradictory facts, comparing to the one intended in the given sentence. In this work, we solve this problem by designing an unsupervised method for cause and effect phrase extraction, PatternCausality, which is specifically suitable for the medical literature. Our proposed approach first uses a collection of cause-effect dependency patterns as template to extract head words of cause and effect phrases and then it uses a novel phrase extraction method to obtain complete and meaningful cause and effect phrases from a sentence. Experiments on a cause-effect dataset built from sentences from PubMed articles show that for extracting cause and effect entities, PatternCausality is substantially better than the existing methods with an order of magnitude improvement in the F-score metric over the best of the existing methods.

* Journal of Healthcare Informatics Research 2022 
* 22 pages without comment 

Multimodal Learning on Graphs for Disease Relation Extraction

Mar 16, 2022
Yucong Lin, Keming Lu, Sheng Yu, Tianxi Cai, Marinka Zitnik

Objective: Disease knowledge graphs are a way to connect, organize, and access disparate information about diseases with numerous benefits for artificial intelligence (AI). To create knowledge graphs, it is necessary to extract knowledge from multimodal datasets in the form of relationships between disease concepts and normalize both concepts and relationship types. Methods: We introduce REMAP, a multimodal approach for disease relation extraction and classification. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, followed by aligning the multimodal embeddings for optimal disease relation extraction. Results: We apply REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves text-based disease relation extraction by 10.0% (accuracy) and 17.2% (F1-score) by fusing disease knowledge graphs with text information. Further, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4% (accuracy) and 10.4% (F1-score). Discussion: Systematized knowledge is becoming the backbone of AI, creating opportunities to inject semantics into AI and fully integrate it into machine learning algorithms. While prior semantic knowledge can assist in extracting disease relationships from text, existing methods can not fully leverage multimodal datasets. Conclusion: REMAP is a multimodal approach for extracting and classifying disease relationships by fusing structured knowledge and text information. REMAP provides a flexible neural architecture to easily find, access, and validate AI-driven relationships between disease concepts.