Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

A Novel Framework to Expedite Systematic Reviews by Automatically Building Information Extraction Training Corpora

Jun 21, 2016
Tanmay Basu, Shraman Kumar, Abhishek Kalyan, Priyanka Jayaswal, Pawan Goyal, Stephen Pettifer, Siddhartha R. Jonnalagadda

A systematic review identifies and collates various clinical studies and compares data elements and results in order to provide an evidence based answer for a particular clinical question. The process is manual and involves lot of time. A tool to automate this process is lacking. The aim of this work is to develop a framework using natural language processing and machine learning to build information extraction algorithms to identify data elements in a new primary publication, without having to go through the expensive task of manual annotation to build gold standards for each data element type. The system is developed in two stages. Initially, it uses information contained in existing systematic reviews to identify the sentences from the PDF files of the included references that contain specific data elements of interest using a modified Jaccard similarity measure. These sentences have been treated as labeled data.A Support Vector Machine (SVM) classifier is trained on this labeled data to extract data elements of interests from a new article. We conducted experiments on Cochrane Database systematic reviews related to congestive heart failure using inclusion criteria as an example data element. The empirical results show that the proposed system automatically identifies sentences containing the data element of interest with a high recall (93.75%) and reasonable precision (27.05% - which means the reviewers have to read only 3.7 sentences on average). The empirical results suggest that the tool is retrieving valuable information from the reference articles, even when it is time-consuming to identify them manually. Thus we hope that the tool will be useful for automatic data extraction from biomedical research publications. The future scope of this work is to generalize this information framework for all types of systematic reviews.

Access Paper or Ask Questions

Automatic extraction of requirements expressed in industrial standards : a way towards machine readable standards ?

Dec 24, 2021
Helene de Ribaupierre, Anne-Francoise Cutting-Decelle, Nathalie Baumier, Serge Blumental

The project, under industrial funding, presented in this publication aims at the semantic analysis of a normative document describing requirements applicable to electrical appliances. The objective of the project is to build a semantic approach to extract and automatically process information related to the requirements contained in the standard. To this end, the project has been divided into three parts, covering the analysis of the requirements document, the extraction of relevant information and creation of the ontology and the comparison with other approaches. The first part of our work deals with the analysis of the requirements document under study. The study focuses on the specificity of the sentence structure, the use of particular words and vocabulary related to the representation of the requirements. The aim is to propose a representation facilitating the extraction of information, used in the second part of the study. In the second part, the extraction of relevant information is conducted in two ways: manual (the ontology being built by hand), semi-automatic (using semantic annotation software and natural language processing techniques). Whatever the method used, the aim of this extraction is to create the concept dictionary, then the ontology, enriched as the document is scanned and understood by the system. Once the relevant terms have been identified, the work focuses on identifying and representing the requirements, separating the textual writing from the information given in the tables. The automatic processing of requirements involves the extraction of sentences containing terms identified as relevant to a requirement. The identified requirement is then indexed and stored in a representation that can be used for query processing.

Access Paper or Ask Questions

An Intellectual Property Entity Recognition Method Based on Transformer and Technological Word Information

Mar 21, 2022
Yuhui Wang, Junping Du, Yingxia Shao

Patent texts contain a large amount of entity information. Through named entity recognition, intellectual property entity information containing key information can be extracted from it, helping researchers to understand the patent content faster. Therefore, it is difficult for existing named entity extraction methods to make full use of the semantic information at the word level brought about by professional vocabulary changes. This paper proposes a method for extracting intellectual property entities based on Transformer and technical word information , and provides accurate word vector representation in combination with the BERT language method. In the process of word vector generation, the technical word information extracted by IDCNN is added to improve the understanding of intellectual property entities Representation ability. Finally, the Transformer encoder that introduces relative position encoding is used to learn the deep semantic information of the text from the sequence of word vectors, and realize entity label prediction. Experimental results on public datasets and annotated patent datasets show that the method improves the accuracy of entity recognition.

Access Paper or Ask Questions

IMoJIE: Iterative Memory-Based Joint Open Information Extraction

May 17, 2020
Keshav Kolluru, Samarth Aggarwal, Vipul Rathore, Mausam, Soumen Chakrabarti

While traditional systems for Open Information Extraction were statistical and rule-based, recently neural models have been introduced for the task. Our work builds upon CopyAttention, a sequence generation OpenIE model (Cui et. al., 2018). Our analysis reveals that CopyAttention produces a constant number of extractions per sentence, and its extracted tuples often express redundant information. We present IMoJIE, an extension to CopyAttention, which produces the next extraction conditioned on all previously extracted tuples. This approach overcomes both shortcomings of CopyAttention, resulting in a variable number of diverse extractions per sentence. We train IMoJIE on training data bootstrapped from extractions of several non-neural systems, which have been automatically filtered to reduce redundancy and noise. IMoJIE outperforms CopyAttention by about 18 F1 pts, and a BERT-based strong baseline by 2 F1 pts, establishing a new state of the art for the task.

* ACL 2020, Long paper 
Access Paper or Ask Questions

Use of 'off-the-shelf' information extraction algorithms in clinical informatics: a feasibility study of MetaMap annotation of Italian medical notes

Apr 02, 2021
Emma Chiaramello, Francesco Pinciroli, Alberico Bonalumi, Angelo Caroli, Gabriella Tognola

Information extraction from narrative clinical notes is useful for patient care, as well as for secondary use of medical data, for research or clinical purposes. Many studies focused on information extraction from English clinical texts, but less dealt with clinical notes in languages other than English. This study tested the feasibility of using 'off the shelf' information extraction algorithms to identify medical concepts from Italian clinical notes. We used MetaMap to map medical concepts to the Unified Medical Language System (UMLS). The study addressed two questions: (Q1) to understand if it would be possible to properly map medical terms found in clinical notes and related to the semantic group of 'Disorders' to the Italian UMLS resources; (Q2) to investigate if it would be feasible to use MetaMap as it is to extract these medical concepts from Italian clinical notes. Results in EXP1 showed that the Italian UMLS Metathesaurus sources covered 91% of the medical terms of the 'Disorders' semantic group, as found in the studied dataset. Even if MetaMap was built to analyze texts written in English, it worked properly also with texts written in Italian. MetaMap identified correctly about half of the concepts in the Italian clinical notes. Using MetaMap's annotation on Italian clinical notes instead of a simple text search improved our results of about 15 percentage points. MetaMap showed recall, precision and F-measure of 0.53, 0.98 and 0.69, respectively. Most of the failures were due to the impossibility for MetaMap to generate Italian meaningful variants. MetaMap's performance in annotating automatically translated English clinical notes was in line with findings in the literature, with similar recall (0.75), F-measure (0.83) and even higher precision (0.95).

* Journal of biomedical informatics, Volume 63, October 2016, Pages 22-32 
* This paper has been published in the Journal of biomedical informatics, Volume 63, October 2016, Pages 22-32 
Access Paper or Ask Questions

Knowledge-guided Open Attribute Value Extraction with Reinforcement Learning

Oct 19, 2020
Ye Liu, Sheng Zhang, Rui Song, Suo Feng, Yanghua Xiao

Open attribute value extraction for emerging entities is an important but challenging task. A lot of previous works formulate the problem as a \textit{question-answering} (QA) task. While the collections of articles from web corpus provide updated information about the emerging entities, the retrieved texts can be noisy, irrelevant, thus leading to inaccurate answers. Effectively filtering out noisy articles as well as bad answers is the key to improving extraction accuracy. Knowledge graph (KG), which contains rich, well organized information about entities, provides a good resource to address the challenge. In this work, we propose a knowledge-guided reinforcement learning (RL) framework for open attribute value extraction. Informed by relevant knowledge in KG, we trained a deep Q-network to sequentially compare extracted answers to improve extraction accuracy. The proposed framework is applicable to different information extraction system. Our experimental results show that our method outperforms the baselines by 16.5 - 27.8\%.

* EMNLP 2020 
Access Paper or Ask Questions

Approximate Grammar for Information Extraction

May 06, 2003
V. Sriram, B. Ravi Sekar Reddy, R. Sangal

In this paper, we present the concept of Approximate grammar and how it can be used to extract information from a documemt. As the structure of informational strings cannot be defined well in a document, we cannot use the conventional grammar rules to represent the information. Hence, the need arises to design an approximate grammar that can be used effectively to accomplish the task of Information extraction. Approximate grammars are a novel step in this direction. The rules of an approximate grammar can be given by a user or the machine can learn the rules from an annotated document. We have performed our experiments in both the above areas and the results have been impressive.

* Conference on Universal Knowledge and Language, Goa'2002 
* 10 pages, 3 figures, 2 tables, Presented at "International Conference on Universal Knowledge and Language, Goa'2002" 
Access Paper or Ask Questions

An Olfactory EEG Signal Classification Network Based on Frequency Band Feature Extraction

Feb 05, 2022
Biao Sun, Zhigang Wei, Pei Liang, Huirang Hou

Classification of olfactory-induced electroencephalogram (EEG) signals has shown great potential in many fields. Since different frequency bands within the EEG signals contain different information, extracting specific frequency bands for classification performance is important. Moreover, due to the large inter-subject variability of the EEG signals, extracting frequency bands with subject-specific information rather than general information is crucial. Considering these, the focus of this letter is to classify the olfactory EEG signals by exploiting the spectral-domain information of specific frequency bands. In this letter, we present an olfactory EEG signal classification network based on frequency band feature extraction. A frequency band generator is first designed to extract frequency bands via the sliding window technique. Then, a frequency band attention mechanism is proposed to optimize frequency bands for a specific subject adaptively. Last, a convolutional neural network (CNN) is constructed to extract the spatio-spectral information and predict the EEG category. Comparison experiment results reveal that the proposed method outperforms a series of baseline methods in terms of both classification quality and inter-subject robustness. Ablation experiment results demonstrate the effectiveness of each component of the proposed method.

Access Paper or Ask Questions

Business Document Information Extraction: Towards Practical Benchmarks

Jun 20, 2022
Matyáš Skalický, Štěpán Šimsa, Michal Uřičář, Milan Šulc

Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.

* Accepted to CLEF 2022 
Access Paper or Ask Questions

Improving Channel Decorrelation for Multi-Channel Target Speech Extraction

Jun 06, 2021
Jiangyu Han, Wei Rao, Yannan Wang, Yanhua Long

Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting the target speech. We have recently proposed a channel decorrelation (CD) mechanism to extract the inter-channel differential information to enhance the reference channel encoder representation. Although the proposed mechanism has shown promising results for extracting the target speech from mixtures, the extraction performance is still limited by the nature of the original decorrelation theory. In this paper, we propose two methods to broaden the horizon of the original channel decorrelation, by replacing the original softmax-based inter-channel similarity between encoder representations, using an unrolled probability and a normalized cosine-based similarity at the dimensional-level. Moreover, new combination strategies of the CD-based spatial information and target speaker adaptation of parallel encoder outputs are also investigated. Experiments on the reverberant WSJ0 2-mix show that the improved CD can result in more discriminative differential information and the new adaptation strategy is also very effective to improve the target speech extraction.

* accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2010.09191 
Access Paper or Ask Questions