Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

Jul 14, 2022
Zhanzhan Cheng, Peng Zhang, Can Li, Qiao Liang, Yunlu Xu, Pengfei Li, Shiliang Pu, Yi Niu, Fei Wu

Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These methods mainly focus on improving the second, while neglecting that the two parts are highly correlated. This paper proposes a unified end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block. Specifically, the text reading part provides multi-modal features like visual, textual and layout features. The multi-modal context block is developed to fuse the generated multi-modal features and even the prior knowledge from the pre-trained language model for better semantic representation. The information extraction part is responsible for generating key contents with the fused context features. The framework can be trained in an end-to-end trainable manner, achieving global optimization. What is more, we define and group visually rich documents into four categories across two dimensions, the layout and text type. For each document category, we provide or recommend the corresponding benchmarks, experimental settings and strong baselines for remedying the problem that this research area lacks the uniform evaluation standard. Extensive experiments on four kinds of benchmarks (from fixed layout to variable layout, from full-structured text to semi-unstructured text) are reported, demonstrating the proposed method's effectiveness. Data, source code and models are available.


CitationIE: Leveraging the Citation Graph for Scientific Information Extraction

Jun 03, 2021
Vijay Viswanathan, Graham Neubig, Pengfei Liu

Automatically extracting key information from scientific documents has the potential to help scientists work more efficiently and accelerate the pace of scientific progress. Prior work has considered extracting document-level entity clusters and relations end-to-end from raw scientific text, which can improve literature search and help identify methods and materials for a given problem. Despite the importance of this task, most existing works on scientific information extraction (SciIE) consider extraction solely based on the content of an individual paper, without considering the paper's place in the broader literature. In contrast to prior work, we augment our text representations by leveraging a complementary source of document context: the citation graph of referential links between citing and cited papers. On a test set of English-language scientific documents, we show that simple ways of utilizing the structure and content of the citation graph can each lead to significant gains in different scientific information extraction tasks. When these tasks are combined, we observe a sizable improvement in end-to-end information extraction over the state-of-the-art, suggesting the potential for future work along this direction. We release software tools to facilitate citation-aware SciIE development.

* ACL-IJCNLP 2021 camera-ready (long paper in main conference) 

Open Information Extraction from Question-Answer Pairs

Apr 06, 2019
Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, H. V. Jagadish

Open Information Extraction (OpenIE) extracts meaningful structured tuples from free-form text. Most previous work on OpenIE considers extracting data from one sentence at a time. We describe NeurON, a system for extracting tuples from question-answer pairs. Since real questions and answers often contain precisely the information that users care about, such information is particularly desirable to extend a knowledge base with. NeurON addresses several challenges. First, an answer text is often hard to understand without knowing the question, and second, relevant information can span multiple sentences. To address these, NeurON formulates extraction as a multi-source sequence-to-sequence learning task, wherein it combines distributed representations of a question and an answer to generate knowledge facts. We describe experiments on two real-world datasets that demonstrate that NeurON can find a significant number of new and interesting facts to extend a knowledge base compared to state-of-the-art OpenIE methods.

* NAACL 2019 

Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks

Oct 14, 2020
Yue Wang, Zhuo Xu, Lu Bai, Yao Wan, Lixin Cui, Qian Zhao, Edwin R. Hancock, Philip S. Yu

Joint-event-extraction, which extracts structural information (i.e., entities or triggers of events) from unstructured real-world corpora, has attracted more and more research attention in natural language processing. Most existing works do not fully address the sparse co-occurrence relationships between entities and triggers, which loses this important information and thus deteriorates the extraction performance. To mitigate this issue, we first define the joint-event-extraction as a sequence-to-sequence labeling task with a tag set composed of tags of triggers and entities. Then, to incorporate the missing information in the aforementioned co-occurrence relationships, we propose a Cross-Supervised Mechanism (CSM) to alternately supervise the extraction of either triggers or entities based on the type distribution of each other. Moreover, since the connected entities and triggers naturally form a heterogeneous information network (HIN), we leverage the latent pattern along meta-paths for a given corpus to further improve the performance of our proposed method. To verify the effectiveness of our proposed method, we conduct extensive experiments on four real-world datasets as well as compare our method with state-of-the-art methods. Empirical results and analysis show that our approach outperforms the state-of-the-art methods in both entity and trigger extraction.

* Accepted by ICPR 2020 

DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction

Mar 10, 2021
Freddy C. Chua, Nigel P. Duffy

We combine deep learning and Conditional Probabilistic Context Free Grammars (CPCFG) to create an end-to-end system for extracting structured information from complex documents. For each class of documents, we create a CPCFG that describes the structure of the information to be extracted. Conditional probabilities are modeled by deep neural networks. We use this grammar to parse 2-D documents to directly produce structured records containing the extracted information. This system is trained end-to-end with (Document, Record) pairs. We apply this approach to extract information from scanned invoices achieving state-of-the-art results.


Doc2Dict: Information Extraction as Text Generation

May 16, 2021
Benjamin Townsend, Eamon Ito-Fisher, Lily Zhang, Madison May

Typically, information extraction (IE) requires a pipeline approach: first, a sequence labeling model is trained on manually annotated documents to extract relevant spans; then, when a new document arrives, a model predicts spans which are then post-processed and standardized to convert the information into a database entry. We replace this labor-intensive workflow with a transformer language model trained on existing database records to directly generate structured JSON. Our solution removes the workload associated with producing token-level annotations and takes advantage of a data source which is generally quite plentiful (e.g. database records). As long documents are common in information extraction tasks, we use gradient checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single GPU. Our Doc2Dict approach is competitive with more complex, hand-engineered pipelines and offers a simple but effective baseline for document-level information extraction. We release our Doc2Dict model and code to reproduce our experiments and facilitate future work.


On the Effects of Low-Quality Training Data on Information Extraction from Clinical Reports

Mar 04, 2015
Diego Marcheggiani, Fabrizio Sebastiani

In the last five years there has been a flurry of work on information extraction from clinical documents, i.e., on algorithms capable of extracting, from the informal and unstructured texts that are generated during everyday clinical practice, mentions of concepts relevant to such practice. Most of this literature is about methods based on supervised learning, i.e., methods for training an information extraction system from manually annotated examples. While a lot of work has been devoted to devising learning methods that generate more and more accurate information extractors, no work has been devoted to investigating the effect of the quality of training data on the learning process. Low quality in training data often derives from the fact that the person who has annotated the data is different from the one against whose judgment the automatically annotated data must be evaluated. In this paper we test the impact of such data quality issues on the accuracy of information extraction systems as applied to the clinical domain. We do this by comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who has also annotated the test data, and by whose judgment we must abide), with the accuracy deriving from training data annotated by a different coder. The results indicate that, although the disagreement between the two coders (as measured on the training set) is substantial, the difference is (surprisingly enough) not always statistically significant.

* Submitted for publication 

Monitoring Energy Trends through Automatic Information Extraction

Jan 05, 2022
Dilek Küçük

Energy research is of crucial public importance but the use of computer science technologies like automatic text processing and data management for the energy domain is still rare. Employing these technologies in the energy domain will be a significant contribution to the interdisciplinary topic of ``energy informatics", just like the related progress within the interdisciplinary area of ``bioinformatics". In this paper, we present the architecture of a Web-based semantic system called EneMonIE (Energy Monitoring through Information Extraction) for monitoring up-to-date energy trends through the use of automatic, continuous, and guided information extraction from diverse types of media available on the Web. The types of media handled by the system will include online news articles, social media texts, online news videos, and open-access scholarly papers and technical reports as well as various numeric energy data made publicly available by energy organizations. The system will utilize and contribute to the energy-related ontologies and its ultimate form will comprise components for (i) text categorization, (ii) named entity recognition, (iii) temporal expression extraction, (iv) event extraction, (v) social network construction, (vi) sentiment analysis, (vii) information fusion and summarization, (viii) media interlinking, and (ix) Web-based information retrieval and visualization. Wits its diverse data sources, automatic text processing capabilities, and presentation facilities open for public use; EneMonIE will be an important source of distilled and concise information for decision-makers including energy generation, transmission, and distribution system operators, energy research centres, related investors and entrepreneurs as well as for academicians, students, other individuals interested in the pace of energy events and technologies.

* 5 pages 

Extraction of Protein Sequence Motif Information using PSO K-Means

Apr 09, 2015
R. Gowri, R. Rathipriya

The main objective of the paper is to find the motif information.The functionalities of the proteins are ideally found from their motif information which is extracted using various techniques like clustering with k-means, hybrid k-means, self-organising maps, etc., in the literature. In this work protein sequence information is extracted using optimised k-means algorithm. The particle swarm optimisation technique is one of the frequently used optimisation method. In the current work the PSO k-means is used for motif information extraction. This paper also deals with the comparison between the motif information obtained from clusters and biclustersusing PSO k-means algorithm. The motif information acquired is based on the structure homogeneity of the protein sequence.