Alert button

"Information Extraction": models, code, and papers
Alert button

Schema-Driven Information Extraction from Heterogeneous Tables

May 23, 2023
Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Alan Ritter

Figure 1 for Schema-Driven Information Extraction from Heterogeneous Tables
Figure 2 for Schema-Driven Information Extraction from Heterogeneous Tables
Figure 3 for Schema-Driven Information Extraction from Heterogeneous Tables
Figure 4 for Schema-Driven Information Extraction from Heterogeneous Tables

In this paper, we explore the question of whether language models (LLMs) can support cost-efficient information extraction from complex tables. We introduce schema-driven information extraction, a new task that uses LLMs to transform tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we develop a benchmark composed of tables from three diverse domains: machine learning papers, chemistry tables, and webpages. Accompanying the benchmark, we present InstrucTE, a table extraction method based on instruction-tuned LLMs. This method necessitates only a human-constructed extraction schema, and incorporates an error-recovery strategy. Notably, InstrucTE demonstrates competitive performance without task-specific labels, achieving an F1 score ranging from 72.3 to 95.7. Moreover, we validate the feasibility of distilling more compact table extraction models to minimize extraction costs and reduce API reliance. This study paves the way for the future development of instruction-following models for cost-efficient table extraction.

PIVOINE: Instruction Tuning for Open-world Information Extraction

May 24, 2023
Keming Lu, Xiaoman Pan, Kaiqiang Song, Hongming Zhang, Dong Yu, Jianshu Chen

Figure 1 for PIVOINE: Instruction Tuning for Open-world Information Extraction
Figure 2 for PIVOINE: Instruction Tuning for Open-world Information Extraction
Figure 3 for PIVOINE: Instruction Tuning for Open-world Information Extraction
Figure 4 for PIVOINE: Instruction Tuning for Open-world Information Extraction

We consider the problem of Open-world Information Extraction (Open-world IE), which extracts comprehensive entity profiles from unstructured texts. Different from the conventional closed-world setting of Information Extraction (IE), Open-world IE considers a more general situation where entities and relations could be beyond a predefined ontology. More importantly, we seek to develop a large language model (LLM) that is able to perform Open-world IE to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions. We achieve this by finetuning LLMs using instruction tuning. In particular, we construct INSTRUCTOPENWIKI, a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions. We finetune the pretrained BLOOM models on INSTRUCTOPENWIKI and obtain PIVOINE, an LLM for Open-world IE with strong instruction-following capabilities. Our experiments demonstrate that PIVOINE significantly outperforms traditional closed-world methods and other LLM baselines, displaying impressive generalization capabilities on both unseen instructions and out-of-ontology cases. Consequently, PIVOINE emerges as a promising solution to tackle the open-world challenge in IE effectively.

Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

May 23, 2023
Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Jiuding Sun, Yuxiang Chen, Lei How, Juanzi Li, Bin Xu

Figure 1 for Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
Figure 2 for Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
Figure 3 for Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
Figure 4 for Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code will be publicly available.

Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer

May 23, 2023
David Dukić, Kiril Gashteovski, Goran Glavaš, Jan Šnajder

Figure 1 for Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer
Figure 2 for Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer
Figure 3 for Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer
Figure 4 for Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer

Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) -- identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer for TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing negative transfer, in particular when transferring from a high-resource source Wikipedia domain to a low-resource target news domain. Additionally, we combine the extracted relations with masked language modeling on the target domain and obtain further TD performance gains. Finally, we demonstrate that the results are robust to the choice of the OIE system.

InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

May 24, 2023
Ishani Mondal, Michelle Yuan, Anandhavelu N, Aparna Garimella, Francis Ferraro, Andrew Blair-Stanek, Benjamin Van Durme, Jordan Boyd-Graber

Figure 1 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Figure 2 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Figure 3 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Figure 4 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

Learning template based information extraction from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain templates; however, real-world IE do not have pre-defined schemas and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a real-world setting, we need to induce template slots from documents with zero or minimal supervision. Since the purpose of question answering intersect with the goal of information extraction, we use automatic question generation to induce template slots from the documents and investigate how a tiny amount of a proxy human-supervision on-the-fly (termed as InteractiveIE) can further boost the performance. Extensive experiments on biomedical and legal documents, where obtaining training data is expensive, reveal encouraging trends of performance improvement using InteractiveIE over AI-only baseline.

InstructIE: A Chinese Instruction-based Information Extraction Dataset

May 19, 2023
Honghao Gui, Jintian Zhang, Hongbin Ye, Ningyu Zhang

Figure 1 for InstructIE: A Chinese Instruction-based Information Extraction Dataset
Figure 2 for InstructIE: A Chinese Instruction-based Information Extraction Dataset
Figure 3 for InstructIE: A Chinese Instruction-based Information Extraction Dataset
Figure 4 for InstructIE: A Chinese Instruction-based Information Extraction Dataset

We introduce a new Information Extraction (IE) task dubbed Instruction-based IE, which aims to ask the system to follow specific instructions or guidelines to extract information. To facilitate research in this area, we construct a dataset called InstructIE, consisting of 270,000 weakly supervised data from Chinese Wikipedia and 1,000 high-quality crowdsourced annotated instances. We further evaluate the performance of various baseline models on the InstructIE dataset. The results reveal that although current models exhibit promising performance, there is still room for improvement. Furthermore, we conduct a comprehensive case study analysis, underlining the challenges inherent in the Instruction-based IE task. Code and dataset are available at https://github.com/zjunlp/DeepKE/tree/main/example/llm.

* Work in progress 

Easy-to-Hard Learning for Information Extraction

May 19, 2023
Chang Gao, Wenxuan Zhang, Wai Lam, Lidong Bing

Figure 1 for Easy-to-Hard Learning for Information Extraction
Figure 2 for Easy-to-Hard Learning for Information Extraction
Figure 3 for Easy-to-Hard Learning for Information Extraction
Figure 4 for Easy-to-Hard Learning for Information Extraction

Information extraction (IE) systems aim to automatically extract structured information, such as named entities, relations between entities, and events, from unstructured texts. While most existing work addresses a particular IE task, universally modeling various IE tasks with one model has achieved great success recently. Despite their success, they employ a one-stage learning strategy, i.e., directly learning to extract the target structure given the input text, which contradicts the human learning process. In this paper, we propose a unified easy-to-hard learning framework consisting of three stages, i.e., the easy stage, the hard stage, and the main stage, for IE by mimicking the human learning process. By breaking down the learning process into multiple stages, our framework facilitates the model to acquire general IE task knowledge and improve its generalization ability. Extensive experiments across four IE tasks demonstrate the effectiveness of our framework. We achieve new state-of-the-art results on 13 out of 17 datasets. Our code is available at \url{https://github.com/DAMO-NLP-SG/IE-E2H}.

* Findings of ACL 2023 

WebIE: Faithful and Robust Information Extraction on the Web

May 23, 2023
Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, Christos Christodoulopoulos, Andrea Pierleoni

Figure 1 for WebIE: Faithful and Robust Information Extraction on the Web
Figure 2 for WebIE: Faithful and Robust Information Extraction on the Web
Figure 3 for WebIE: Faithful and Robust Information Extraction on the Web
Figure 4 for WebIE: Faithful and Robust Information Extraction on the Web

Extracting structured and grounded fact triples from raw text is a fundamental task in Information Extraction (IE). Existing IE datasets are typically collected from Wikipedia articles, using hyperlinks to link entities to the Wikidata knowledge base. However, models trained only on Wikipedia have limitations when applied to web domains, which often contain noisy text or text that does not have any factual information. We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences automatically collected from the English Common Crawl corpus. WebIE also includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web. We annotate ~25K triples from WebIE through crowdsourcing and introduce mWebIE, a translation of the annotated set in four other languages: French, Spanish, Portuguese, and Hindi. We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability. We also propose three training strategies that use entity linking as an auxiliary task. Our experiments show that adding Entity-Linking objectives improves the faithfulness of our generative IE models.

* ACL 2023 Main Conference 

Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

May 12, 2023
Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, Xiang Bai

Figure 1 for Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution
Figure 2 for Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution
Figure 3 for Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution
Figure 4 for Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

Visual information extraction (VIE), which aims to simultaneously perform OCR and information extraction in a unified framework, has drawn increasing attention due to its essential role in various applications like understanding receipts, goods, and traffic signs. However, as existing benchmark datasets for VIE mainly consist of document images without the adequate diversity of layout structures, background disturbs, and entity categories, they cannot fully reveal the challenges of real-world applications. In this paper, we propose a large-scale dataset consisting of camera images for VIE, which contains not only the larger variance of layout, backgrounds, and fonts but also much more types of entities. Besides, we propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion. Different from the previous end-to-end approaches that directly adopt OCR features as the input of an information extraction module, we propose to use contrastive learning to narrow the semantic gap caused by the difference between the tasks of OCR and information extraction. We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE (a widely used English dataset) to our proposed dataset due to the larger variance of layout and entities. These results demonstrate our dataset is more practical for promoting advanced VIE algorithms. In addition, experiments demonstrate that the proposed VIE method consistently achieves the obvious performance gains on the proposed and SROIE datasets.

* 15 pages, 6 figures, ICDAR2023 

Teamwork Is Not Always Good: An Empirical Study of Classifier Drift in Class-incremental Information Extraction

May 26, 2023
Minqian Liu, Lifu Huang

Figure 1 for Teamwork Is Not Always Good: An Empirical Study of Classifier Drift in Class-incremental Information Extraction
Figure 2 for Teamwork Is Not Always Good: An Empirical Study of Classifier Drift in Class-incremental Information Extraction
Figure 3 for Teamwork Is Not Always Good: An Empirical Study of Classifier Drift in Class-incremental Information Extraction
Figure 4 for Teamwork Is Not Always Good: An Empirical Study of Classifier Drift in Class-incremental Information Extraction

Class-incremental learning (CIL) aims to develop a learning system that can continually learn new classes from a data stream without forgetting previously learned classes. When learning classes incrementally, the classifier must be constantly updated to incorporate new classes, and the drift in decision boundary may lead to severe forgetting. This fundamental challenge, however, has not yet been studied extensively, especially in the setting where no samples from old classes are stored for rehearsal. In this paper, we take a closer look at how the drift in the classifier leads to forgetting, and accordingly, design four simple yet (super-) effective solutions to alleviate the classifier drift: an Individual Classifiers with Frozen Feature Extractor (ICE) framework where we individually train a classifier for each learning session, and its three variants ICE-PL, ICE-O, and ICE-PL&O which further take the logits of previously learned classes from old sessions or a constant logit of an Other class as a constraint to the learning of new classifiers. Extensive experiments and analysis on 6 class-incremental information extraction tasks demonstrate that our solutions, especially ICE-O, consistently show significant improvement over the previous state-of-the-art approaches with up to 44.7% absolute F-score gain, providing a strong baseline and insights for future research on class-incremental learning.

* ACL'23 (Findings). 15 pages, 3 figures, 7 tables