Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Sentimental Content Analysis and Knowledge Extraction from News Articles

Aug 09, 2018
Mohammad Kamel, Neda Keyvani, Hadi Sadoghi Yazdi

In web era, since technology has revolutionized mankind life, plenty of data and information are published on the Internet each day. For instance, news agencies publish news on their websites all over the world. These raw data could be an important resource for knowledge extraction. These shared data contain emotions (i.e., positive, neutral or negative) toward various topics; therefore, sentimental content extraction could be a beneficial task in many aspects. Extracting the sentiment of news illustrates highly valuable information about the events over a period of time, the viewpoint of a media or news agency to these events. In this paper an attempt is made to propose an approach for news analysis and extracting useful knowledge from them. Firstly, we attempt to extract a noise robust sentiment of news documents; therefore, the news associated to six countries: United State, United Kingdom, Germany, Canada, France and Australia in 5 different news categories: Politics, Sports, Business, Entertainment and Technology are downloaded. In this paper we compare the condition of different countries in each 5 news topics based on the extracted sentiments and emotional contents in news documents. Moreover, we propose an approach to reduce the bulky news data to extract the hottest topics and news titles as a knowledge. Eventually, we generate a word model to map each word to a fixed-size vector by Word2Vec in order to understand the relations between words in our collected news database.


Time Difference on Arrival Extraction from Two-Way Ranging

Apr 12, 2022
Patrick Rathje, Olaf Landsiedel

Two-Way Ranging enables the distance estimation between two active parties and allows time of flight measurements despite relative clock offset and drift. Limited by the number of messages, scalable solutions build on Time Difference on Arrival to infer timing information at passive listeners. However, the demand for accurate distance estimates dictates a tight bound on the time synchronization, thus limiting scalability to the localization of passive tags relative to static, synchronized anchors. This work describes the extraction of Time Difference on Arrival information from a Two-Way Ranging process, enabling the extraction of distance information on passive listeners and further allowing scalable tag localization without the need for static or synchronized anchors. The expected error is formally deducted. The extension allows the extraction of the timing difference despite relative clock offset and drift for the Double-Sided Two-Way Ranging and Single-Sided Two-Way Ranging with additional carrier frequency offset estimation.


Open Information Extraction

Jul 10, 2016
Duc-Thuan Vo, Ebrahim Bagheri

Open Information Extraction (Open IE) systems aim to obtain relation tuples with highly scalable extraction in portable across domain by identifying a variety of relation phrases and their arguments in arbitrary sentences. The first generation of Open IE learns linear chain models based on unlexicalized features such as Part-of-Speech (POS) or shallow tags to label the intermediate words between pair of potential arguments for identifying extractable relations. Open IE currently is developed in the second generation that is able to extract instances of the most frequently observed relation types such as Verb, Noun and Prep, Verb and Prep, and Infinitive with deep linguistic analysis. They expose simple yet principled ways in which verbs express relationships in linguistics such as verb phrase-based extraction or clause-based extraction. They obtain a significantly higher performance over previous systems in the first generation. In this paper, we describe an overview of two Open IE generations including strengths, weaknesses and application areas.

* This paper will appear in the Encyclopedia for Semantic Computing 

Feature Extraction Framework based on Contrastive Learning with Adaptive Positive and Negative Samples

Jan 11, 2022
Hongjie Zhang

In this study, we propose a feature extraction framework based on contrastive learning with adaptive positive and negative samples (CL-FEFA) that is suitable for unsupervised, supervised, and semi-supervised single-view feature extraction. CL-FEFA constructs adaptively the positive and negative samples from the results of feature extraction, which makes it more appropriate and accurate. Thereafter, the discriminative features are re extracted to according to InfoNCE loss based on previous positive and negative samples, which will make the intra-class samples more compact and the inter-class samples more dispersed. At the same time, using the potential structure information of subspace samples to dynamically construct positive and negative samples can make our framework more robust to noisy data. Furthermore, CL-FEFA considers the mutual information between positive samples, that is, similar samples in potential structures, which provides theoretical support for its advantages in feature extraction. The final numerical experiments prove that the proposed framework has a strong advantage over the traditional feature extraction methods and contrastive learning methods.


Zero-shot Learning for Relation Extraction

Nov 13, 2020
Jiaying Gong, Hoda Eldardiry

Most existing supervised and few-shot learning relation extraction methods have relied on labeled training data. However, in real-world scenarios, there exist many relations for which there is no available training data. We address this issue from the perspective of zero-shot learning (ZSL) which is similar to the way humans learn and recognize new concepts with no prior knowledge. We propose a zero-shot learning relation extraction (ZSLRE) framework, which focuses on recognizing novel relations that have no corresponding labeled data available for training. Our proposed ZSLRE model aims to recognize new relations based on prototypical networks that are modified to utilize side (auxiliary) information. The additional use of side information allows those modified prototype networks to recognize novel relations in addition to recognized previously known relations. We construct side information from labels and their synonyms, hypernyms of name entities, and keywords. We build an automatic hypernym extraction framework to help get hypernyms of various name entities directly from the web. We demonstrate using extensive experiments on two public datasets (NYT and FewRel) that our proposed model significantly outperforms state-of-the-art methods on supervised learning, few-shot learning, and zero-shot learning tasks. Our experimental results also demonstrate the effectiveness and robustness of our proposed model in a combination scenario. Once accepted for publication, we will publish ZSLRE's source code and datasets to enable reproducibility and encourage further research.

* 11 pages, 7 figures, submitted to WWW 2021 

REMOD: Relation Extraction for Modeling Online Discourse

Feb 22, 2021
Matthew Sumpter, Giovanni Luca Ciampaglia

The enormous amount of discourse taking place online poses challenges to the functioning of a civil and informed public sphere. Efforts to standardize online discourse data, such as ClaimReview, are making available a wealth of new data about potentially inaccurate claims, reviewed by third-party fact-checkers. These data could help shed light on the nature of online discourse, the role of political elites in amplifying it, and its implications for the integrity of the online information ecosystem. Unfortunately, the semi-structured nature of much of this data presents significant challenges when it comes to modeling and reasoning about online discourse. A key challenge is relation extraction, which is the task of determining the semantic relationships between named entities in a claim. Here we develop a novel supervised learning method for relation extraction that combines graph embedding techniques with path traversal on semantic dependency graphs. Our approach is based on the intuitive observation that knowledge of the entities along the path between the subject and object of a triple (e.g. Washington,_D.C.}, and United_States_of_America) provides useful information that can be leveraged for extracting its semantic relation (i.e. capitalOf). As an example of a potential application of this technique for modeling online discourse, we show that our method can be integrated into a pipeline to reason about potential misinformation claims.

* 11 pages, 5 figures 

A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries

May 02, 2022
Hermann Kroll, Jan Pirklbauer, Florian Plötzky, Wolf-Tilo Balke

Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper tackles the question how digital libraries can handle such extractions and if their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), pharmacy and political sciences. We report on opportunities and limitations. Finally we discuss best practices for unsupervised extraction workflows.

* Accepted at JCDL2022, 11 pages, 1 figure 

Closing the Gap: Joint De-Identification and Concept Extraction in the Clinical Domain

May 19, 2020
Lukas Lange, Heike Adel, Jannik Strötgen

Exploiting natural language processing in the clinical domain requires de-identification, i.e., anonymization of personal information in texts. However, current research considers de-identification and downstream tasks, such as concept extraction, only in isolation and does not study the effects of de-identification on other tasks. In this paper, we close this gap by reporting concept extraction performance on automatically anonymized data and investigating joint models for de-identification and concept extraction. In particular, we propose a stacked model with restricted access to privacy-sensitive information and a multitask model. We set the new state of the art on benchmark datasets in English (96.1% F1 for de-identification and 88.9% F1 for concept extraction) and Spanish (91.4% F1 for concept extraction).

* ACL 2020 

Biographical: A Semi-Supervised Relation Extraction Dataset

May 02, 2022
Alistair Plum, Tharindu Ranasinghe, Spencer Jones, Constantin Orasan, Ruslan Mitkov

Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developed Biographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set. Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

* Accepted to ACM SIGIR 2022 

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Jun 02, 2021
Tuan-Anh D. Nguyen, Hieu M. Vu, Nguyen Hong Son, Minh-Tien Nguyen

Information extraction (IE) from visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which demonstrates great potential of pre-training methods. In this paper, we present a new approach to improve the capability of language model pre-training on VRDs. Firstly, we introduce a new IE model that is query-based and employs the span extraction formulation instead of the commonly used sequence labelling approach. Secondly, to further extend the span extraction formulation, we propose a new training task which focuses on modelling the relationships between semantic entities within a document. This task enables the spans to be extracted recursively and can be used as both a pre-training objective as well as an IE downstream task. Evaluation on various datasets of popular business documents (invoices, receipts) shows that our proposed method can improve the performance of existing models significantly, while providing a mechanism to accumulate model knowledge from multiple downstream IE tasks.