Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Information Extraction": models, code, and papers

Wrapper Maintenance: A Machine Learning Approach

Jun 24, 2011
C. A. Knoblock, K. Lerman, S. N. Minton

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task.

* Journal Of Artificial Intelligence Research, Volume 18, pages 149-181, 2003 

FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Jun 10, 2021
Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal

Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this paper we introduce a novel dataset and benchmark, Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. Furthermore, we detail our efforts to track and minimize the biases present in the dataset and could be exploited by models, e.g. being able to predict the label without using evidence. Finally, we develop a baseline for verifying claims against text and tables which predicts both the correct evidence and verdict for 18% of the claims.


Probability Map Guided Bi-directional Recurrent UNet for Pancreas Segmentation

Apr 07, 2019
Jun Li, Xiaozhu Lin, Hui Che, Hao Li, Xiaohua Qian

Pancreatic cancer is one of the most lethal cancers as morbidity approximates mortality. A method for accurately segmenting the pancreas can assist doctors in the diagnosis and treatment of pancreatic cancer, but huge differences in shape and volume bring difficulties in segmentation. Among the current widely used approaches, the 2D method ignores the spatial information, and the 3D model is limited by high resource consumption and GPU memory occupancy. To address these issues, we propose a bi-directional recurrent UNet based on probabilistic map guidance (PBR-UNet). PBR-UNet includes a feature extraction module for extracting pixel-level probabilistic maps and a bi-directional recurrent module for fine segmentation. The extracted probabilistic map will be used to guide the fine segmentation and bi-directional recurrent module integrates contextual information into the entire network to avoid the loss of spatial information in propagation. By combining the probabilistic map of the adjacent slices with the bi-directional recurrent segmentation of intermediary slice, this paper solves the problem that the 2D network loses three-dimensional information and the 3D model leads to large computational resource consumption. We used Dice similarity coefficients (DSC) to evaluate our approach on NIH pancreatic datasets and eventually achieved a competitive result of 83.35%.

* Under review 

CorefDRE: Document-level Relation Extraction with coreference resolution

Feb 22, 2022
Zhongxuan Xue, Rongzhen Li, Qizhu Dai, Zhong Jiang

Document-level relation extraction is to extract relation facts from a document consisting of multiple sentences, in which pronoun crossed sentences are a ubiquitous phenomenon against a single sentence. However, most of the previous works focus more on mentions coreference resolution except for pronouns, and rarely pay attention to mention-pronoun coreference and capturing the relations. To represent multi-sentence features by pronouns, we imitate the reading process of humans by leveraging coreference information when dynamically constructing a heterogeneous graph to enhance semantic information. Since the pronoun is notoriously ambiguous in the graph, a mention-pronoun coreference resolution is introduced to calculate the affinity between pronouns and corresponding mentions, and the noise suppression mechanism is proposed to reduce the noise caused by pronouns. Experiments on the public dataset, DocRED, DialogRE and MPDD, show that Coref-aware Doc-level Relation Extraction based on Graph Inference Network outperforms the state-of-the-art.


Unsupervised Word and Dependency Path Embeddings for Aspect Term Extraction

May 25, 2016
Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, Ming Zhou

In this paper, we develop a novel approach to aspect term extraction based on unsupervised learning of distributed representations of words and dependency paths. The basic idea is to connect two words (w1 and w2) with the dependency path (r) between them in the embedding space. Specifically, our method optimizes the objective w1 + r = w2 in the low-dimensional space, where the multi-hop dependency paths are treated as a sequence of grammatical relations and modeled by a recurrent neural network. Then, we design the embedding features that consider linear context and dependency context information, for the conditional random field (CRF) based aspect term extraction. Experimental results on the SemEval datasets show that, (1) with only embedding features, we can achieve state-of-the-art results; (2) our embedding method which incorporates the syntactic information among words yields better performance than other representative ones in aspect term extraction.

* IJCAI 2016 

CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

Feb 01, 2019
Shikhar Vashishth, Prince Jain, Partha Talukdar

Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.

* International World Wide Web Conferences Steering Committee 2018 
* Accepted at WWW 2018 

BERE: An accurate distantly supervised biomedical entity relation extraction network

Jun 22, 2019
Lixiang Hong, JinJian Lin, Jiang Tao, Jianyang Zeng

Automated entity relation extraction (RE) from literature provides an important source for constructing biomedical database, which is more efficient and extensible than manual curation. However, existing RE models usually ignore the information contained in sentence structures and target entities. In this paper, we propose BERE, a deep learning based model which uses Gumbel Tree-GRU to learn sentence structures and joint embedding to incorporate entity information. It also employs word-level attention for improved relation extraction and sentence-level attention to suit the distantly supervised dataset. Because the existing dataset are relatively small, we further construct a much larger drug-target interaction extraction (DTIE) dataset by distant supervision. Experiments conducted on both DDIExtraction 2013 task and DTIE dataset show our model's effectiveness over state-of-the-art baselines in terms of F1 measures and PR curves.

* My tutor told me to withdraw this paper at once 

Per-run Algorithm Selection with Warm-starting using Trajectory-based Features

Apr 20, 2022
Ana Kostovska, Anja Jankovic, Diederick Vermetten, Jacob de Nobel, Hao Wang, Tome Eftimov, Carola Doerr

Per-instance algorithm selection seeks to recommend, for a given problem instance and a given performance criterion, one or several suitable algorithms that are expected to perform well for the particular setting. The selection is classically done offline, using openly available information about the problem instance or features that are extracted from the instance during a dedicated feature extraction step. This ignores valuable information that the algorithms accumulate during the optimization process. In this work, we propose an alternative, online algorithm selection scheme which we coin per-run algorithm selection. In our approach, we start the optimization with a default algorithm, and, after a certain number of iterations, extract instance features from the observed trajectory of this initial optimizer to determine whether to switch to another optimizer. We test this approach using the CMA-ES as the default solver, and a portfolio of six different optimizers as potential algorithms to switch to. In contrast to other recent work on online per-run algorithm selection, we warm-start the second optimizer using information accumulated during the first optimization phase. We show that our approach outperforms static per-instance algorithm selection. We also compare two different feature extraction principles, based on exploratory landscape analysis and time series analysis of the internal state variables of the CMA-ES, respectively. We show that a combination of both feature sets provides the most accurate recommendations for our test cases, taken from the BBOB function suite from the COCO platform and the YABBOB suite from the Nevergrad platform.


Estimating Visual Information From Audio Through Manifold Learning

Aug 03, 2022
Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, Kwang Moo Yi

We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to occlusions and changes in illumination, and can function as a backup in case vision/lidar sensors fail. Therefore, audio-based methods can be useful even for applications in which only visual information is of interest Our framework is based on Manifold Learning and consists of two steps. First, we train a Vector-Quantized Variational Auto-Encoder to learn the data manifold of the particular visual modality we are interested in. Second, we train an Audio Transformation network to map multi-channel audio signals to the latent representation of the corresponding visual sample. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset. In particular, we consider the prediction of the following visual modalities from audio: depth and semantic segmentation. We hope the findings of our work can facilitate further research in visual information extraction from audio. Code is available at: