A recent state-of-the-art neural open information extraction (OpenIE) system generates extractions iteratively, requiring repeated encoding of partial outputs. This comes at a significant computational cost. On the other hand, sequence labeling approaches for OpenIE are much faster, but worse in extraction quality. In this paper, we bridge this trade-off by presenting an iterative labeling-based system that establishes a new state of the art for OpenIE, while extracting 10x faster. This is achieved through a novel Iterative Grid Labeling (IGL) architecture, which treats OpenIE as a 2-D grid labeling task. We improve its performance further by applying coverage (soft) constraints on the grid at training time. Moreover, on observing that the best OpenIE systems falter at handling coordination structures, our OpenIE system also incorporates a new coordination analyzer built with the same IGL architecture. This IGL based coordination analyzer helps our OpenIE system handle complicated coordination structures, while also establishing a new state of the art on the task of coordination analysis, with a 12.3 pts improvement in F1 over previous analyzers. Our OpenIE system, OpenIE6, beats the previous systems by as much as 4 pts in F1, while being much faster.
Open Information Extraction (OIE) is the task of extracting facts from sentences in the form of relations and their corresponding arguments in schema-free manner. Intrinsic performance of OIE systems is difficult to measure due to the incompleteness of existing OIE benchmarks: the ground truth extractions do not group all acceptable surface realizations of the same fact that can be extracted from a sentence. To measure performance of OIE systems more realistically, it is necessary to manually annotate complete facts (i.e., clusters of all acceptable surface realizations of the same fact) from input sentences. We propose AnnIE: an interactive annotation platform that facilitates such challenging annotation tasks and supports creation of complete fact-oriented OIE evaluation benchmarks. AnnIE is modular and flexible in order to support different use case scenarios (i.e., benchmarks covering different types of facts). We use AnnIE to build two complete OIE benchmarks: one with verb-mediated facts and another with facts encompassing named entities. Finally, we evaluate several OIE systems on our complete benchmarks created with AnnIE. Our results suggest that existing incomplete benchmarks are overly lenient, and that OIE systems are not as robust as previously reported. We publicly release AnnIE under non-restrictive license.
One drawback of evolutionary multiobjective optimization algorithms (EMOA) has traditionally been high computational cost to create an approximation of the Pareto front: number of required objective function evaluations usually grows high. On the other hand, for the decision maker (DM) it may be difficult to select one of the many produced solutions as the final one, especially in the case of more than two objectives. To overcome the above mentioned drawbacks number of EMOA's incorporating the decision makers preference information have been proposed. In this case, it is possible to save objective function evaluations by generating only the part of the front the DM is interested in, thus also narrowing down the pool of possible selections for the final solution. Unfortunately, most of the current EMO approaches utilizing preferences are not very intuitive to use, i.e. they may require tweaking of unintuitive parameters, and it is not always clear what kind of results one can get with given set of parameters. In this study we propose a new approach to visually inspect produced solutions, and to extract preference information from the DM to further guide the search. Our approach is based on intuitive use of dynamic query sliders, which serve as a means to extract preference information and are part of the graphical user interface implemented for the efficient UPS-EMO algorithm.
In the search and retrieval of multimedia objects, it is impractical to either manually or automatically extract the contents for indexing since most of the multimedia contents are not machine extractable, while manual extraction tends to be highly laborious and time-consuming. However, by systematically capturing and analyzing the feedback patterns of human users, vital information concerning the multimedia contents can be harvested for effective indexing and subsequent search. By learning from the human judgment and mental evaluation of users, effective search indices can be gradually developed and built up, and subsequently be exploited to find the most relevant multimedia objects. To avoid hovering around a local maximum, we apply the epsilon-greedy method to systematically explore the search space. Through such methodic exploration, we show that the proposed approach is able to guarantee that the most relevant objects can always be discovered, even though initially it may have been overlooked or not regarded as relevant. The search behavior of the present approach is quantitatively analyzed, and closed-form expressions are obtained for the performance of two variants of the epsilon-greedy algorithm, namely EGSE-A and EGSE-B. Simulations and experiments on real data set have been performed which show good agreement with the theoretical findings. The present method is able to leverage exploration in an effective way to significantly raise the performance of multimedia information search, and enables the certain discovery of relevant objects which may be otherwise undiscoverable.
Few-Shot Relation Extraction aims at predicting the relation for a pair of entities in a sentence by training with a few labelled examples in each relation. Some recent works have introduced relation information (i.e., relation labels or descriptions) to assist model learning based on Prototype Network. However, most of them constrain the prototypes of each relation class implicitly with relation information, generally through designing complex network structures, like generating hybrid features, combining with contrastive learning or attention networks. We argue that relation information can be introduced more explicitly and effectively into the model. Thus, this paper proposes a direct addition approach to introduce relation information. Specifically, for each relation class, the relation representation is first generated by concatenating two views of relations (i.e., [CLS] token embedding and the mean value of embeddings of all tokens) and then directly added to the original prototype for both train and prediction. Experimental results on the benchmark dataset FewRel 1.0 show significant improvements and achieve comparable results to the state-of-the-art, which demonstrates the effectiveness of our proposed approach. Besides, further analyses verify that the direct addition is a much more effective way to integrate the relation representations and the original prototypes.
Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to existing software packages such as HTML2text, jusText and Lynx, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML attributes that determine the text alignment. In addition, it (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled.
'Radiomics' is a method that extracts mineable quantitative features from radiographic images. These features can then be used to determine prognosis, for example, predicting the development of distant metastases (DM). Existing radiomics methods, however, require complex manual effort including the design of hand-crafted radiomic features and their extraction and selection. Recent radiomics methods, based on convolutional neural networks (CNNs), also require manual input in network architecture design and hyper-parameter tuning. Radiomic complexity is further compounded when there are multiple imaging modalities, for example, combined positron emission tomography - computed tomography (PET-CT) where there is functional information from PET and complementary anatomical localization information from computed tomography (CT). Existing multi-modality radiomics methods manually fuse the data that are extracted separately. Reliance on manual fusion often results in sub-optimal fusion because they are dependent on an 'expert's' understanding of medical images. In this study, we propose a multi-modality neural architecture search method (MM-NAS) to automatically derive optimal multi-modality image features for radiomics and thus negate the dependence on a manual process. We evaluated our MM-NAS on the ability to predict DM using a public PET-CT dataset of patients with soft-tissue sarcomas (STSs). Our results show that our MM-NAS had a higher prediction accuracy when compared to state-of-the-art radiomics methods.
High quality energy systems information is a crucial input to energy systems research, modeling, and decision-making. Unfortunately, precise information about energy systems is often of limited availability, incomplete, or only accessible for a substantial fee or through a non-disclosure agreement. Recently, remotely sensed data (e.g., satellite imagery, aerial photography) have emerged as a potentially rich source of energy systems information. However, the use of these data is frequently challenged by its sheer volume and complexity, precluding manual analysis. Recent breakthroughs in machine learning have enabled automated and rapid extraction of useful information from remotely sensed data, facilitating large-scale acquisition of critical energy system variables. Here we present a systematic review of the literature on this emerging topic, providing an in-depth survey and review of papers published within the past two decades. We first taxonomize the existing literature into ten major areas, spanning the energy value chain. Within each research area, we distill and critically discuss major features that are relevant to energy researchers, including, for example, key challenges regarding the accessibility and reliability of the methods. We then synthesize our findings to identify limitations and trends in the literature as a whole, and discuss opportunities for innovation.
Fine-grained aspect extraction is an essential sub-task in aspect based opinion analysis. It aims to identify the aspect terms (a.k.a. opinion targets) of a product or service in each sentence. However, expensive annotation process is usually involved to acquire sufficient token-level labels for each domain. To address this limitation, some previous works propose domain adaptation strategies to transfer knowledge from a sufficiently labeled source domain to unlabeled target domains. But due to both the difficulty of fine-grained prediction problems and the large domain gap between domains, the performance remains unsatisfactory. This work conducts a pioneer study on leveraging sentence-level aspect category labels that can be usually available in commercial services like review sites to promote token-level transfer for the extraction purpose. Specifically, the aspect category information is used to construct pivot knowledge for transfer with assumption that the interactions between sentence-level aspect category and token-level aspect terms are invariant across domains. To this end, we propose a novel multi-level reconstruction mechanism that aligns both the fine-grained and coarse-grained information in multiple levels of abstractions. Comprehensive experiments demonstrate that our approach can fully utilize sentence-level aspect category labels to improve cross-domain aspect extraction with a large performance gain.
As an unsupervised dimensionality reduction method, principal component analysis (PCA) has been widely considered as an efficient and effective preprocessing step for hyperspectral image (HSI) processing and analysis tasks. It takes each band as a whole and globally extracts the most representative bands. However, different homogeneous regions correspond to different objects, whose spectral features are diverse. It is obviously inappropriate to carry out dimensionality reduction through a unified projection for an entire HSI. In this paper, a simple but very effective superpixelwise PCA approach, called SuperPCA, is proposed to learn the intrinsic low-dimensional features of HSIs. In contrast to classical PCA models, SuperPCA has four main properties. (1) Unlike the traditional PCA method based on a whole image, SuperPCA takes into account the diversity in different homogeneous regions, that is, different regions should have different projections. (2) Most of the conventional feature extraction models cannot directly use the spatial information of HSIs, while SuperPCA is able to incorporate the spatial context information into the unsupervised dimensionality reduction by superpixel segmentation. (3) Since the regions obtained by superpixel segmentation have homogeneity, SuperPCA can extract potential low-dimensional features even under noise. (4) Although SuperPCA is an unsupervised method, it can achieve competitive performance when compared with supervised approaches. The resulting features are discriminative, compact, and noise resistant, leading to improved HSI classification performance. Experiments on three public datasets demonstrate that the SuperPCA model significantly outperforms the conventional PCA based dimensionality reduction baselines for HSI classification. The Matlab source code is available at https://github.com/junjun-jiang/SuperPCA