During the rapid urbanization construction of China, acquisition of urban geographic information and timely data updating are important and fundamental tasks for the refined management of cities. With the development of domestic remote sensing technology, the application of Gaofen-2 (GF-2) high-resolution remote sensing images can greatly improve the accuracy of information extraction. This paper introduces an approach using object-oriented classification methods for urban feature extraction based on GF-2 satellite data. A combination of spectral, spatial attributes and membership functions was employed for mapping the urban features of Qinhuai District, Nanjing. The data preprocessing is carried out by ENVI software, and the subsequent data is exported into the eCognition software for object-oriented classification and extraction of urban feature information. Finally, the obtained raster image classification results are vectorized using the ARCGIS software, and the vector graphics are stored in the library, which can be used for further analysis and modeling. Accuracy assessment was performed using ground truth data acquired by visual interpretation and from other reliable secondary data sources. Compared with the result of pixel-based supervised (neural net) classification, the developed object-oriented method can significantly improve extraction accuracy, and after manual interpretation, an overall accuracy of 95.44% can be achieved, with a Kappa coefficient of 0.9405, which objectively confirmed the superiority of the object-oriented method and the feasibility of the utilization of GF-2 satellite data.
Many real world systems need to operate on heterogeneous information networks that consist of numerous interacting components of different types. Examples include systems that perform data analysis on biological information networks; social networks; and information extraction systems processing unstructured data to convert raw text to knowledge graphs. Many previous works describe specialized approaches to perform specific types of analysis, mining and learning on such networks. In this work, we propose a unified framework consisting of a data model -a graph with a first order schema along with a declarative language for constructing, querying and manipulating such networks in ways that facilitate relational and structured machine learning. In particular, we provide an initial prototype for a relational and graph traversal query language where queries are directly used as relational features for structured machine learning models. Feature extraction is performed by making declarative graph traversal queries. Learning and inference models can directly operate on this relational representation and augment it with new data and knowledge that, in turn, is integrated seamlessly into the relational structure to support new predictions. We demonstrate this system's capabilities by showcasing tasks in natural language processing and computational biology domains.
An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible.
Predictable Feature Analysis (PFA) (Richthofer, Wiskott, ICMLA 2015) is an algorithm that performs dimensionality reduction on high dimensional input signal. It extracts those subsignals that are most predictable according to a certain prediction model. We refer to these extracted signals as predictable features. In this work we extend the notion of PFA to take supplementary information into account for improving its predictions. Such information can be a multidimensional signal like the main input to PFA, but is regarded external. That means it won't participate in the feature extraction - no features get extracted or composed of it. Features will be exclusively extracted from the main input such that they are most predictable based on themselves and the supplementary information. We refer to this enhanced PFA as PFAx (PFA extended). Even more important than improving prediction quality is to observe the effect of supplementary information on feature selection. PFAx transparently provides insight how the supplementary information adds to prediction quality and whether it is valuable at all. Finally we show how to invert that relation and can generate the supplementary information such that it would yield a certain desired outcome of the main signal. We apply this to a setting inspired by reinforcement learning and let the algorithm learn how to control an agent in an environment. With this method it is feasible to locally optimize the agent's state, i.e. reach a certain goal that is near enough. We are preparing a follow-up paper that extends this method such that also global optimization is feasible.
Requirements elicitation can be very challenging in projects that require deep domain knowledge about the system at hand. As analysts have the full control over the elicitation process, their lack of knowledge about the system under study inhibits them from asking related questions and reduces the accuracy of requirements provided by stakeholders. We present ELICA, a generic interactive visual analytics tool to assist analysts during requirements elicitation process. ELICA uses a novel information extraction algorithm based on a combination of Weighted Finite State Transducers (WFSTs) (generative model) and SVMs (discriminative model). ELICA presents the extracted relevant information in an interactive GUI (including zooming, panning, and pinching) that allows analysts to explore which parts of the ongoing conversation (or specification document) match with the extracted information. In this demonstration, we show that ELICA is usable and effective in practice, and is able to extract the related information in real-time. We also demonstrate how carefully designed features in ELICA facilitate the interactive and dynamic process of information extraction.
To reduce the large amount of time spent screening, identifying, and recruiting patients into clinical trials, we need prescreening systems that are able to automate the data extraction and decision-making tasks that are typically relegated to clinical research study coordinators. However, a major obstacle is the vast amount of patient data available as unstructured free-form text in electronic health records. Here we propose an information extraction-based approach that first automatically converts unstructured text into a structured form. The structured data are then compared against a list of eligibility criteria using a rule-based system to determine which patients qualify for enrollment in a heart failure clinical trial. We show that we can achieve highly accurate results, with recall and precision values of 0.95 and 0.86, respectively. Our system allowed us to significantly reduce the time needed for prescreening patients from a few weeks to a few minutes. Our open-source information extraction modules are available for researchers and could be tested and validated in other cardiovascular trials. An approach such as the one we demonstrate here may decrease costs and expedite clinical trials, and could enhance the reproducibility of trials across institutions and populations.
State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages. To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. Utilizing both textual and structural layout features, an NLP system must find the most important information, about various types of entities, in long formal documents. We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Moreover, we checked the most popular PDF processing tools for text extraction (pdf2djvu, Tesseract and Textract) in order to analyze behavior of IE system in presence of errors introduced by these tools.
Event extraction has long been treated as a sentence-level task in the IE community. We argue that this setting does not match human information-seeking behavior and leads to incomplete and uninformative extraction results. We propose a document-level neural event argument extraction model by formulating the task as conditional generation following event templates. We also compile a new document-level event extraction benchmark dataset WikiEvents which includes complete event and coreference annotation. On the task of argument extraction, we achieve an absolute gain of 7.6% F1 and 5.7% F1 over the next best model on the RAMS and WikiEvents datasets respectively. On the more challenging task of informative argument extraction, which requires implicit coreference reasoning, we achieve a 9.3% F1 gain over the best baseline. To demonstrate the portability of our model, we also create the first end-to-end zero-shot event extraction framework and achieve 97% of fully supervised model's trigger extraction performance and 82% of the argument extraction performance given only access to 10 out of the 33 types on ACE.
We curated WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction. Although automatic annotation can lead to a high degree of label noise, it is an inexpensive process and can generate large volumes of annotated documents. We trained a BERT-based NER model with WikiPII and showed that with an adequately large training dataset, the model can significantly decrease the cost of manual information extraction, despite the high level of label noise. In a similar approach, organizations can leverage text mining techniques to create customized annotated datasets from their historical data without sharing the raw data for human annotation. Also, we explore collaborative training of NER models through federated learning when the annotation is noisy. Our results suggest that depending on the level of trust to the ML operator and the volume of the available data, distributed training can be an effective way of training a personal information identifier in a privacy-preserved manner. Research material is available at https://github.com/ratmcu/wikipiifed.
Document-level relation extraction (RE), which requires reasoning on multiple entities in different sentences to identify complex inter-sentence relations, is more challenging than sentence-level RE. To extract the complex inter-sentence relations, previous studies usually employ graph neural networks (GNN) to perform inference upon heterogeneous document-graphs. Despite their great successes, these graph-based methods, which normally only consider the words within the mentions in the process of building graphs and reasoning, tend to ignore the non-entity clue words that are not in the mentions but provide important clue information for relation reasoning. To alleviate this problem, we treat graph-based document-level RE models as an encoder-decoder framework, which typically uses a pre-trained language model as the encoder and a GNN model as the decoder, and propose a novel graph-based model NC-DRE that introduces decoder-to-encoder attention mechanism to leverage Non-entity Clue information for Document-level Relation Extraction.