Accurate diagnosis of autism spectrum disorder (ASD) based on neuroimaging data has significant implications, as extracting useful information from neuroimaging data for ASD detection is challenging. Even though machine learning techniques have been leveraged to improve the information extraction from neuroimaging data, the varying data quality caused by different meta-data conditions (i.e., data collection strategies) limits the effective information that can be extracted, thus leading to data-dependent predictive accuracies in ASD detection, which can be worse than random guess in some cases. In this work, we systematically investigate the impact of three kinds of meta-data on the predictive accuracy of classifying ASD based on structural MRI collected from 20 different sites, where meta-data conditions vary.
Classic information extraction techniques consist in building questions and answers about the facts. Indeed, it is still a challenge to subjective information extraction systems to identify opinions and feelings in context. In sentiment-based NLP tasks, there are few resources to information extraction, above all offensive or hateful opinions in context. To fill this important gap, this short paper provides a new cross-lingual and contextual offensive lexicon, which consists of explicit and implicit offensive and swearing expressions of opinion, which were annotated in two different classes: context dependent and context-independent offensive. In addition, we provide markers to identify hate speech. Annotation approach was evaluated at the expression-level and achieves high human inter-annotator agreement. The provided offensive lexicon is available in Portuguese and English languages.
We build a reference for the task of Open Information Extraction, on five documents. We tentatively resolve a number of issues that arise, including inference and granularity. We seek to better pinpoint the requirements for the task. We produce our annotation guidelines specifying what is correct to extract and what is not. In turn, we use this reference to score existing Open IE systems. We address the non-trivial problem of evaluating the extractions produced by systems against the reference tuples, and share our evaluation script. Among seven compared extractors, we find the MinIE system to perform best.
Image haze removal is highly desired for the application of computer vision. This paper proposes a novel Context Guided Generative Adversarial Network (CGGAN) for single image dehazing. Of which, an novel new encoder-decoder is employed as the generator. And it consists of a feature-extraction-net, a context-extractionnet, and a fusion-net in sequence. The feature extraction-net acts as a encoder, and is used for extracting haze features. The context-extraction net is a multi-scale parallel pyramid decoder, and is used for extracting the deep features of the encoder and generating coarse dehazing image. The fusion-net is a decoder, and is used for obtaining the final haze-free image. To obtain more better results, multi-scale information obtained during the decoding process of the context extraction decoder is used for guiding the fusion decoder. By introducing an extra coarse decoder to the original encoder-decoder, the CGGAN can make better use of the deep feature information extracted by the encoder. To ensure our CGGAN work effectively for different haze scenarios, different loss functions are employed for the two decoders. Experiments results show the advantage and the effectiveness of our proposed CGGAN, evidential improvements over existing state-of-the-art methods are obtained.
Identifying informative tweets is an important step when building information extraction systems based on social media. WNUT-2020 Task 2 was organised to recognise informative tweets from noise tweets. In this paper, we present our approach to tackle the task objective using transformers. Overall, our approach achieves 10th place in the final rankings scoring 0.9004 F1 score for the test set.
In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction. This means that the system can be trained even on data sets where only a subset of all valid entity spans is annotated. We provide an extensive evaluation of the proposed system and its strengths and weaknesses. Our approach, which can be scaled dynamically in computational complexity at inference time, produces predictions with high precision and reaches 3rd place in the leaderboard of SemEval-2022 Task 12. For inputs in the domain of physics and math, it achieves high relation extraction macro f1 scores of 95.43% and 79.17%, respectively. The code used for training and evaluating our models is available at: https://github.com/nicpopovic/RE1st
In today's digital age in the dawning era of big data analytics it is not the information but the linking of information through entities and actions which defines the discourse. Any textual data either available on the Internet off off-line (like newspaper data, Wikipedia dump, etc) is basically connect information which cannot be treated isolated for its wholesome semantics. There is a need for an automated retrieval process with proper information extraction to structure the data for relevant and fast text analytics. The first big challenge is the conversion of unstructured textual data to structured data. Unlike other databases, graph databases handle relationships and connections elegantly. Our project aims at developing a graph-based information extraction and retrieval system.
We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly have to sieve through large collections of documents such as those downloaded automatically from the internet. The proposed system takes a whole document collection as input. It first carries out some automatic analysis tasks (named entity recognition, geo-coding, clustering, term extraction), annotates the texts with the generated meta-information and stores the meta-information in a database. The system then generates a zoomable and hyperlinked geographic map enhanced with information on entities and terms found. When the system is used on a regular basis, it builds up a historical database that contains information on which names have been mentioned together with which other names or places, and users can query this database to retrieve information extracted in the past.
Recent rapid increase in the generation of clinical data and rapid development of computational science make us able to extract new insights from massive datasets in healthcare industry. Oncological clinical notes are creating rich databases for documenting patients history and they potentially contain lots of patterns that could help in better management of the disease. However, these patterns are locked within free text (unstructured) portions of clinical documents and consequence in limiting health professionals to extract useful information from them and to finally perform Query and Answering (QA) process in an accurate way. The Information Extraction (IE) process requires Natural Language Processing (NLP) techniques to assign semantics to these patterns. Therefore, in this paper, we analyze the design of annotators for specific lung cancer concepts that can be integrated over Apache Unstructured Information Management Architecture (UIMA) framework. In addition, we explain the details of generation and storage of annotation outcomes.