Information distributed through the Web keeps growing faster day by day, and for this reason, several techniques for extracting Web data have been suggested during last years. Often, extraction tasks are performed through so called wrappers, procedures extracting information from Web pages, e.g. implementing logic-based techniques. Many fields of application today require a strong degree of robustness of wrappers, in order not to compromise assets of information or reliability of data extracted. Unfortunately, wrappers may fail in the task of extracting data from a Web page, if its structure changes, sometimes even slightly, thus requiring the exploiting of new techniques to be automatically held so as to adapt the wrapper to the new structure of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through improved tree edit distance matching techniques.
Interest point detection is one of the most fundamental and critical problems in computer vision and image processing. In this paper, we carry out a comprehensive review on image feature information (IFI) extraction techniques for interest point detection. To systematically introduce how the existing interest point detection methods extract IFI from an input image, we propose a taxonomy of the IFI extraction techniques for interest point detection. According to this taxonomy, we discuss different types of IFI extraction techniques for interest point detection. Furthermore, we identify the main unresolved issues related to the existing IFI extraction techniques for interest point detection and any interest point detection methods that have not been discussed before. The existing popular datasets and evaluation standards are provided and the performances for eighteen state-of-the-art approaches are evaluated and discussed. Moreover, future research directions on IFI extraction techniques for interest point detection are elaborated.
Extracting temporal relations among events from unstructured text has extensive applications, such as temporal reasoning and question answering. While it is difficult, recent development of Neural-symbolic methods has shown promising results on solving similar tasks. Current temporal relation extraction methods usually suffer from limited expressivity and inconsistent relation inference. For example, in TimeML annotations, the concept of intersection is absent. Additionally, current methods do not guarantee the consistency among the predicted annotations. In this work, we propose SMARTER, a neural semantic parser, to extract temporal information in text effectively. SMARTER parses natural language to an executable logical form representation, based on a custom typed lambda calculus. In the training phase, dynamic programming on denotations (DPD) technique is used to provide weak supervision on logical forms. In the inference phase, SMARTER generates a temporal relation graph by executing the logical form. As a result, our neural semantic parser produces logical forms capturing the temporal information of text precisely. The accurate logical form representations of an event given the context ensure the correctness of the extracted relations.
In recent years, many methods have been developed to identify important portions of text documents. Summarization tools can utilize these methods to extract summaries from large volumes of textual information. However, to identify concepts representing central ideas within a text document and to extract the most informative sentences that best convey those concepts still remain two crucial tasks in summarization methods. In this paper, we introduce a graph-based method to address these two challenges in the context of biomedical text summarization. We show that how a summarizer can discover meaningful concepts within a biomedical text document using the Helmholtz principle. The summarizer considers the meaningful concepts as the main topics and constructs a graph based on the topics that the sentences share. The summarizer can produce an informative summary by extracting those sentences having higher values of the degree. We assess the performance of our method for summarization of biomedical articles using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) toolkit. The results show that the degree can be a useful centrality measure to identify important sentences in this type of graph-based modelling. Our method can improve the performance of biomedical text summarization compared to some state-of-the-art and publicly available summarizers. Combining a concept-based modelling strategy and a graph-based approach to sentence extraction, our summarizer can produce summaries with the highest scores of informativeness among the comparison methods. This research work can be regarded as a start point to the study of small-world networks in summarization of biomedical texts.
Protein-protein interaction (PPI) extraction from published scientific literature provides additional support for precision medicine efforts. However, many of the current PPI extraction methods need extensive feature engineering and cannot make full use of the prior knowledge in knowledge bases (KB). KBs contain huge amounts of structured information about entities and relationships, therefore plays a pivotal role in PPI extraction. This paper proposes a knowledge-aware attention network (KAN) to fuse prior knowledge about protein-protein pairs and context information for PPI extraction. The proposed model first adopts a diagonal-disabled multi-head attention mechanism to encode context sequence along with knowledge representations learned from KB. Then a novel multi-dimensional attention mechanism is used to select the features that can best describe the encoded context. Experiment results on the BioCreative VI PPI dataset show that the proposed approach could acquire knowledge-aware dependencies between different words in a sequence and lead to a new state-of-the-art performance.
Event extraction, the technology that aims to automatically get the structural information from documents, has attracted more and more attention in many fields. Most existing works discuss this issue with the token-level multi-label classification framework by distinguishing the tokens as different roles while ignoring the writing styles of documents. The writing style is a special way of content organizing for documents and it is relative fixed in documents with a special field (e.g. financial, medical documents, etc.). We argue that the writing style contains important clues for judging the roles for tokens and the ignorance of such patterns might lead to the performance degradation for the existing works. To this end, we model the writing style in documents as a distribution of argument roles, i.e., Role-Rank Distribution, and propose an event extraction model with the Role-Rank Distribution based Supervision Mechanism to capture this pattern through the supervised training process of an event extraction task. We compare our model with state-of-the-art methods on several real-world datasets. The empirical results show that our approach outperforms other alternatives with the captured patterns. This verifies the writing style contains valuable information that could improve the performance of the event extraction task.
The past several years have witnessed significant progress in modeling the Cocktail Party Problem in terms of speech separation and speaker extraction. In recent years, multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to speaker extraction task to serve as complementary information to each other to achieve better performance. However, the front-end model, for speaker extraction, become large and hard to deploy on a resource-constrained device. In this paper, we address the aforementioned problem with novel model architectures and model compression techniques, and propose a lightweight multi-modal framework for speaker extraction (dubbed LiMuSE), which adopts group communication (GC) to split multi-modal high-dimension features into groups of low-dimension features with smaller width which could be run in parallel, and further uses an ultra-low bit quantization strategy to achieve lower model size. The experiments on the GRID dataset show that incorporating GC into the multi-modal framework achieves on par or better performance with 24.86 times fewer parameters, and applying the quantization strategy to the GC-equipped model further obtains about 9 times compression ratio while maintaining a comparable performance compared with baselines. Our code will be available at https://github.com/aispeech-lab/LiMuSE.
Protein-protein interaction (PPI) extraction from published scientific literature provides additional support for precision medicine efforts. Meanwhile, knowledge bases (KBs) contain huge amounts of structured information of protein entities and their relations, which can be encoded in entity and relation embeddings to help PPI extraction. However, the prior knowledge of protein-protein pairs must be selectively used so that it is suitable for different contexts. This paper proposes a Knowledge Selection Model (KSM) to fuse the selected prior knowledge and context information for PPI extraction. Firstly, two Transformers encode the context sequence of a protein pair according to each protein embedding, respectively. Then, the two outputs are fed to a mutual attention to capture the important context features towards the protein pair. Next, the context features are used to distill the relation embedding by a knowledge selector. Finally, the selected relation embedding and the context features are concatenated for PPI extraction. Experiments on the BioCreative VI PPI dataset show that KSM achieves a new state-of-the-art performance (38.08% F1-score) by adding knowledge selection.
We present an Open Information Extraction (IE) approach that uses a two-layered transformation stage consisting of a clausal disembedding layer and a phrasal disembedding layer, together with rhetorical relation identification. In that way, we convert sentences that present a complex linguistic structure into simplified, syntactically sound sentences, from which we can extract propositions that are represented in a two-layered hierarchy in the form of core relational tuples and accompanying contextual information which are semantically linked via rhetorical relations. In a comparative evaluation, we demonstrate that our reference implementation Graphene outperforms state-of-the-art Open IE systems in the construction of correct n-ary predicate-argument structures. Moreover, we show that existing Open IE approaches can benefit from the transformation process of our framework.