In this work, we analyze a pseudo-relevance retrieval method based on the results of web search engines. By enriching topics with text data from web search engine result pages and linked contents, we train topic-specific and cost-efficient classifiers that can be used to search test collections for relevant documents. Building upon attempts initially made at TREC Common Core 2018 by Grossman and Cormack, we address questions of system performance over time considering different search engines, queries, and test collections. Our experimental results show how and to which extent the considered components affect the retrieval performance. Overall, the analyzed method is robust in terms of average retrieval performance and a promising way to use web content for the data enrichment of relevance feedback methods.
Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sections from videos, but it remains a significant challenge for computers. In this paper, we address the problem of assigning importance scores to video segments, that is how much information they contain with respect to the overall topic of an educational video. We present an annotation tool and a new dataset of annotated educational videos collected from popular online learning platforms. Moreover, we propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features. Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction.
The work presented in this paper attempts to evaluate and quantify the use of discourse relations in the context of blog summarization and compare their use to more traditional and factual texts. Specifically, we measured the usefulness of 6 discourse relations - namely comparison, contingency, illustration, attribution, topic-opinion, and attributive for the task of text summarization from blogs. We have evaluated the effect of each relation using the TAC 2008 opinion summarization dataset and compared them with the results with the DUC 2007 dataset. The results show that in both textual genres, contingency, comparison, and illustration relations provide a significant improvement on summarization content; while attribution, topic-opinion, and attributive relations do not provide a consistent and significant improvement. These results indicate that, at least for summarization, discourse relations are just as useful for informal and affective texts as for more traditional news articles.
Blind source separation (BSS), i.e., the decoupling of unknown signals that have been mixed in an unknown way, has been a topic of great interest in the signal processing community for the last decade, covering a wide range of applications in such diverse fields as digital communications, pattern recognition, biomedical engineering, and financial data analysis, among others. This course aims at an introduction to the BSS problem via an exposition of well-known and established as well as some more recent approaches to its solution. A unified way is followed in presenting the various results so as to more easily bring out their similarities/differences and emphasize their relative advantages/disadvantages. Only a representative sample of the existing knowledge on BSS will be included in this course. The interested readers are encouraged to consult the list of bibliographical references for more details on this exciting and always active research topic.
Despite the prevalence of sentiment-related content on the Web, there has been limited work on focused crawlers capable of effectively collecting such content. In this study, we evaluated the efficacy of using sentiment-related information for enhanced focused crawling of opinion-rich web content regarding a particular topic. We also assessed the impact of using sentiment-labeled web graphs to further improve collection accuracy. Experimental results on a large test bed encompassing over half a million web pages revealed that focused crawlers utilizing sentiment information as well as sentiment-labeled web graphs are capable of gathering more holistic collections of opinion-related content regarding a particular topic. The results have important implications for business and marketing intelligence gathering efforts in the Web 2.0 era.
In this paper, we present our Alexa Prize Grand Challenge 4 socialbot: Proto. Leveraging diverse sources of world knowledge, and powered by a suite of neural and rule-based natural language understanding modules, state-of-the-art neural generators, novel state-based deterministic generators, an ensemble of neural re-rankers, a robust post-processing algorithm, and an efficient overall conversation strategy, Proto strives to be able to converse coherently about a diverse range of topics of interest to humans, and provide a memorable experience to the user. In this paper we dissect and analyze the different components and conversation strategies implemented by our socialbot, which enables us to generate colloquial, empathetic, engaging, self-rectifying, factually correct, and on-topic response, which has helped us achieve consistent scores throughout the competition.
This paper presents a preliminary experimentation study using the CLEF 2017 eHealth Task 2 collection for evaluating the effectiveness of different indexing methodologies of documents and query parsing techniques. Furthermore, it is an attempt to advance and share the efforts of observing the characteristics and helpfulness of various methodologies for indexing PubMed documents and for different topic parsing techniques to produce queries. For this purpose, my research includes experimentation with different document indexing methodologies, by utilising existing tools, such as the Lucene4IR (L4IR) information retrieval system, the Technology Assisted Reviews for Empirical Medicine tool for parsing topics of the CLEF collection and the TREC evaluation tool to appraise system's performance. The results showed that including a greater number of fields to the PubMed indexer of L4IR is a decisive factor for the retrieval effectiveness of L4IR.
There are many discussions held during political meetings, and a large number of utterances for various topics is included in their transcripts. We need to read all of them if we want to follow speakers\' intentions or opinions about a given topic. To avoid such a costly and time-consuming process to grasp often longish discussions, NLP researchers work on generating concise summaries of utterances. Summarization subtask in QA Lab-PoliInfo-2 task of the NTCIR-15 addresses this problem for Japanese utterances in assembly minutes, and our team (SKRA) participated in this subtask. As a first step for summarizing utterances, we created a new pre-trained sentence embedding model, i.e. the Japanese Political Sentence-BERT. With this model, we summarize utterances without labelled data. This paper describes our approach to solving the task and discusses its results.
Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.
Developing new ideas and algorithms in the fields of graph processing and relational learning requires datasets to work with and WikiData is the largest open source knowledge graph involving more than fifty millions entities. It is larger than needed in many cases and even too large to be processed easily but it is still a goldmine of relevant facts and subgraphs. Using this graph is time consuming and prone to task specific tuning which can affect reproducibility of results. Providing a unified framework to extract topic-specific subgraphs solves this problem and allows researchers to evaluate algorithms on common datasets. This paper presents various topic-specific subgraphs of WikiData along with the generic Python code used to extract them. These datasets can help develop new methods of knowledge graph processing and relational learning.