ROUGE is a widely adopted, automatic evaluation measure for text summarization. While it has been shown to correlate well with human judgements, it is biased towards surface lexical similarities. This makes it unsuitable for the evaluation of abstractive summarization, or summaries with substantial paraphrasing. We study the effectiveness of word embeddings to overcome this disadvantage of ROUGE. Specifically, instead of measuring lexical overlaps, word embeddings are used to compute the semantic similarity of the words used in summaries instead. Our experimental results show that our proposal is able to achieve better correlations with human judgements when measured with the Spearman and Kendall rank coefficients.
In this paper, we motivate the need for a publicly available, generic software framework for question-answering (QA) systems. We present an open-source QA framework QANUS which researchers can leverage on to build new QA systems easily and rapidly. The framework implements much of the code that will otherwise have been repeated across different QA systems. To demonstrate the utility and practicality of the framework, we further present a fully functioning factoid QA system QA-SYS built on top of QANUS.