Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox


Why We Need New Evaluation Metrics for NLG

Add code

Jul 21, 2017
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser


Share this with someone who'll enjoy it:


The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

* Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2231-2242, Copenhagen, Denmark, September 7-11, 2017 
* accepted to EMNLP 2017 


   Access Paper Source



Share this with someone who'll enjoy it: