



Abstract:In this paper, we propose a new metric for Machine Translation (MT) evaluation, based on bi-directional entailment. We show that machine generated translation can be evaluated by determining paraphrasing with a reference translation provided by a human translator. We hypothesize, and show through experiments, that paraphrasing can be detected by evaluating entailment relationship in the forward and backward direction. Unlike conventional metrics, like BLEU or METEOR, our approach uses deep learning to determine the semantic similarity between candidate and reference translation for generating scores rather than relying upon simple n-gram overlap. We use BERT's pre-trained implementation of transformer networks, fine-tuned on MNLI corpus, for natural language inferencing. We apply our evaluation metric on WMT'14 and WMT'17 dataset to evaluate systems participating in the translation task and find that our metric has a better correlation with the human annotated score compared to the other traditional metrics at system level.
