Reference resolution on extended texts (several thousand references) cannot be evaluated manually. An evaluation algorithm has been proposed for the MUC tests, using equivalence classes for the coreference relation. However, we show here that this algorithm is too indulgent, yielding good scores even for poor resolution strategies. We elaborate on the same formalism to propose two new evaluation algorithms, comparing them first with the MUC algorithm and giving then results on a variety of examples. A third algorithm using only distributional comparison of equivalence classes is finally described; it assesses the relative importance of the recall vs. precision errors.

Title:Three New Methods for Evaluating Reference Resolution

Paper and Code