Problems with BLEU, and New Translation Quality Measurement Approaches

taretmakeup Haziran 16, 2022

In the previous entry I described the basic concept of MT translation quality measurement using BLEU scores. However, there is much criticism of BLEU for many reasons which I will describe briefly.

There are several criticisms of BLEU that should also be understood if you are to use the metric effectively. BLEU only measures direct word-by-word similarity, and looks to match and measure the extent to which word clusters in two documents are identical. Accurate translations that use different words may score poorly since there is no match in the human reference.

There is no understanding of paraphrases and synonyms so scores can be somewhat misleading in terms of overall accuracy. You have to get the exact same words as the human reference translation to get credit e.g.

"Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."

Also, nonsensical language that contains the right phrases in the wrong order can score high. e.g.

"Appeared calm when he was taken to the American plane, which will to Miami, Florida" would get the very same score as: "was being led to the calm as he was would take carry him seemed quite when taken". These and other problems are described in this article.

This paper from the Univ. Of Edinburgh shows that BLEU may not correlate with human judgment to the degree it was previously believed and that sometimes higher scores do not mean better translation quality.

A more recent criticism identifies the following problems:

-- It is an intrinsically meaningless score

-- It admits too many variations – meaningless and syntactically incorrect variations can score the same as good variations

-- It admits too few variations – it treats synonyms as incorrect

-- More reference translations do not necessarily help

-- Poor correlation with human judgments

They also point out several problems with BLEU in the English-Hindi language combination. Thus it is clear that while BLEU can be useful in some circumstances it is increasingly brought into question.

The research community has recently focused on developing metrics that overcome at least some of these shortcomings and there are a few new measures that are promising. The objective is to develop automated measures that track very closely to human judgments and thus provide a useful proxy during the development of any MT system. The most promising of these include:

METEOR: Alon Lavie who is the inventor of this metric says:

“I think it would be worth pointing out that while BLEU has been most suitable for "parameter tuning" of SMT systems, METEOR is particularly better suited for assessing MT translation quality of individual sentences, collections of sentences, and documents. This may be quite important to LSPs and to users of MT technology.” METEOR also can be trained to correlate closely to particular types of human quality judgments and customized for specific project needs.

TERp: takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation.

There are some who suggest that it is best to use a combination of these measures to get the best sense and maximize accuracy. The NIST and Euromatrix project have studied the correlation of various metrics to human judgments for those who want the gory details.

The basic objective of any automated quality measurement metric is to provide the same judgment that a competent, consistent and objective human would provide, very quickly and efficiently during the engine development process. And thus guide the development of MT engines in an accurate and useful way.

In the Automated Language Translation group in LinkedIn there is a very interesting discussion on these metrics. If you read the thread all the way through you will see there is much confusion on what is meant by “translation quality”. The dialog between developers and translation professionals has not been productive because of this confusion. Process and linguistic quality are often equated and thus confusion ensues.

This discussion is often difficult because of conflation, i.e. very different concepts being equated and assumed to be the same. I think we have at least 3 different concepts that are being referenced and confused as being the same concept, in many discussions on “quality”.

1. End to End Process Standards: ISO 9001, EN15038, Microsoft QA and LISA QA 3.1. They have a strong focus is on administrative, documentation, review and revision processes not just the quality assessment of the final translation.

2. Automated SMT System Output Metrics (TQM): BLEU, METEOR, TERp, F-Measure, Rouge and several others that only focus on rapidly scoring MT output by assessing precision and recall and referencing one or more human translations of the exact same source material to develop this score.

3. Human Evaluation of Translations: Error categorization and subjective human quality assessment, usually at a sentence level. SAE J2450, the LISA Quality Metric and perhaps the Butler Hill TQ Metric (that Microsoft uses extensively and TAUS advocates) are examples of this. The screenshot below shows the Asia Online human evaluation tool.

There is also the ASTM F2575 A new translation quality assurance standard published by ASTM International (an ANSI-accredited standards body) which defines quality as: The degree to which the characteristics of a translation fulfill the requirements of the agreed upon specifications.

For the professional translation industry the most important question is: Is the quality good enough to give to a human post-editor? There is an interesting discussion on “Weighting Machine Translation Quality with Cost of Correction” by Tex Texin and commenters. There is work being done to see if METEOR and TERp can be adapted to provide more meaningful input to professionals on the degree of post-editing effort likely for different engines.It will get easier and the process will be more informed.

For most MT projects it is recommended that "Test Sets" be developed with great care to reflect the reality of the systems production use as a first step. There should also be always be a human quality assessment step. This is usually done after BLEU/METERO have been used to develop the basic engine. This human error analysis is used to develop corrective strategies and assess the scope of the post-editing task. I think this step is critical BEFORE you start post-editing as it will provide information on how easy or difficult the effort is likely to be. Successful post-editing has to start with a quality assessment of the general MT engine quality. We are still in the early days of this and the best examples I have seen of this kind of quality assessment are at Microsoft and Asia Online.

So ideally an iterative process chain would be something like:
Clean Training Data > Build Engine with BLEU/Meteor as TQM > Human Quality and Error Evaluation > Create New Correction Data > Improve & Refine the MT Engine > Release Production Engine> Post-Edit > Feed Corrected Data back to Engine and finally make LQA measurement on delivered output to understand the quality in L10N terms.
There is much evidence that an a priori step to clean up/simplify the source content would generally improve overall quality and efficiency and help one get to higher quality sooner.

It is my opinion that we are just now learning how to do this well and processes and tools are still quite primitive, though some in the RbMT community claim they have been doing it (badly?) for years. As the dialog between the developers and localization professionals gets clearer, I think we will see tools emerge that provide the critical information LSPs need, to understand what the production value of any MT engine is, before they begin the post-editing project. I am optimistic and believe that we will see many interesting advances in the near future.