The Challenge of Defining Translation Quality

The subject of “translation quality" has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak. Since every LSP claims to deliver the "best quality " or "high quality" translations, it is difficult for customers to tell the difference in this aspect from one service provider to another. The quality claim between vendors, thus, essentially cancels out. 

Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation?  

The industry response to this problem is colored by a localization mindset we see in approaches like the Dynamic Quality Framework (DQF). Many consider this too cumbersome and overly detailed to implement in translating modern fast-flowing content streams. While DQF can be useful in some limited localization use-case scenarios, the ability to rapidly handle and translate large volumes of DX-relevant content cost-effectively is increasingly a higher priority, and needs a new and different view on monitoring quality. The linguistic quality of the translation does matter, but has a lower priority than speed, cost, and digital agility.

Today, MT solutions are essential to the global enterprise mission. Increasingly, more and more dynamic content is translated for target customers without EVER going through any post-editing modification. The business value of a translation is often defined by its utility to the consumer in a digital journey, basic understandability, availability-on-demand, and the overall CX impact, rather than linguistic perfection. Generally, useable accuracy delivered in time matters more than perfect grammar and fluency. The phrase "good enough" is used both disparagingly, and as a positive attribute, for the translation output that is useful to a customer even in a less than “perfect” state. 

Defining Machine Translation Output Quality

The MT development community has had less difficulty establishing a meaningful and widely useful comparative measurement for translation quality. Fortunately, they had assistance from NIST who developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner.

The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces "huge improvements" in BLEU scores, as it is easy to make extravagant claims of improvement that are not easily validated.  Independent evaluations used by many today, provide comparisons where several systems may have actually trained on the test sets - this is the equivalent of giving a student the exam with the answers before a formal test. Other reference test set-based measurements like hLepor, Meteor, chrF, Rouge, etc. are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.

Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to really get a handle on the relative quality of different MT systems. This becomes difficult as soon as we start asking basic questions like:

•    What are we testing on?
•    Are we sure that these MT systems have not trained on the test data? 
•    What kind of translators are evaluating the output?  
•    How do these evaluators determine what is better and worse when comparing different correct translations?

Conducting an accurate evaluation is difficult, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process. However, in the last few years, several MT developers have claimed to produce MT systems that have achieved human parity. This has been especially true with the advent of neural MT. These claims are useful for creating a publicity buzz among ignorant journalists, and fear amongst some translators, but usually disappoint anybody who looks more closely.

I have challenged the first of these broad human parity claims here: The Google Neural Machine Translation Marketing Deception. A few years later Microsoft claimed they reached human parity on a much narrower focus with their Chinese to English News system but were more restrained in their claim.

Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true for a broader test.

We should understand, that at least among some MT experts, there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity. 

There are basically two definitions of human parity generally used to make a claim.

  • Definition 1. If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.
  • Definition 2. If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity.

Again, the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers who earn pennies per evaluation. The other problem is the messy, inconsistent, irrelevant, biased data underlying the assessments.

Mechanical Turk

Objective, consistent human evaluation is necessary but difficult to do on a required continuous and ongoing basis. Additionally, if the underlying data used in an evaluation are fuzzy and unclear, we actually move to obfuscation and confusion rather than clarity. 

Useful Issues to Understand

While the parity claims can be true for a small carefully selected sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). If we cannot define what a "good translation" is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?

Some questions to help an observer to understand the extent to which parity has been reached, or expose deceptive marketing spin that may motivate the claims:

What was the test data used in the assessments?

MT systems are often tested and scored on news domain data which is most plentiful. A broad range of different types of content should be included to make claims as extravagant as having reached human parity.

What is the quality of the reference test set?

Ideally, only expert human-created test sets should be used.

Who produced the human translations being used and compared?

The reference translations against which all judgments will be made should be "good" translations. Easily said but not so easily done.

How much data was used in the test to make the claim?

Often human assessments are done with as little as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic.

Who is making the judgments and what are their credentials?

It is usually cost-prohibitive to use expert professional translators to make the judgments and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained.

Doing an evaluation properly is a significant and expensive task, but MT developers have to do this continuously while building the system. This is why BLEU and other "imperfect" automated quality scores are so widely used. These scores provide the developers with continuous feedback in a fast and cost-efficient manner, especially if they done with care and rigor.

Source: A Set of Recommendations for Assessing Human–Machine Parity in Language Translation

What Would Human Parity MT Look Like?

MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. It would be useful to us that MT developers refrain from making claims until they can show all of the following:
•    90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
•    Catch obvious errors in the source and possibly even correct these before attempting to translate 
•    Handle variations in the source with consistency and dexterity
•    Have at least some nominal contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine?

Until we reach the point where all of the above is true, a claim that more explicitly stated key parameters below would certainly create less drama and more clarity on the true extent of the accomplishment:
•    Test set size
•    Descriptions of source material  
•    Who judged, scored, and compared the translations

I am skeptical that we will achieve human parity by 2029 as some "singularity" enthusiasts have been saying for over a decade. The issues to be resolved are many and will likely take longer than most best guesses made today, and are unlikely to render humans obsolete. 

"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not pixie dust that magically solves all your problems." -- Steven Pinker 

Recently some in the singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet. Language does not have binary outcomes and there are few clear-cut and defined rules.

Michael Housman, a faculty member of Singularity University, states the need for more labeled data, but notes that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”

Perhaps, we need to finally admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?

Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience. Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of the translation task instantaneously might be a better research focus. This may be a more worthwhile goal than aiming for a God-like machine that can translate anything and everything at human parity.

Maybe we need more focus on improving the man-machine interaction and find more elegant and natural collaborative models. Getting to a point where the large majority of translators always want to use MT because it simply makes the work easier and more efficient is perhaps a better goal for future MT.