Can We Predict Post-Editing Effort?

Entire PhD research careers are dedicated to the question of how to measure machine translation quality, either by some criteria or against a reference translation. These metrics can be subjective and complicated, and are never perfect.

Measuring Post-Editing Effort

In the translation industry today, the simplest segment-level metric is: was this machine translation approved as-is or not?

For past projects, we’re lucky to have post-editing data: the source text, the machine translation and the final human translation. The reference translation is actually based on the machine translations in question.

We can consider a machine translation “approved” if the human translator reviewed it and did not touch it, and “rejected” if the human translator post-edited it — even by one character. The key document-level or project-level metric follows from this.

The Key Metric: Human-Approval Rate

What percent of machine-translated segments are approved as-is? At ModelFront, we see the human-approval rate varying between 10% and 90%, depending on the content type, language pair, machine translation setup and quality goals. For a typical stream of enterprise technical content or commerce content, the approval rate is 20% to 40%. It’s rising slowly but steadily as machine translation and machine translation setups improve.

Whether full post-editing or hybrid translation — safely auto-approving raw machine translation — the approval rate drives savings and quality.

Predicting Post-Editing Effort

In order to give or get a quote for a new project, translation companies, buyers and translators would love to know the approval rate for new documents in new projects before these are passed on to human translators for post-editing.

To this end, a system like ModelFront provides a risk score for each translation.

Translation Risk Prediction

Known in the research world as “machine translation quality estimation” or “QE”, risk prediction provides a line-level score from 0% to 100%. It’s based on the original text and the translated text — it does not require a reference translation.

João Graça, the CTO of Unbabel, defined quality estimation as the missing link for machine translation adoption.”  The approach was researched and developed inside Unbabel, KantanMT, Microsoft, eBay, Amazon, Facebook and VMWare. It’s the topic of the QE shared task at WMT and open libraries and models.

ModelFront made risk prediction technology available as an API. A ModelFront risk prediction score is like a bet on whether a translation will be human-approved or not.

Aggregating into a Document-Level Quality Score

We want to aggregate the line-level risk prediction scores into a document-level quality score.

At ModelFront, we use the length-weighted average — each line-level score is weighted by the length of the line, to better reflect the post-editing effort. And finally we flip it — 0% average risk is 100% average quality, and vice-versa.

Graphing the Document-Level Correlation

Can we use document-level quality scores to predict the actual document-level approval rates? Yes, but we can’t assume a 1:1 relationship.

We first need to graph the relationship between the document-level predicted quality scores and the document-level human-approval rates.

A mid-size language service provider tested the accuracy using post-editing data for e-commerce content across 6 fashion retail client brands and 5 language pairs. Each project is a test set of the most recent 500 translations for the client and language pair. The circle size corresponds to the amount of training data for that client and language pair. Across the two dozen projects, the human-approval rate varies from 17% to 51%. The Spearman correlation between the predicted quality score and the human-approval rate is 0.47.

Hover with the mouse over the dots for more information.
 

 

 

Now, given a predicted quality score for a new document, we can estimate roughly where it would fall on the spectrum of human-approval rates. For example, a predicted quality score below 30 is almost always bad news.

We can also use shape, color or size to visualize other variables like content type, language pair, document size or training dataset size that have an impact on the predicted quality score or human-approval rate.

Zooming in on the Outliers

At the document level, there are documents that are far below or far above the curve. Documents above the curve turned out to be harder to translate than predicted. Documents below the curve turned out to be easier than predicted. We can click into each document to understand what’s going on at the line level.

False Negative

The human post-edited the translation, but ModelFront predicted it was low-risk. This may require a fix to the risk prediction system. In other cases, the human translators made unnecessary stylistic edits or inconsistent edits.

False Positive

The human approved the translation, but ModelFront predicted it was high-risk. We accept some false positives as the price of safety. For example, a very short segment may be ambiguous. We don’t want our risk prediction system to approve it, even if it happens to be a good translation in this document. In other cases, the human translator was just asleep at the wheel.

In machine learning, the extreme false positives and extreme false negatives are just as often due to human error or process issues as to machine prediction error. In any case, it’s worth taking a look at them. For example, in the test sets above, the outlier is caused by Client D’s specific instruction to drop the brand name from item titles in the German translations. The custom machine translation system was trained to implement that, but the risk prediction system was not. Ultimately, the correlation between the machine predictions and human judgement is significant and we’re also interested in the outliers they both agree on. What is going on in the project with the very low predicted quality score and very low human-approval rate?

Conclusions

The deep learning revolution has grown the arsenal of tools for making predictions on text. Now early-mover enterprises and translation companies are integrating production-ready tools into their own platforms, understanding what the predictions mean in practice and adjusting their project terms and processes accordingly.

The first step is to save the data. Post-editing data — source text, the machine translation and the final human translation — is the key to tracking and predicting post-editing metrics.

The next step is to define line-level, document-level and project-level metrics. We’ve found a simple binary classification — approved vs. post-edited — to be intuitive and robust for determining post-editing effort. To make predictions and find the outliers, we can graph the correlations across the data.

To improve prediction accuracy and savings, we can zoom in on the outliers — both documents and actual lines of text — to see what is driving the scores.

More and more machine translations are getting a risk prediction, the human-approval rate is increasing and the terms for buyers and translators reflect that rate more fairly.

For more resources on post-editing and quality estimation, please visit the GALA Knowledge Center.