Calling for Transparency: Automated Language Quality Metrics in the Translation Industry
The debate about translation is probably as old as writing itself – in many languages the competing translations of holy texts and major literary accomplishments were accompanied by often rather fierce disputes on translation quality. We can safely conclude: measuring translation quality is a historic demand in our industry.
This demand increased dramatically in the last couple of years as translation volumes exploded and turnaround times were getting shorter and shorter. For this very reason, automated language quality metrics and methods are playing a very important role in the translation industry. We analyze the translation volumes to calculate costs and effort. We also analyze the performance of our MT engines. Although we use the results of these analyses on a daily basis and, by now, they are considered fundamentals of the industry, we do not really know too much about how they operate.
In this blog, I will try to give a short overview about the different algorithms and provide an overview about the difficulties and problems these methods represent. I will point out the fact that we need more transparency and industry standards for these metrics than we have now.
It’s All About Similarity
Fuzzy match and BLEU score are well-known terms to those working in the language industry. Fuzzy match is the process when the content to be translated is analyzed against a database (Translation Memory) to identify existing and similar texts from previous projects which are already translated. BLEU score is an algorithm which was originally created to evaluate the quality of machine translation by comparing the MT (machine translated) segments to human reference translations. What do these two methods have in common? A lot.
Both methods are basically used to measure similarity between two strings. To be more precise: fuzzy match is an elastic concept and the concrete method which is used in a fuzzy match algorithm is in many cases BLEU score. The only difference is that people somehow are accepting fuzzy match because they are used to use it, while they do not like BLEU score because it is related to machine translation in their minds.
We can say that BLEU score is basically one method used typically in the world of machine translation and in fuzzy match algorithms to judge about how close two segments are to each other. But there are a couple of other ones. Here is an incomplete list of them:
- Edit distance
But why would people develop so many seemingly competing methods for such a small portion of our industry? Are these people out of their mind or they are just extremely bored? Well, there is good reason for all this fuss - let us take a closer look at the problems all these methods must tackle somehow!
Drowning by Numbers
Let’s take following simple sentences where the first one is the reference (see also the related post on GALA Connect by Gergely Horváth).
- I want to be a doctor.
- I want to be a translator.
- I want to be a velociraptor.
Now, imagine that you must provide a number between 0 and 100 which tells us how close these two segments are to each other. You say 80%? Why not 81%? Or why not 75% or 95%? Do we have any standards for it? This is where the problem starts: we have no reference or gold standard which we can rely on.
To continue our investigation, the figures provided by our CAT tools in the past seem to be the obvious choice. So, let’s check what these are telling us about our problem at hand.
What you can see below is the fuzzy match index of four different CAT tools provided for these simple sentences.
Another example. This is the original segment:
Please let me go.
And these are the two sentences to which we are comparing it:
Let me go please.
Me go please let.
If you are looking for the words only, these are two nearly perfect matches. If you decide to observe word order too, then some penalty must be applied to both (let’s say fuzzy match index is 80%). And if you are looking for language quality, then you should give two very different match rates. It is obvious that we have a couple of basic questions at hand: what should be the exact penalty for a word order modification?
In some languages word order is strictly fixed, in some others you can freely play around with it. A slightly changed word order may result in an acceptable sentence or in an unacceptable one. In languages like Arabic, a small change within the segment can completely change the meaning of it. This means that beside the technical factor there is a language factor which should be considered when we are passing decisions on similarity or quality.
So why are the CAT tool results different? Every CAT tool is using a different algorithm or a combination of algorithms to measure the difference between segments. The different algorithms calculate things in a slightly different way. Some of them are calculating based on words, some of them use character-based methodologies, while others base their calculations on character/word groups (so-called n-grams). Some of the algorithms penalize word order changes, some of them don’t. This is one of the reasons for the different results in different CAT tool analyses.
What is a Word?
Just to make it even more complicated, there is yet another very stunning problem. Let’s look at the results of the analyses by two popular CAT-tools. We ran both on the very same document with an empty translation memory and with internal consistency check (checking similarity within the document, not just repetitions).
Isn’t it surprising? Not even the number of words and characters is the same, let alone the number of repetitions and internal consistency (=internal fuzzy match analysis). The reason for the difference is simple: different tools have different definitions for a word and are using different default segmentation rules. And, of course, there is the problem concerning different methods of analyzing fuzzy matches as explained above.
So how does this problem take shape from the perspective of machine translation? The most common way to measure MT quality is to compare the MT output with human translation. As mentioned above, the algorithms are basically the same as the ones used to create fuzzy match analyses. This also means that the problems will also be the same. Different methods are delivering different values, and the relation of the figures to the real language quality can be very different. While MT scientists are typically happy about an increased BLEU-score for an engine, translators are not always finding the results so enthusiastic, even if the score has increased.
In the MT world, there is another method for automated MT quality measurement which is called Quality Estimation (QE). The difference compared to BLEU-score and similar methods is that QE tries to measure quality without any reference translation which is quite a challenge.
The QE model is created in a machine learning process based on different linguistic features. These can be very simple like measuring the length of the segment, or rather complex ones like the perplexity of a language model. The advantage of using QE is that these figures are provided already at project start, as no reference translation is required. One problem with QE is the fact that the reliability of the results can be very different even within a project. Another issue is that even if QE values look like a fuzzy match analysis, the technology behind it is very different and the results are hardly comparable.
The Bottom Line
There is a lot of room for improvement regarding reliability, transparency, and standards for automated language quality methods. We can say that it is nearly impossible to find a good way to convert language quality into one figure which satisfies everybody. On the other hand, it is obvious that we have to deal with the issue – and, even though they are not perfect, we need automated methods for use in our daily work.
The following action items could improve the situation for every stakeholder of the translation industry (translation providers, translation buyers and tool providers):
- Creating Standards: We need “golden standards” which can be used as reference. Diversity among the tools is not a problem, but we should be able to compare the results of our tools to some common standards.
- More Simplicity: The level of detail within the match rate tables and scores is unnecessarily overcomplicated. We can simplify quality assignment to three basic categories: good, usable, and useless.
I would encourage all GALA members to use the opportunities provided by this great industry organization to start discussions and probably do the first steps in a direction which brings us closer to reach our goals.
See you in March in Amsterdam!