E.g., 07/14/2020
E.g., 07/14/2020

Comparing Errors: Neural MT vs. Traditional Phrase-based and Rule-based MT

By: Aljoscha Burchardt (Lab Manager)

07 June 2017

Machine translation systems come in two different flavors. The first class of systems – rule-based machine translation (“RbMT”) – is based on hand-written rules and vocabulary lists. These systems emerged in the 1940s, became somewhat popular in the 1980s and are still operative in certain applications.

The second class of systems – statistical machine translation (“SMT”) – has no access to explicit linguistic knowledge. It works with statistical probabilities derived from pretranslated parallel corpora.

In the last 10 years, the open-source system Moses has become very popular. A traditional phrase-based SMT system, it implements a pipeline architecture that comprises several separate processing steps (e.g., reordering, phrase translation, “smoothing”). Starting in 2016, a new class of statistical systems based on so-called “deep learning” in neural networks (NMT) has gained a lot of attention. These systems work in an end-to-end fashion, i.e., they go from input to output in one step.

The question we want to answer with our evaluation reported here is: What can we expect from these systems? More concretely: What errors do they make?

[Also, check out the free GALA webinar by Aljoscha on this same topic.]

Excursus: Current SMT Development and Quality Evaluation

For quite some time, research in SMT had locked itself in an academic ivory tower, detached from the knowledge and needs of language industry. Most research focused on “gisting” (information-only) using readily available news data. It sought to improve average scores. Improvement has been measured mostly by automatic comparison of MT output with human reference translations using shallow surface-based measures like BLEU. This naïve approach of evaluating MT quality has been criticized frequently.

For the language industry, the automatic scores are practically useless for several reasons. One of them is that they do not allow to compare different systems (it is only possible to assess changes of one system). A related problem is that the numbers have no absolute interpretation, e.g. one system with BLEU 25 can produce the same translation quality as another one with BLEU 35. Finally, the automatic scores do not provide any indication as to what is that nature of the issues in the translations.

Towards a Human-Informed HQMT Development Cycle

Starting in the EC-funded project QTLaunchPad and continued in the ongoing project QT21, we have started to promote and implement a human-informed paradigm for Hiqh-Quality MT in cooperation with GALA and several professional translators and LSPs.

The main idea is very simple. It is illustrated in Figure 1: At certain intervals, the SMT developers have to perform testing rounds together with language experts and the results need to be played back into the well-known “in vitro” development cycle.

How Language Professionals Can Provide Informed Feedback

There are several ways in which professional feedback can be used in MT R&D.

In this article, we want to quickly mention analytic error annotation with MQM and then focus on the use of test suites for systematically checking the coverage of linguistic phenomena in texts. Other methods that are actively being research include the use of post-edits for improving systems performance.

The Multidimensional Quality Metrics (MQM) is a framework we have developed as a unified way of (manually) assessing translation quality. The MQM error hierarchy allows to provide a detailed break-down of error types found in a given corpus.

How Can We Systematically Find Errors?

While MQM allows implementers to spot errors in the translations (which is vital for the translation industry), in the context of R&D, we also want to learn what triggers errors. This means that we have to shift our focus to the source segments.

Our idea was to use test suites: selected sets of source-target pairs that reflects interesting or difficult cases such as multi-word expressions (MWEs), long-distance dependences, negation, and terminology. In contrast to a “real-life“ corpus with reference translations, the input in a test suite can include segments edited or even made-up to isolate and illustrate issues.

One advantage is that testing can be local/partial, e.g., when checking lexical ambiguity (German “Gericht”; English “court” vs. “dish”), evaluators only need to see if the expected word appears or not, but do not need to check the rest of the sentence. When checking prefix verbs (English “picked up …”; German “hob … auf”), they can check if they find the two expected parts (or an equivalent formulation).

When comparing different systems with a test suite, the goal is to get quantitative and qualitative insights (e.g., system X gets all 20 imperatives right, but only 50% of the negations) into its strengths and weaknesses.

In the QT21 project, we have built a test suite comprising about 5000 manually constructed segments for the language pair German ­– English. In our recent GALA webinar we have reported a first study that will be published at EAMT 2017[1] that included several different online and research systems. Table 1 shows the numerical comparison.

For reasons explained in the paper, this experiment has been based on smaller, unbalanced version of our test suite (ca. 800 segments). Note that the distribution of phenomena is not representative of real corpora and that the goal is not to find the “winning system”, but to analyze strengths and weaknesses in terms of linguistic phenomena. For example, consider the following:

Here one can see that the old phrase-based systems cannot deal with this inserted relative clause. When we find consistent performance across multiple examples, we can discover avenues to improve systems or things to be avoided in source sentences if we know that particular systems will be used. More examples are discussed in the webinar and in the paper.

Some of the conclusions we have drawn are that there has been a tremendous quality improvement between phrase-based and neural MT, which is now on par with best RBMT systems. However, we also find cases where the high fluency of NMT translations “masks” errors that are more difficult to spot by proof readers.


Current evaluation workflow based on reference translation (and scores like BLEU) provides little insights about MT quality and the nature of errors.

Alternatives are being actively researched:

  • Learning from post-edits
  • Target analytics: Error annotation with MQM
  • Source-driven testing: Test suites
  • Quality estimation, better automatic metrics, etc.

Test suites can also be adopted to the technical domain. The interested reader is referred to the publications below:

  • Eleftherios Avramidis, Aljoscha Burchardt, Vivien Macketanz, and Ankit Srivastava. 2016. DFKI’s System for WMT 16 IT-domain Task, including Analysis of Systematic Errors. In Proceedings of the First Conference on Machine Translation, Association for Computational Linguistics (ACL 2016), Berlin, Germany. pp. 415-422.
  • Anne Beyer, Vivien Macketanz, Aljoscha Burchardt and Philip Williams - Can Out-of-the-box NMT Beat a Domain-trained Moses on Technical Data? In Proceedings of EAMT 2017, Prague, Czech Republic

The later paper describes a comparison we have performed between a domain-adopted Moses used by an LSP and an unadopted NMT system.

A striking conclusion is that the NMT system outperforms the Moses on almost all aspects except for terminology (the NMT was not domain-trained) and tag handling (the NMT did not do any explicit tag handling). From this study, we see no reason to stick with the “traditional” PBSMT technology. Post-editing is definitely still needed, also for collecting data that can help to improve the engines. Only technology that is in use can get better.

Want to learn more? Check out the GALA webinar on this topic.


The work presented above is joint work with Vivien Macketanz, Kim Harris, Silvia Hansen-Schirra, Hans Uszkoreit, and many others. Part of this work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 645452 (“QT21”).

Aljoscha is lab manager at the Language Technology Lab of the German Research Center for Artificial Intelligence (DFKI GmbH). His interests include the evaluation of (machine) translation quality and the inclusion of language professionals in the MT R&D workflow. Burchardt is co-developer of the MQM framework for measuring translation quality. He has a background in semantic Language Technology.


[1] Aljoscha Burchardt, Vivien Macketanz, Jon Dehdari, Georg Heigold, Jan-Thorsten Peter and Philip Williams. A Linguistic Evaluation of Rule-based, Phrase-based, and Neural MT Engines. In Proceedings of EAMT 2017, Prague, Czech Republic.