Which Approach to Human Translation Quality Evaluation and Why?
By: Stephen Doherty and Federico Gaspari (Centre for Next Generation Localisation)
05 December 2013
Under the mounting pressure of delivering high quality at a fast and consistent pace while remaining competitive on the market, more and more translation and localization service providers face challenges of effective translation quality management and evaluation.
Industry data reveal a widely felt need to enhance current translation quality evaluation methodologies, but different production workflows require specific solutions and tools (see the recent survey). However, given the wide variety of approaches and metrics that are available, it is not easy to make the right choices in this area. This article offers an overview of proven industry-leading options in terms of human translation quality evaluation, with a view to helping translation and localization service providers find the right solutions to their needs.
A rather traditional and still very common scenario involves manual evaluation, whereby a reviewer scores a translation by checking it and looking for errors; the concept of “error” is more complex to define than might appear at first glance, as there are several factors to be taken into account: these concern, for instance, the severity of the error, which is to some extent subjective and depends on its perceived or anticipated impact on the target text users. In addition, severe errors in abstract terms (e.g. a missing negation or a typo in a single word, resulting in a seriously misleading meaning in an instruction manual) might be relatively easy to fix, say by adding a few letters or overwriting a single wrong character. On the other hand, ostensibly less dramatic issues in terms of meaning and content might be very time-consuming to rectify (e.g. the use of the informal mode of address to the reader when the formal variant would be more appropriate, with the required changes involving the use of honorifics, pronouns, verb forms, etc.). To take these rather subtle factors into account in human evaluation methodologies, errors are normally assigned “weights” according to their type and severity, i.e. a numerical multiplier corresponding to how important the error is.
There are two main categories of human translation quality evaluation approaches, namely error rate models and rubric models. Error rate models are more common, and are based on error counting, corresponding to a subtractive model of translation quality; in this scenario, a score (usually expressed as a percentage) indicates the quality of the translation: assuming that a “perfect” translation would be scored 100%, errors lead to point deductions (negative scores are possible, in principle). Minimum quality thresholds can be set for specific translation projects: any quality score falling under that level would lead to the translation being rejected on the basis of its poor, i.e. unacceptable, quality. Popular examples of translation quality models adopting the error rate approach are the SDL TMS Classic Model, the SAE J2450 Model and the LISA QA Model (although the latter is no longer officially available, its structure still serves as a basis for a number of tools).
The other broad category of metrics and tools for human translation quality evaluation is represented by the rubric models, which appear to be less common in the industry at present. These have an additive nature, i.e. starting from zero, points are incrementally added if the translation meets requirements specified in the metric; this means that the overall quality score is achieved by adding up positive features. At the moment, rubric models seem to be still relatively uncommon in commercial translation and localization, remaining largely confined to academia. However, they might become more popular in the future, and in theory hybrid models combining the error rate and rubric approaches are possible, but not yet widespread. It is quite common in the industry to apply these translation quality models to samples of the text being evaluated, to mitigate the disadvantages in terms of cost and time entailed by having to rely on humans for these evaluation methodologies. Clearly, this involves some level of approximation, in spite of the claimed objectivity and reliability of these approaches.
Although there is no space discuss them in detail here, it is worth mentioning the formal criteria for the evaluation of certification exams used by national professional translators associations, as they might be of some interest to translation and localization industry players. These specifications are aimed at assessing the performance in exam conditions of individuals aspiring to become certified translators, hence there is some overlap with the considerations applying in the business world. One such example is provided by the American Translators Association (ATA), which has recently issued its own Framework for Standardized Error Marking: Explanation of Error Categories. Up-to-date and accessible discussions of the issues involved in this and other similar evaluation schemes used for certification and accreditation by reputable professional bodies for translators and interpreters around the world are available here
To sum up the main pros and cons of human approaches to translation quality evaluation, they all rely on assessors detecting discrepancies in meaning and/or form between the source and target text, and judging their impact on the translation product. While this is bound to guarantee accuracy and reliability (provided that the evaluators are well trained and receive clear evaluation guidelines), it also entails relatively high costs and time-consuming processes; in addition, judgements on the nature, severity and therefore on the “weight” of errors can be to some extent subjective. One common remark in this respect is that often it is not easy for translation and localization service providers to explain to their clients the implications and extra costs associated with having robust quality assurance policies in place, even though quality is universally seen as paramount.
A comparison of the most widely used human translation quality evaluation metrics and models reveals that many of the error categories considered actually correspond to general language errors (e.g. misspellings, grammatical errors, etc.) in the target text that do not require bilingual knowledge; hence one might investigate the possibility of optimising the evaluation process by first having the translation checked by a monolingual assessor (looking exclusively at problems in the target language), with other issues linked to the source text being referred to an additional evaluator with bilingual expertise. While this looks like an elegant solution in principle, probably the overheads of managing this process effectively may not make it feasible in practice.
In addressing the shortcomings of the above metrics, QTLaunchPad has been working with a wide range of players, including GALA, translation trainers, user groups, public and corporate users, freelancers, and tech doc creators, to develop the Multidimensional Quality Metrics framework. This approach allows users adaptability and flexibility for different project types and workflows. MQM has applicability to source and target to promote full integration of the document-production lifecycle, and more fairness to all stakeholders from content creators to translators and post-editors. It is suitable for both human and machine translation workflows and combinations thereof, thus allowing greater comparability across domains, projects, and even language pairs.
MQM has a free, open, and flexible platform that supports requirements specification, bidding, translation quality assessment/assurance, and other business processes in one uniform model, as well as in-line mark-up with issue resolution and auditing trails. Standardization for it has been built upon existing ISO specifications and the popular models described above, and does not exclude openness and compatibility with these existing models (e.g. full compatibility with legacy systems such as LISA QA and SAE J2450) which allows existing workflows to be kept while still taking advantage of MQM’s features and extensions.
Finally, a review of the most popular human translation quality evaluation models and metrics used in the industry shows that they have very few error categories in common (with the notable exception of the correct and consistent use of terminology, which is a standard feature), while – perhaps surprisingly – little overlap is found as far as other error categories are concerned. This in turn raises problems related to how errors should be effectively corrected, once they have been identified, to ensure an incremental and cost-effective quality improvement of the final translation that is delivered to the client.
A recent QTLaunchPad one-hour webinar provides a more comprehensive discussion to the above and is available for free via GALA. Further information about the Multidimensional Quality Metrics and all of the above metrics can be found in QTLaunchPad’s training section, which also includes similar training information and materials for automatic and semi-automatic evaluation approaches, links to QA and quality estimation tools, and related translation and machine translation topics.
Stephen Doherty is a postdoctoral researcher and lecturer in the Centre for Next Generation Localisation in Dublin City University. He has a PhD in translation technology and lectures undergraduate and postgraduate students in this field, in addition to industry-based translator training, coming from a background of technical translation He is currently working on the European Commission-funded QTLaunchPad project which tackles the barriers to high-quality human and machine translation.
Federico Gaspari has a background in translation studies and holds a PhD in machine translation from the University of Manchester. He has more than 10 years’ experience as a university lecturer in specialised translation and translation technology (Universities of Manchester, Salford, Bologna at Forlì and Macerata). He is a postdoctoral researcher in the Centre for Next Generation Localisation in Dublin City University, specialising in (human and machine) translation quality evaluation as part of the QTLaunchPad project.