A Holistic Approach to Reusing Terminological Data in Software Localization
By: Silvia Cerrella Bauer (SIS SegaInterSettle) and Detlef Reineke (University of Las Palmas de Gran Canaria)
06 September 2006
This article highlights how an organization's internal process of production and localization of software interface texts (SITs) can be developed into a proactive, contextualization-oriented workflow across several departments, enabling data storage of these SITs with their appropriate context description in a universal format such as XML to ensure continuous data security and exchange.
When it comes to using language resources (LRs), the issue is primarily not of linguistic but of logistic nature. All dynamic work processes aimed at increasing efficiency have gained such importance that technical solutions, know-how and skills are allocated at least the same value as the language and translation competencies. The main goal in this context is to avoid further deep-structured knowledge remodeling and to allow central, personnel and tool independent language and translation management in a proactive way. To achieve this, maximum "knowledge explicitation" and seamless data exchange must be guaranteed, especially when handling de-contextualized text fragments in extensive projects.
Former and current workflow for the translation of software interface texts
Our industry example is that of a financial organization, SIS SegaInterSettle AG, Zurich, which decided to make the SITs of one of its executable Java-based applications available to clients in different languages. Its visual user interface is string-based: every data type is a character string which may represent dialogue texts, menu entries, message texts etc. Each string to be localized averages about 1.8 words. Some string examples are given below:
Figure 1: String examples from the proprietary application
The strings range from simple one-word action buttons to whole sentences, such as:
Table 1: String examples from the proprietary application
Before the in-house language services department (LSD) started using a localization tool for translating the SITs, disparate and erroneous data (in terms of consistency, language and terminological correctness) were integrated in the program by non-native speakers in the programming department then reapplied in the documentation by the technical writers. It was only when the program-related documentation was passed on for translation that several inconsistencies and errors were identified. Time-consuming text identifications and corrections had to be initiated by the LSD.
Having automated and re-allocated responsibilities within the framework of this process, enormous time savings could already be achieved: the time for processing each single string could be reduced to one third of the original amount of time required. In addition, the initial cost of the localization tool and training the LSD could be completely amortized only four months after launching. The programming and the technical writers' team could benefit the most from the new process. Then they could devote more time to other core (value-added) tasks.
Proposed enhancements of the current localization process
Most of the SITs are small units, i.e. nouns (simple or compound) or short phrases. From a typological point of view, these text strings can be classified as terms and therefore should be stored in a terminology database (TDB) together with the corresponding meta-data. Even strings like "Click on the following link and you will have access to the complete webMAX User Manual" could be stored in TDBs, because the (textual) representation of this concept could be shortened to term size without any problems (menu item: Complete webMAX User Manual).
Yet, despite their terminological character, the SITs are usually exported from the localization tool into a translation memory system (TMS). For translation reuse, this solution offers advantages as far as translation speed is concerned. However, with regards to a better understanding of the concepts and relations between the SITs, TMSs are not very helpful, because they generally do not integrate features that allow a language or software specific contextualization or conceptual systematization of the texts.
Let's have a look at some examples that demonstrate the need for a more context-oriented approach:
Figure 2: String example extracted from the localization tool environment
Figure 3: String example imported into the TMS
Alternatively, comprehensive context meta-data such as definitions, term relations etc. could be included in a TDB. One might say that there is already a context-oriented TMS like Star Transit, but this tool allows only for contextualization of conventional texts and is unable to trace links (for example, a dialog option error message), variable values or other program-specific issues. In a terminology management system like Trados MultiTerm, for example, the SIT in figure 3 could be defined as shown in figure 4:
Figure 4: Screenshot from the XML-based terminology management system
Apart from language-related data like "Source", "Definition" or "Term type", this entry model also includes program-specific information like "Code", "Name" and "String ID". This data could be helpful when identifying or tracking possible doubts or program errors together with the corresponding software developer.
Terminology standards for software localization
Trados MultiTerm is an XML-based terminology management system and allows for structuring terminology entries similar to the ISO 16642 (2003): "Terminology in computer applications – Terminological markup framework (TMF)." This standard has been defined to provide a common basis for terminological data modeling and lossless data exchange between different systems.
Figure 5: Meta-model (ISO 16642:18)
This meta-model is based on the MARTIF standard (ISO 12200), with the exception of TCS level, which is not part of the MARTIF structure. The upper levels of the meta-model (TDC, GI, CI) may contain information that applies to the whole TDB or file such as the validating schema (if XML is used), encoding information, the title of the file, address information, copyright information, update information or information related to the author of the database. Additionally, full textual bibliographical or administrative information, static or dynamic graphic images, video, audio or other terminological resources or ontologies may be stored in or external to the file and be pointed to from the individual terminological entries by hyperlink.
The terminological entry (TE) contains all the information pertaining to a single concept. One entry can be made up of one or more language sections (LS) and term sections (TS). Furthermore, the components of compound terms can be described in the term component section (TCS).
The meta-model is expressed by a terminological markup language (TML), which in most cases will use XML to describe the terminological information, but other modeling languages like RDF or UML can also be used. One of the predefined TML of the ISO 16642 standard is MSC (MARTIF with Specified Constraints), which in turn is used to define the TermBase eXchange Format (TBX), a format that is supposed to play the same role in the field of terminology data interchange as does TMX in the field of TMSs or XLIFF in the field of localization tools.
To become more flexible for interchange between different entry structures of different terminology management systems, the data categories are not directly connected to the nodes of the meta-model. Generally, each data category is instantiated by a type attribute within a meta-data category tag (see figure 6).
The names of the data categories in our example are user-defined and do not comply with ISO 12620 (1999): "Terminology in computer applications – Data categories". But some of these data categories, i.e. the software specific data categories like "Code" or String ID", do not even have equivalents in ISO 12620. Most of the ISO 12620 compliant data categories are likewise to be used in most of the common TDB entry structures, but due to the specific context needs in the field of software localization, some additional data categories would be necessary to be able to design a more granular data entry model (see Reineke 2004, Schmitz 2005).
Resource IDs and categories, resource types and other structure-related software data should be described within specific data categories to help translators and localizers to contextualize the SIT directly during the translation process or to speed up context checks with the software designers.
Figure 6: XML-based Trados MultiTerm format
A systematic, standards-based approach in digital content development and translation can help reduce costs and increase product quality. Generally speaking, the transition from loose, uncoordinated workflows and data flows to more efficient methods and formats requires knowledge, persuasiveness and patience. This is not likely to happen in quantum leaps. In most cases, the need for more translation/localization oriented workflows and data flows originates from the LSDs due to the lack of clarity of the SITs detected here first. Therefore, software developers should make good use of their LSD considering them as the first users of their products and establish adequate feedback channels for readjustment.
Another important aspect when it comes to the implementation of sophisticated processes is the interoperability of tools and formats in order to allow lossless migration of information. As we have seen above, SITs can be considered concepts and therefore be stored in a TDB. Apart from the theoretical fundaments that justify this approach, there are also practical, context-related reasons to do so. Today, only TDBs allow for context-oriented description of SITs; TMSs don't.
Apart from the aspects mentioned above, the general question is whether language and terminology management should be a re-engineering process (introduction of context-data during translation/localization) or whether part of the language management tasks can be performed during the software design phase. Terminologists, or even instructed software developers, could create source language TDBs which would then be sent to the translators in the form of external TDBs and be connected to a localization tool. Another possible scenario could be the inclusion of context-oriented meta-data in the software application source code. In this case, compilers and localization tool interfaces should be developed to allow the extraction of both SITs and meta-data.
CERRELLA BAUER, SILVIA. (2005): Processing multilingual deliverables in an organisation: the role of terminology management in software localisation. In: Tekom-Tagungen – Jahrestagung 2005 in Wiesbaden, Zusammenfassung der Referate, TEKOM Gesellschaft für Technische Kommunikation e.V. – tekom, Stuttgart, 202-205.
FREIGANG, KARL-HEINZ/REINEKE, DETLEF (2004): Kontextualisierung und Datenfluss in der Softwarelokalisierung. In: Lebende Sprachen 4/2004, 159-167.
ISO 16642 (2003): Computer applications in terminology – Terminological markup framework (TMF). Geneva: International Organization for Stan¬dardization.
REINEKE, DETLEF. (2004): fFpPtTdDcClLrR??? – Wissenseinheiten in der Softwarelokalisierung (und deren Verwaltung in Terminologieverwaltungs¬systemen). In: Mayer, F.; Schmitz, K.-D.; Zeumer, I. (Hrsg.): Terminologie und Wissensmanagement, Akten des Symposiums Deutscher Terminologie-Tag, Köln, 26.-27.03.2004. Köln, SDK Systemdruck, 193-207.
SCHMITZ, KLAUS-DIRK (2005): Terminologieverwaltung für die Softwarelokalisierung. In: Reineke, Detlef/Schmitz, Klaus-Dirk (eds.): Einführung in die Softwarelokalisierung. Tübingen: Narr, 39-53.
Silvia Cerrella Bauer is a certified conference interpreter and a certified terminologist. She holds a post-graduate degree in Corporate Communications. She has gathered experience as a freelance interpreter and translator and has been working for the past seven years as a translator-terminologist at SIS SegaInterSettle AG, a major Swiss service provider for the securities industry based in Zurich. She has been responsible for this company's translation department since 2001. Some of her special interests are knowledge management, software localization, controlled languages and engineering of document production processes. She has participated as a speaker at various international forums and events on translation, terminology and technical documentation and has published a number of articles related to these subjects and her professional practice.
Detlef Reineke obtained a degree in specialized translation (mechanical engineering, electrotechnics – French, English and Spanish) at the University of Hildesheim and holds a PhD on data modeling in software localization. Since 1994, he has been teaching and researching at the University of Las Palmas de Gran Canaria (Spain) at the Faculty of Translation and Interpreting (subjects: specialized translation, software localization, German language and culture). He is author of various articles and books on software localization and participated in localization-related projects. He has been General Secretary of the German Terminology Association "Deutscher Terminologie-Tag e.V." since 2004.