E.g., 11/21/2019
E.g., 11/21/2019

UTX (universal terminology eXchange): a simple terminology format

By: Yuji Yamamoto (Cosmoshouse)


06 May 2013

UTX is a simple, tab-delimited terminology format established by the Asia-Pacific Association for Machine Translation (AAMT). UTX significantly improves translation accuracy by accumulating, sharing, and reusing glossary information. 

Why UTX?
In 2009, AAMT established the first UTX specification, which has been subsequently revised and updated. Based on this specification, anyone can create, publish, and share a UTX glossary (also called a "UTX dictionary"). With UTX, a user can easily create, share, and reuse glossaries to improve translation quality. For human translators, UTX is a concise, easy-to-build glossary that cuts the time and cost to check terminology. For terminology-based machine translation and terminology tools, UTX serves as ready-to-use glossary data.

Have you ever thought that translation software can produce only strange translations? When translation software fails to translate correctly, the problem is often that it doesn’t have sufficient translation knowledge of certain words and phrases that should be translated. You can greatly improve the accuracy of translation software (rule-based machine translation) by accumulating translation knowledge as a UTX glossary, and then converting it into a user dictionary of the translation software.

Until now, an individual user of translation software required a huge effort to prepare effective user dictionaries. Also, even an Excel glossary or a simple plain text file is difficult to share or to reuse, if the entry format is not standardized. Many glossaries are available on the Internet, but their formats are not readily usable out-of-the-box. Time-consuming corrections and fine-tuning are required to use them in actual tools. However, if you use a standard format such as UTX, you can share a glossary among various tools, and quickly reuse it.

What does it look like?
The figure below shows the basic components of UTX glossary. The detailed specification is also available.
 

Figure 1: Basic components of UTX glossary

Who creates and uses UTX?
UTX is specifically designed to be created and used by translators and end-users of translation software. It does not require any advanced technical knowledge of linguistics, grammar, or machine translation software, etc. to create or use it. It can be made from minimum data such as basic parts of speech (noun, verb, etc.), and the plural form, if the entry is a noun.

Figure 2: UTX enables users to easily share glossaries

Figure 3: UTX enables users to share glossaries across different tools

In which domains?
UTX can be used in any specialized domain that has technical terms, such as ICT, medicine, legal, engineering translation, etc.

What kind of words should we include?
A UTX glossary contains only technical terms of specific domains, such as names of products, parts, diseases, medicines, and laws. It also contains proper nouns, such as names of people, places, and facilities. In many cases, entries are nouns, especially compound nouns. For example, a word like "XML declaration" can be correctly translated into its Japanese equivalent, "XML 宣言" by just registering it in a user dictionary. Basic vocabulary like "window" should not be included, because such words are already contained in the system dictionaries of translation software. Translation accuracy can be improved by collecting, sharing, and reusing the data of fine-tuned bilingual translations which are not included in translation software out-of-box.

Sentences should not be included, except when it is appropriate to treat them as "words." As a rule, UTX should be separated from translation memory, which is a bilingual database of sentences, but not words.

Multilingual glossary and term management
Since the character code of UTX is Unicode, it can handle almost any language. Normally, a UTX glossary includes only entries of the single source language A, and their translations in the single target language B. Starting with UTX 1.20, you can specify multiple target languages.

Here is an example of a glossary containing multiple target languages.
 

Figure 4: UTX glossary sample with multiple target languages

With UTX, you can manage terminological quality and ensure that the correct terms are used. You can specify one of four statuses - provisional, forbidden, approved, and non-standard - to each entry. When multiple users contribute new terms, the initial term status would be "provisional" (or left blank). Then the term administrator checks each term, and if it is suitable, the administrator changes the term status to "approved." The term with "authorized" status can also be used for translation of the reversed direction (from language B to A). A "forbidden" status forbids the use of specific terms. And a "non-standard" status means that even though the word is not the best translation, it needs to be included for the processing purpose (an example is an alternative spelling).

How do we make a UTX glossary?
A UTX glossary can be easily created, edited, and viewed with any spreadsheet application or text editor. Some tools are available to perform mutual conversion among UTX and various formats.

Figure 5 shows errors commonly found in a UTX glossary.

Figure 6 shows how to write a UTX glossary's first few lines, which indicate the information about the glossary.

Tips for making a UTX dictionary (glossary)

  • A specific domain requires a glossary
  • Each entry has one and only one source term
  • Choose only the single, most appropriate translation corresponding to a source word
  • Only use upper case for proper nouns
  • The basic form of the word should be entered (singular form for a noun, root form for a verb - as you would see in a commercial dictionary)
  • Any comments should be noted separately in the comment field, not as a part of the entry

Please also refer to the Quick Guide and the UTX specifications for details.

In what kind of scenarios do we use UTX?

  • Creating a glossary from scratch
  • Collecting translated terms during translation 
  • As an intermediate conversion format for the conversion between various terminological formats

How do we use UTX?
Since a UTX glossary is a simple format, it can be easily converted and imported to various tools. In tools such as OmegaT (a translation memory tool) and ApSIC Xbench (terminology reference tool), it can be used with very few changes.

What does it cost to create a UTX glossary?
You can download the UTX specifications for free. Many UTX glossaries are also available for free, although some creators may charge a fee.

To open source developers and translators - Why not release your glossary in UTX format, and share it with others?
By making UI strings or a bilingual glossary into UTX format, and publishing and sharing it, you can multilingualize your software quickly and accurately. Thus, many potential users around the world would be able to use your software with a minimum effort.

UTX mailing list
Anyone can participate in the discussions on UTX through the UTX mailing list.

More information
Download the UTX specifications, sample glossaries, and the UTX Quick Guide for free.

UTX example (ICT glossary)

#UTX 1.11; en-US/ja-JP; 2013-04-01T10:00:00Z+09:00; copyright: AAMT (2013); license: CC-by 3.0

#src

tgt

src:pos

term status

src:plural

early adopter

アーリー アドプター

noun

approved

early adopters

fast

高速な

adjective

provisional

 

optional

省略可能な

adjective

approved

 

optional

オプショナルな

adjective

forbidden

 

save

保存する

verb

approved

 

UTX example (ICT glossary)

The first line: (a comment line, denoted by a "#") the basic information of the glossary. Each item is divided by a semicolon and a space.

#UTX <version number>; <source language>/<target language>; <last update date/timestamp>; copyright: <copyright holder name (year)>; license: <license>; <additional information> (if needed)

Second line (comment line): field names. Each item is tab-delimited.
In the above example:
#<source word> <target word> <part of speech of the source word> <term status> <plural form of the source word>

The third and subsequent lines contain actual entries in this example. Each item is tab-delimited.
 

Yamamoto Yuji (CosmosHouse) et al. (UTX team, AAMT (Asia-Pacific Association for Machine Translation))

AAMT is comprised of three entities: researchers, manufacturers, and users of machine translation systems. AAMT members are volunteers. The association endeavors to develop machine translation technologies to expand the scope of effective global communications. For this purpose, the association is engaged in machine translation system development, improvement, education, and publicity.

Yamamoto Yuji, language/translation consultant at CosmosHouse, is the leader of the UTX team at AAMT. He has contributed over 230 technical articles for translators through various publications. The topics include machine translation for translators, CAT tools, and terminology. In his recent book, Practical Japanese, he shares best practices of practical Japanese writing.

Members of the UTX team (not in a particular order)
YAMAMOTO Yuji (leader)  CosmosHouse
MURATA Toshiki                 Oki Electric Industry Co., Ltd.
Francis Bond                       Nanyang Technological University
SHIMAZU Miwako              Toshiba Solutions Corporation
OKURA Seiji                        Fujitsu Laboratories Limited
Michael Konin Kato             Learning Consultant
AKIMOTO Kei                      Cross Language Inc.
METSUGI Yumiko               STAR Japan Co., Ltd.
KAMEYA Hiroshi                 SunFlare Co., Ltd.
HIRABAYASHI Takeshi        Inter Group Corp.