A “Legs, Bums and Tums” Workout for Machine Translation Engines
Summer is here, and many people are starting to take exercise again. At itl, we find that machine translation (MT) engines need regular training, too. And what’s the basis of any training program? A good plan. For a machine translation engine, this training plan consists of the following:
- Data-set preparation (the warm-up phase)
- The workout for the MT engine
- Post-editing (the cool-down phase
So Why Train Machine Translation Engines?
Because you reap benefits not just in terms of linguistic quality but in terms of your corporate language and terminology, as well. And when you give your translation engine the right workout, not only do you improve your translations; you also save time and money.
Data-Set Preparation as a Warm-up Routine
A first, important step is to develop a training plan based on the data in your existing translation memory system. Scripts and automated tool solutions can help to analyze your data and deal with some of the problem areas automatically. In the manual noise reduction and automated normalization phases, the focus is on the following problem areas:
- Incorrect language pairs
- Unicode blocks
- The spaces between words (tabs, line breaks, spaces)
- Segments that are two long (we recommend no more than 40 words per segment)
The Workout for Highly Tuned MT Engines
Now that the warm-up routine is over, the workout can begin. A large batch of text segments (workout exercises) is prepared and fed into the MT engine. How extensive the data set has to be (how many exercises there are in the workout, in other words) depends on the MT engine involved. A minimum of 100,000 segments should be available. This minimum amount can be augmented with a basic stock of translations from a variety of suppliers to provide a good foundation (much like you might get protein powders from a supplement shop). Without this basic stock, as many as a million segments are recommended.
Putting Our Workout to the Test
Test the effectiveness of your "workout", for example, together with a few of your main customers. Take come existing catalogue content (short, fragmented, terminology-intensive texts) and perform the following steps:
- Step 1: Translate the existing data using a generic MT engine
Outcome: inconsistent translations that does not use the corporate terminology
- Step 2: Prepare the data from the translation memory and add a basic stock of translations from elsewhere
Outcome: The Data Set for Training the MT Engine
- Step 3: Feed the data set into a trainable MT engine
- Step 4: Check the results using automated evaluation metrics
- Step 5: Verify the results using post-editors and proofreaders
In our case, the translations of the trainable MT engine were completely convincing in the test: the use of the customer’s specialist terminology was much improved.
Custom Training Is the Key to Success
Generic MT engines such as Google Translate or DeepL work best with large, general texts where the aim is to get an accurate translation that reads well. The specialist terminology distributed throughout the text can be dealt with subsequently in the post-editing phase. In the long run, however, an all-round athlete like this will not manage to rise above mediocrity. If you want to get on the winners’ rostrum for different kinds of text, you need to put together a team of specialists: a weightlifter for catalogues, a sprinter for advertising copy and so on. A specific program of training enables you to use each MT engine where it will be successful and give you the best results. Or would you expect a weightlifter to win the gold medal in the 100 meters?
That All Sounds Like a Bit Too Much for Me...
Not ready for a full data workout? Or is a good all-rounder all you need? No problem. A generic MT engine followed by post-editing will also give you good results. Regardless of whether you use a generic MT engine or a trainable one, itl’s localization engineering experts will be with you every step of the way. Regardless of whether you use a generic MT engine or a trainable one, get an expert on your side to guide you through every step of the way. Automated metrics are a valuable tool for the initial assessment of suitability of content for machine translation. Your expert guide will help you choose the right MT engine, and will also keep an eye on data security and server locations.