Investigating the Scalability of Alignment in Okapi
Earlier this year I got an email from a freelance translator who was looking for advice on how to solve an l10n engineering problem he had. He had a huge quantity of text that he had machine translated using Moses, the open source machine translation engine. The source and translated files were in the format Moses expects — plain text, one segment per line. Now he wanted to know if there was an easy way he could use Okapi to align these segments and write the pairs out as a TMX file. He had been experimenting with Rainbow, but hadn’t had any luck in accomplishing his task.
I hadn’t tried this before, but I knew that Okapi supported a pipeline step that supported alignment, so I opened up Rainbow and started experimenting with a small bit of test data. As it turns out, Okapi has three separate steps that can perform alignment: “Sentence Alignment”, “Paragraph Alignment”, and “Id-Based Aligner”. Since the input was pre-segmented, I chose “Paragraph Alignment”.
Within a few minutes, I had a setup that worked on a pair of small test documents I’d made. I added the “source” and “target” documents to separate input lists, each configured to use the Moses Text Filter. Then I built a 3-step pipeline to process the data:
- Raw Document to Filter Events, to extra the segments from each of the documents.
- Paragraph Alignment, to align the segments.
- Format Conversion, to write out the TMX.
I emailed the solution to the translator, and went back to work. The next day, I heard back — the solution worked, he said, but only for very small files. Okapi seemed to have a hard limit on how many segments it would align, and even if he reduced the file below that, he was having trouble getting the pipeline to execute. He sent me his data to help me reproduce the problem.
The Plot Thickens
Using his data, I confirmed that Okapi had a hard limit, and went to the source code to look, where I found an ominous warning about memory usage. I also confirmed that if I took a subset of the data — about 50,000 segments — the pipeline would fail with a memory error. I increased Rainbow’s max heap to 4gb, and tried again. Still out of memory.
At this point I decided to actually read the code, and discovered the problem quickly. The Paragraph Alignment and Sentence Alignment steps are implemented using a dynamic programming algorithm called a checkerboard. The code allocates a huge grid and then fills in the results, one square at a time, based on the results of the surrounding squares. Computationally, the algorithm is efficient, but you need enough memory to hold the grid, the size of which increases with the square of the number of segments pairs.
With my 50,000 segment test file, that meant a grid containing 2.5 billion cells. That’s probably 50gb or more of memory, when you factor in the size of the cells themselves. That was just for my test file — the translator’s full corpus had about 2 million segments. That would be a grid containing… 4 trillion cells.
It would be possible, I suspected, to rewrite the checkerboard implementation to recycle rows as it went along, meaning that the memory we needed to hold in memory could be reduced to something linear. But that meant a nasty rewrite, and if I was going to do that, there might be a better algorithm to be using in the first place.
Was there another way I could be aligning? Looking at the code, the Sentence Alignment step appeared to have the same problem, but the Id-Based Aligner had a much simpler implementation that didn’t need a grid at all — it should be able to align as many segments as it could fit into memory.
But there was just one problem. In Okapi’s extracted text model, a TextUnit contains two identifiers:
- An id, assigned by the filter that extracts the text. The id is only guaranteed to be unique within the document, so if two TextUnits from different documents have the same id, it doesn’t necessarily mean anything.
- A name, which is a more permanent identifier that is present in some native formats. An example of a name is the key from a Java properties file or other software string format.
The Id-Based Aligner aligns based on name, since it’s the more dependable identifier. But while all Okapi filters generate ids, names are only generated when native file format supports it. The Moses text format contains no metadata, so no name is exposed.
Finding a Solution
At this point, it seemed I had hit a dead-end of what I was able to do without committing serious time to this project. But after some discussion on theokapitools mailing list, we settled on a solution: add an option to the Id-Based Aligner to allow it to align based on id instead of name. This would let the aligner work for a lot of simple cases like this one where the ordering of text is known to be stable between the two files. Of course, this option needs to be used with care — Okapi would essentially be blindly aligning based on the order of the text, which is dangerous if the user isn’t confident that the files are already perfectly aligned.
This turned out to be an easy option to add, so it’s now available in the latestOkapi snapshot builds and will be part of the M30 release. Since the Id-Based Aligner step supports TMX creation directly, the original problem can now be solved using an even simpler pipeline:
- Raw Document to Filter Events, to extract the text from the native files.
- Id-Based Aligner Step, to perform the alignment (based on TextUnit Id) and generate a TMX.
This story demonstrates a few aspects of the reality of open source. Obviously, without the ability to dig into the code and implement a fix directly, this tale doesn’t have a happy ending.
But it also shows that the code isn’t always perfect, especially for features that aren’t used that frequently. Okapi is a large codebase, and there are only so many people who can contribute their time to improving it. We tend to prioritize the code we rely on most heavily. Alignment is an uncommon task for many people.
This is why feedback from the community is valuable to help identify and prioritize issues. Even if not everything can be fixed immediately, it’s very helpful to the project to know what works well, what could work better, and what might just be broken. Because of this, if you’ve got questions about using the tools, I would encourage you to join the okapitools mailing list and start a discussion. You can also always file issues in the issue tracker.
If you’re interested in helping out, Okapi can always use your assistance. If you’re interested in working with the code, the okapi-devel list can be a good resource, and you are welcome to file a pull request on bitbucket. However, you don’t need to be a programmer to help out! There is always testing that needs to be done, and there are large areas of the knowledge base that need additional work — updating old tutorials, documenting pipeline steps, and providing more guidance in general.