Big Data and the Translation Industry: Three Technology Challenges
By: Andrew Joscelyne (LT Innovate)
19 May 2015
LT Innovate’s Andrew Joscelyne breaks down three major trends poised to transform the language industry and prescribes his solution to the challenges ahead — collaboration.
At LT-Innovate, a language tech business support organization, we see powerful synergy between the various application fields of language technology (e.g. multilingual/cross-lingual speech/text processing) as big data, connectedness, and machine intelligence gradually morph us from a traditional IT field to a more intelligence-centric technology landscape. We shall look here at three problem points where evolving technology will almost certainly challenge or compete with the design solutions for translation.
Push to Pull Translation
The translation industry is largely concerned with delivering push – in other words, with providing services for “big content” publishers who want to control the quality of their messages when broadcasting to end-users around the world in multiple languages. But who will provide translation for pull — for the potentially billions of connected individuals who need to translate the foreign-language content they encounter as they journey through the physical and virtual worlds? Theoretically, the content producers have already anticipated these end users. (They, surely, are the potential targets of the “big content” messages referred to above.) By pushing out translation upstream, they ostensibly anticipate many potential pull demands downstream.
Yet the language coverage of a content campaign (50 or so languages) rarely covers the myriad individual needs of the pull demographic. This group requires up to 250 languages (and counting), and has rapidly evolving definitions of what it means to “be online” and interact with content. “Hey,” says a Maltese consumer in Europe’s upcoming digital single market, “what about accessing that content in my native language?” This same user (and others in the pull demographic) will also be using smart phones, ear buds, wearables, and even AR/VR glasses to access content online, and will demand a rich, interactive, and multilingual relationship with content.
Many online content publishers around the world will anticipate a pull population and will continue to localize their content to at least English. Why? Because big consumer data will probably demonstrate that a non-EN users with access to EN content will use an app on their mobile devices that delivers free translation from an online translation service – and vice versa. Content publishers will probably wager that low quality English-to-user-language via the machine is probably better than not reaching their user at all.
A similar situation could play out between users, their devices, and local data sources. Data about users’ digital profiles, geolocation, and personalized habits could all be recruited to build a more user-centric translation solution to solve a particular language problem. And it would automatically grow better as users go through life. In other words, push could emerge when and if computing evolves into a set of data and algorithms that know how to perform knowledge-based tasks.
Data to Knowledge
‘Knowledge’ is a weasel word in language processing, yet is increasingly relevant to the cognitive dimension of everything we do with technology. Philosophers note the distinction between “know that” (intellectual knowledge) and “know how” (practical knowledge). Linked open data, semantic graphs, and other emerging meaning formats all play a role in this nascent ecosystem, and help produce know-that type knowledge – i.e. facts, and relations between facts that can then be used to feed other actions such as predicting, inferring, proposing, and arguing. This sort of knowledge apparatus is gradually being downloaded to our machines and will eventually be built into the IT infrastructure. Know-how is literally how to perform a task: translation for example.
In the case of translation, we need both an IT-based knowledge platform (see Heyn and Wetzel’s article in this issue) to handle interoperable semantics, and an openness to cognitive computing as a new knowledge technology, in the hope that it will spawn far smarter offspring than today’s rather limited efforts. The more stuff computers know (that), the more they can help us process massively data-based events and utterances more accurately and relevantly.
But what about know-how? What about the craft, skill, and inalienable savoir-faire of homo artifex. You probably noted the recent buzz about a robot that closely watched a video of a chef preparing a meal, and then managed to build a good enough internal data model to imitate his/her gestures sufficiently to prepare the same meal, using clever neural-net gadgetry. Except for one crucial point: the poor little robot had a lot of know that, but no know how about chopping up the veggies (all that complex joint and muscle engineering embedded in human somatic systems!).
Will some translating device in due course transform knowing how to do something into simply knowing that the correct output is X or Y, by virtue of machine learning from zillions of examples, and then modelling the result? After all, a computer will never learn how a human translates something (all that neural cogitating and fantasizing); it will only know that phrases X and Y are pieces of statistically acceptable information produced by a translation process that humans seem to accept.
If you want to go down that (ultimately big data) road, you could have a robot analyze the keyboard work of hundreds of translators transmuting a text into a number of languages, and use cognitive technology to come up with a fairly accurate and repeatable model of how translators seem to make decisions on translation cruces and get their work done. You could then use these data to model a similar automated process. Translation know-how, yet not quite machine translation as we know it.
The same surely goes for post-editing: the robot will note the millions of vocabulary and grammar decisions and actions taken digitally by post-editors and learn from these at each linguistic juncture (word, phrase, sentence, and paragraph) how to duplicate the process. Knowledge, then, is slowly being offloaded to smart machines that in turn can transmogrify know-that facts derived from data about processes into a simulation of practical know-how. In other words, these machines can learn from us.
Data to Text
Once upon a time, natural language generation (NLG) was a key research project in the language technology Umwelt, and is indeed one of the modules in a standard-issue translation automation system: somehow you have to generate the translation target language output from grammatical and lexical information encoded as a ‘language model’ for the machine.
A related, non-trivial language engineering challenge has been automatic summarization, a technique that would enable a machine to summarize a long article, or (more creatively) a whole bevy of texts on a given topic. Ideally it would even be able to sum up the arguments for and against a given issue on the basis of the contents. Tech solutions have not yet materialized, but expect to see more effort in this area soon. People have naturally tried to prune news articles to generate simple summaries small enough to fit a smartphone window, but there’s not much deep know-that or know-how in these apps.
Yet suddenly, language generation of this sort looks set to enjoy a rosier future. The reason is that the breakneck growth in big data propagation due to the combination of sensors, cloud storage, and connectedness is forcing businesses and other organizations to develop solutions that can produce useful knowledge automatically from the data tsunami. Decision-makers don’t have enough time to decode spreadsheets of numbers; they want simple, compelling reports that boil down the meaning of the data analytics into clear phrases that are maximally relevant to C-level people.
So-called robot-journalism is another emanation of the same technology: formulaic reports of sports events, stock market behavior, weather patterns, surveillance observations, election results, medical examinations and more can be produced automatically by collecting the data and transforming them into the sort of boiler-plate texts that back-office clerks or trainee journalists once used to produce. Eventually robot ‘writers’ will be able to watch videos and summarize their content for a human agent to read (the reverse of the robot cook). More tellingly for our community, these reports will be generated in various languages from a single base language – yet another example of how technology is now trying to capture know-how and apply it to data sets. The problem this will raise is that “machine” text will circulate more widely and be scraped from the web as a translation resource. Smart solutions may be required to ensure that machines also know about this potential quality issue!
It is not hard to see where this knowledge technology will go. By collecting lots of data about customers, marketers will be able to craft personalized messages and content to individual consumers that can be rapidly drafted using NLG straight into the preferred language of the target. Obviously much of the meaning that can be squeezed out of big data will be displayable in some enhanced visual form – maybe marketers will be able to have personalized videos shot automatically from data about products, people, weather, news, etc. using a database of stock footage, exactly like a language sentence maker. But many people will prefer readable content as well –short texts using familiar language adapted to their cultural profile.
"...the task ahead will be to collaborate and innovate, both sustainably and sometimes disruptively on top of the emerging infrastructure."
How should the industry react to cognitive technology and the shift to knowledge as a platform? Resistance would be a natural corporatist reflex. However, we suggest that the translation industry should invent new types of smart cooperation between translators and the machineries of knowledge management. As in many jobs in content-rich industries likely to be impacted by machine learning, the task ahead will be to collaborate and innovate, both sustainably and sometimes disruptively on top of the emerging infrastructure. We shall be eagerly watching out for new mergers and acquisitions, partnerships and the odd unicorn in this space in the years ahead!
Andrew Joscelyne has worked in the language technology industry for many years, as a journalist, consultant and analyst. He has (co)authored a number of surveys and reports on the state of the industry in Europe. He is currently an advisor with LT-Innovate.