Language Technology has always been AI
By: Andrew Joscelyne (Senior Advisor) - LT-Innovate
27 September 2017
LT-Innovate, the business association for the language technology (LT) industry, is holding its annual Language Technology Industry Summit (LTI17) in Brussels from 9 to 11 October. In other words, slap bang in the middle of the current boom in Artificial Intelligence (AI) applications.
A year ago, AI would have been familiar to the cognoscenti but largely absent from the industry, business and consumer radar. Now it has taken near-center stage in driving what organizations are calling their digital transformation. It is even helping reinvent parts of industries such as healthcare, finance, insurance, and legal. What does this shift mean for the language technology agenda in particular, and how can we address it…err… intelligently?
The AI story is largely about using algorithmic technology to better leverage all that data that’s been piling up since the dawn of the database. For our purposes, AI is best embodied in Machine Learning (ML), a software engineering discipline that enables the learning of patterns to successfully automate such processes as speech recognition, machine translation, and image recognition and deliver results that are sometimes uncannily similar to human capabilities. That is why these LT applications, including summarization and dialogue modelling, are currently being repackaged for marketing purposes as AI components.
LT-Innovate partially anticipated this move back in 2013 by calling one of the three application pillars of the LT industry “intelligent content”. We considered that any compute process that renders digital text or spoken content more tractable to software treatment made that content more inherently “intelligent.” That covers generating, summarising, analysing, searching, and discovering rich information in any content by using natural language processing (NLP). In other words, these tools radically reduce the human effort typically needed to understand and use such content. The blooming of AI now positions this intelligence in a broader constellation of smart applications.
Perhaps the most palpable advantage of this new, ML stage in tech development is the considerable number of researchers all over the world who are exploring the complex machine learning space, and testing which tools and algorithms appear to work better than others on which data sets. Some of this knowledge is currently being commoditized into open source kits that enable developers to use NLP, for example, to build simple chatbots or other applications in a couple of weeks. Other results could help avoid costly experiments in developing solutions that don’t really work. Yet all these advances are just the first glimmer of dawn of a potentially long day of extensive software and hardware fabrication.
For all that, we do not expect AI developments to radically change the way commercial translation is carried out in the immediate future, except that AI-type learning systems might help rationalize certain pipelines by providing new insights from all the quality analytics implemented in a fully-connected translation job. In other words, AI will probably impact management performance more than linguistic details. Nor will AI-improved speech recognition transform the interpretation business in the immediate future, even though remote interpretation, aided by streaming speaker images that are so vital to an interpreter, will become increasingly common. Long-tail languages will also continue to suffer from data scarcity, making it harder to pull them into the neural MT dynamic, even though people say that less data is needed in some ML-driven translation solutions. LTI17 will of course be touching on a number of translation tech issues at the conference in Brussels, ranging from improving terminology systems to automatically generating multilingual documents.
Focus on Verticals
LTI17 will also be looking at the next chapter of language technologies in this story: How, for example, will the combination of symbolic processing–language as hierarchical words and phrases–and ML, which encodes language as numerical vectors, provide a more powerful, interpretable mechanism for building effective LT?
If this merger of methods proves feasible, then LT can start designing platforms for specific vertical industries that help end-users achieve their business and technical targets by using deep linguistic knowledge (semantics and syntax) to get the machine to learn more efficiently. During the conference, we will be looking in detail at how the defense, media monitoring, and finance/insurance industries in particular are benefitting from tailored solutions that could form part of their broader AI agenda.
Three Key Challenges
There are obviously multiple challenges facing the ML development path involving language technology as we know it, and LTI17 will no doubt reference many of these. Here to start with are three issues that will eventually need to be addressed by the LT industry.
1. Language neutrality. If the project to build the EU’s digital single market is to be fully executed, something like a “translate” layer needs to be built into the technology stack that underpins this economic network. Users can then engage naturally with the market in their own language. No one yet knows what it will look like, but it should at least enable unrestrained linguistic interaction (speech and text) between consumers and e-tailers of all sorts via websites, bots, virtual assistants, cars, and other devices on a massive scale. Other countries such as India and South Africa (the two best-known highly multilingual sovereign states) will probably need to build successful language-neutral marketplaces in the same way. Working out exactly which mix of ML, semantic networks, and translation procedures can activate such a layer will be critical to their success.
2. Data markets. Yes, of course data are the fuel for both AI and language processing in general. In the LT space, language data are exploited for their inherent “languaginess,” not as time series or for their numerical or personal value. To ensure the effective availability of the right data to accelerate system building (e.g. in MT and speech applications), it will be necessary to develop better pipelines for data collection, cleaning and sharing, and build an ecosystem around appropriately sized and “domained” data sets and flows that can benefit the whole LT industry. We also need to find adequate solutions to overcome the legal strictures on the use of data mining that currently plague the data economy in the European Union and elsewhere.
3. Multimodal fusion. ML seems to be telling us that different modalities of data, e.g. text, speech, and image, can learn about each other in complex forms of entanglement across data types. This suggests that rich multimodal processing of all sorts (including digital video, sound tracks, conversations, and virtual reality sessions) could form the bedrock of a new generation of content management technologies. If this materializes, data from different modalities could be used to reinforce LT pipelines–and vice versa–to ultimately synthesize new kinds of knowledge and experience that humans will need to appreciate new types of engagement.
The etymology of the word “intelligence” is revealing: the Latin plugs inter (between) into legere (choose, pick out a word, hence read), itself derived from an Indo-European root *leg- (collect, gather). Reading and speaking were in fact imagined as “picking out words that have been gathered.” This suggests that intelligence was felt to be reading between the picked-out words, extracting the linked meanings. It all sounds rather like an AI process. Rather as Molière’s Monsieur Jordain was happy to learn he’d always been speaking prose, perhaps language technology practitioners have always been doing AI without realizing it!
Join us at LTI17 in Brussels on 9-11 October for a unique learning, networking, and deal making opportunity! Follow us on Twitter @LTInnovate #LTI17.