Unlocking Language Resource Assets
By: Christian Galinski (Infoterm)
15 December 2015
The language industry has seen immense growth that shows no sign of slowing, but tools developers are facing a roadblock when attempting to connect with much-needed language resources. Why are some players swimming in abundant resources while others struggle? Author Christian Galinski demystifies this challenge and introduces new opportunities to leveraging untapped language resources.
As is widely known, the language industry – covering language technology tools (LTT), language services (LS), and language resources (LR) – is a field that has seen high double-digit growth in recent years, a trend that is set to continue and even accelerate.
Language service providers (LSPs) need LTTs and LRs in order to provide increasingly efficient services to both the marketplace and to the public sector. LTT developers need LRs in order to develop appropriate technology solutions.
However, in spite of this highly positive general trend, there is at the root an underlying paradox (depending on the language used or needed): some players seem to be swimming in an abundance of LRs, while others struggle to find the ‘right’ LRs for their intended purposes. How can we resolve this issue?
The Language Technology Observatory
A new EU project called “Language Technology Observatory” (LTO) has been launched, tasked with identifying and inventorying existing LRs in any language of the EU, and evaluating them from a usability perspective with a view to supporting and informing a more diversified community of MT users.
Although the main focus is on LRs needed to support machine translation (MT), the LTO’s interest in LRs is not confined to text corpora. In general, there appears to be a fundamental disconnect between the theoretical availability of millions of LRs filed away in academic and other repositories, and the more industry-sensitive need for the optimum usability (access, relevance, interoperability) of any putative resource.
There have been numerous R&D projects in Europe and elsewhere over the years focused on mono-, bi-, or multilingual LRs, mostly in conjunction with the development of language technologies and tools. Some of these have created resources for other than translation purposes, such as in the fields of e-health or e-commerce. This means that there are large-scale repositories of LRs intended for other data purposes that could be reused as material for LTT development. Last but not least, many large-scale industries – especially in tandem with the behemoths among Internet Service Providers (ISP) – make use of large amounts of (proprietary) LRs for internal purposes or for rendering efficient services.
So why another project on this topic? We must address the road-blocks along the LR boulevard that are still hampering the smoother development of language technology data solutions.
Bumps on the Road to Efficiency
The fact is that big LSPs or LT developers almost certainly have a very large set of LRs in multiple languages at their fingertips. For individuals, small expert groups, and SMBs, however, the situation is very different:
- This community is prevented from accessing large-scale LRs because they are proprietary and/or confidential or require high membership/subscription fees.
- LRs may have been developed for certain purposes in formats which do not make them easily reusable.
- For many distinct language communities, LRs are fairly rare or do not exist in an easily processable form.
For these potential users, the field of LRs – with a few exceptions – looks like a desert. This leads to a situation where a large number of potential users or technology innovators face enormous difficulties in:
- Tracking down LRs and evaluating whether they are fit for purpose.
- Converting any existing LRs into the most usable form for their purposes.
- Transforming existing, but not yet digitized LRs into accessible digital resources.
Furthermore, the smaller the language community targeted by LT developers, the harder it is to find LRs. Needless to say, this poses a major barrier not only to potential LT users and innovators in their own development, but also to addressing the “Digital Single Market” (DSM) in Europe in general. Europe needs advanced communication and information technologies that are able to process spoken and written language and overcome language barriers in a fast, robust, reliable and ubiquitous way. And for this it needs appropriate LRs – especially for the low-speaker-density languages.
There is a second area where much work has been done but which is still a roadblock to translation automation in the most general sense: terminology. Some people thought that terminology issues would disappear once translation memories and parallel corpora became popular. But the fact is terminology is a resource that changes all the time, and is in certain technical cases, hard to find – especially in 24 official languages. That is one of the reasons why Sue Ellen Wright likes to quote her colleague Kelly Washbourne: “Terminology management: There is unfortunately no cure for terminology; you can only hope to manage it.” Do we manage terminology effectively?
Recent studies estimate the volume of terminological entities across all sciences and subject fields to total 150 million or more. In May 2011, the number of identifiable chemical substances hit 60 million (according to the Chemical Abstract Service, CAS). But do nomenclatures belong to terminology? What about domain-specific taxonomies, thesauri, ontologies, product master data and other kinds of factual data, bilingual dictionaries of specialized lexicography etc.? Not to mention proper names of all sorts, which may have different forms in other languages, abbreviated forms like those for long terms, and which often occur as elements in technical terms. ‘Terminology’ clearly needs to be seen in a wider perspective.
In terminology work, a ‘designation’ – mostly referring to ‘terms’ – is a representation of a ‘concept’ by a sign which denotes it. But in ISO 10241-1:2011 two notes are added to the definition:
- “In terminology work three types of designation are distinguished: terms, symbols and appellations.”
- “Designations can be verbal or non-verbal or a combination thereof.”
Thus a new dimension to ‘terminology’ is added – and there may be more dimensions on the Internet. For instance, the definition of ‘microcontent’ as “a more general term indicating content that conveys one primary idea or concept, is accessible through a single definitive URL or permalink, and is appropriately written and formatted for presentation in email clients, web browsers, or on handheld devices as needed”* hints at another extension of semantically structured content (or structured data) in the direction of ‘terminology and other language and content resources’ (TLCR).
Common to all is their vertical orientation with a focus on one or a few uses, thereby reducing the potential for interoperability and re-usability. Nearly all of them do not indicate the degree of reliability of the data, thus posing a quality problem: A re-user might become liable for consequences arising from using individual entities of a TLCR without thorough checking. Probably several tens of thousands of TCLR exist on the Internet, but they are difficult to trace. In view of this multitude of TCLR on the Internet, ‘one-stop-shop’ access to such resources – such as the Online Browsing Platform of the International Organization for Standardization (ISO) – is rare. So there are quantitative and qualitative issues impeding easy access to and re-use of TLCR.
*A day’s weather forecast, the arrival and departure times for an airplane flight, an abstract from a long publication, or a single instant message can all be examples of microcontent.
What does all this teach us?
- Terminology is part of a zoo of quite similar animals; different scientific traditions, approaches and applications form cages preventing the recognition of similarities and the construction of truly generic approaches.
- Terminology which was and still is difficult to trace in texts may well be around in different guises: from traditional data banks with their terminological entries via new kinds of collections of structured content/data, down to communication platforms for experts, such as specialized blogs, etc.
- Tools for handling all kinds of TLCR exist, but maybe the right (adaptation and) combination of them in order to access and re-use TLCR in the Internet is missing.
TLCR are ‘big data’ in themselves. In this connection, terminology (understood in a broader sense and properly designed and managed) would have an additional potential: terminological entries can:
- Serve as a means for semantically structuring all kinds of TLCR.
- Point exactly to semantically meaningful contexts – whether in texts or other data resources.
- With its language-independent approach, pave the way for making other TLCR multilingual.
Thus they could largely facilitate the disambiguation of TLCR entities which would ease the task of a human translator as well as facilitate MT.
Toward a Solution
The LTO project, therefore, will take a fresh look at the LR requirements of a new generation of MT users in order to propose more practical solutions for accessing relevant resources. It is starting with a number of scenarios:
- For ‘horizontal’ uses of MT and LRs more or less universal to many different applications.
- For ‘vertical’ uses within one industry (e.g. construction or finance) or for specific services (eProcurement, media, etc.).
In these scenarios, users will be classified as stakeholders in an industry or as actors in a workflow or value chain.
In addition, new applications and sources of LRs will be investigated, such as websites, blogs, apps, Q&A systems and the like. Many of these, especially if they are bi- or multilingual, could provide a source of LRs and/or need LRs to help in the localization process, usually from a limited set of domains for dedicated MT solutions. At the same time, operations such as testing and quality control, production, post-editing, and delivery will be common to all solutions.
“Usability” here plays a key role in several ways:
- LRs should be reusable without requiring too much adaptation.
- The access-for-use process should be as user-friendly as possible (including the quality of the system documentation).
- Training for new users should be as generic and shareable as possible.
It may well be that the workflows identified will differ from standard LSP type workflows, but LSPs should also benefit from the above-mentioned approach, which is designed to “democratize” LR access and use.
Call for Contributions
As LTO is geared towards providing solutions for non-specialist users and small and/or under-privileged language communities, we would like to hear from anyone in the community with ideas and suggestions, especially with respect to:
- Non-standard technologies and LRs supporting the development or enhancement of MT systems.
- Atypical LRs which could nevertheless be reused for MT tomorrow.
- Unorthodox approaches to managing, evaluating or improving LRs.
- Efficiently adapting MT engines using ‘small’ LR assets for low-density languages.
A first workshop will take place on 25 June 2015 in the framework of the LT-Innovate Summit. Interested parties are invited to participate.
For more information and/or contributions, please contact us at [email protected]
Christian Galinski is the CEO of Infoterm.