Information Overload? What to do with all our linguistic data? Weighing information on a virtual scale.

The world now gives us ample opportunity to test and use multiple tools for language analysis and understanding. With the rise of neural network techniques fueled by large-scale text corpora we now have what it takes to develop just about anything. But the sheer power of the data is often not enough to implement a usable functionality which could bring measurable gains in production scenarios.

Why is that? The data typically stores a multitude of real-life examples. These examples are drawn from different scenarios, different functionalities - one might even say from different “lives”. For instance, when we analyze the word “skin”, its typical interpretation is a noun referring to the outer tissue of an organism. This meaning prevails in any linguistic data coming from the domains of biology or cosmetology and also dominates in general texts. However, in the narrow domain of taxidermy, the word “skin” can just as appropriately be used as a verb meaning “to remove skin”. For that reason, when we develop a language understanding module driven by broad linguistic data, we create something that works for everybody except for taxidermists. And what if our main client is a taxidermist?

The problem described above arises from the fact that any natural language processing mechanism is forced to make a decision at some point. Being a soulless machine, the mechanism will always choose a 95% probability over 5%. The sad thing is that the 5% is then forgotten. When the processing pipeline consists of multiple steps, like tokenization, lemmatization, part-of-speech tagging - each of these modules makes approximate, “95%” decisions and forgets all the alternatives.

This is why at XTM International we introduced a different approach. The data collected by different language processing modules is aggregated and weighed on a virtual scale. This way we don’t miss a thing!

Conference Speakers