Machine Learning (ML) begins to outperform humans in many tasks which seemingly require intelligence. The hype about ML makes it even into mass media. ML can read lips, recognizes faces, or transform speech to text. But when ML has to deal with the ambiguity, variety and richness of language, when it has to understand text or extract knowledge, ML continues to need human experts.
Knowledge is Stored as Text
The Web is certainly our greatest knowledge source. However, the Web has been designed for being consumed by humans, not by machines. The Web’s knowledge is mostly stored in text and spoken language, enriched with images and video. It is not a structured relational database storing numeric data in machine processable form.
Text is Multilingual
The Web is also very multilingual. Recent statistics show that surprisingly only 27% of the Web’s content is English and only 21% in the next 5 most used languages. That means more than half of its knowledge is expressed in a long tail of other languages.
Constraints of Machine Learning
ML faces some serious challenges. Even with today’s availability of hardware, the demand for computing power can become astronomical when input and desired output are rather fuzzy (see the great NYT article “The Great A.I. Awakening“).
ML is great for 80/20 problems, but it is dangerous in contexts with high accuracy needs: “Digital assistants on personal smartphones can get away with mistakes, but for some business applications the tolerance for error is close to zero”, emphasizes Nikita Ivanov, from Datalingvo, a Silicon Valley startup.
ML performs good on n-to-1 questions. For instance, in face recognition “all these pixel show which person?” has only one correct answer. However, ML is struggling in n-to-many or in gradual circumstances … there are many ways to translate a text correctly or express a certain piece of knowledge.
ML is only as good as its available relevant training material. For many tasks mountains of data are needed. And the data better be of supreme quality. For language related tasks these mountains of data are often required per language and per domain. Further, it is also hard to decide when the machine has learned enough.
Monolingual ML Good enough?
Some suggest why not process everything in English. ML does also an OK job at Machine Translation, like Google Translate. So why not translate everything into English and then lets run our ML algorithms? This is a very dangerous approach since errors multiply. If the output of an 80% accurate Machine Translation becomes the input to an 80% accurate Sentiment Analysis errors multiply to 64%. At that hit rate you are getting close to flipping a coin.
Human Knowledge to Help
The world is innovating constantly. Every day new products and services are created. To talk about them we continuously craft new words: the bumpon, the ribbon, a plug-in hybrid, TTIP ‒ only with the innovative force of language we can communicate new things.
Struggle with Rare Words
By definition new words are rare. They first appear in one language and then may slowly propagate into other domains or languages. There is no knowledge without these rare words, the terms. Look at a typical product catalog description with the terms highlighted. Now imagine this description without the terms – it would be nothing but a meaningless scaffold of fill-words.
Knowledge Training Required
At university we acquire the specific language, the terminology, of the field we are studying. We become experts in that domain. But even so, later in our professional career when we change jobs we still have to acquire the lingo of the new company: names of products, modules, services, but also job roles and their titles, names for departments, processes, etc. We get familiar with a specific corporate language by attending training, by reading policies, specifications, and functional descriptions. Machines need to be trained in the very same way with that explicit knowledge and language.
Multilingual Knowledge Systems Boost ML with Knowledge
There is a remedy: Terminology databases, enterprise vocabularies, word lists, glossaries – organizations usually already own an inventory of “their” words. This invaluable data can be leveraged to boost ML with human knowledge: by transforming these inventories into a Multilingual Knowledge System (MKS). An MKS captures not only all words in all registers in all languages, but structures them into a knowledge graph (a ‘convertible’ IS-A ‘car’ IS-A ‘vehicle’…, ‘front fork’ IS-PART of ‘frame’ IS-PART of ‘bicycle’).
It is the humanly curated Multilingual Knowledge System that enables ML and Artificial Intelligence solutions to work for specific domains with only small amounts of textual data and also for less resourced languages.