E.g., 04/02/2020
E.g., 04/02/2020

New Advances in Spoken Language Translation

By: Zeeshan Ahmed, Éva Székely, and Dr João Cabral (Centre for Next Generation Localisation)

06 March 2013

Recent advances in spoken language translation research, including the use of facial expression recognition and expressive speech synthesis, are helping to produce more accurate and human-sounding translated speech output. 

Spoken language translation is the process of translation of an utterance spoken in a source language into a different target language. The increasingly globalized and multilingual world is continuously intensifying the demand for effective and affective spoken language translation technology.

Spoken language translation is not only applicable to face-to-face communication between persons speaking different languages, but it is also required for the translation of online/offline multimedia content such as academic lectures, news broadcasts, movies, and dramas. This means that, besides the social benefit of facilitating interpersonal communication, there is a strong economic interest in the technology.

Thanks to the combined expertise of machine translation and speech synthesis specialists, instant translation of conversations is now possible - and the accuracy of machine-translated speech output continues to improve.

Think Sounds rather than Words
One important challenge in this technology is to reduce the extent to which the speech interface that generates text from speech hinders the performance of the text translation between two languages. An important development in this regard is to use a phonetic representation of the text to be translated instead of words. This enables us to use more sophisticated linguistic models to improve the translation quality, as we will explain below.

Besides linguistic information, there is additional information that can be inferred from the speaker (their speech or video) which can be leveraged in a speech-to-speech translation application to facilitate the communication process better. A speech-to-speech translation system developed by CNGL uses information about the affective state (i.e. emotions or feelings) of the speaker, which is recognized by facial expression analysis and is reproduced into the translated synthetic speech output. So, if I’m happy in English (as indicated by my smile), I will also sound happy in the German translation.

Increasing Effectiveness of Machine Translation Systems
The recent progress in machine translation (MT) technologies has paved the way for the research and development of effective spoken language translation systems. For example, the annual International Workshop on Spoken Language Translation (IWSLT,) which is dedicated to in-depth research into spoken language translation technology and evaluation of MT systems, has been growing in popularity and followed the challenges of this technology. Initially (from 2002 to 2010), the IWSLT evaluation campaign was conducted on fairly easy and restricted domains such as spoken dialogs from the travel domain. More recently, increasingly complex translation domains have been considered, such as translation of online video lectures and talks – thereby showing a significant interest and improvement in the core translation technology.

How Spoken Language Translation Works
Generally, a spoken language translation system is the integration of three components: Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech Synthesis (TTS). The first component is used to recognize text from speech in the source language, MT is used to translate the recognized sentence to the target language, and TTS transforms the translated sentence into an utterance.

The Cascade Model: Connecting Components Serially
The most popular architecture for spoken language translation is referred to as a cascade model. In this architecture, the automatic speech recognition, machine translation, and text-to-speech synthesis components are connected serially, and there is no information sharing between these components except for input-output. Almost all of the systems participating in the IWSLT evaluation campaign for translation in large vocabulary unrestricted domains are based on this model (M. Federico, L. Bentivogli, et al. 2011, M. Federico, L. Bentivogli, et al. 2012).

The cascade model is flexible and permits a scalable approach to spoken language translation because the automatic speech recognition, machine translation, and speech synthesis components can be developed independently from each other. However, automatic speech recognition is prone to word recognition errors which propagate to the MT component and affect the translation quality. The improvement of the robustness of the translation system is therefore limited by the word-level output of the automatic speech recognition system.

The Alternative: Performing Translation and Speech Recognition in Parallel
Spoken language translation can also be performed using a tightly integrated (or “tightly coupled”) model, in which the recognition and translation are performed together. Typically, this model is implemented using a finite state transducer (FST), e.g. (Casacuberta, et al. 2004). The finite state transducer is a type of “translating machine” which produces output as well as reading input. The tightly integrated model has shown a superior performance for restricted domain and limited vocabulary tasks, while the cascade model is preferred for more complex tasks as highlighted by the experimental results of Casacuberta, et al. (2004).

Combining the “Best of Both Worlds” to produce better quality translation
Recently, CNGL researchers have proposed a new approach to spoken language translation (Ahmed, Jiang, et al. 2012, Jiang, Ahmed, et al. 2011) which overcomes the limitations of the cascade model caused by the traditional word-level output of automatic speech recognition. This approach is termed “phonetic representation-based speech translation” (PRBST) and it uses a semi-tight coupling of an automatic speech recognition component and a MT component as shown in Figure 1.

Figure 1: Phonetic Representation-based Speech Translation

Essentially, the main difference between the traditional cascade and the new PRBST model is that the latter uses automatic speech recognition to first transcribe a spoken utterance into a sequence of phones (or speech sounds) instead of words and then uses machine translation to translate directly from the phonetic form into a sentence in the target language.

The great advantage of PRBST is that the machine translation part has now access to fine-grained phonetic information, which permits further improvements in the speech recognition with the help of machine translation technology.

Although the automatic speech recognition task is reduced to recognizing phones of a language, the MT role is increased such that all the major linguistic analyses are performed during the translation process. This allows the application of sophisticated linguistic models (Phrases, Syntax, etc.) during translation from the phonetic form, which facilitates better recognition accuracy and translation quality than the word-level output in the cascade model.

The new PRBST model has produced a relative improvement of as much as 9.38% in translation quality using the widely-respected BLEU metric compared to a baseline system based on a cascade model (Ahmed, Jiang, et al. 2012, Jiang, et al. 2011).

Just like the cascade model, PRBST is applicable to large vocabulary spoken language translation tasks. It also provides higher flexibility to improve the speech recognition and translation parts independently than a finite state transducer, because the two components of PRBST are not as tightly integrated in this model. Thus, this novel approach combines the main advantages of the cascade model with those of the tightly integrated model.

Beyond Words: Translating Emotion into Speech Output
Another topic of interest in the field of speech-to-speech translation is to transmit paralinguistic information about the speaker through the translated synthetic speech output. This means considering the vocal (and sometimes non-vocal) signals beyond the basic verbal message or speech.

An innovative method to address this problem makes use of automatic facial expression recognition in video, combined with expressive speech synthesis. In practice, this means that if the speaker smiles while saying a message to the system, the synthetic voice in the target language will come out sounding “cheerful”; if the speaker expresses anger through, for example, furrowed eyebrows, then the resulting translated synthetic speech will carry acoustic characteristics that can be identified by the speakers of the target language as “aggressive”, and so on.

A demonstrator system that uses this approach (FEAST – Facial Expression-based Affective Speech Translation) has been implemented to perform affective speech-to-speech translation from English to German (Székely, Ahmed et al. 2012 and Ahmed, Steiner et al. 2013). A listener evaluation has shown that when considering four distinct facial expressions (happy, angry, sad, and neutral) mapping onto four voice styles (cheerful, aggressive, depressed, and neutral), the correct voice style was selected in two thirds of cases. Future work is planned to improve these results through personalization and acoustic analysis of the input speech.

Spoken language translation is a fascinating area that still presents a multitude of challenges - not least speaker-dependent variations and lack of adequate speech and text corpora in many of the world's languages. Across the world, multi-disciplinary teams will continue to strive for ever-more accurate and human-sounding machine-translated speech output to facilitate interpersonal communication, cross-cultural exchange, and global business. The integration of advances such as phonetic representation and facial affective information into speech translation systems will help to enhance the quality of communications with your global customers.

Ahmed, Zeeshan, Ingmar Steiner, Éva Székely, and Julie Carson-Berndsen. "A System for Facial Expression-based Affective Speech Translation." Proceedings of International Conference on Intelligent User Interfaces (IUI 2013). California, 2013.

Ahmed, Zeeshan, Jie Jiang, Julie Carson-Berndsen, Peter Cahill, and Andy Way. "Hierarchical Phrase-Based MT for Phonetic Representation-Based Speech Translation." Proceedings of Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA). San Diego, 2012.

Casacuberta, Francisco, et al. "Some approaches to statistical and finite-state speech-to-speech translation." Computer Speech and Language (Elsevier) 18 (2004): 25-47.

Federico, Marcello, L Bentivogli, Michael Paul, and Sebastian Stuker. "Overview of the IWSLT 2012 Evaluation Campaign." Proceedings IWSLT 2012. Hong Kong, 2012.

Federico, Marcello, Luisa Bentivogli, Michael Paul, and Sebastian Stuker. "Overviewof the IWSLT 2011 Evaluation Campaign." Proceedings IWSLT 2011. San Francisco, 2011.

Jiang, Jie, Zeeshan Ahmed, Julie Carson-Berndsen, Peter Cahill, and Andy Way. "Phonetic Representation-Based Speech Translation." Proceedings of 13th Machine Translation Summit. Xiamen, 2011.

Székely, Éva, Zeeshan Ahmed, Ingmar Steiner and Julie Carson-Berndsen.“Facial expression as an input annotation modality for affective speech-to-speech translation.”Proceedings of International Workshop on Multimodal Analyses for Human Machine Interaction (MA3), Santa Cruz, 2012.

Zeeshan Ahmed and Éva Székely are PhD candidates and Dr João Cabral is a postdoctoral researcher with the Centre for Next Generation Localisation (CNGL) at University College Dublin, Ireland. CNGL is a collaborative academia-industry research center dedicated to delivering disruptive innovations in digital intelligent content, and to revolutionizing the global content value chain for enterprises, communities and individuals. www.cngl.ie

Zeeshan Ahmed is a PhD candidate with the Centre for Next Generation Localisation (CNGL) at University College Dublin. He received his BS degree in Computer Science from University of Karachi. He completed his MS degree in Natural Language Processing (NLP) from Charles University of Prague and University of Nancy, on EU funded Erasmus Mundus program. The focus of Zeeshan’s PhD work is on the development of large scale speech translation and understanding systems. Multi-modal human-computer interaction (where in addition to text, audio and video modality is also utilized in NLP application) is also another aspect of his PhD work. He has expertise in devising algorithmic solutions to the problems in the field of NLP. He also has several years of software development experience in different IT companies.
Éva Székely is a PhD candidate with the Centre for Next Generation Localisation (CNGL) at University College Dublin. Her main research interests lie in developing methods to exploit the naturally occurring expressive space in speech data, applying machine learning techniques on acoustic features of speech, including voice quality. Furthermore, Éva’s research involves optimizing the application of expressive synthetic voices for human interaction such as assistive technologies and speech-to-speech translation. In particular she is working on the development of gesture-based intelligent interfaces for automatically choosing the right tone of voice for a given message. Éva is a member of the Special Interest Group on Speech & Language Processing for Assistive Technologies (SIG-SLPAT). She holds a MA degree in Speech and Language Technology from the University of Utrecht.
Dr João Cabral is Postdoctoral Researcher with the Centre for Next Generation Localisation (CNGL) at University College Dublin. He received a PhD degree in Computer Science and Informatics from The University of Edinburgh, in 2010. He has a BSc and MSc from Instituto Superior Técnico (I.S.T.)/Technical University of Lisbon in Electrical and Computer Engineering. He has more than ten years of research experience in text-to-speech synthesis and speech signal processing. His research interests also include machine learning, automatic speech recognition, glottal source modeling, and Computer-Assisted Language Learning (CALL).