SPECOM'2006, St. Petersburg, 25-29 June 2006

Building a Slovenian-English Language Pair Speech-to-Speech Translation System

Jerneja Žganec Gros1, France Mihelič2, Mario Žganec1

1Alpineon Research and development Alpineon d.o.o., Ljubljana, Slovenia 2Laboratory of Artificial Perception, Systems and Cybernetics Faculty of Electrical Engineering, University of Ljubljana, Slovenia [email protected] [email protected]

similar to Phraselator [4] or Speechalator [5], able to translate Abstract simple sentences in a Slovenian-English language pair. In the paper the design phases of the VoiceTRAN The application domain is limited to common application Communicator are presented, which combines speech scenarios that occur in peace-keeping operations on foreign recognition, and text-to- missions when the users of the system have to communicate using the Darpa Galaxy architecture. The aim of the with the local population. More complex phrases can be contribution was to build a robust multimodal speech-to- entered via keyboard using a graphical user interface. speech translation communicator able to translate simple First an overview of the VoiceTRAN system architecture domain-specific sentences in the Slovenian-English language is given. We continue to describe the individual server pair. The project represents a joint collaboration between modules. We conclude the paper by discussing the speech-to- several Slovenian research organizations that are active in speech translation evaluation methods and outlining plans for human language technologies. future work. We provide an overview of the task, describe the system architecture and individual servers. Further we describe the 2. System architecture language resources that were used and developed within the project. We conclude the paper with plans for evaluation of The VoiceTRAN Communicator uses the DARPA Galaxy the VoiceTRAN Communicator. Communicator architecture [6]. The Galaxy Communicator open source architecture was chosen to provide inter-module 1. Introduction communication support as its plug-and-play approach allows interoperability of commercial software and research software Automatic speech-to-speech (STS) translation systems aim to components. It was specially designed for development of facilitate communication among people who speak in voice-driven user interfaces in a multimodal platform. The different languages [1,2,3]. Their goal is to generate a speech Galaxy Communicator is a distributed, message-based, hub- signal in the target language that conveys the linguistic and-spoke infrastructure optimized for constructing spoken information contained in the speech signal from the source dialogue systems. language. There are, however, major open research issues that challenge the deployment of natural and unconstrained speech-to-speech translation systems, even for very restricted application domains, due to the fact that state-of-the-art automatic speech recognition and machine translation systems are far from perfect. Additionally, in comparison to translating written text, conversational spoken messages are often conveyed with imperfect syntax and casual spontaneous speech. In practice, when building demonstration systems, STS systems are typically implemented by imposing strong constraints on the application domain and the type and structure of possible utterances, i.e. both in the range and in the scope of the user input allowed at any point of the interaction. Consequently, this compromises the flexibility and naturalness of using the system. The VoiceTRAN Communicator is being built within a national Slovenian research project involving 5 partners: Alpineon, the University of Ljubljana (Faculty of Electrical Engineering, Faculty of Arts and Faculty of Social Studies), the Jožef Stefan Institute, and Amebis as a subcontractor. Figure 1: VoiceTRAN system architecture. The project is co-funded by the Slovenian Ministry of Defense. The aim of the project is to build a robust multimodal speech-to-speech translation communicator,

147 SPECOM'2006, St. Petersburg, 25-29 June 2006

The VoiceTRAN Communicator is composed of a server modules are described in more detail in the next number of servers that interact with each other through the subsections. Hub as shown in Fig. 1. The Hub is used as a centralized message router through which servers can communicate with 2.1. Audio server one another. Frames containing keys and values are emitted The audio server connects to the microphone input and by each server. They are routed by the hub and received by a speaker output terminals on the host computer and performs secondary server based on rules defined in the Hub script. recoding user input and playing prompts or synthesized The VoiceTRAN Communicator consists of a Hub and speech. Input speech captured by the audio server is five servers: audio server, graphic user interface, speech automatically recorded to files for later system training. recognizer, machine translator and speech synthesizer: In a compact version of the system that does not run in a

distributed environment, e.g. for application on embedded platforms, the two roles of the audio server can be divided Audio Receives speech signals from the between the speech recognizer and the speech synthesizer in Server microphone and sends them to the order to increase the efficiency of the system by eliminating recognizer. the audio server. Sends synthesized speech to the speakers. 2.2. Speech recognizer Receives input text from the The speech recognition server receives the input audio stream keyboard. from the audio server and provides at its output a word graph Graphic Displays recognized source language and a ranked list of candidate sentences, the N-best User sentences and translated target hypotheses list that can include part-of-speech information Interface language sentences. generated by the language model. Provides user controls for handling The speech recognition server used in VoiceTRAN is the application. based on the Hidden Markov Model Recognizer developed by the University of Ljubljana [7]. It will be upgraded to perform Takes the signals from Audio Server large vocabulary speaker (in)dependent speech recognition on and maps audio samples into text Speech a wider application domain. A back-off class-based trigram strings. Recognizer language model was used. Produces a N-best sentence Since the final goal of the project is a stand-alone speech hypothesis list. communicator used by a specific user, the speech recognizer can be additionally trained and adapted to the individual user Receives N-best postprocessed in order to achieve higher recognition accuracy at least in one sentence hypotheses from the Speech language. Recognition Server and translates Machine A common speech recognizer output typically has no them from a source language into a Translator information on sentence boundaries, punctuation and target language. capitalization. Therefore, additional postprocessing in terms Produces a scored disambiguated of punctuation and capitalization are being performed on the sentence hypothesis list. N-best hypotheses list before it is passed to the machine translator. Receives rich and disambiguated A prosodic module was included in order to link the word strings from the Machine source language to the target language, but also to enhance Speech Translation server. speech recognition proper. Besides syntactic and semantic Synthesizer Converts the input word strings into information, properties such as dialect, sociolect, sex and speech and prepares them for the attitude etc are signaled by prosody [2]. Prosody information audio server. helps to determine proper punctuation and sentence accent

information.

There are two ways of porting modules into the Galaxy 2.3. Machine translator architecture: the first is to alter its code so that it can be The machine translator converts text strings from a source incorporated into the Galaxy architecture; the second is to language into text strings in the target language. Its task is create a wrapper or a capsule for the existing module, the difficult since the results of the speech recognizer convey capsule then behaves as a Galaxy server. spontaneous speech patterns and are often erroneous or ill- We have opted for the second option since we want to be formed. able to test commercial modules as well. Minimal changes to A postprocessing algorithm inserts basic punctuation and the existing modules were required, mainly those regarding capitalization information before passing the target sentence input/output processing. to the speech synthesizer. The output string can also convey A particular session is initiated by a user either through lexical stress information in order reduce disambiguation interaction with a graphical user interface (typed input) or the efforts during text-to-speech synthesis. microphone. The VoiceTRAN Communicator servers capture A multi-engine based approach was use that makes it spoken or typed input from the user, and return the servers' possible to exploit strengths and weaknesses of different MT responses with synthetic speech, graphics, and text. The technologies and to choose the most appropriate engine or

148 SPECOM'2006, St. Petersburg, 25-29 June 2006

combination of engines for the given task. Four different graphical user interface where input text in the source translation engines were applied in the system. We combined language can also be entered via a keyboard. A screenshot of TM (translation memories), SMT (statistical machine the VoiceTRAN GUI is shown in Fig. 2. translation), EBMT (example-based machine translation) and Recognized sentences in the source language along with RBMT (rule-based machine translation) methods. their translated counterparts in the target language are A bilingual aligned domain-specific corpus was used to displayed. build the TM and train the EBMT and the SMT phrase A push-to-talk button is provided to signal an input voice translation models. In SMT an interlingua approach, similar activity, a replay button serves to start a replay of the to the one described in [3] will be further investigated and synthesized translated utterance. The translation direction can promising directions pointed out in [8] are pursued. be changed by pressing the translation direction button. The Presis translation system is used as our baseline system [9]. It is a commercial conventional rule-based translation system that is constantly being optimized and upgraded. It was adapted to the application domain by upgrading the lexicon. Based on stored rules, Presis parses each sentence in the source language into grammatical components, such as subject, verb, object and predicate and attributes the relevant semantic categories. Then it uses built- in rules for converting these basic components into the target language, performs regrouping and generates the output sentence in the target language. The speech recognition server sends to the hub each postprocessed sentence hypothesis from the N-best hypotheses list as a separate token. All of the different tokens for various sentence hypotheses of a given utterance can be jointly considered by a MT postprocessing module, after frame construction, which takes into consideration the quality of each hypothesis well-formedness that is evaluated by the machine translation server, along with a special criterium that looks for any salient words, such as named entities. Early Figure 2: Screenshot of the VoiceTRAN GUI. The source decisions can be made when a hypothesis is perfect; language text provided by the speech recognition module otherwise the final decision is delayed until the last and the translated text in the target language are displayed. hypothesis has been processed.

2.4. Speech synthesizer 3. Language resources The last part in a speech-to-speech translation task is the Some of the multilingual language resources needed to set up conversion of the translated utterance into its spoken STTS systems and include the Slovenian language are equivalent. The input target text sentence is equipped with presented in [11]. lexical stress information at possible ambiguous words. For building the speech components of the VoiceTRAN The AlpSynth unit-selection text-to-speech system is used system, existing speech corpora were used [12]. The language for this purpose [10]. It performs grapheme-to-phoneme model was trained on a domain-specific that has conversion based on rules and a look-up and rule- been collected and annotated within the project. based prosody modeling. It was further upgraded within the The AlpSynth pronunciation lexicon [10] was used for project towards better naturalness of the resulting synthetic both speech recognition and text-to-speech synthesis. Speech speech. Domain-specific adaptations include new synthesis is based on the AlpSynth speech corpus. It was pronunciation lexica and the construction of a speech corpus expanded by the most frequent in-domain utterances. of frequently used in-domain phrases. Other commercial off- For developing the initial machine translation component, the-shelf products are tested as well. the dictionary of military terminology [13] and various We also explored how to pass a richer structure from the existing aligned parallel corpora were used [14,15,16]. machine translator to the speech synthesizer. An input structure containing information on POS and lexical stress 3.1. Data collection efforts information resolves many ambiguities and can result in more The VoiceTRAN team participated in the annotation of an accurate prosody prediction. in-domain large Slovenian monolingual text corpus that has The speech synthesizer produces an audio stream for the been collected at the Faculty of Social Studies, University of utterance. The audio is stream is finally sent to the speakers Ljubljana. This corpus was used for training the language by the audio server. After the synthesized speech has been model in the speech recognizer, as well as for inducing transmitted to the user, the audio server is freed up in order to relevant multiword units (collocations, phrases and terms) for continue listening for the next user utterance. the domain. Within VoiceTRAN, another aligned bilingual in-domain 2.5. Graphical user interface corpus was collected. It consists of general and scenario- In addition to the speech user interface, the VoiceTRAN specific in-domain sentences. The compilation of such a Communicator provides a simple interactive user-friendly corpus involves selecting and obtaining the digital original of

149 SPECOM'2006, St. Petersburg, 25-29 June 2006

the bi-texts, re-coding to XML TEI P4, sentence alignment, McDonough, J., Soltau, H., Lazzari, G., Mana, N., word-level syntactic tagging and [17]. Such Pianesi, F., Pianta, E., Besacier, L., Blanchon, H., pre-processed corpora are then used to induce bi-lingual Vaufreydaz, D. and Taddei, L., "A Multi-Perspective single word and phrase lexica for the MT component, or as Evaluation of the NESPOLE! Speech-to-Speech direct inputs for SMT and EBMT systems. They also served Translation System", In Proceedings of the ACL 2002 for additional training of the speech recognizer language workshop on Speech-to-speech Translation: Algorithms model. and Systems, Philadelphia, PA, 2002. [4] Sarich, A., "Phraselator, one-way speech translation 4. Evaluation plans system", Available at http://www.sarich.com/translator/, 2001. The evaluation tests of a speech-to-speech translation system [5] Waibel, A., Badran, A., Black, A. W., Frederking, R., serve two purposes: Gates, D., Lavie, A., Levin, L., Lenzo, K., Mayfield 1. to evaluate whether we have improved the system by Tomokyo, L., Reichert, J., Schultz, T., Wallace, D., introducing improvement of individual components of the Woscsyna and M., Zhang, J., "Speechalator: Two-Way system; Speech-to-Speech Translation on a Consumer PDA", In 2. to test end user system acceptance in field tests. Proceedings of the EUROSPEECH, Geneva, We have performed end-to-end translation quality tests Switzerland, 2003, pp. 369-372. both on manually transcribed and automatic speech [6] Seneff, S., Hurley, E., Lau, R., Pao, C., Schmid, P. and recognition input. Human graders have assessed the end-to- Zue, V., "Galaxy-II: A Reference Architecture for end translation performance, evaluating how much of the user Conversational System Development". In Proceedings of input information has been conveyed to the target language the ICSLP, Sydney, Australia, 1998, pp. 931-934,. and also how well-formed the target sentences had been. available at http://communicator. sourceforge.net/. Back-translation evaluation experiments involving [7] Dobrišek, S., Analysis and Recognition of Phrases in paraphrases were performed, as well [18]. Speech Signals, PhD Thesis, University of Ljubljana, In addition, individual component tests in order to select Slovenia, 2001. the most appropriate methods for each application server have [8] Ney, H, "The Statistical Approach to Spoken Language been performed. Speech recognition was evaluated by Translation", In Proceedings of the International computing standard word error rates (WER). For the machine Workshop on Spoken Language Translation, Kyoto, translation component, subjective evaluation tests in terms of Japan, 2004, pp. XV-XVI. fluency and adequacy were chosen, as well as objective [9] Romih, M. and Holozan, P., "Slovensko-angleški evaluation tests [19, 20], having in mind that objective prevajalni sistem" (A Slovenian-English Translation evaluation methods evaluate the translation quality in terms System). In Proceedings of the 3rd Language of the capacity of the system to mimic the reference text. Technologies Conference. Ljubljana, Slovenia, 2002, p. 167. (in Slovenian) 5. Conclusions [10] Žganec Gros, J., "Text-to-speech synthesis for embedded We presented the VoiceTRAN multimodal speech-to-speech speech user interfaces", WSEAS Transactions on translation communicator, which is able to translate simple Communications, No. 4, Vol. 5, 2006, pp. 543-548. domain-specific sentences in the language pair Slovenian- [11] Verdonik, D., Rojc, M. and Kačič, Z., "Creating English. Slovenian language resources for development of speech- The concept of the VoiceTRAN Communicator to-speech translation components", In Proceedings of the implementation is discussed in this paper. The chosen system Fourth international conference on language resources architecture makes it possible to test a variety of server and evaluation, Lisbon, Portugal, 2004, pp. 1399-1402. modules. The end-to-end prototype was evaluated in and is [12] Mihelič, F., Žganec Gros, J., Dobrišek, S., Žibert, J. and ready for end user field trials in military exercises. Pavešić, N., "Spoken language resources at LUKS of the University of Ljubljana", International Journal on 6. Acknowledgements Speech Technologies, Vol. 6, No. 3, 2003, pp. 221-232. [13] Korošec, T. (2002). "Opravljeno je bilo pomembno The authors of the paper wish to thank the Slovenian Ministry slovarsko delo o vojaškem jeziku", Slovenska vojska,. of Defense and the Slovenian Research Agency for co- Vol. 10., No. 10., 2002, pp. 12-13, (in Slovenian). funding this work under contract no. M2-0019. [14] Erjavec, T., "The IJS-ELAN Slovene-English parallel corpus", International Journal on Corpus Linguistics, 7. References Vol. 7, No. 1, 2002, pp. 1-20. [15] Erjavec, T., "MULTEXT-East Version 3: Multilingual [1] Lavie, A., Waibel, A., Levin, L., Finke, M., Gates, D., Morphosyntactic Specifications, Lexicons and Corpora", Gavaldà, M., Zeppenfeld, T. and Zhan, P., "Janus-III: In Proceedings of the Fourth International Conference on Speech-to-Speech Translation in Multiple Languages", In Language Resources and Evaluation, LREC'04, Lisbon, Proceedings of the ICASSP, Munich, Germany, 1997, pp. Portugal, 2004, pp. 1535-1538. 99-102. [16] Erjavec, T., Ignat, C., Pouliquen, P. and Steinberger, R., [2] Wahlster, W., Verbmobil: Foundation of Speech-to- "Massive multi-lingual corpus compilation: Acquis Speech translation, Springer Verlag, 2000. Communautaire and totale", In Proceedings of the 2nd [3] Lavie, A., Metze, F., Cattoni, R., Costantin, E. &, Language and Technology Conference, April 21-23, Burger, S., Gates, D., Langley, C., Laskowski, K., Levin, Poznań, Poland, 2005. L., Peterson, K., Schultz, T., Waibel A., Wallace, D.,

150 SPECOM'2006, St. Petersburg, 25-29 June 2006

[17] Erjavec, T. and Džeroski, S., "Machine Learning of Language Structure: Lemmatising Unknown Slovene Words", Applied Artificial Intelligence, Vol. 18, No. 1, 2004, pp. 17-41. [18] Rossato, S., Blanchon, H. and Besacier, L., "Speech-to- Speech Translation System Evaluation: Results for French for the Nespole! Project First Showcase", In Proceedings of the ICSLP, Denver, CO, 2002. [19] MT Evaluation Kit, "NIST MT Evaluation Kit Version 11a", Available at http://www.nist.gov/speech/tests/mt,. 2002. [20] Akiba, Y., Federico, M., Kando, N., Nakaiwa, H., Paul, M. and Tsuji J., "Overview of the IWSLT04 Evaluation Campaign", In Proceedings of the International Workshop on Spoken Language Translation. Kyoto, Japan, 2004, pp. 1-9.

151