Initial Considerations in Building a Speech-to-Speech Translation System for the Slovenian-English Language Pair

J. Žganec Gros1, A. Mihelič1, M. Žganec1, F. Mihelič2, S. Dobrišek2, J. Žibert2, Š. Vin- tar2, T. Korošec2, T. Erjavec3, M. Romih4 1Alpineon R&D, Ulica Iga Grudna 15, SI-1000 Ljubljana, Slovenia 2University of Ljubljana, SI-1000 Ljubljana, Slovenia 3Jožef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia 4Amebis, Bakovnik 3, SI-1241 Kamnik, Slovenia E-mail: [email protected]

Abstract. The paper presents the design concept of the VoiceTRAN Communicator that in- tegrates speech recognition, and text-to- using the DARPA Galaxy architecture. The aim of the project is to build a robust multimodal speech-to-speech translation communicator able to translate simple domain-specific sentences in the Slove- nian-English language pair. The project represents a joint collaboration between several Slovenian research organizations that are active in human language technologies. We pro- vide an overview of the task, describe the system architecture and individual servers. Fur- ther we describe the language resources that will be used and developed within the project. We conclude the paper with plans for evaluation of the VoiceTRAN Communicator.

1. Introduction point of the interaction. Consequently, this com- promises the flexibility and naturalness of using Automatic speech-to-speech (STS) translation the system. systems aim to facilitate communication among The VoiceTRAN Communicator is being built people who speak in different languages (Lavie within a national Slovenian research project in- et al., 1997), (Wahlster, 2000), (Lavie et al., volving 5 partners: Alpineon, the University of 2002). Their goal is to generate a speech signal Ljubljana (Faculty of Electrical Engineering, in the target language that conveys the linguis- Faculty of Arts and Faculty of Social Studies), tic information contained in the speech signal the Jožef Stefan Institute, and Amebis as a sub- from the source language. contractor. There are, however, major open research is- The project is co-funded by the Slovenian sues that challenge the deployment of natural Ministry of Defense. The aim of the project is to and unconstrained speech-to-speech translation build a robust multimodal speech-to-speech trans- systems, even for very restricted application lation communicator, similar to Phraselator (Sa- domains, due to the fact that state-of-the-art auto- rich, 2001) or Speechalator (Waibel, 2003), able matic speech recognition and machine transla- to translate simple sentences in a Slovenian-Eng- tion systems are far from perfect. Additionally, lish language pair. in comparison to translating written text, con- The application domain is limited to common versational spoken messages are often conveyed application scenarios that occur in peace-keeping with imperfect syntax and casual spontaneous operations on foreign missions when the users speech. In practice, when building demonstra- of the system have to communicate with the lo- tion systems, STS systems are typically imple- cal population. More complex phrases can be mented by imposing strong constraints on the entered via keyboard using a graphical user in- application domain and the type and structure terface. of possible utterances, i.e. both in the range and in the scope of the user input allowed at any

EAMT 2005 Conference Proceedings 288 Initial considerations in building a speech-to-speech translation system for Slovenian-English

2. System architecture Takes the signals from Audio Server and Speech Rec- maps audio samples into text strings. The VoiceTRAN Communicator uses the DAR- ognizer Produces a N-best sentence hypothesis PA Galaxy Communicator architecture (Seneff list. et al, 1998). The Galaxy Communicator open Receives N-best postprocessed sentence source architecture was chosen to provide inter- hypotheses from the Speech Recognition Machine Server and translates them from a source module communication support as its plug-and- Translator language into a target language. play approach allows interoperability of com- Produces a scored disambiguated sen- mercial software and research software compo- tence hypothesis list. nents. It was specially designed for develop- Receives rich and disambiguated word ment of voice-driven user interfaces in a multi- strings from the Machine Translation modal platform. It is a distributed, message- Speech Syn- server. thesizer Converts then input word strings into based, hub-and-spoke infrastructure optimized speech and prepares them for the audio for constructing spoken dialogue systems. server.

There are two ways of porting modules into the Galaxy architecture: the first is to alter its code so that it can be incorporated into the Galaxy architecture; the second is to create a wrapper or a capsule for the existing module, the capsule then behaves as a Galaxy server. We have opted for the second option since we want to be able to test commercial modules as well. Minimal changes to the existing mod- ules were required, mainly those regarding in- put/output processing. Figure 1. VoiceTRAN system architecture A particular session is initiated by a user ei- ther through interaction with a graphical user The VoiceTRAN Communicator is composed of interface (typed input) or the microphone. The a number of servers that interact with each other VoiceTRAN Communicator servers capture spo- through the Hub as shown in Figure 1. The Hub ken or typed input from the user, and return the is used as a centralized message router through servers’ responses with synthetic speech, graph- which servers can communicate with one an- ics, and text. The server modules are described other. Frames containing keys and values are in more detail in the next subsections. emitted by each server. They are routed by the hub and received by a secondary server based 2.1. Audio server on rules defined in the Hub script. The VoiceTRAN Communicator consists of The audio server connects to the microphone in- the Hub and five servers: audio server, graphic put and speaker output terminals on the host com- user interface, speech recognizer, machine trans- puter and performs recoding user input and play- lator and speech synthesizer: ing prompts or synthesized speech. Input speech captured by the audio server is automatically Audio Server Receives speech signals from the micro- recorded to files for later system training. phone and sends them to the recognizer. Sends synthesized speech to the speak- 2.2. Speech Recognizer ers. Receives input text from the keyboard. The speech recognition server receives the input Displays recognized source language audio stream from the audio server and pro- Graphic User sentences and translated target language vides at its output a word graph and a ranked Interface sentences. list of candidate sentences, the N-best hypothe- Provides user controls for handling the application. ses list that can include part-of-speech informa- tion generated by the language model.

EAMT 2005 Conference Proceedings 289 Žganec Gros, et al.

The speech recognition server used in Voice- A postprocessing algorithm inserts basic punc- TRAN is based on the Hidden Markov Model tuation and capitalization information before Recognizer developed by the University of Ljubl- passing the target sentence to the speech synthe- jana (Dobrišek, 2001). It will be upgraded to sizer. The output string can also convey lexical perform large vocabulary speaker (in)dependent stress information in order reduce disambigua- speech recognition on a wider application do- tion efforts during text-to-speech synthesis. main. A back-off class-based trigram language A multi-engine based approach will be used model will be used. Given a limited amount of in the early phase of the project that makes it training data the parameters in the models will possible to exploit strengths and weaknesses of be carefully chosen in order to achieve maxi- different MT technologies and to choose the mum performance. most appropriate engine or combination of en- Further in the project we want to test other gines for the given task. Four different transla- speech recognition approaches with an empha- tion engines will be applied in the system. We sis on robustness, processing time, footprint and will combine TM (translation memories), SMT memory requirements. (statistical machine translation), EBMT (exam- Since the final goal of the project is a stand- ple-based machine translation) and RBMT (rule- alone speech communicator used by a specific based machine translation) methods. A simple user, the speech recognizer can be additionally approach to select the best translation from all trained and adapted to the individual user in or- the outputs will be applied. der to achieve higher recognition accuracy at A bilingual aligned domain-specific corpus least in one language. will be used to build the TM and train the A common speech recognizer output typi- EBMT and the SMT phrase translation models. cally has no information on sentence boundaries, In SMT an interlingua approach, similar to the punctuation and capitalization. Therefore, addi- one described in (Lavie et al., 2002) will be in- tional postprocessing in terms of punctuation and vestigated and promising directions pointed out capitalization will be performed on the N-best in (Ney, 2004) will be pursued. hypotheses list before it is passed to the ma- The Presis translation system will be used as chine translator. our baseline system (Romih et al., 2002). It is a The inclusion of a prosodic module will be commercial conventional rule-based translation investigated in order to link the source language system that is constantly being optimized and to the target language, but also to enhance upgraded. It will be adapted to the application speech recognition proper. Besides syntactic and domain by upgrading the lexicon. Based on stored semantic information, properties such as dia- rules, Presis parses each sentence in the source lect, sociolect, sex and attitude etc are signaled language into grammatical components, such as by prosody, (Wahlster, 2000). The degree of lin- subject, verb, object and predicate and attributes guistic information conveyed by prosody varies the relevant semantic categories. Then it uses between languages, from languages such as Eng- built-in rules for converting these basic compo- lish, with a relatively low degree of prosodic nents into the target language, performs re- disambiguation, via tone-accent languages such grouping and generates the output sentence in as Swedish, to pure tone languages (Eklund et the target language. al., 1995). Prosody information will help to de- The speech recognition server sends to the termine proper punctuation and sentence accent hub each postprocessed sentence hypothesis from information. the N-best hypotheses list as a separate token. All of the different tokens for various sentence 2.3. Machine Translator hypotheses of a given utterance can be jointly The machine translator converts text strings from considered by a MT postprocessing module, af- a source language into text strings in the target ter frame construction, which takes into consid- language. Its task is difficult since the results of eration the quality of each hypothesis well- the speech recognizer convey spontaneous speech formedness that is evaluated by the machine patterns and are often erroneous or ill-formed. translation server, along with a special criterium that looks for any salient words, such as named

290 EAMT 2005 Conference Proceedings Initial considerations in building a speech-to-speech translation system for Slovenian-English entities. Early decisions can be made when a 3. Language Resources hypothesis is perfect; otherwise the final deci- sion is delayed until the last hypothesis has Some of the multilingual language resources been processed. needed to set up STTS systems and include the Slovenian language are presented in (Verdonik 2.4. Speech Synthesizer et al., 2004). For building the speech components of the The last part in a speech-to-speech translation VoiceTRAN system, existing speech corpora task is the conversion of the translated utterance will be used (Mihelič et al., 2003). We do not into its spoken equivalent. The input target text have an actual in-domain speech database for sentence is equipped with lexical stress infor- the chosen application domain. However, as de- mation at possible ambiguous words. monstrated by Lefevre (Lefevre et al., 2001) and The AlpSynth unit-selection text-to-speech others, out-of-domain speech training data do system is used for this purpose (Žganec Gros et not cause significant degradation of the system al., 2004). It performs grapheme-to-phoneme performance. Since the speech corpora have been conversion based on rules and a look-up dic- collected from different sources, adaptations will tionary and rule-based prosody modeling. It will be carried out (Tsakalidis et al., 2005). The lan- be further upgraded within the project towards guage model will be trained on a domain-spe- better naturalness of the resulting synthetic speech. cific that is being collected and an- Domain-specific adaptations will include new notated within the project. pronunciation lexica and the construction of a The AlpSynth pronunciation lexicon (Žganec speech corpus of frequently used in-domain Gros et al, 2004) will be used for both speech phrases. Other commercial off-the-shelf prod- recognition and text-to-speech synthesis. Speech ucts will be tested as well. synthesis will be based on the AlpSynth speech We will also explore how to pass a richer corpus. It will be expanded by the most frequent structure from the machine translator to the in-domain utterances. speech synthesizer. An input structure contain- For developing the initial machine transla- ing information on POS and lexical stress in- tion component, the of military termi- formation resolves many ambiguities and can nology (Korošec, 2002) and various existing result in more accurate prosody prediction. aligned parallel corpora will be used: (Erjavec, The speech synthesizer produces an audio 2002) and (Erjavec et al., 2005). stream for the utterance. The audio is stream is finally sent to the speakers by the audio server. 4. Data Collection Efforts After the synthesized speech has been transmit- ted to the user, the audio server is freed up in The VoiceTRAN team will participate in the order to continue listening for the next user ut- annotation of an in-domain large Slovenian terance. monolingual text corpus that is being collected at the Faculty of Social Studies, University of 2.5. Graphical User Interface Ljubljana. This corpus will be used for training the language model in the speech recognizer, as In addition to the speech user interface, the Voice- well as for inducing relevant multiword units (col- TRAN Communicator provides a simple inter- locations, phrases and terms) for the domain. active user-friendly graphical user interface where Within VoiceTRAN, an aligned bilingual in- input text in the source language can also be en- domain corpus is also being collected. It will tered via a keyboard. consist of general and scenario-specific in-do- Recognized sentences in the source language main sentences. The compilation of such cor- along with their translated counterparts in the pora involves selecting and obtaining the digital target language are displayed. original of the bi-texts, re-coding to XML TEI A push-to-talk button is provided to signal P4, sentence alignment, word-level syntactic tag- an input voice activity, a replay button serves to ging and (Erjavec and Džeroski start a replay of the synthesized translated utter- 2004). Such pre-processed corpora are then ance. The translation direction can be changed used to induce bi-lingual single word and phrase by pressing the translation direction button.

EAMT 2005 Conference Proceedings 291 Žganec Gros, et al. lexica for the MT component, or as direct inputs translation communicator able to translate sim- for SMT and EBMT systems. They will also ple domain-specific sentences in a Slovenian- serve for additional training of the speech rec- English language pair. ognizer language model. The concept of the VoiceTRAN Communi- cator implementation is discussed in the paper. 5. Planned Evaluation The chosen system architecture allows for test- ing a variety of server modules. The intended experiments will be performed under quite clean conditions that are far differ- 7. Acknowledgements ent from real application environments where the speech signal is often mixed with noise and The authors of the paper wish to thank the Slo- distorted by room reverberation or communica- venian Ministry of Defense and the Slovenian tion channels. Research Agency for co-funding the project.

5.1. End-to-End System Evaluation 8. References Evaluation efforts within the VoiceTRAN pro- AKIBA, Y., FEDERICO, M., KANDO, N., NAKAI- ject will serve for two purposes: WA, H., PAUL, M., TSUJI, J. (2004). ‘Overview of - to evaluate whether we have improved the the IWSLT04 Evaluation Campaign’. In Proceedings system by introducing improvement of in- of the International Workshop on Spoken Language Translation (pp. 1-9), Kyoto, Japan. dividual components of the system; - to test the system acceptancy by potential DOBRIŠEK, S. (2001). ‘Analysis and Recognition users in field tests. of Phrases in Speech Signals’. PhD Thesis, Univer- sity of Ljubljana, Slovenia. We intend to perform end-to-end translation qual- EKLUND, R., LYBERG, B. (1995). ‘Inclusion of a ity tests both on manually transcribed and auto- Prosodic Module in Spoken Language Translation matic speech recognition input. Human graders Systems’. In Proceedings of the ASA 130th Meeting. will asses the end-to-end translation perform- St. Louis, MO. ance evaluating how much of the user input in- ERJAVEC, T. (2002). ‘The IJS-ELAN Slovene-Eng- formation has been conveyed to the target lan- lish parallel corpus’. International Journal on Corpus guage and also how well formed the target sen- Linguistics (pp. 1-20), Vol. 7., No. 1. tences are. Back-translation evaluation experi- ments involving paraphrases will be considered, ERJAVEC, T., DŽEROSKI, S. (2004). ‘Machine Learning of Language Structure: Lemmatising Un- as well (Rossato et el., 2002). known Slovene Words’. Applied Artificial Intelligence 5.2. Single Component Evaluation (pp. 17-41), Vol. 18, No. 1. ERJAVEC, T. (2004). ‘MULTEXT-East Version 3: We will also perform individual component tests Multilingual Morphosyntactic Specifications, Lexi- in order to select the most appropriate methods cons and Corpora’. In Proceedings of the Fourth In- for each application server. Speech recognition ternational Conference on Language Resources and will be evaluated by computing standard word Evaluation, LREC’04 (pp. 1535-1538), Lisbon, Por- error rates (WER). For the machine translation tugal. component subjective evaluation tests in terms ERJAVEC, T., IGNAT, C., POULIQUEN, P., STEIN- of fluency and adequacy are planned, as well as BERGER, R. (2005). ‘Massive multi-lingual corpus objective evaluation tests (MT Evaluation Kit, compilation: Acquis Communautaire and totale’. Sub- 2004), (Akiba et al., 2004), having in mind that mitted to 2nd Language, Technology Conference, April objective evaluation methods evaluate the trans- 21-23, Poznań, Poland. lation quality in terms of the capacity of the KOROŠEC, T. (2002). ‘Opravljeno je bilo pomemb- system to mimic the reference text. no slovarsko delo o vojaškem jeziku’. Slovenska vo- jska,(pp. 12-13), Vol. 10., No. 10. (in Slovenian) 6. Conclusion LAVIE, A., WAIBEL, A., LEVIN, L., FINKE, M., The VoiceTRAN project provides an attempt to GATES, D., GAVALDÀ, M., ZEPPENFELD, T., build a robust multimodal speech-to-speech ZHAN, P. (1997). ‘Janus-III: Speech-to-Speech Trans-

292 EAMT 2005 Conference Proceedings Initial considerations in building a speech-to-speech translation system for Slovenian-English lation in Multiple Languages’. In Proceedings of the SARICH, A. (2001). ‘Phraselator, one-way speech ICASSP (99. 99-102), Munich, Germany. translation system’. Available at http://www.sarich. LAVIE, A., METZE, F., CATTONI, R., COSTAN- com/translator/, 2001. TIN, E. &, BURGER, S., GATES, D., LANGLEY, SENEFF, S., HURLEY, E., LAU, R., PAO, C., C., LASKOWSKI, K., LEVIN, L., PETERSON, K., SCHMID, P., ZUE, V. (1998). ‘Galaxy-II: A Refer- SCHULTZ, T., WAIBEL A.,, WALLACE, D., MC- ence Architecture for Conversational System Devel- DONOUGH, J., SOLTAU, H., LAZZARI, G., MA- opment’. In Proceedings of the ICSLP (pp. 931- NA, N., PIANESI, F., PIANTA, E., BESACIER, L., 934), Sydney, Australia, Available at http://commu- BLANCHON, H., VAUFREYDAZ, D., TADDEI, nicator. sourceforge.net/. L. (2002). ‘A Multi-Perspective Evaluation of the TSAKALIDIS, S., BYRNE, W. (2005). ‘Acoustic NESPOLE! Speech-to-Speech Translation System’. training from heterogeneous data sources: Experiments In Proceedings of the ACL 2002 workshop on Speech- in Mandarin conversational telephone speech tran- to-speech Translation: Algorithms and Systems, scription’. In Proceedings of the IEEE Conference on Philadelphia, PA. Acoustics, Speech and Signal Processing, Philadel- LEFEVRE, F., GAUVAIN, J.L., LAMEL, L. (2001). phia, PA, to appear. ‘Improving Genericity for Task-Independent Speech VERDONIK, D., ROJC, M., KAČIČ, Z. (2004). ‘Cre- Recognition’. In Proceedings of the Eurospeech (pp. ating Slovenian language resources for development 1241-1244), Aalborg, Denmark. of speech-to-speech translation components’. In Pro- MIHELIČ, F., GROS, J., DOBRIŠEK, S., ŽIBERT, ceedings of the Fourth international conference on J., PAVEŠIĆ, N. (2003). ‘Spoken language resources language resources and evaluation (1399-1402), Lis- at LUKS of the University of Ljubljana’. International bon, Portugal. Journal on Speech Technologies (pp. 221-232), Vol. VIČIČ, J., ERJAVEC, T. (2002). ‘Corpus driven 6, No. 3. machine translation’. In Proceedings of the 7th TELRI MT Evaluation Kit (2002). ‘NIST MT Evaluation seminar Information in Corpora (p. 20,. Abstract, Kit Version 11a’. Available at http://www.nist.gov/ TELRI Association, Dubrovnik, Croatia. speech/ tests/mt. Wahlster, W. (2000). ‘Verbmobil: Foundation of NEY, H. (2004). ‘The Statistical Approach to Spo- Speech-to-Speech translation’. Springer Verlag. ken Language Translation’. In Proceedings of the In- WAIBEL, A., BADRAN, A., BLACK, A. W., FRE- ternational Workshop on Spoken Language Transla- DERKING, R., GATES, D., LAVIE, A., LEVIN, L., tion (pp. XV-XVI), Kyoto, Japan. LENZO, K., MAYFIELD TOMOKYO, L., REI- ROMIH, M., HOLOZAN, P. (2002). ‘Slovensko-ang- CHERT, J., SCHULTZ, T., WALLACE, D., WOS- leški prevajalni sistem (A Slovene-English Transla- CSYNA, M., ZHANG, J. (2003). ‘Speechalator: tion System)’. In Proceedings of the 3rd Language Two-Way Speech-to-Speech Translation on a Con- Technologies Conference (p. 167), Ljubljana, Slove- sumer PDA’. In Proceedings of the EUROSPEECH nia. (in Slovenian) (pp. 369-372), Geneva, Switzerland. ROSSATO, S., BLANCHON, H., BESACIER, L. ŽGANEC GROS, J., MIHELIČ, A., ŽGANEC, M., (2002). ‘Speech-to-Speech Translation System Eval- PAVEŠIĆ, N., MIHELIČ, F., CVETKO OREŠNIK, uation: Results for French for the Nespole! Project V. (2004). ‘AlpSynth corpus-driven Slovenian text- First Showcase’. In Proceedings of the ICSLP, Den- to-speech synthesis: designing the speech corpus’. In ver, CO. Proceedings of the joint conferences CTS+CIS, Com- puters in technical systems, Intelligent systems, (pp. 107-110), Rijeka, Croatia.

EAMT 2005 Conference Proceedings 293