DEMO - MILLA – A Multimodal Interactive Agent

Joao Cabral1, Nick Campbell1, Shree Ganesh2, Emer Gilmartin1, Fasih Haider1, Eamon Kenny1, Mina Kheirkhah3, Andy Murphy1, Neasa Ní Chiaráin1, Thomas Pellegrini4, Odei Rey Orosko5

Trinity College Dublin, Ireland1; GCDH-University of Goettingen, Germany2; Institute for Advanced Studies in Basic Sciences, Zanjan, Iran3; Institut de Recherche en Informatique de Toulouse, France4; Universidad del País Vasco, Bilbao, Spain5

veloped which provide feedback either through Abstract the learner listening back to their own efforts and comparing to a model, or the system providing a We describe the motivation behind, design, score or other feedback. These resources are very and implementation of MILLA, a prototype useful in development of discrete skills, but the speech-to-speech English language tutoring challenge of providing spoken interaction tuition system. and practice remains. The MILLA system, de- veloped at the 2014 eNTERFACE workshop 1 Background (‘The 10th International Summer Workshop on Learning a new language involves the acquisition Multimodal Interfaces - eNTERFACE’14 ISCA and integration of a range of skills. A human tu- Training School’, 2014) is a multimodal spoken tor aids learners by (i) providing a framework of dialogue system combining custom modules with tasks suitable to the learner’s needs, (ii) monitor- existing web resources in a balanced curriculum, ing learner progress and adapting task content and, by integrating spoken dialogue, modelling and delivery style to suit, and (iii) providing a some of the advantages of a human tutor. source of speaking practice and motivation. With the advent of audiovisual technology and the 2 MILLA System Components communicative paradigm in language pedagogy, MILLA’s spoken dialogue Tuition Manager focus has shifted from written grammar and consults a two-level curriculum of language translation to activities focusing more on com- learning tasks, a learner record, and a learner municative competence in listening and spoken state module to greet and enroll learners, direct production. In recent years the Common Europe- them to language learning submodules, provide an Framework of Reference for Language Learn- feedback and scoring, and monitor user state ing and Teaching (CEFR) officially added a with Kinect sensors. All of the tuition manager’s more integrative fifth skill – spoken interaction - interaction with the user can be performed using to the traditional four skills – reading and listen- speech through a Cereproc TTS voice using ing, and writing and speaking (Little, 2006) . Cereproc’s Python SDK (‘CereVoice Engine While second have always been Text-to-Speech SDK | CereProc Text-to-Speech’, learned interactively through negotiation of 2014) and understanding via CMU’s Sphinx4 meaning between speakers of different languages ASR (Walker et al., 2004) through custom Py- sharing living or working environments, these thon bindings using W3C compliant Java Speech methods did not figure in formal (funded) set- Format Grammars. tings. However, with increased mobility and Tasks include spoken dialogue practice with two globalisation, many formal learners now need different chatbots, first language (L1) focused language as a practical tool for everyday life and and general pronunciation training, and grammar business rather than simply as an academic and vocabulary exercises. Several speech recog- achievement. Developments in Computer Assist- nition (ASR) engines (HTK, Google Speech) and ed Language Learning (CALL) have resulted in a text-to speech (TTS) voices (Mac and Windows range of free and commercial language learning system voices, Google Speech) are incorporated material for autonomous study. Much of this ma- in the modules to meet the demands of particular terial transfers long-established paper-based and tasks and to provide a cast of voice characters audiovisual exercises to the computer screen. which provide a variety of speech models to the Pronunciation training exercises have been de- learner. ’s Kinect SDK (‘Kinect for Windows SDK’, 2014) is used for gesture recog- (Level 2), two chatbots created using the Pan- nition and as a platform for affect recognition. dorabots web-based chatbot hosting service . The The tuition manager and all interfaces are written bots were first implemented in text-to-text form in Python 2.6, with additional C#, Javascript, in AIML (Artificial Intelligence Markup Lan- Java, and Bash coding in the Kinect, chat, guage) and then TTS and ASR were added Sphinx4, and pronunciation elements. For rapid through the Web Speech API, conforming to prototyping many of the dialogue modules were W3C standards (W3C, 2014). Based on consulta- first written in VoiceXML, and then ported to tion with language teachers and learners, the sys- Python modules. tem allows users to speak directly to the chat bot, or enter chat responses using text input. A chat 2.1 Pronunciation Tuition log was also implemented into the interface, al- MILLA incorporates two pronunciation mod- lowing the user to read back or replay several of ules, based on comparison of learner production their previous interactions with the chat bot. with model production: (i) a focused pronuncia- 2.3 Grammar, Vocabulary and External tion tutor using HTK ASR with the five-state 32 Resources Gaussian mixture monophone acoustic models provided with the Penn Aligner toolkit (Young, MILLA’s curriculum includes a number of grad- n.d.; Yuan & Liberman, 2008) on the system’s ed activities from the OUP’s English File and the local machine and (ii) MySpeech a phrase level British Council’s Learn English websites (REF). trainer hosted on University College Dublin’s Wherever possible the system scrapes any scores cluster and accessed by the system via Internet returned for exercises and incorporates them into (Cabral et al., 2012). the learner’s record, while in other cases the pro- For the focused pronunciation system, we used gression and scoring system includes a time re- the baseline implementation of the Goodness of quired to be spent on the exercises before the Pronunciation algorithm, (Witt & Young, 2000). user progresses to the next exercises. There are GOP scoring involves two phases: 1) a free also a number of custom morphology and syntax phone loop recognition phase which determines exercises designed for MILLA using Voxeo’s the most likely phone sequence given the input Prophecy platform and VoiceXML which will be speech without giving the ASR any information ported to MILLA in the near future. about the target sentence, and 2) a forced align- ment phase which provides the ASR system with 2.4 User State and Gesture Recognition the orthographic transcription of the input sen- MILLA includes a learner state module which tence and force aligns the speech signal with the will eventually infer boredom or involvement in expected phone sequence. For each phone reali- the learner. As a first pass, gestures indicating zation aligned to the speech signal, comparison various commands were designed and incorpo- of the log-likelihoods of the forced alignment rated into the system using Microsoft’s Kinect and thevfree recognition phases, produces a GOP SDK. The current implementation comprises score where zero reflects a perfect match and four gestures (Stop, I don’t know, Swipe increasing positive scores correspond to inaccu- Left/Right), which were designed by tracking the racies. Phone specific threshold scores were set skeletal movements involved and extracting joint to decide whether a phone was mispronounced coordinates on the x,y, and z planes to train the (``rejected'') or not (``accepted''), by artificially recognition process. As MILLA is multiplatform inserting errors in the pronunciation lexicon and (Unix and Windows), Python’s socket program- running the algorithm on native recordings, as in ming modules was used to communicate between (Kanters, Cucchiarini, & Strik, 2009). After pre- the Windows machine running the Kinect and liminary testing, we constrained the free phone the Mac laptop hosting MILLA. loop recogniser for more robust behavior, using phone confusions common in specific L1’s to 3 Future work define constrained phone grammars. A of common errors in several L1s with test utter- MILLA is an onging project. In particular work ances was built into the curriculum module. is in progress to add a Graphical User Interface and avatar to provide a more immersive version 2.2 Spoken Interaction Tuition (Chat) of MILLA. User trials are planned for the aca- demic year 2014-15 in several centres providing To provide spoken interaction practice, MILLA language training to immigrants in Ireland. sends the user to Michael (Level1) or Susan Acknowledgments W3C. (2014). Web Speech API Specification. Re-

Do not number the acknowledgment section. Do trieved 7 July 2014, from not include this section when submitting your paper for review. https://dvcs.w3.org/hg/speech-api/raw- file/tip/speechapi.html References Cabral, J. P., Kane, M., Ahmed, Z., Abou-Zleikha, Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R.,

M., Székely, E., Zahra, A., … Schlögl, S. Gouvea, E., … Woelfel, J. (2004). Sphinx-4:

(2012). Rapidly Testing the Interaction Mod- A flexible open source framework for speech

el of a Pronunciation Training System via recognition. Retrieved from

Wizard-of-Oz. In LREC (pp. 4136–4142). http://dl.acm.org/citation.cfm?id=1698193

CereVoice Engine Text-to-Speech SDK | CereProc Witt, S. M., & Young, S. J. (2000). Phone-level pro-

Text-to-Speech. (2014). Retrieved 7 July nunciation scoring and assessment for inter-

2014, from active language learning. Speech Communi-

https://www.cereproc.com/en/products/sdk cation, 30(2), 95–108. Retrieved from

Kanters, S., Cucchiarini, C., & Strik, H. (2009). The http://www.sciencedirect.com/science/article

goodness of pronunciation algorithm: a de- /pii/S0167639399000448

tailed performance study. In SLaTE (pp. 49– Young, S. (n.d.). HTK Toolkit.

52). Retrieved 7 July 2014, from

Kinect for Windows SDK. (2014). Retrieved 7 July http://htk.eng.cam.ac.uk/

2014, from http://msdn.microsoft.com/en- Yuan, J., & Liberman, M. (2008). Speaker identifica-

us/library/hh855347.aspx tion on the SCOTUS corpus. Journal of the

Little, D. (2006). The Common European Framework Acoustical Society of America, 123(5), 3878.

of Reference for Languages: Content, pur-

pose, origin, reception and impact. Language

Teaching, 39(03), 167–190.

The 10th International Summer Workshop on Multi-

modal Interfaces - eNTERFACE’14 ISCA

Training School. (2014). Retrieved 7 July

2014, from

http://aholab.ehu.es/eNTERFACE14/