I Iberian SLTech 2009

Proceedings of the I Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages

Porto Salvo, Portugal, September 3-4, 2009

Edited by: António Teixeira, Miguel Sales Dias & Daniela Braga

Published by: Designeed

ISBN: 978-989-96278-1-9 Portuguese National Library Number: 298538/09

Preface

The ISCA Special Interest Group on Iberian Languages (SIG-IL) board, pursuing the aims of organizing conferences, schools and workshops, promoting industry/university collaboration and offering a forum to discuss opportunities in research and industry applications in the field of Speech and Language Technology, decided to organize a new event, I Iberian SLTech - I Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages. A practical way of making possible the event in 2009 was to join efforts with Microsoft Language Development Center (MLDC), in Portugal, associating the SIG-IL event to a new edition of their past Workshops (I Microsoft Workshop on Speech Technology - Building bridges between industry and academia, Porto Salvo, May 2, 2007 and Propor 2008 Special Session: Applications of Portuguese Speech and Language Technologies, September 10, 2008, Curia, Portugal). The event also had the support of Red Temática en Tecnologías del Habla (RTTH). The main objective for this new event is to create a forum for the exchange of ideas and promote collaboration in the fields of Speech and Language Technologies for all the institutions who work on Iberian Languages. It also our goal to continue this event in the following years and that it may be held in the different places of the SIG-IL geography. The Organization is honoured to receive Alex Acero as keynote speaker who will bring us the following topic: “Building accurate and user-friendly speech sys- tems”. Alex Acero is the Research Area Manager of the Speech Group in Microsoft Research, in Redmond (USA) and is one of the world leadings researchers in Speech Technology, directing an organization with 70 researchers in audio, speech, multi- media, communication, natural language, and information retrieval. He is also an affiliate Professor of Electrical Engineering at the University of Washington, Seat- tle. Dr. Acero is author of the books “Acoustical and Environmental Robustness in Automatic Speech Recognition” (Kluwer, 1993) and “Spoken Language Processing” (Prentice Hall, 2001). Our Scientific Committee selected the following contributions for presentation: 21 posters (3 from students), 4 recent PhDs, 2 demos and 8 groups and projects. The posters covered 6 areas: systems and applications; resources and tools; speech recog- nition; ; gender, speaker and language recognition and language processing. These contributions were edited in online, CD and paper proceedings. We accepted 18 contributions with authors from Portugal, 15 from Spain, 3 from Ger- many, 1 from USA, 1 from Cuba and 1 from Brazil. We would like to send our special thanks to all the Chairs and Scientific Com- mittee members that helped us in the proposals’ revision and conference organization in such a tight schedule, to our keynote speaker for bringing us such an interesting topic and to all authors for submitting the most recent advances of their work on Iberian languages speech and language processing. We expect that this is the first edition of a growing annual event that aims to attract more researchers in all countries that work on the latest advances on speech and language processing for the Iberian Languages.

The Iberian SL Tech 2009 Organizers, António Teixeira, SIG-IL Chair, IEETA Daniela Braga, Microsoft

Committees

General Chairs

António Teixeira, SIG-IL Chair, Universidade de Aveiro/IEETA, Portugal Miguel Dias, Microsoft, Portugal Daniela Braga, Microsoft, Portugal

Local Organization Committee

Daniela Braga, Microsoft, Portugal Francisco Pires, Microsoft, Portugal Bruno Reis Bechtlufft, Microsoft, Portugal

Demos Chair

Rubén San-Segundo, SIG-IL ISCA Liaison, Universidad Politécnica de Madrid, Spain

Program Chair

Aldebaro Klautau, SIG-IL Secretary, Universidade Federal do Pará, Brazil Juan Arturo Nolazco Flores, SIG-IL Vice-Chair, Tecnológico de Monterrey, Mexico Carmen García Mateo, University of Vigo, Spain

Presentations and Panels Chair

António Teixeira, SIG-IL Chair, Universidade de Aveiro/IEETA, Portugal Daniela Braga, Microsoft, Portugal

Scientific Committee

Abel Herrera Camacho, FI-UNAM, México Alberto Abad, INESC-ID Lisboa, Portugal Alberto Simões, Universidade do Minho, Portugal Aldebaro Klautau, Universidade Federal do Pará, Brazil Alex Acero, Microsoft Research, USA Alexander Gelbuck, CIC - IPN, México Alfonso Ortega, Universidad de Zaragoza, Spain Alvaro Iriarte, Universidade do Minho, Portugal Amália Andrade, CLUL/Universidade de Lisboa, Portugal Andreia Rauber, Universidade do Minho, Portugal Antonia Marti Antonín, Universidad de Barcelona, Spain António Bonafonte, Universitat Politècnica de Catalunya, Spain António Branco, FCUL, Portugal António Serralheiro, L2F INESC-ID and Academia Militar, Portugal António Teixeira, IEETA/Universidade de Aveiro, Portugal Ascensión Gallardo, Universidad Carlos III de Madrid, Spain Belinda Maia, FLUL, Portugal Carlos Meneses, ISEL, Portugal Carlos Teixeira, FCUL, Portugal Céu Viana, FLUL, Portugal Ciro Martins, IEETA/Universidade de Aveiro, Portugal Daniela Braga, MLDC/Microsoft, Portugal Diana Santos, SINTEF, Norway Doroteo Torres, Universidad Autónoma de Madrid, Spain Encarna Segarra, Universidad Politécnica de Valencia, Spain Eva Navas, Universidad del País Vasco, Spain Fábio Violaro, Universidade Estadual de Campinas - UNICAMP, Brazil Fernando Gil Resende Jr., Universidade Federal do Rio de Janeiro, Brazil Fernando Perdigão, Universidade de Coimbra, Portugal Francisco Campillo, Universidade de Vigo, Spain Francisco Vaz, IEETA/Universidade de Aveiro, Portugal Frank Seide, Microsoft Research Asia - Speech Group Hugo Meinedo, INESC-ID Lisboa, Portugal Inmaculada Hernaez Rioja, Universidad del País Vasco, Spain Isabel Trancoso, INESC-ID/IST, Portugal João Veloso, Faculdade de Letras da Universidade do Porto, Portugal Jorge Baptista, Universidade do Algarve, Portugal José Manuel Pardo, Universidad Politécnica de Madrid, Spain José Ramón Calvo de Lara, CENATAV, Cuba José Teixeira, Universidade do Minho, Portugal Juan Manuel Montero, Universidad Politécnica de Madrid, Spain Juan Nolazco Flores, Tecnológico de Monterrey, Mexico Luís Caldas Oliveira, INESC-ID/IST, Portugal Luis Hernandez, Universidad Politécnica de Madrid, Spain Luis Villaseñor Pineda, INAOE, Mexico Manuel Montes y Gómez, INAOE, Mexico Maria Aldina Marques, Universidade do Minho, Portugal Maria Helena Mira Mateus, ILTEC, Portugal Mário Silva, FCUL, Portugal Nestor Yoma, Universidad de Chile, Chile Nuno Mamede, INESC-ID/IST, Portugal Paula Carvalho, FCUL, Portugal Paulo Quaresma, Universidade de Évora, Portugal Plínio Barbosa, Universidade Estadual de Campinas (UNICAMP), Brazil Ranniery Maia, NICT, Japan Ricardo de Córdoba, Universidad Politécnica de Madrid, Spain Rubén San Segundo, Universidad Politécnica de Madrid, Spain Sérgio Paulo, INESC-ID Lisboa, Portugal Thomas Pellegrini, INESC-ID Lisboa, Portugal Vera Strube de Lima, Pontifícia Universidade Católica do Rio Grande do Sul, Brazil Violeta Quental, Pontifícia Universidade Católica do Rio de Janeiro, Brazil Xavier Gómez Guinovart, Universidade de Vigo, Spain Xosé Ramón Freixeiro Mato, Universidade da Coruña, Spain Keynote Speaker

Building accurate and user-friendly speech systems 3 Alex Acero Microsoft, USA

Selected Posters

Systems & Applications A task-independent stochastic dialog manager for the EDECAN project 9 Francisco Torres Goterris Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia, Spain

Terminology extraction from English-Portuguese and English-Galician parallel cor- pora based on probabilistic translation dictionaries and bilingual syntactic patterns 13 Alberto Simões Department of Computer Science Universidade do Minho, Portugal Xavier Gómez Guinovart Department of Translation and Linguistics Universidade de Vigo, Spain

A Hierarchical Architecture for Audio Segmentation in a Broadcast News Task 17 Mateu Aguilo, Taras Butko, Andrey Temko, Climent Nadeu Department of Signal Theory and Communications, TALP Research Center Universitat Politècnica de Catalunya, Spain

Browsing Multilingual Making-Ofs 21 Carlos Teixeira LASIGE, University of Lisbon, Portugal Ana Respício OR Center, DI, University of Lisbon, Portugal Catarina Ribeiro LASIGE, University of Lisbon, Portugal

Resources & Tools A Catalan Broadcast Conversational Speech Database 27 Henrik Schulz, José A. R. Fonollosa Department of Signal Theory and Communications Technical University of Catalunya (UPC), Spain An XML Resource Definition for Spoken Document Retrieval 31 Luis Javier Rodríguez-Fuentes, Germán Bordel, Arantza Casillas Mikel Penagarikano & Amparo Varona Grupo de Trabajo en Tecnologías Software (GTTS) Universidad del País Vasco, Spain

CORPOR System: Corpora of the Portuguese Language as spoken in São Paulo 35 Zilda Zapparoli Universidade de São Paulo (USP), Brazil

Machine Translation of the Penn Treebank to Spanish 39 Martha Alicia Rocha Departamento de Sistemas y Computación Instituto Tecnológico de León, México Joan Andreu Sánchez Instituto Tecnológico de Informática Universidad Politécnica de Valencia, Spain

Adapting the Unisyn Lexicon to Portuguese: Preliminary issues in the development of LUPo 43 Simone Ashby, José Pedro Ferreira, Sílvia Barbosa Instituto da Linguística Teórica e Computacional (ILTEC), Portugal

Speech Recognition A Baseline System for the Transcription of Catalan Broadcast Conversation 49 Henrik Schulz, José A. R. Fonollosa Department of Signal Theory and Communications Technical University of Catalunya (UPC), Spain David Rybach Human Language Technology and Pattern Recognition RWTH Aachen University, Germany

A Fast Discriminative Training Algorithm for Minimum Classification Error 53 B. Silva, H. Mendes, C. Lopes, A. Veiga & F. Perdigão Department of Electrical and Computer Engineering, FCTUC Instituto de Telecomunicações, Polo II University of Coimbra, Portugal

Global Discriminative Training of a Hybrid Speech Recognizer 57 Carla Lopes Department of Electrical and Computer Engineering, FCTUC Instituto de Telecomunicações, Polo II University of Coimbra, Portugal Instituto Politécnico de Leiria-ESTG, Portugal Fernando Perdigão Department of Electrical and Computer Engineering, FCTUC Instituto de Telecomunicações, Polo II University of Coimbra, Portugal Towards Microphone Selection Based on Room Impulse Response Energy-Related Measures 61 Martin Wolf & Climent Nadeu TALP Research Center, Department of Signal Theory and Communications Universitat Politècnica de Catalunya, Spain

Speech Synthesis

Towards an Objective Voice Preference Definition for the Portuguese Language 67 Luis Coelho ESEIG, Instituto Politécnico do Porto, Portugal Horst-Udo Hain & Oliver Jokisch Laboratory of Acoustics and Speech Communication, TU Dresden, Germany Daniela Braga Microsoft Language Development Center, Microsoft, Portugal

A Detailed Analysis and Comparison of Speech Synthesis Paradigms 71 Luis Coelho ESEIG, Instituto Politecnico do Porto, Porto, Portugal Daniela Braga Microsoft Language Development Center, Microsoft, Portugal Carmen Garcia-Mateo Departamento de Teoría de la Señal y Comunicaciones University of Vigo, Spain

Gender, Speaker & Language Recognition

Detection of Children’s Voices 77 Rui Martins & Isabel Trancoso IST - Technical University of Lisbon, Portugal L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Portugal Alberto Abad & Hugo Meinedo L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Portugal

Unsupervised SVM based 2-Speaker Clustering 81 Binda Celestino, Hugo Cordeiro, Carlos Meneses Ribeiro Multimedia and Machine Learning Group Department of Electronic Telecommunication and Computer Engineering Instituto Superior de Engenharia de Lisboa (ISEL), Portugal

Speaker Verification with Shifted Delta Cepstral Features: Its Pseudo-Prosodic Be- havior 85 Dayana Ribas González, José R. Calvo de Lara Advanced Technologies Application Center, CENATAV, Cuba Multilevel and channel-compensated language recognition: ATVS-UAM systems at NIST LRE 2009 89 Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Javier Franco-Pedroso, Daniel Ramos Doroteo T. Toledano & Joaquin Gonzalez-Rodriguez ATVS Biometric Recognition Group Universidad Autonoma de Madrid, Spain

Language Processing Bilingual Example Segmentation based on Markers Hypothesis 95 Alberto Simões & José João Almeida Departamento de Informática, Universidade do Minho, Portugal

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages 99 Fernando Batista DCTI, Institute of Science, Technology and Management, Portugal L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Portugal Isabel Trancoso & Nuno Mamede IST - Technical University of Lisbon, Portugal L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, Portugal

Demos

The TALP on-line Spanish-Catalan machine-translation system 105 Marc Poch, Mireia Farrús, Marta R. Costa-jussà, José B. Mariño Adolfo Hernández, Carlos Henríquez & José A. R. Fonollosa Center for Language and Speech Technologies and Applications (TALP) Technical University of Catalonia (UPC), Spain

CINTIL-Treebank Searcher 107 Patricia Nunes Gonçalves, António Branco NLX-Natural Language and Speech Group, Lisbon University, Portugal

Recent PhDs

Hierarchical language models based on classes of phrases: formulation, learning and decoding 111 Raquel Justo Blanco Department of Electricity and Electronics University of the Basque Country, Spain Dynamic Language Modeling for European Portuguese 113 Ciro Martins Department of Electronics, Telec. & Informatics/IEETA University of Aveiro, Portugal L2F - Spoken Language Systems Lab - INESC-ID, Portugal

A phonological study of Portuguese language variety spoken in Beira Interior region. Some syntactic and semantic considerations 115 Sara Candeias Instituto de Telecomunicações (Coimbra), Portugal

From grapheme to gesture: Linguistic contributions for an articulatory based text- to-speech system 117 Catarina Oliveira ESSUA/IEETA - University of Aveiro, Portugal

Groups & Projects

Natural Language Science and Technology at the University of Lisbon, Department of Informatics: the NLX Group 121 António Branco NLX - Natural Language and Speech Group Department of Informatics, University of Lisbon, Portugal

LX-Center: A center of online services for education, research and development on language science and technology 123 Antonio Branco, Francisco Costa, Eduardo Ferreira, Pedro Martins Filipe Nunes, Joao Silva & Sara Silveira Department of Informatics, University of Lisbon, Portugal

Pattern recognition & speech technologies 125 M. I. Torres, R. Justo, A. Pérez, V. Guijarrubia, J.M. Olaso G. Sánchez, E. Alonso & J.M. Alcaide Department of Electricity and Electronics University of the Basque Country, Spain

SD-TEAM: Interactive Learning, Self-Evaluation and Multimodal Technologies for Multidomain Spoken Dialog Systems 127 María Inés Torres Pattern Recognitiona & Speech Technologies Group University of the Basque Country, Spain Eduardo Lleida Communication Technologies Group. University of Zaragoza, Spain Emilio Sanchis Pattern Recognition and Artifitial Intelligence Group Polytechnic University of Valencia, Spain Ricardo de Córdoba Speech Technology Group, Polytechnic University of Madrid, Spain Javier Macías-Guarasa Intelligent Spaces & Transport Group University of Alcalá de Henares, Spain

Recent work on the FESTCAT database for speech synthesis 131 Antonio Bonafonte TALP Research Center, Universitat Politècnica de Catalunya, Spain Lourdes Aguilar Departament Filologia Espanyola Universitat Autònoma de Catalunya, Spain Ignasi Esquerra, Sergio Oller & Asunción Moreno TALP Research Center, Universitat Politècnica de Catalunya, Spain

The Project HERON 133 António Teixeira & Augusto Silva Dep. Electronics, Telecom. & Informatics/IEETA University of Aveiro, Portugal Catarina Oliveira & Paula Martins School of Health/IEETA, University of Aveiro, Portugal Inês Domingues IEETA, University of Aveiro, Portugal

SDI Media Booth 135 ErinRose Widner SDI Media Group

Microsoft Language Development Center’s activities in 2008/2009 137 Daniela Braga, António Calado, Pedro Silva & Miguel Sales Dias Microsoft Language Development Center (MLDC), Portugal Proceedings of the I Iberian SLTech 2009

Keynote Speaker

1

Proceedings of the I Iberian SLTech 2009

Building accurate and user-friendly speech systems

Alex Acero, Research Area Manager, Microsoft Research

Abstract: While accurate speech recognition engines are critical to successful speech applications, there are other factors than can impact user experience even more than the accuracy of the engine itself. For example, the grammar the ASR engine uses should predict what the user will say but it’s often hard for an application developer to design a grammar that will result in high system accuracy. I will show how data- driven techniques can be used to build accurate grammars in a straightforward way. I’ll also describe a technique that uses a statistical language model and an inverted index and which can be used for applications such as voice search or SMS dictation and results in high accurate end-to-end systems. Even an accurate speech recognition system is not enough for good user experience because such systems will always make errors and it’s critical to provide a graceful error recovery mechanism. Also, users have a choice between speaking, touching a screen, or typing and may choose to not speak unless this is better than the alter- native. I will show designs for several systems that take into account this in voice search, education and the automobile.

Bio:

Alex Acero received a M.S. degree from the Polytechnic University of Madrid, Madrid, Spain, in 1985, a M.S. degree from Rice University, Houston, TX, in 1987, and a Ph.D. degree from Carnegie Mellon University, Pittsburgh, PA, in 1990, all in Electrical Engineering. Dr. Acero worked in Apple Computer’s Advanced Technology Group in 1990-1991. In 1992, he joined Telefonica I+D, Madrid, Spain, as Manager of the speech technology group. Since 1994 he has been with Microsoft Research, Redmond, WA, where he is presently a Research Area Manager directing an organization with 70 researchers in audio, speech, multimedia, communication, natural language, and information retrieval. He is also an affiliate Professor of Electrical Engineering at the University of Washington, Seattle. Dr. Acero is author of the books “Acoustical and Environmental Robustness in Automatic Speech Recognition” (Kluwer, 1993) and “Spoken Language Processing” (Prentice Hall, 2001), has written invited chapters in 4 edited books and 200 technical papers. He holds 53 US patents. Dr. Acero is a Fellow of IEEE. He has served the IEEE Signal Processing Society as Vice Pres- ident Technical Directions (2007-2009), 2006 Distinguished Lecturer, member of the Board of Governors (2004-2005), Associate Editor for IEEE SIGNAL PROCESSING LETTERS (2003- 2005) and IEEE TRANSACTIONS OF AUDIO, SPEECH AND LANGUAGE PROCESSING (2005-2007), and member of the editorial board of IEEE JOURNAL OF SELECTED TOP- ICS IN SIGNAL PROCESSING (2006-2008) and IEEE SIGNAL PROCEESING MAGAZINE (2008-2010). He also served as member (1996-2000) and Chair (2000-2002) of the Speech Technical Committee of the IEEE Signal Processing Society. He was Publications Chair of ICASSP98, Sponsorship Chair of the 1999 IEEE Workshop on Automatic Speech Recognition and Understanding, and General Co-Chair of the 2001 IEEE Workshop on Automatic Speech Recognition and Understanding. Since 2004, Dr. Acero, along with co-authors Drs. Huang and Hon, has been using proceeds from their textbook “Spoken Language Processing” to fund the “IEEE Spoken Language Processing Student Travel Grant” for the best ICASSP student papers in the speech area. Dr. Acero is member of the editorial board of Computer Speech and Language and he served as member of Carnegie Mellon University Dean’s Leadership Council for College of Engineering.

3

Proceedings of the I Iberian SLTech 2009

Posters

5

Proceedings of the I Iberian SLTech 2009

Systems & Applications

7

Proceedings of the I Iberian SLTech 2009

A task-independent stochastic dialog manager for the EDECAN project

Francisco Torres Goterris

Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia, Spain [email protected]

and modify the BM, readjusting the probabilities of the Abstract transitions. On the other hand, the user simulator just provides The adaptation of a stochastic dialog manager to work in a an appropriate flow of user turns to easily generate consistent new domain is presented. A dialog manager, previously dialogs. This user simulation technique has been demonstrated developed to attend a specific task (queries about train valid to test the dialog manager and to enhance its BM [13]. services), has been modified to be used in a different domain In the research that is reported here, we have pointed to (a sport courts booking system). The new manager deals with three aims. First, we have converted the BASURDE/DIHANA both tasks, just loading their corresponding bigram models and dialog manager into a task-independent dialog manager. configuration files. A user simulator technique has been Second, we have migrated to JAVA, developing a platform- applied to acquire a corpus, and to automatically learn the independent prototype. Third, we have applied the user models. The dialog manager using the learnt models has been simulation technique to test the dialog manager in the evaluated, achieving satisfactory results for using them in an EDECAN task: acquiring a dialog corpus, and learning an acquisition with real users. initial stochastic dialog model. We have obtained a dialog Index Terms: dialog management, stochastic models, domain manager that works suitably in a dialog system for booking independence, user simulation. sport courts (the EDECAN task). In Section 2, the EDECAN task and the design of its BM 1. Introduction are described. In Section 3, the stochastic dialog manager is revised. In Section 4, the prototype is described, and some The statistical approach to the design of spoken dialog results of its evaluation in both tasks are reported. Finally, in systems has provided satisfactory results, as in [1], [2], and Section 5, some conclusions are presented. [3], and it is currently a way open for further improvements. Some drawbacks of this approach, as the high cost of the 2. Task and dialog model description acquisition of the corpora and the evaluation made by interacting with human users, have been dealt with different The EDECAN project is focused on the adaptation of dialog strategies as, for instance, the user simulation techniques, as in systems to different acoustic environments and to different [4], [5], and [6]. Equally, important efforts have been made to semantic domains. One of these tasks consists of a sport courts develop dialog systems that can be easily adapted to different booking system (called the EDECAN task in this paper). domains, i.e., to obtain task-independent dialog systems, as in The EDECAN task has been semantically characterized by [7], and [8]. In this paper, the adaptation of a stochastic dialog identifying the concepts and attributes involved in a set of manager to work in a new domain is explained. This dialogs with real users (recorded by the sport courts booking adaptation is one of the objectives in the EDECAN [9] system of our University). The concepts are the goals of the research project. user queries, and they are the following: AVAILABILITY Up to now, we have developed a dialog system for the (queries about availability of courts), BOOKING (bookings of BASURDE [10] and DIHANA [11] tasks, which provides courts, given certain restrictions), BOOKED (queries about access to an information system for train timetables, prices, currently booked courts) and CANCELLATION (cancellations of and services. In this system, the dialog manager [12] uses a the bookings of courts). The attributes are the items that the stochastic dialog model that is a bigram model (BM) of dialog user must or can provide to specify his/her goals, and they are acts. The information provided in previous turns of the dialog the following: SPORT, DATE, HOUR, COURT-TYPE, and COURT- (i.e., data out of the scope of the BM) is stored in a historic ID. register (HR). The dialog manager selects a new state, which In addition, these dialogs with real users have been studied will determine its following action, taking into account the last to design a set of scenarios, which are used in the acquisition user turn, the probabilities of the available transitions in the of a dialog corpus. Different levels of complexity have been BM, and the degree of appropriateness of these transitions established in the proposed set of 15 scenarios. For instance, given the content of the HR. the first and the last of them have been coded as follows: In addition, we have developed a user simulator [13] that • Scenario-1: SPORT [COURT-TYPE] [DATE] allows us to acquire synthetic dialogs, learn dialog models, [HOUR]. and evaluate the system. The behavior of the user simulator is • Scenario-15: [SPORT] determined by the same BM, and by some heuristic rules that [DATE] [HOUR] [] SPORT implement a collaborative dialog strategy (in order to generate [COURT-TYPE] DATE HOUR. consistent dialogs, which will be useful for learning dialog models). These collaborative rules are domain independent. Scenario-1 consists of a query about availability on a During an acquisition of simulated dialogs, on the one certain sport, allowing the user to specify date, hour, and hand, the dialog manager decides its strategy using its BM and court-type. Scenario-15 is a complex scenario, and it can be its HR, and can automatically verify the success of the dialogs decomposed into three phases: (1) the user has to obtain the list of his/her booked courts; (2) the user has to cancel some

9 Proceedings of the I Iberian SLTech 2009

Dialog System Generic System Dialog Database Manager (GSDM) Manager

text Understanding Module frames frames System Language text BM SHR DP Generator (SLG)

text User Language frames User Dialog Manager frames Generator (ULG) (UDM)

BM UHR TPR User Simulator

Figure 2: Task-independent dialog system block diagram. courts of the previous list, and s/he can optionally provide the and L3 labels, respectively. The transitions between states sport, the date, or the hour, to specify the booked court whose have been established connecting any user (system) state to all cancellation s/he wants; and (3) the user has to book some the system (user) states. All the transitions have the same courts providing the sport, the date, and the hour, and s/he can probability (i.e., given that there are 228 user states and 266 supply the court-type, or can make an availability query. system states, the probability of the transition to any user state Thus, dialogs of complex scenarios are dialogs composed is 1/228, and the probability of the transition to any system by sequences of sub-dialogs, and there are sub-dialogs that state is 1/266). Thus, this initial BM is an equiprobable model. share information among them. This circumstance occurs Given a certain current dialog state, the dialog manager will between the BOOKED and CANCELLATION sub-dialogs, and also choose any following state without influence of statistical between the AVAILABILITY and BOOKING sub-dialogs. information. In the DIHANA project, the acquired dialog corpus was Could I book a tennis-court on next Friday? labeled applying the concept of dialog act and a hierarchy of (U:QUESTION:BOOKING:DATE,SPORT) three levels. In this hierarchy, the first level (L1) identifies the Do you want to play on a lawn court? generic dialog act; the second level (L2), the semantic of the (S:CONFIRMATION:COURT-TYPE:COURT-TYPE) task; and the third level (L3), the instantiated attributes. Once No. I want a clay court. each dialog turn was labeled, each dialog consists of a (U:REJECTION:COURT-TYPE:NIL) sequence of dialog acts. Thus, the structures of the dialog (U:ANSWER:COURT-TYPE:COURT-TYPE) models are represented by sequences of dialog acts. However, at the moment of carrying out the work reported Figure 1: Labeling a segment of a hypothetic dialog. here, there was not any EDECAN dialog corpus. Thus, after studying the corpus facilitated by our University, we have defined a set of labels for describing the semantic of the task, 3. Task-independent dialog management according to the scheme of a hierarchy of three levels. The L1 labels are the following: OPENING, CLOSING, WAITING, NEW- Spoken dialog systems are usually integrated by six modules: QUERY, QUESTION, CONFIRMATION, ANSWER, CHOICE, NOT- speech recognizer, language understanding module, dialog UNDERSTOOD, ACCEPTANCE, REJECTION. The L2 and L3 labels manager, database manager, language generator, and speech are the following: AVAILABILITY, BOOKING, CANCELLATION, synthesizer. However, in this approach of training models BOOKED, SPORT, DATE, HOUR, COURT-ID, COURT-TYPE, NIL. through a synthetic acquisition, only the text is used, and Using this label set, we can define the descriptors of the neither speech recognizers nor speech synthesizers are part of dialog states that will be the nodes of the BM. For instance, the prototype. Figure 2 shows its block diagram. the (U:QUESTION:BOOKING:DATE) descriptor identifies a state in In a synthetic acquisition, the understanding module which the user asks for booking a court, specifying the date receives the sentences generated by the user simulator, s/he wants to play. Equally, the (S:ANSWER:BOOKING:COURT- extracts its meaning, and builds a set of user frames or ID,HOUR) (S:CHOICE:BOOKING:NIL) descriptor identifies a state semantic representations. Up to now, this module is an in which the system supplies a list of courts (providing their application restricted to the BASURDE and DIHANA tasks, court-ids and time-slots) that can be booked, and asks the user and it is not used in an EDECAN acquisition. Thus, the frames for choosing one of them. Figure 1 illustrates this approach. generated by the user dialog manager (UDM) are directly In Figure 1, the user asks for something (L1: QUESTION), provided as input to the generic system dialog manager the question is about bookings (L2: BOOKING), and s/he (GSDM), with the possibility of applying some error provides the values of two attributes (L3: DATE, SPORT). In the simulations. following turn, the system answers by making a confirmation The GSDM receives these user frames, decides the system (L1: CONFIRMATION) of the court-type (L2: COURT-TYPE) and it dialog strategy (taking into account its BM, its system historic provides the value of this attribute (L3: COURT-TYPE). Then, register, SHR, and the domain parameters, DP), and builds a the user carries out two dialog acts in the same turn: s/he set of system frames, which formalizes the chosen behavior. In rejects (L1: REJECTION) the court-type proposed by the system, addition, this manager interacts with the database manager. and s/he provides (L1: ANSWER) other value of this attribute. The UDM receives the system frames, decides the user Starting from this labeling proposal, we have built an dialog strategy (taking into account its BM, its user historic initial BM for the EDECAN task. The states of this BM are register, UHR, and a set of target planning rules, TPR), and defined by one or several identifiers that match the (US-ID:L1- builds the user frames. This manager follows a collaborative ID:L2-ID:L3-ID) pattern, where US-ID is U or S depending on the strategy, defined by the TPR and a set of task scenarios. turn corresponds to the user or to the system, L1-ID is one of The database manager attends the queries of the GSDM. the L1 labels, and L2-ID and L3-ID are one or several of the L2 The user/system language generators (ULG/SLG) translate the

10 Proceedings of the I Iberian SLTech 2009 user/system frames into Spanish or English sentences. Both On the other hand, the UDM performs the following modules work using a set of templates and a set of rules for actions at the beginning of each dialog: (1) it reads the DP, instantiating the templates. including the data of the scenario, and stores them in its UHR; Figure 3 shows the algorithm of the dialog manager in its (2) it reads the BM, and looks for a state to ask for the goals of usual role of system interlocutor. The algorithm of the UDM the scenario; and (3) it generates the corresponding question (i.e., the dialog manager when it plays as user interlocutor) frames. In each dialog turn, the UDM performs the following differs slightly from the GSDM algorithm. Both managers use actions: (1) reading of the system frames; (2) semantic the same dialog model that is a BM of dialog acts. generalization of the frames; (3) transition to a system state in Initialization (DP, SHR); the BM, stochastically, using the semantic generalizations; (4) updating of the UHR with data from the system frames; (5) Read (BM); BM.state = OPENING; transition to a user state in the BM, heuristically, taking into BM.mode = Select (STATIC, DYNAMIC); REPEAT Read (U-frames); account the TPR; and (6) generation of the user frames. BM.input = Adapt (SHR, U-frames); More details about this user simulator algorithm can be BM.state = Transit (BM.state, BM.input); found in [13], especially the use of heuristic rules to establish SHR = Update (SHR, U-frames); the collaborative dialog strategy. BM.state = Transit (BM.state, SHR); It must be remarked that both algorithms are task- SHR = Update (SHR, BM.state, BD.info); independent. All the information about the tasks has been S-frames = Adapt (BM.state, SHR); encapsulated into the bigram models, the scenarios, and other Write (S-frames); configuration files. Thus, the data-structures (models, registers) are initialized using the files that correspond to the IF (BM.mode = DYNAMIC) Update (BM); selected task, and the methods called in both algorithms have UNTIL BM.state = CLOSING; been appropriately parameterized. IF (BM.mode = DYNAMIC) Read (UHR); success = Compare (UHR, SHR); 4. Development and evaluation IF success Write (BM); We have developed a JAVA platform, available in [14] as an Figure 3: System Dialog Manager (GSDM) algorithm. applet, which corresponds to the design of the dialog system described in Section 3. Figure 4 shows a screenshot of this prototype in a turn of a synthetic dialog of the EDECAN task. Now, we describe the steps of both algorithms, starting By means of this application, it is possible to acquire with the GSDM algorithm. At the beginning of each dialog, dialogs of both tasks, selecting several ways of working. In the the GSDM performs the following three actions: (1) it interactive mode, any human user can provide the input initializes the domain parameters (DP) and the SHR, whose frames through a graphical interface, and s/he can read the structure is task-dependent; (2) it reads the BM, and selects system answers, carrying out complete dialogs. the initial state, which is the opening of dialog; and (3) it In the simulation mode, the dialog is completely done by establishes the way of using the BM. In static mode, BM the platform. The prototype allows us to simulate dialogs turn cannot be modified. In dynamic mode, BM can be modified by turn, or whole dialogs, or series of any number of dialogs, when successful dialogs are carried out. and to specify which scenarios are simulated. In addition, the Then, and for each sequence of turns between the user and user frames can be altered by including errors in the attributes the system, the GSDM performs the following actions: (1) whose values are critical to achieve the success of the dialog. reading of the user frames; (2) identification of the user dialog Moreover, there are the test and training sub-modes, which acts corresponding to the user frames, and generation of their correspond to the static and dynamic modes of using the BM. semantic generalizations; (3) transition to a user state in the The applet area consists of seven areas of text. The three BM, stochastically, using the semantic generalizations; (4) areas on the left, from top to bottom, are the real user updating of the SHR with data from the user frames; (5) graphical interface, the output of the UDM (user frames), and transition to a system state in the BM, taking into account the its internal state (BM transitions, and UHR content). The three probabilities of the available transitions in the BM, and a set areas on the right, from top to bottom, are the output of the of heuristic rules (that check the consistence of the transitions SLG (system sentences), the output of the GSDM (system against the content of the SHR); (6) updating of the SHR in frames), and its internal state (BM transitions, and SHR the case of querying the database; (7) building of the semantic content). In the bottom text area, the whole dialog is collected. representation of the system turn (system frames); (8) writing Using this prototype, we have executed several training of the system frames, providing them to the SLG and UDM sets for the EDECAN task, starting from the BM described in modules; and (9), in case of working in BM dynamic mode, Section 2. Different trainings have been done by enabling or increasing of the counters of the chosen transitions. disabling the simulation of input errors (each training set The dialog ends when the closing state is reached. After contains 4,000 dialogs for each scenario, i.e., a total of 60,000 this, and if it is working in BM dynamic mode, the following dialogs). Then, several test sets have been done to evaluate the actions are made: (1) reading of the UHR; (2) verification of learnt models (15,000 dialogs per test set). In addition, several the success of the dialog by comparing both registers; and (3), test sets for the BASURDE task have been carried out. Table 1 in case of successful ending, the modified counters of the summarizes the more important statistics of these test series. chosen transitions are used to recalculate the probabilities of all the transitions in the BM (which is consolidated into file). Table 1. Evaluation of the prototype in both tasks. More details about the dialog manager algorithm can be found in [12], especially in key aspects as the semantic Task BASURDE EDECAN generalization technique, and the determination of the system Success rate 98.7 97.1 99.7 85.3 behavior following a hybrid dialog strategy, which is half Errors per dialog 0.00 1.02 0.00 1.04 stochastic (by using BM) and half heuristic (by using SHR). Turns per dialog 6.95 7.51 7.28 8.07

11 Proceedings of the I Iberian SLTech 2009

Figure 4: The JAVA application of the dialog system.

These results are enough satisfactory. The prototype works appropriately with the BASURDE task. In previous 6. Acknowledgements tests [13], a success rate of 71.8% was achieved introducing This work has been partially funded by CICYT under project 1.12 errors per dialog (and with an average duration of 4.42 TIN2008-06856-C05-02/TIN, Spain. system turns). Now, the success rate has risen to 97.1% (with a lower error rate: 1.02 errors per dialog). This enhancement 7. References can be explained by the increase of the duration (7.51 system turns), which is due to a greater number of confirmations. [1] Levin, E., Pieraccini, R., Eckert, W., "A stochastic model of In addition, the prototype works finely with the EDECAN human-machine interaction for learning dialog strategies", IEEE task. The success rate of 85.3% (achieved introducing 1.04 Trans. on Speech and Audio Processing, 8 (1), 11–23, 2000. [2] Young, S., "The statistical approach to the design of spoken errors per dialog) is lower than the one obtained in the dialogue systems", Cambridge University, Tech. Rep., 2002. BASURDE test, because the EDECAN task and scenarios are [3] Potamianos, A., Narayanan, S., Riccardi, G., "Adaptive more complex than the BASURDE ones. This fact also categorical understanding for spoken dialogue systems", IEEE explains the higher duration (8.07 system turns). Trans. on Speech and Audio Processing, 13 (3), 321–329, 2005. It must be remarked that the dialog manager applies the [4] Eckert, W., Levin, E., Pieraccini, R., "User modeling for spoken hybrid dialog strategy [12]. However, the EDECAN training dialogue system evaluation", Proc. of ASRU – IEEE Workshop, starts from an initial BM, applying a heuristic strategy. To Santa Barbara, USA, 1997. measure the quality of the learnt model, the initial BM and the [5] López-Cózar, R., De la Torre, A., Segura, J.C., Rubio, A.J., learnt BM have been tested disabling the heuristic rules. In "Assessment of dialogue systems by means of a new simulation technique", Speech Communication 40 (2003), 387–407. such a situation, the initial BM does not work at all (its [6] Schatzmann, J., Georgila, K., Young, S., "Quantitative success rate is 5.3%), whereas using the learnt BM the success evaluation of user simulation techniques for spoken dialog rate is 42.8% (with the same error rate). This result is coherent systems", Proc. of SIGdial Workshop, Lisboa, Portugal, 2005. with a similar experiment done for BASURDE in [13]. [7] Lemon, O., Gruenstein, A., Battle, A., Peters, S., "Multi-tasking Nowadays, the prototype can be used in both tasks. and collaborative activities in dialogue systems", Proc. of Although the success rates would be lower when interacting SIGdial Workshop, Philadelphia, USA, 113-124, 2002. with real users, the working of the prototype seems acceptable [8] Bohus, D., Rudnicky, A.I., "The RavenClaw dialog management to use it in a real corpus acquisition. Once this EDECAN framework: Architecture and systems", Computer Speech & Language (2008), doi:10.1016/j.csl.2008.10.001. corpus will be available, the user simulation technique can be [9] Lleida, E., et al., "EDECÁN: sistema de diálogo multidominio applied to enhance the BM extracted from such a corpus. con adaptación a contexto acústico y de aplicación", Jornadas en Tecnología del Habla (JTH), Zaragoza, Spain, 291–296, 2006. 5. Conclusions [10] Bonafonte, A., et al. "Desarrollo de un sistema de diálogo oral en dominios restringidos", JTH, Sevilla, Spain, 2000. In this paper, the adaptation of a stochastic dialog manager to [11] Benedí, J.M., et al., "Design and acquisition of a telephone deal with different tasks has been discussed. A dialog system spontaneous speech dialogue corpus in Spanish: DIHANA", prototype has been developed, allowing us to carry out real Proc. of LREC, Genove, Italy, 1636–1639, 2006. and simulated dialogs, acquire a synthetic corpus, learn dialog [12] Torres, F., Hurtado, L.F., García, F., Sanchis, E., Segarra, E., models, and evaluate the system using these models. The "Error handling in a stochastic dialog system through confidence measures", Speech Communication 45 (2005), 211–229. results are enough satisfactory as to consider using the [13] Torres, F., Sanchis, E., Segarra, E., "User simulation in a prototype in a real acquisition with promising expectations. stochastic dialog system", Computer Speech & Language, 22 Thus, future work will be oriented to acquire real user dialog (2008), 230–255. corpus for the DIHANA and EDECAN tasks, and to extend [14] Torres, F., Prototype of the dialog system, available in the prototype to another semantic domains. http://www.laesteladetanit.es/inicioB.htm.

12 Proceedings of the I Iberian SLTech 2009

Terminology extraction from English-Portuguese and English-Galician parallel corpora based on probabilistic translation dictionaries and bilingual syntactic patterns

Alberto Simoes˜ Xavier Gomez´ Guinovart

Department of Computer Science Department of Translation and Linguistics Universidade do Minho Universidade de Vigo [email protected] [email protected]

Abstract 2.1. Extraction Algorithm The terminology extraction algorithm used in this study is based This paper presents a research on parallel corpora-based bilin- on NATools probabilistic translation dictionaries [1]. These gual terminology extraction based on the occurrence of bilin- dictionaries are extracted automatically from sentence aligned gual morphosyntactic patterns in the probabilistic translation parallel corpora. The resulting dictionaries are mappings from dictionaries generated by NATools. To evaluate this method, words in a language to a set of probable translations in other lan- we carried out an experiment in which both the level of lexical guage. Each of these translations have a probabilistic measure cohesion of the term candidates and their specificity with re- of translatability. spect to a non-terminological corpus of the target language were This information enables to create an alignment matrix for taken into account. The evaluation results show a high degree any translation unit, either from that same corpora or from a of accuracy of the terminology extraction based on probabilis- different one. These translation matrixes include in each cell the tic translation dictionaries complemented by bilingual syntactic mutual translation probability for each word combination (from patterns. the source/target language). [2] provides a detailed explanation Index Terms: bilingual terminology extraction, probabilistic of the matrix construction, and how it can be used to extract translation dictionaries simple translation examples. These same matrixes can be used to extract bilingual termi- 1. Introduction nology using translation patterns. These patterns specify how word order in the source language changes after translation This paper presents a research on parallel corpora-based bilin- takes place. gual terminology extraction based on the occurrence of bilin- gual morphosyntactic patterns in the probabilistic translation 1 dictionaries generated by NATools. NATools is an open source Human Rights workbench for parallel corpora processing which includes a Direitos X sentence aligner, a probabilistic translation dictionaries extrac- do tor, a word aligner, a terminology extractor, and a set of other Homem X tools to study the aligned parallel corpora. To evaluate the method used by NATools, we carried out an experiment in which both the level of lexical cohesion of the term candidates Figure 1: Example of translation pattern: A "de" B = B A and their specificity with respect to a non-terminological cor- pus of the target language were taken into account. Testing was Figure 1 illustrates an alignment pattern and its visual rep- conducted for the language pairs English-Galician and English- resentation. This pattern can be read as: T (A · “de” · B) = Portuguese using the corpus of the Unesco Courier and the JRC- T (B) ·T (A) Each X in the table represents an anchor: it cor- Acquis, respectively. The evaluation results show a high degree responds to a high translation probability. of accuracy of the terminology extraction based on probabilis- These patterns are searched in the translation matrix, tic translation dictionaries complemented by bilingual syntactic matching on anchor cells, as shown in figure 2. These cells patterns. need to have a probability value higher than 20% of the remain- ing column and row cells to be considered anchor cells. 2. Terminology Extraction Translation patterns may include morphological restrictions defining the morphological categories allowed for the words The extraction algorithm used by NATools is based on transla- matching the pattern. Each variable on the right side is followed tion patterns containing the most commonly found grammatical by a morphological restriction in square brackets [...]. NA- bilingual combinations for terminological units. As a help to Tools relies on external morphological analyzers to validate the detect the term relevance, we calculate the log-likelihood ratio morphological restrictions. for each term and the translation probability in the corpus for There are several morphological analyzer engines and, each candidate pair of bilingual terminological equivalents. sometimes, different languages require different morphologi- cal analyzers. For instance, for our experiments we needed a 1http://natools.sourceforge.net/ morphological analyzer for Portuguese and for Galician. While

13 rcinbsdo 6,w cr ahcniaeuigtelog- the ex- using the term candidate using on computed each is works score which we measure, other likelihood [6], many on Following based traction can- [5]. the scoring terms for didate techniques well-known different are There Log-likelihood 2.2.2. ue steaeaeo h uultasainpoaiiyo the of probability variables A translation the mutual matching B the words of average = the as B sured "de" A dis- pair, word each translation. stop-words for for probabilities probabilities is carding translation value the This on equivalents. candi- terminological based each bilingual for of measure pair probability date translation a calculate We Probability Translation 2.2.1. metrics Terminology 2.2. block that adequate, as are marked is analysis morphological the and sition) words. validating required continue the will match system analysis the the the of for restrictions any ask If and analysis. analyzer word specific morphological the invoke will system definition. pattern translation the the in are they used The as ones far data-structure). same NATools as Perl for valid irrelevant a completely are be keys should (it correct be should h nefc returns: word interface the the with words dictionary such tuguese of analysis line). single an receive a return on to and to word able and per each be (one line) for should per tool tool (one interface words This an create analyzer. to morphological need we analyzers logical one. Portuguese good a for dictionary include a not has does dictionary [4] it a FreeLing but way, lacks Galician, same it the Portuguese, In for Galician. dictionary for a has [3] jSpell transla- patterns. Portuguese–English marked a with for unit tion matrix Alignment 2: Figure [{CAT=>’v’,T=>’p’,N=>’s’,P=>’3’,rad=>’poder’}, {CAT=>’v’,T=>’i’,N=>’s’,P=>’3’,rad=>’podar’}] {CAT=>’v’,T=>’pc’,N=>’s’,P=>’1_3’,rad=>’podar’}, {CAT=>’v’,T=>’i’,N=>’s’,P=>’2’,rad=>’poder’}, financiamento osdrn h rvosptenexample, pattern previous the Considering po- specified the in exist cells (anchor matches pattern the If the restriction morphological a containing variable each For syntax its and line, single a on appear should output This o ntne hncligteitraet h SelPor- JSpell the to interface the calling when instance, For morpho- external with NATools integrate help to order In alternativ discussão europeia aliança r fontes adical sobre par de as a a . used 44 0 0 0 0 0 0 0 0 0 0 0 discussion n h tigpi presented. pair string the and , 11 0 0 0 0 1 0 0 0 3 0 0 about 23 0 0 0 0 0 0 0 0 0 0 0 h rnlto rbblt smea- is probability translation the alternative 74 0 0 0 0 0 0 0 0 0 0 0 sources A 27 0 0 0 0 1 0 0 0 0 0 0 of and 56 0 0 0 0 0 0 0 0 0 0 0 pode financing B 28 0 0 0 0 4 0 0 6 0 0 0 . for rceig fteIIeinSTc 2009 SLTech Iberian I the of Proceedings a miuu word), ambiguous (an 33 0 0 0 3 0 0 0 0 0 0 0 the 59 0 0 0 0 0 0 0 0 0 0 0 european 80 Text::NSP 0 0 0 0 0 0 0 0 0 0 0 radical 65 0 0 0 0 0 0 0 0 0 0 0 alliance 80 0 0 0 0 0 0 0 0 0 0 0 . 14 fvr ifrn sizes. different English–Portuguese, very pairs and of two English–Galician used corpora, we parallel experiments of extraction terminology the For Corpora Parallel 3.1.1. evaluation. word extraction and for exclusion, termi- used trigrams corpora the and monolingual for bi- the used and corpora extraction, parallel nology the describes section This corpora exclusion and corpora Parallel 3.1. relevant. the lan- more evaluation Moreover, English–Portuguese the the made languages. pair for guage target corpora two bigger English– the of availability of proximity pairs: explained the be language can by choice two This English–Portuguese. on and Galician focused experiments Our mini- the [7]. as trigrams partial computed the is for measure value this mum terms bigger for grams, module. Perl 0fito ok nml oas rmteVrulLbayof Library Virtual the from romans) (namely works fiction 30 par- exclusion. process, trigrams evaluation and the bigrams in for used ticularly were corpora literary Two Corpora Exclusion 3.1.2. v3 JRC-Acquis used we lan- English–Portuguese work pair. the guage this for available, of version latest purpose the present [9], the the and For 1950s the time. between written com- texts currently selected and prises continuously changes text legislative collec- of This tion Centre. of Research group Joint Technology Commission’s 22 European Language in the the corpus by parallel maintained This is States. languages Member EU the in applicable sciences. a social and As sociology of fields sport. the from of units nological spell the the or whole, bioethics Unesco’s immigration, of world one heritage, languages, endangered treats as that concerns, cultural Each dossier and thematic scientific world. a the of around consists from issue articles in thoughts and the concerns of part Courier is esco which Corpus Spanish) Parallel and CLUVI French Galician, (English, the of 1998-2001) period 5 4 3 2 http://www.bivir.com/ http://sli.uvigo.es/CLUVI/ http://www.unesco.org/courier/ http://ngram.sourceforge.net/ osdrn httemdl nyspot irm n tri- and bigrams supports only module the that Considering The The The ii Corpus BiVir nsoCorpus Unesco JRC-Acquis oreTokens Source nsoCourier Unesco agtTokens Target oreForms Source agtForms Target 2 rn.Units Trans. samnhypbiainwihrflcsUnesco’s reflects which publication monthly a is Corpus al :Ue aallcorpora Parallel Used 1: Table .Experiments 3. stettlbd fErpa no law Union European of body total the is 5 saGlca ieaycru containing corpus literary Galician a is 4 sacleto f3 sus(rmthe (from issues 30 of collection a is 8.Cetdi uut14,the 1947, August in Created [8]. nsoCourier Unesco 1 886 019 1 556 057 1 Unesco otisahg est ftermi- of density high a contains 0866 50 903 47 6515 66 JRC-Acquis 105535 075 51 596 605 37 1 907 315 1 3 8 061 283 9 923 295 nfu languages four in Un- Proceedings of the I Iberian SLTech 2009

Corpus BiVir Compara English (and LLR) Portuguese (and LLR) Prob. Oc. Tokens 1 008 125 1 714 523 european union 231 965 uniao˜ europeia 311 030 65.24 12 465 european parliament 205 297 parlamento europeu 267 379 63.31 13 066 Bigrams 361 547 544 274 european community 136 471 comunidade europeia 224 132 57.48 18 251 Trigrams 641 349 1 243 356 european comunidades communities 265 877 europeias 284 409 53.51 19 545 Table 2: Exclusion corpora council decision 43 760 decisao˜ do conselho 398 348 58.80 1 665 commission decision 32 322 decisao˜ da comissao˜ 264 191 43.73 2 215 basic regulation 61 569 regulamento de base 103 700 63.75 3 390 EN-GL patterns using FreeLing tags management committee 36 170 comite´ de gestao˜ 83 014 69.79 3 549 [R1] A B = B[CAT<-/ˆNC/] A[CAT<-/ˆAQ0/]; [R2] A B = B[CAT<-/ˆNC/] "de"|"do"|"da"| "dos"|"das" A[CAT<-/ˆNC/]; Table 4: EN-PT top-occurring term candidates from the JRC- [R3] A "of"|"in"|"for" B = A[CAT<-/ˆNC/] Acquis Corpus "de"|"do"|"da"|"dos"|"das" B[CAT<-/ˆNC/]; [R4] A B C = C[CAT<-/ˆNC/] A[CAT<-/ˆAQ0/] B[CAT<-/ˆAQ0/]; their similarity with a lexical pattern, ranking of candidates by EN-PT patterns using JSpell tags virtue of some score of lexical association, and assessment of [R1] A B = B[CAT<-/nc/] A[CAT<-/(a_nc|adj)/]; term specificity with respect to some kind of non-terminological [R2] A B = B[CAT<-/nc/] "de"|"do"|"da"|"dos"| corpus of the language, among others [11]. "das" A[CAT<-/(a_nc|nc)/]; With the first filtering method, term candidates beginning or [R3] A "of"|"in"|"for" B = A[CAT<-/nc/] "de"| ending with any of the words of a list of stop words are removed "do"|"da"|"dos"|"das" B[CAT<-/nc/]; [R4] A B C = C[CAT<-/nc/] A[CAT<-/(adj|a_nc)/] from the list. This is the approach used by the Corpografo´ [12]. B[CAT<-/(adj|a_nc)/]; This method, however, does not apply to the results of NATools complemented with bilingual syntactic patterns, since term can- didates generated by NATools match the patterns specified by Figure 3: EN-GL and EN-PT bilingual syntactic patterns particular morphosyntactic rules, which means that they never begin or end with a stop word. Another well-known method for filtering the results of term Universal Literature in Galician language and mantained by the extraction consists of calculating the lexical association of can- Association of Galician Translators. 6 didates in the corpus using one of the possible scores to test Compara [10] is a large human-edited English-Portuguese the strength of this attraction, such as the Mutual Information parallel corpus whose sentence alignment, sentence separa- [13] and the log-likelihood ratio [6]. One of the most widely tion, lemmatization and POS tagging have been revised by hu- used scores for terminology extraction is the log-likelihood ra- man annotators (in fact, lemmatization and tagging have been tio, which is the score calculated by the term extractor in NA- checked and corrected by hand only for Portuguese so far). Tools. However, this score does not carry any significance as a Compara contains 75 fiction texts and their translations, corre- discriminatory factor when assessing the outcome of the termi- sponding to approximately 1.5 million words in each language. nology extraction by NATools with bilingual syntactic patterns, presumably because the quality of selection based on a prob- 3.2. Translation Patterns abilistic translation dictionary derived from the parallel corpus In order to evaluate the precision of the NATools-based term ex- and filtered with patterns ensures a fairly high minimum cohe- traction algorithm, four translation patterns have been extracted, sion between the components of the candidate terms. using the morphological analyzer of FreeLing for Galician and Thus, we decided to check the accuracy of the term ex- Jspell for Portuguese. traction of NATools with bilingual syntactic patterns using a Translation patterns for Galician (with FreeLing analyzer) non-terminological corpus of exclusion as a filter. The exclu- and for Portuguese (with JSpell analyzer) are shown in figure 3. sion corpus will determine the identification (and exclusion) of Tables 3 and 4 show the top occurring entries extracted using unlikely term candidates. Literary corpora, unlike corpora of these rules. news articles, for instance, usually contain very few termino- logical units. A literary corpus, as a corpus of exclusion for English (and LLR) Galician (and LLR) Prb Oc. term extraction, represents a very safe filter. When using a liter- united states 4 701 estados unidos 9 286 53.7 265 ary corpus as a filter, there are more false candidates identified human rights 4 942 dereitos humanos 3 904 68.3 207 united nations 2 462 nacions´ unidas 5 130 47.4 125 as such than correct candidates wrongly identified as false ones. world bank 1 490 banco mundial 1 809 60.0 114 We created lists of word n-grams from the exclusion corpora security council 467 consello de seguridade 1 023 69.2 26 street children 342 nenos da rua´ 700 60.7 22 BiVir and Compara (see above), and applied these lists as cri- market economy 268 econom´ıa de mercado 492 67.7 19 teria for filtering and evaluation of NATools-based terminology life expectancy 304 esperanza de vida 852 51.6 18 extraction. The results are discussed in the next section. Table 3: EN-GL top-occurring term candidates from the Unesco 4.1. Experiment Results Corpus The evaluation results (table 5) point to a high precision of the NATools-based extraction algorithm. As shown in the first col- umn of the table, the 12,689 translation equivalences (TE) iden- 4. Filtering and evaluation tified in the Unesco Corpus using NATools with the EN-GL Different methods are used for filtering the results of term ex- bilingual syntactic patterns depicted in figure 3 represent 7,250 traction: identification of unlikely term candidates because of candidate bilingual term pairs (term candidates or TC) (57% of TE) after eliminating repeated TE. When filtering that list of 6http://www.linguateca.pt/COMPARA/ TC with the list of word bi- and trigrams from the BiVir Cor-

15 Proceedings of the I Iberian SLTech 2009 pus, we obtain a list of 6,949 Galician terms from TC (corre- 5. Conclusions sponding to 96% of TC) which are not present in the exclusion Bilingual terminology extraction from parallel corpora based corpus, and a complementary list of 301 Galician term candi- on probabilistic translation dictionaries and complemented with dates (only 4% of TC) identified as erroneous term candidates bilingual syntactic patterns shows high rates of accuracy. In the due to their presence in the exclusion corpus. Thus, these scores experiments described here this ratio is between 87.4% and 96% show a precision of 96% in the NATools-based term extraction depending on the characteristics of the corpus. Considering that from the Unesco Corpus. this method of extraction is dependent on POS-taggers accu- As for the experiments with the JRC-Acquis Corpus, the racy, an erroneous tagging may lead to false candidates. Thus, 717,293 TE identified with the EN-PT bilingual syntactic pat- improvement in tagging results brings about an improvement in terns shown in figure 3 represent 72,952 TC (only 10.2% of TE) the performance of terminology extraction. after eliminating repeated TE. Differences between the TE/TC ratio of the Unesco Corpus and and that of the JRC-Acquis Cor- pus (57% vs. 10.2%) lie in the lexical density (percentage of 6. Acknowledgments different words in a text) of the two corpora. When filtering that This work has been funded by the Ministerio de Educacion´ y Ciencia and list of TC with the list of n-grams from the Compara Corpus, we the Fondo Europeo de Desenvolvemento Rexional (FEDER) within the project ”Diseno˜ e implementacion´ de un servidor de recursos integrados para el desarrollo get a list of 63,744 Portuguese terms from TC (corresponding de tecnolog´ıas de la lengua gallega (RILG)” (HUM2006-11125-C02-01/FILO), to 87.4% of TC) which are not present in the exclusion corpus, and by the Consellar´ıa de Innovacion´ e Industria da Xunta de Galicia within the and a complementary list of 6,949 Portuguese term candidates project “Desenvolvemento e aplicacion´ de recursos integrados da lingua galega” (ref. INCITE08PXIB302185PR). (12.6% of TC) identified as unlikely term candidates because of their presence in the exclusion corpus. Differences in the preci- 7. References sion scores of term extraction between the Unesco Corpus and the JRC-Acquis Corpus (96% vs. 87.4%) lie in the different size [1] Simoes,˜ A. and Almeida, J.J., “NATools: A Statistical Word Aligner Workbench”, Procesamiento del Lenguaje Natural, 31, of the corpora (and of the exclusion corpora) and also in their 217-224, 2003. level of lexical density and terminological specificity. [2] Simoes,˜ A. and Almeida, J.J., “Combinatory Examples Extraction for Machine Translation”, in Proc. of the 11th Annual Confer- Corpora Unesco JRC-Acquis ence of the European Association for Machine Translation, 27-32, Language GL PT 2006. Trans. Equiv. 12 689 717 293 Term Cand. 7 250 (57%) 72 952 (10.2%) [3] Almeida, J.J. and Pinto, U. “Jspell: um modulo´ para analise´ Excluded TC 301 (4%) 9 208 (12.6%) lexica´ generica´ de linguagem natural, in Actas do X Encontro da Not-excl. TC 6 949 (96%) 63 744 (87.4%) Associac¸ao˜ Portuguesa de Lingu´ıstica, 1-15, 1994. [4] Atserias, J. et al., “FreeLing 1.3: Syntactic and semantic services Table 5: Extraction results in an open-source NLP library”, in Proc. of the 5th International Conference on Language Resources and Evaluation, 48-55, 2006. Moreover, the evaluation undergone has shown that the log- [5] Daille, B., “Study and Implementation of Combined Techniques likelihood ratio (LLR) may be significant as a score to rank the for Automatic Extraction of Terminology”, in J. Klavans and P. ”terminological quality” of candidates belonging to a language Resnik [Ed] The Balancing Act: Combining Symbolic and Statis- from one corpus. However, the LLR cannot be used for com- tical Approaches to Language, 49-66, The MIT Press, 1996. paring the quality of the term candidates extracted from two [6] Dunning, T., “Accurate methods for the statistics of surprise and different sized corpora, being LLR dependent upon the size of coincidence”, Computational Linguistics, 19, 61-74, 1993. corpora, as shown in table 6. [7] Patry, A. and Langlais, P., “Corpus-Based Terminology Extrac- tion”, in Proceedings of the 7th International Conference on Ter- Log-likelihood ratio minology and Knowledge Engineering, 313-321, 2005. Unesco JRC-Acquis [8] Gomez´ Guinovart, X. and Sacau, E., “Parallel corpora for the EN GL EN PT Galician language: building and processing of the CLUVI (Lin- min 0 0 0 0 guistic Corpus of the University of Vigo), in Proceedings of the max 4 942 9 286 448 664 613 529 4th International Conference on Language Resources and Evalu- mean 75 153 3 486 9 734 ation, 1179–1182, 2004. stddev 238 415 10 666 32 238 [9] Steinberger, R. et al., “The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages, in Proc. of the 5th Interna- Table 6: Log-likelihood ratio statistics tional Conference on Language Resources and Evaluation, 2006. Translation prob. [10] Frankenberg-Garcia, A. and Santos, D., “Introducing COM- Unesco JRC-A. PARA, the Portuguese-English parallel translation corpus, in F. EN–GL EN–PT Zanettin, S. Bernardini and D. Stewart [Ed] Corpora in Transla- min 10.25 10.03 tion Education, St. Jerome Publishing, 71-87, 2003. max 85.57 94.53 mean 49.60 53.00 [11] Munpyo Hong, M., Fissaha, S. and Haller, J., “Hybrid filtering stddev 13.92 13.02 for extraction of term candidates from German technical texts, in Proceedings of Terminologie et Intelligence Artificielle, 2001. Table 7: Translation probability statistics [12] Sarmento, L. et al.,“Corpografo´ V3: From Terminological Aid to Semi-automatic Knowledge Engine”, in Proc. of the 5th Interna- tional Conference on Language Resources and Evaluation, 1502- Finally, differences in translation probabilities of bilingual 1505, 2006. term candidates from the two parallel corpora (mean values of 49.60 vs. 53) point to a highly homogenous extraction of can- [13] Church, K. and Hanks, P., “Word association norms, mutual in- formation, and lexicography”, Computational Linguistics, 16(1), didates with respect to translation probability, in spite of the 22-29, 1990. highly heterogenous characteristics of corpora, in table 7.

16 Proceedings of the I Iberian SLTech 2009

A Hierarchical Architecture for Audio Segmentation in a Broadcast News Task

Mateu Aguilo, Taras Butko, Andrey Temko, Climent Nadeu

Department of Signal Theory and Communications, TALP Research Center Universitat Politecnica` de Catalunya, Barcelona, Spain {maguilo,butko,temko,climent}@gps.tsc.upc.edu

Abstract ventions are mixed in the program’s main audio stream as a wide band stream. In the broadcast news domain audio segmentation is an im- portant pre-processing step for other speech technologies like • “Telephone speech over music”. The same as previous speech recognition and speech diarization. In this work we pro- class but additionally there is music in the background. pose an architecture that allows to integrate the individual detec- • “Music”. Pure music recorded in the studio without any tions of various acoustic classes. By implementing a different speech on top of it. algorithm adapted to the characteristics of each class, we can • “Silence”. obtain much better results than using a generic detector for all classes. Additionally, new features suited to detect telephone Since silences are not labeled, the evaluation of “silence” channel speech over wideband music that improve the accuracy class is not included in our task. However it is detected to fa- are also introduced. cilitate the detection of the other classes. Moreover, “telephone Index Terms: audio segmentation, acoustic event detection, speech” class is poorly represented in the database (see Section music detection, telephone speech, software architecture 3), so this class is not evaluated either. We propose a hierarchical architecture for detecting acous- 1. Introduction tic classes using a set of binary detection systems. For compar- ison we also show an alternative system with a one-step multi- The TECNOPARLA project aims to develop speech technolo- class detector described in [5]. gies in the Catalan language focusing on the broadcast news The rest of this paper is organized as follows: Section task. It involves language identification, automatic speech 2.1 describes the classical one-step multiclass segmentation ap- recognition (ASR), machine translation, speech synthesis and proach. The hierachical structure for segmentation is presented speaker diarization [1]. in 2.2. Section 3 presents experimental results and Section 4 Audio segmentation is the task of segmenting a continu- concludes the work. ous audio stream in terms of acoustically homogenous regions. Several speech technologies can benefit from audio segmenta- 2. Audio segmentation tion done at early steps. A previous identification of speech segments facilitates the task of speech recognition or speaker 2.1. One-step multiclass detection diarization systems. Furthermore audio segmentation is widely 2.1.1. Features used to make online adaptation of ASR models or generating a set of acoustic cues for speech recognition to improve overall The audio signal (16 kHz sampling rate) is framed using 30 ms system performance [2]. In [3] audio classes are defined accord- Hamming window and 10 ms shift. For each frame, a set of ing to human perception which provide a clue towards detecting spectral parameters has been extracted. It consists of the con- a particular event. The audio streams are segmented into five catenation of two types of parameters: 1) 16 Frequency-Filtered different types including speech, commercials, environmental (FF) log filter-bank energies [10], along with the first and the sound, physical violence and silence. Similarly in [4] five au- second time derivatives; and 2) a set of the following param- dio classes are defined: silence, music, background sound, pure eters: zero-crossing rate, short time energy, 4 sub-band ener- speech, and non-pure speech which includes speech over music gies, spectral flux, calculated for each of the defined sub-bands, and speech over noise. The definition of audio classes depends spectral centroid, and spectral bandwidth. In total, a vector of much on the data and application domain. 60 components is built to represent each frame. The mean and In this work the database consists of 43h and 25m of audio the standard deviation parameters have been computed over all coming from audio-visual recordings of Agora` debate program frames in a 0.5sec window with a 100ms shift, thus forming one of the Catalan TV (TV3). According to this material we define vector of 120 elements. six different audio classes: 2.1.2. Classifier • “Speech”. This is pure speech recorded in the studio without background such as music. For each pair of classes an SVM classifier is trained. A dataset reduction algorithm based on PSVMs [6] to cope with the enor- • “Speech over music”. This category includes all studio mous amount of data available for training is applied . Those speech with music in the background. sets of feature vectors whose PSVM classifier accuracies in the • “Telephone speech”. Some sections of the program have middle (not the best classifiers nor the worst, in contrast with telephonic interventions from the viewers. These inter- [6]) are finally used to train the final SVM classifier. Using a

17 Proceedings of the I Iberian SLTech 2009

is dealt with in a post-process stage by using time con- straints. • The short time energy is convoluted with a 31 samples derivative filter, as proposed in [7] with the modifications in [8], to enhance the dynamics of the signal. • Finally a threshold is tuned to separate the speech and non-speech frames. A final post process stage smooths the decisions and places time constraints (by using a fi- nite state automat) to meet the evaluation requirements. This detector has been tuned to give class “silence” at its output only when the confidence is high. The second silence detector removes most of the silences to prepare the signal for the subsequent modules. Since there are no references for the silences it is trained in unsupervised manner. The algorithm can be described as: • The short-time energy of the signal is then transformed to the logarithmic scale, and a GMM of N Gaussians is trained. The Gaussians with a lower weight than a fixed percent of the weight of the Gaussian with the highest weight are discarded (if any).

• The Nsil Gaussians with the lowest mean are selected Figure 1: Flow diagram of the hierarchical architecture. SP: for the silence class (as they represent the frames with Pure speech. SM: Speech over background music. TM: Tele- low energy). The N − Nsil other Gaussians are left for phone speech over background music. MU: Pure music. SI: the non-silence class. Silence. • With the Gaussians selected for silence, the whole show, is evaluated frame by frame. The same is done with the non-silence class. Comparing the silence and non- DAG architecture, as proposed in [9], each frame is classified in silence likelihoods, plus a penalty for silence, each frame the final stage. is classified as silence or non-silence. 2.2. Hierarchical architecture • Then the decisions are smoothed using a median filter. Finally, silences longer than the specified minimum du- The hierarchical architecture (Figure 1) is a group of detectors ration are writen to the output file. (called modules), where each module is responsible for detec- tion of one class of interest. As input it uses the output of the 2.2.2. Music detection preceding module and has 2 outputs: the first corresponds to audio segments detected as corresponding class of interest, and Music segments usually appear at the beginning and the end the other is the rest of the input stream. of the show or when the topic of discussions changes. Music One of the most important decisions when using this kind serves as introduction to show and it attracts attention of the of architecture is to put the modules in the best order in terms of audience towards its beginning. It is worth to mention that the information flow, since some modules may benefit greatly from melody in AGORA shows doesn’t vary and only 2 or 3 different the previous detection of certain classes. For instance, previous musical instruments could be distinguished: drums, saxophone detection of the classes that show high confusion with subse- and piano. To detect music segments, a one-against-all topol- quent classes potentially can improve the overall performance. ogy of detection process is selected [10] [11]. As disscussed On the other hand, in this type of architecture, it is not neces- in [10] the advantage of this topology is the possibility of us- sary to have the same classifier, feature set and/or topology for ing a specific kind of features for each particular classification different detectors. Tuning of parameters is done in each the task. The differences between the music and non-music class system independently, and the two-class detection can be done can be noticed in the spectral domain. The periodograms of 0.5 in a fast and easy way. sec long music and speech segments are displayed in Figure 2 Given the modules, the detection accuracy can be computed (we selected “speech” class as a representative of “non-music” individually and a priori. Those modules with best accuracies metaclass). are then placed in the early stages to facilitate the subsequent As it seen from Figure 2, the spectral envelope is flatter detection of the classes with worst individual accuracies. for “music” class while for “speech” class the energy is concen- trated in lower bands. Typical ASR features are used in this mu- 2.2.1. Silence detection sic detector (the FF coefficients with their first time derivatives. In total the feature vector has 32 components). Finally, mean The silence detector before the “music” detector is based on the normalization is applied. We model each of the two classes derivative of the short time energy. This is done to avoid con- separately using Hidden Markov Models (HMMs) and apply fusion with silences that have “musical” spectra. The algorithm Viterbi decoding for final segmentation. The “music” HMM can be described as follows: model consists of 2 emitting states with 5 Gaussians per state, • In the first stage the audio signal is low-band filtered at while “non-music” model has 3 emitting states and 9 Gaus- 1.5kHz. Although this filtering may cause problems with sians per state, as its observation distribution is more complex. fricatives, that might become missdetected silences, this Both of the models have left-to-right connected state transitions.

18 Proceedings of the I Iberian SLTech 2009

Figure 3: Sub-band couples for the spectral slope superposed Figure 2: Periodograms corresponding to “speech” and “mu- over periodograms corresponding to “speech” and “telephone sic” classes. Sampling rate 16 kHz. speech over music” classes. Sampling rate 16 kHz.

Using the proposed detection scheme the confusion between which is concatenated with all the features listed in subsection speech and music classes is minimal. 2.1 leading to a feature vector of 78 components.

2.2.3. Speech over music detection 3. Experiments ` Often the discussions in Agora shows start when music is still 3.1. Database description in the background. In this case we call it a “speech over mu- sic” segment. We use the same feature set as well as detection As mentioned in Section 1, the database used to test the system scheme as in the previously described music detector. Depend- consists of 43h and 25m of spontaneous speech in the context of ing on the ratio between the energies of speech and background a debate TV program. Each program has been cut in two parts to music, the spectrum will be more or less similar to the spectrum exclude the commercials, and each part has a duration of about of the “speech” class and, in extreme case when the energy of 40 minutes. Agora` is a highly moderated program where around music in background is very low, the differences between the 7 different speakers discuss a wide variety of topics. The Agora` corresponding spectra are negligible. In such cases the confus- program has a fairly fixed structure, although no use of this in- sion between classes increases. formation has been made in order to keep the system general. As can be observed in Table 1, the dominant class is “speech” 2.2.4. Telephone speech over music detection appearing 83.21% of the time. The class “telephone speech” (without background music) has been discarded because of its To detect “telephone speech over music” class we use the two- extremely low appearance. The class “OV” (overlapped speech class version of the system described in subsection 2.1. In our coming from two or more speakers) has been left out for future scenario “telephone speech over music” class is composed of work. the music that spans all the frequency range 0-8 Hz and tele- phone speech which is in low frequency range. New features, called spectral slopes, are concatenated to the existing ones to Table 1: Distributions of the events in the database. enhance the detection accuracy. To compute a spectral slope, two different couples of subbands are defined. These sub- Acoustic class Appearance (%) bands have been chosen to discriminate between “telephone Speech 83.21 speech over music” and the rest of audio based on the slope Speech over music 9.78 of the spectrum in the region around 4000 Hz, the end of the Telephone speech 0.02 band of telephone speech, beyond which only music frequency Telephone speech over music 2.48 components exist. The first couple is made of the sub-bands Music 1.16 [1000 − 3000]Hz and [3000 − 7000]Hz and the second is con- Overlapped speech 3.36 sists of the sub-bands [3000 − 3500]Hz and [3500 − 4000]Hz aims to parametrize the energy in the region where the energy ` drop should appear for the “telephone speech over music” class. The Agora database has been manually annotated. In or- A feature vector ss is computed for each couple as: der to evaluate the audio segmentation system the database has been divided in three sets: training, development and evalua- tion. The sets have been designed to have a similar distribution S1 to the whole database (see Table 1). This leaves 8h of audio for ss = (S1,S2, ) (1) S2 development, 8h for evaluation, and 27h for training. where S1, S2 are total energies of the first and second sub-band 3.2. Results respectively. Experimental results have shown that the dynamics of the In order to evaluate the improvements introduced by the hierar- spectral slope features are helpful for the detection of the “tele- chical architecture two systems are compared: phone speech over music” class. Thus the deltas and acceler- • One-step (described in subsection 2.1). ations are added to the final feature vector. Finally we obtain a set of 18 values for each frame (with two sub-band couples), • Hierarchical (described in subsection 2.2).

19 Proceedings of the I Iberian SLTech 2009

We use two metrics to compare both systems: the first metric is make it possible to tune the parameters of the silence detectors the ratio between the time when the hypothesis doesn’t match and get an improvementof the overall accuracy. the reference (error time) and the total time of the audio record- ings. The second metric is the average ratio between the error 5. Acknowledgements time and the total time of audio per each class. This work has been funded by the Generalitat de Catalunya in terror ERROR = (2) the framework of the TECNOPARLA project and also by the ttotal Spanish SAPIRE project (TEC2007-65470). Nclass 1 X terror(classi) MERROR = (3) 6. References Nclass ttotal(classi) i=1 [1] H. Schulz, M. R. Costa-Juss, J. R. A. Fonollosa, “TECNOPARLA Table 2 shows that the use of hierarchical architecture improves - Speech technologies for Catalanand its application to Speech-to- speech Translation”, Procesamiento del Lenguaje Natural, vol. 41, pp. 319-320, 2008. Table 2: Segmentation results [2] H. Meinedo, J. Neto, “Audio Segmentation, Classification And Clustering in a Broadcast News Task”, Proc. ICASSP, vol. 2, pp. System ERROR (%) MERROR (%) 5-8, 2003. One-step 7.20 46.88 [3] T. L. Nwe H. Li, “Broadcast news segmentation by audio type Hierarchical 3.71 3.4 analysis”, ICASSP, vol. 2, pp. 1065-1068, 2005 [4] L. Lu, S. Z. Li, H.-J. Zhang “Content-based Audio Segmentation Using Support Vector Machines”, IEEE International Conference on Multimedia and Expo, pp. 956-959, 2001. Table 3: Segmentation results per class [5] A. Temko, C. Nadeu, J-I. Biel, “Acoustic Event Detection: SVM- based System and Evaluation Setup in CLEAR’07”, in Multi- Class One-step (%) Hierarchical (%) modal Technologies for Perception of Humans, LNCS, vol.4625, Pure speech 6.5 4.8 pp.354-363, Springer, 2008. Pure music 32.0 2.4 [6] A. Temko, D. Macho, C. Nadeu, “Enhanced SVM Training for Speech over music 75.3 4.9 Robust Speech Activity Detection”, IEEE International Confer- Telephone speech over music 73.7 1.5 ence on Acoustics, Speech, and Signal Processing, ICASSP, 2007. [7] A. T. Qi Li, J. Zheng and Q. Zhou, “Robust endpoint detection and energy normalization for real-time speech and speaker recog- both ERROR and MERROR. As can be seen in Table 3 the nition”, IEEE Transactions on Speech and Audio Processing, vol. large reduction of MERROR can be explained by the poor re- 10, 2002. sults the One-step system achieves in the minority classes, while [8] M. Aguilo, “Deteccion´ de actividad oral en un sistema de di- the Hierarchical system performs rather well for all classes. arizacion”,´ Final Degree Project, UPC, 2005. The proposed spectral slope features yield a strong relative [9] J. Platt, N. Cristianini, J. Shawe-Taylor, “Large Margin DAGs for improvement in detection of the class “telephone speech over Multiclass Classification”, Proc. Advances in Neural Information music”, as displayed in Table 4. Processing Systems, vol. 12, pp. 547-553, 2000 [10] T. Butko, C. Canton-Ferrer, C. Segura, X. Giro,´ C. Nadeu, J. Her- nando, J.R. Casas, “Improving Detection of Acoustic Events Us- Table 4: “Telephone speech over music” detection results ing Audiovisual Data and Feature Level Fusion”, accepted to In- System ERROR (%) terspeech, 2009. w/o Spectral Slope 3.5 [11] R. Rifkin, A. Klautau, “In defense of One-Vs-All Classification”, w/ Spectral Slope 1.5 Journal of Machine learning Research, vol. 5, pp.101-141, 2004.

4. Conclusions From the results in Table 2, it can be observed that the use of a more flexible architecture allows to develop a system that is more suited to a particular task. A large improvement can be ob- tained by using a set of detectors, which are properly combined and also tuned to the different target classes. The one-step multiclass detection system tries to detect the most dominant class while doing worse in other classes; this is reflected in the large value of MERROR. On the other hand, an hierarchical system does not detect the most dominant class (“speech”) explicitely, converse, it detects all other classes and “speech” is what is left. Future work will be devoted to improve performance of the “speech over music” detector. For instance, the current system produces a large proportion of errors in the speech segments with very low level of music in the background. The forthcom- ing annotation of the “silence” class in the Agora` database will

20 Proceedings of the I Iberian SLTech 2009

Browsing Multilingual Making-Ofs

Carlos Teixeira1, Ana Respício2 and Catarina Ribeiro1

1LASIGE 2OR Center University of Lisbon, DI 1749 - 016 Lisbon, Portugal {cjct, respicio}@di.fc.ul.pt, [email protected]

with the foreknowledge that studios will probably earn an Abstract increasing percentage of their profits from a growing The present work describes a new film player that enriches catalogue of DVDs” [11]. cinematographic experience and boost film-viewer interaction. The implementation of the proposed prototype encloses two A multilingual subtitle time alignment algorithm provides main problems: the synchronization of the film with the natural browsing links between film and the corresponding related parts in the making-of; and the availability of an making-of. Results are presented for a set of known English interactive film player, able to integrate two video streams, spoken films, also using the corresponding Portuguese and alternatively presented according to viewer decisions. The Spanish subtitles. solution found for the first problem assumes that both film and making-of are subtitled in the same language. This allows the Index Terms: multilingual processing, subtitle alignment, use of text alignment techniques for video synchronization. A making-of alignment, interactive cinematography. similar approach was recently presented for film enrichment [12]. Additional decisions should be made in order to find: Introduction first, if there is a relevant making-of scene to be presented; secondly, find the most relevant making-of scene to be In the past few years, films and related add-ons have become presented. At the present time, our research has pursuit a an important resource for the SLT (Speech and Language multilingual approach, only requiring a few specific language Technology) area. The production and use of subtitles lexica, such as characters and locations names. The quality and provides a clear bridge between the video signal processing, the consistency of the subtitles have revealed to be more including audio and speech, and textual annotation suitable for important than the used language. The second problem also NLP (Natural Language Processing) and related applications: reutilizes some of the strategy used in previous author’s work automatic subtitle translation [1], parallel corpora construction [12]. However the need for interaction and the visioning of a based on subtitles [2][3] and information retrieval [4]. second film raises additional problems, namely while Specialized techniques have been proposed for automatically maintaining the accessibility using the web. extracting film’s images [5] and shots [6]. These can be This paper is organized in five sections. Next section respectively be used in DVDs menus and trailers. The menus describes the approach adopted for the first problem: finding provide an intuitive interactive tool for browsing the film from time anchors for synchronization of the film with the related the “outside” of the film. However, no attempt is found so far parts in the making-of. Section 3 describes the solution found in the opposite direction: using the film itself to browse for the second problem: the interactive film player integrating additional information. To our knowledge this paper presents the making-of visioning. Section 4 presents results obtained in the first prototype for evaluation of this type of interaction. two known English spoken films. Finally, the last section The development of applications for film content analysis presents conclusions and future work. [7][8], browsing [9], and skimming [5] has been a challenge in the past few years. These applications can be useful for film Time alignment production, for someone who wants to study the film from different perspectives or even just to provide a faster and The making-of often includes small shots from the film using better understanding of the film for a common user [10]. the same subtitles in both videos. These are considered here as Technological developments, the increasing availability of reliable anchors for the required alignment, since such subtitle movie related data, as well as the deployment of new standards matching can be robustly detected. However, the order of for multimedia navigation have lead to the development of these shots in the making-of often depends on a different time specialized techniques to approach movie content analysis and structure. A making-of usually contains interviews from film skimming [4]. New research lines have been focused on cast, having a specific time structure, sometimes according to innovative approaches to integrate additional multimodal the specific agenda of the interviewer or even the answers media, such as the ingredients (previous inspirational writings, from the cast. Accordingly, our approach considers similarity screenplay) and sub-products from film production (subtitles, between film and making–of subtitles independently of their making-of, interviews, and writings about the film) [10]. This original order in any of those contents. However, film shots paper describes a prototype implementing one of those fill a small percentage of the making-of duration. Some parts possible integrations: the making-of merged into the of the making-of contents may not even be related to any correspondent film. A making-of is a documentary film about specific scene in the film. This is the case when, for instance, the film production. Making-ofs and other similar film add-ons the cast is asked about personal issues about themselves, their are having increasing importance in the DVD market: “DVD family or friends. add-ons have become much more expansive, even reflexive, This section briefly describes the proposed approach to with consumers/fans willing to pay much higher prices for obtain the alignments between film and making-of subtitles deluxe versions of their favorite films. Films are now according to the main blocks in Figure 1. frequently shot specifically to include material for add-ons, Before being coded into a vector token, the subtitles from

21 Proceedings of the I Iberian SLTech 2009

both film and making-of are both normalized in a alignment. Stop words receive lower weights. Proper names, preprocessing stage. Every sequence of alphanumeric such as the character names, receive higher weights. The characters (letters or digits) occurring between white spaces or current system does not consider multiword expressions, very punctuation marks is considered a word. All letters are frequent in languages, and whose correspondence between converted to lowercases. Each word has an associated languages is not always obvious, and their inclusion has not multiplying weight according to its importance for the yet been tested.

time codes Subtitle alignments film making-of ... … 57 16 ... 17 89 ...... 28 128 preprocessing 29 Film ... text ... text 140 233 141 234 ... 235 Aligning 178 lexicon aligning ... post- 179 method 272 processing ... 278 199 ...... 281 242 ... Making-of preprocessing text ... 321 text 245 ... 246 353 247 354 248 357 249 358 ...... time codes Figure 1: Aligning system.

The output of the aligning method is represented in the 0.80 will include a significant number of links, with a very low “subtitle alignments” block of Figure 1. The arrows provide rate of wrong assignments (less than 2 per film). the time anchors that will allow switching the viewing Each subtitle assignment between the film and the making-of between the film and the correspondent making-of. These time is considered here as an anchor. The subtitle time codes are the anchors are built exclusively between the subtitles of each links between the results from the alignment algorithm and the video stream. In the future, we plan to consider additional video signal. More than this, once asked by the film viewer, information such as content-based features [7][8], and other the player should be able to decide if there is no relevant scene related contents such as the screenplay. in the making-of or, in the opposite case, where to start in the The subtitle alignment is done, one by one, evaluating the making-of timeline. similarity measure of the well-known vector space model. In this model, each subtitle is represented by a token vector representing the word occurrences in the lexical space. Aligned Making-Of Aligned Film Subtitle Similarities between subtitles are ranked by evaluating the Subtitle deviation of angles between the correspondent token vectors. This is equivalent to compute the cosine of the angle between Δt_after the corresponding vectors. Consequently, the similarity t between the i-th film subtitle and the j-th making-of subtitle is Δ Δt_nothing given by equation (1). Film timeline Film

Making-of timeline Making-of film • mkof Δt_before i j (1) S(i, j) = Aligned Making-Of filmi . mkofji Subtitle Aligned Film Subtitle where is the vector representing the i-th subtitle of the filmi film and is the vector representing the j-th subtitle from mkof j the making-of. The same measure was successfully used in a Figure 2: Improving the making-of film coverage. related study for screenplay alignment [12]. The alignment method will first compute a similarity matrix for all the subtitles. Considering that each row represents a Our post-processing module implements a simple strategy film subtitle and each column a making-of subtitle, a depicted in Figure 2. The time duration between two film maximum value is found for each row and the corresponding aligned subtitles (Δt) is divided into two time segments (2). subtitle index memorized in a list. At this stage every subtitle from the film will have an assignment to a making-of subtitle. Δtafter = Δtbefore = Δt / 2. (2) However, most of these links will show a very low similarity. It was found empirically that a threshold for similarity above

22 Proceedings of the I Iberian SLTech 2009

Within the first segment (Δt_after), any making-of request Change playing time. Change playing time. will drive the viewer to the making-of instant of the previous aligned subtitle. Within the second segment (Δt_before), the Click “more about this scene” Film pauses. video stream will be switched to the making-off time instant of start Viewing Viewing the next aligned subtitle. However, time between two Film Making-of consecutive aligned subtitles in the film is sometimes very long including several different scenes (more than 20 sec.). Click to close the film window. Click to close the making-of window. This will confuse the viewer that can be switched to some Restart the film. making-of scenes which were only related to film scenes end already forgotten (Δt_after is excessively long) or not yet seen (Δt_before is excessively long). In order to overcome this problem, both Δt_after and Δt_before were limited to a Figure 4: State transition network. maximum of 10 seconds. The remaining film between these two segments (Δt_nothing) will not be allowed to redirect to Both players and the single input button are controlled by the making-of video. the interface manager module. This module reads a pre- This simple strategy is expected to be improved in the near computed table of time assignments, every time the user future, integrating more data into the decisions about the requests to see more about the current scene of the film. A boundaries for the above mentioned segments. snapshot of the transition from the film to the making-of using A system like the one described above can hardly detect the presented prototype is shown in Figure 5. every semantic relation between the film and the making-of. Some of the aimed relations can even only be explained by Film aesthetic reasons which can be very difficult to be automatically detected. In the near future, we will improve our system to include advanced knowledge models. Nevertheless, we expect that manual corrections will still be necessary. Making-Of However, in order to demonstrate the performance of our automatic methods, all the results presented in this paper were directly collected from these methods without any direct human intervention. User Interface This section describes a prototype implemented for the demonstration of our new multimedia object, for testing the new developed methods and for usability tests. Figure 3 describes the components of the user interface. Figure 5: Snapshot of the transition from the film to the making-of. Subtitle Time Code alignments interface manager Film Making-of ... … Results 57 16 ... 17 89 ...... 28 Figures 6 and 7 display the distribution of aligned subtitles, 128 29 ...... 140 233 along normalized time, for two English films, considering 141 234 ... 235 subtitles in three languages: English (EN), Portuguese (PT), 178 ... film making-of 179 272 ... 278 and Spanish (SP). These results were obtained with the 199 ... More about this scene ... 281 similarity threshold set to 0.80. 242 ...... 321 245 ... Red Dragon Aligment 246 353 247 354 248 357 100 249 358 ...... 80 EN Figure 3: User interface: playing the film and making- 60 PT

of video streams. Film SP 40

The user scenario encompasses a film session using our 20 prototype as a common video player. However, this film 0 viewer has an additional button named “more about this 0 20406080100 Making-of scene”. This button, once pressed, will pause the current film and start a new video: the making-of, in a second video player. Figure 6: Aligned subtitles for “Red Dragon”. However, instead of starting from the making-of beginning, the video will start in an instant related to the scene just The alignments for “Red Dragon” in Figure 6 are almost previously seen in the film. The user can later close the coincident for the three languages. These indicates that the making-of player and return to the paused film for continuing main dialogue lines, that are similar in film and making-of, are the previous film session. The state transition network of these well identified and aligned. The exceptions found reflect one events can be found in Figure 4. of two situations. In the first one, English subtitles are, in this case, mainly aimed at the hearing impaired thus displaying all sounds as subtitles (e.g. “[Suspenseful instrumental music]”).

23 Proceedings of the I Iberian SLTech 2009

This fact adds around 200 pseudo-subtitles to the English A prototype was developed implementing a new film player version which reflects on the chart as alignment “delays” when for those who want to know more about the stories behind the comparing to the two Iberian languages. In the second scenes. In the short term, we will run formal usability tests situation, other non coincident alignments reflect small with questionnaires through the internet. Other future differences in translation or dialogue lines that do not appear developments consider further linguistic refinements and the in one of the languages. use of knowledge models for establishing additional anchors The alignment of “Red Dragon” reveals the references in the between the film and the making-of. making-of interviews to scenes of high tension and action, mainly concentrated in the following times of the film: 13- 15% – Jack Crawford talks with Will Graham about Hannibal Acknowledgments Lecter, 30-35% and 70-76% – Will visits Lecter at prison, 60- This work was partially supported by FCT through the 68% – Francis Dolarhyde kills Freddy Lounds, and 95-100% – Large-Scale Informatics Laboratory and Operations Research climax scene where Francis tries to kill Will’s family. Center – POCTI/ISFL/152 – from the University of Lisbon. The authors acknowledge the comments of three anonymous reviewers about a previous version of the paper.

Twister Aligment 100 References 80 [1] Armstrong, S.; Way, A.; Caffrey, C.; Flanagan, M.; Kenny, D. 60 EN and O’Hagan, M.; “Improving the quality of automated DVD PT subtitles via example-based machine translation”, Proc. of 40Film SP Translating and the Computer 28, London, 2006.

20 [2] Tiedemann. J., “Building a multilingual parallel subtitle corpus”, Proceedings of 17th CLIN, Leuven, Belgium, 2006 0 [3] Tiedemann, J., “Synchronizing Translated Movie Subtitles”, 0 20406080100 Making-of Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, May 28-30, 2008. Figure 7: Aligned subtitles for “Twister”. [4] Lee, Jia-Hong; Cheng, Jia-Shin and Wu, Mei-Yi, “A Multimedia Tool for English Learning Using Text Retrieval Techniques on For the film “Twister”, Figure 7, there are less alignment DVD Subtitles”, Intelligent Tutoring Systems 2006: 701-703, points than for “Red Dragon”, since the making-of of 2006. “Twister” is mainly based on interviews and backstage [5] Li, Y., Lee, S.H., Yeh, C.-H., Kuo, C.-C.J.: Techniques for Movie Content Analysis and Skimming: Tutorial and Overview images, containing few references to film scenes. on Video Abstraction Techniques. IEEE Signal Processing Nevertheless, similarly to the “Red Dragon” results, the Magazine. 23, 79-89, 2006. alignments are almost coincident for the different languages. [6] Smeaton A., Lehane, B., O’Conor N., Brady, C., Craig, The exceptions are mainly explained by: small translation G.,:Automatically Selecting Shots for Action Movie Trailers, discrepancies between the film and the making-of; differences MIR’06, 231-238, 2006. in the subtitles due to different amounts of text - number of [7] Guo, G., Li, S.Z.: Content-Based Audio Classification and words and letters which has to be divided across a different Retrieval by Support Vector Machines, In IEEE Transactions on number of subtitles; and textual signs in the film that were also Neural Networks, 14/1, 209-215, 2003. [8] Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: translated into the Iberian subtitles, thus increasing the number Content-based Image Retrieval: the End of the Early Days, In of non-native language subtitles. The references to film IEEE Transactions Pattern Analysis and Machine Intelligence, scenes, in the of making-of time, are mainly concerned with 1349-1380, 2000. setting Dorothy, the machine for twister analysis – 18%, in [9] Hammoud, R., Mohr, R.: Interactive Tools for Constructing and Helen Hunt interview; the divorce of the main characters – 20- Browsing Structures for Movie Films, In: MULTIMEDIA '00 - 22%, in Bill Paxton interview; approaching the storm – 58- eighth ACM International Conference on Multimedia, 497-498, 62%, in comments about Jan De Bont; and comments about ACM, New York, 2000. Helen Hunt and Bill Paxton entering the storm – around 78%. [10] Teixeira, C., “Multimedia Enriched Digital Books”, ACM 17th Conference on Information and Knowledge Management The discussed results show clearly that the proposed system proceedings, ACM, BooksOnline'08 Workshop on "Setting the is almost language independent and can be used for films Roadmap for Research Advances in Large Digital Book subtitled in different languages. Repositories", Napa Valley, CA, USA, 2008. [11] Brereton, P., Editorial: The consumption and use of DVDs and their add-ons. Convergence: The International Journal of Conclusions Research into New Media Technologies, 13 (2). ISSN 1748- 7382, 2007. We presented results for two known English spoken films. [12] Teixeira, C., Respício, A.: See, Hear or Read the Film. In Ma L., Additional films have been tested. Preliminary results reveal Rauterberg M. and Nakatsu R. (eds.) Entertainment Computing – several interesting findings: how film shots are added into the ICEC2007, LNCS, vol. 4740, 271-281. Springer, Berlin/Heidelberg, 2007. making-of; an informal test, with a set of five users, found the prototype accurate and useful. Our research has pursuit a multilingual approach, only requiring a few specific language lexica. We tested English, Portuguese and Spanish subtitles versions for the same films. The quality and the consistency of the subtitles have revealed to be more important than the used language. For instance, using subtitles from different writers, having different regional origins, can cause severe performance degradation. These findings reveal that the proposed system is almost language independent.

24 Proceedings of the I Iberian SLTech 2009

Resources & Tools

25

Proceedings of the I Iberian SLTech 2009

A Catalan Broadcast Conversational Speech Database

Henrik Schulz, Jose´ A. R. Fonollosa

Department of Signal Theory and Communications Technical University of Catalunya (UPC), Barcelona, Spain [email protected], [email protected]

Abstract usually interrupted once by a commercial, the final recordings Data driven methods in speech and linguistic research, and do not contain them. system develoment require appropriate speech databases. A The debate generally comprises spontaneous speech, new Catalan speech database has been developed with a partic- whereas the introduction of topic and participants features ular emphasis on broadcast conversational speech. The article planned speech. Both differ significantly acoustically and lin- describes origin and nature of the broadcasts and its acoustic guistically. environment. Annotation and transcription provide statistics on Catalan - mainly spoken in Catalonia - exhibits substantial specific phenomena of exhibited speech, speaker characteristics dialectual differences, dividing the language into an eastern and and acoustic events. It concludes with perspective uses and lim- western group. The eastern dialect includes Northern Catalan itations. (French Catalonia), Central Catalan (the eastern part of Catalo- Index Terms: Catalan, audio video speech database, broadcast nia) and Balearic. The western dialect includes North-western conversation, literal transcription, spontaneous speech Catalan and Valencian (south-west Catalonia)[1]. The origin of the recorded Catalan participants suggests a predominant Central Catalan dialect. Furthermore, the rare se- 1. Introduction lection of alternative pronunciations other than those equal to State-of-the-art empirical and statistical data driven methods in Central Catalan during the alignment of acoustic data while speech processing depend to a large extend on sufficient and building ASR acoustic models supports this impression. appropriate sample data, often covering a particular domain, Although TV3 is primarily a Catalan television channel, the acoustic environment or recording channel. The tremendous recorded broadcasts contain a surprisingly high proportion of variability of phenomena in spoken language and those across Spanish speaking participants. The assessment of the database speakers and speaking styles often require large amounts of data annotation will therefore emphasis on both languages. for robust system development. Generally, an appropriate cov- erage is often achieved across a large variety of speakers, di- 2. Data and Annotation alects, gender and age. But, for example, in terms of speech recognition, most system models provide best results for a spe- The 32 Agora` television broadcasts are originally available as cific task or domain that limit the acoustic and linguistic vari- video files, those audio channels were extracted and stored as ability, leading to task specific collections. Likewise for provid- 16 bit PCM, mono with its original 32 kHz sampling rate. In ing evidence for a particular acoustic or linguistic phenomenon order to facilitate transcription and further processing the audio in question, the data not only need to be sufficient, but also need data have been downsampled to 16 kHz. Each broadcast was to comply to secondary conditions. partitioned into two files at the time of the original commer- Speech databases for the purpose of speech and linguistic cial. The annotation procedure followed a 3-pass approach: a research, and for system development have been collected for markup of initial segmentation, a full literal transcription and Catalan as well as for other languages. Nevertheless, and in speaker annotation, and finally a verification and refinement of particular to study and develop technologies for regional lan- boundaries by another transcriber. The transcription was carried guages, the available data seem still insufficient and demand out and formated according to [2] and leads to an XML-like file further efforts. format. The article describes a novel Catalan broadcast conversa- Several conditions were qualified regarding speaking mode: tional speech database. Its original video recordings were sup- planned or spontaneous; background : music, noise and plied directly by the Catalan TV3 broadcast station and contain speech; and channel: studio, telephone, outside (typically pub- 32 live television broadcasts of the Agora` (Greek: a place of po- lic places). litical and juridical assembly) programme, that are debates on The original broadcast conversations bear a frequent selected topics from politics, economy or society. speaker change, contain segments of music and speaker overlap. Each broadcast follows a repeating format: the anchorman Segments of speaking rate acceleration can be observed. Lin- is initially presenting the current topic, followed by an intro- guistically, the conversations possess frequent repetition and re- duction of invited participants featuring background music. The pairs. Furthermore a measurable amount of mispronunciations, main part features the debate between the invited participants, incompleteness and filled pauses. usually public figures. During the debate, public opinions are Every speaker turn is decomposed into segments by placing added, either as e-mails or faxes read by the anchorman, or tele- breakpoints at grammatical sentence boundaries. Spontaneous phone recordings played back again featuring background mu- utterances or incomplete phrases are separated at audible natural sic. Between these general elements of the broadcast format boundaries, i.e. at pauses, breaths, etc. and at interruptions of short terms of music. Although the original TV broadcasts are the sentence flow. The orthographic literal transcription is case

27 Proceedings of the I Iberian SLTech 2009 sensitive and contains applicable punctuation marks. new identifier. As speaker overlap is a typical phenomenon dur- ing the debates, a combination of overlapping speakers receives 3. Acoustic Environment an additional record solely with its identifier. The speaker dis- tribution for Catalan and Spanish segments is shown in tables 3 Primarily, the broadcasts are studio recordings. As the for- and 4 respectively. The unknown gender category refers to seg- mat contains public opinions provided via telephone, several ments of identifiable overlapping speakers but does not consider segments have been identified accordingly. Furthermore, these background speech (further see section 6). Likewise contribu- segments feature background music. The speech component of tions of speakers via telephone as those received gender and these segments is therefore bandlimited to frequencies from 300 language with their speaker record. Hz to 3.4 kHz, while the background music employs the full bandwidth. Few segments originate from recordings of public places, again gathering public opinions that feature background Gender # Speakers Duration [h] # Segments music. The tables 1, and 2 display the durations of the given male 441 24:33 25335 recording and background conditions and follow the distinction female 113 3:51 3848 of planned and spontaneous speech. The categories ’Music’ and unknown 317 0:40 623 ’Speech’ denote the respective background condition, whereas ’None’ refers to clean speech. Table 3: Catalan Speaker Distribution

Planned [h] Spontaneous [h] None Music None Speech Music Studio 0:30 1:50 20:37 2:51 0:59 Gender # Speakers Duration [h] # Segments Telephone - - - - 0:52 male 83 6:19 6869 Outside 0:03 0:02 0:30 0:06 0:28 female 29 1:20 1327 unknown 45 0:13 174

Table 1: Duration breakdown regarding recording environ- ment and background conditions for Catalan segments Table 4: Spanish Speaker Distribution

Since planned speech originates rather from the anchorman The speaker distributions indicate a clear gender misbal- than from invited speakers, there are no segments of planned ance among the data. The segments of unknown speaker are speech among the Spanish segments in table 2. The share of short and accumulate only a small contributions to the total du- ration.

Spontaneous [h] 5. Speech Events None Speech Music Studio 7:02 0:41 - Speech events comprise ordinary spoken words, incomplete Telephone - - 0:05 words and hesitations. Table 5 shows the amount of speech Outside 0:01 - - events for Catalan and Spanish segments, whereas the total number of elements denotes the sum of all speech and non- speech events. The ratios in parenthesis refer to the total number Table 2: Duration breakdown regarding recording environ- of elements. Hesitations, i.e. a single syllable speech disfluency ment and background conditions for Spanish segments not consistent with the grammatical structure of a sentence, and incomplete words emphasizing the spontaneity of the language. background speech, i.e. segments possessing events of non- transcribed overlapping speakers is rather large. It denotes # elements Catalan Spanish solely background speech of speakers that could not be identi- total 401701 95575 fied for their term. The total number of such events amounts running words 373886 (93%) 88614 (92%) to 2843. A segment featuring an event usually extends to a hesitations 6007 (1.5%) 1703 (1.7%) longer duration than the actual event itself. The phenomenon also indicates a high spontaneity and naturalness of the discus- incompletes 2290 (0.6%) 530 (0.6%) sion. There are few segments of pure music and pure silence words 21908 8854 as well as of a combination of music and background speech of minor duration which have not been tabulated. Table 5: Speech Events 4. Speakers Utterances primarily spoken in Catalan also contain 2600 The set of speakers for each broadcast is formed by the invited Spanish words, and vice versa primarily Spanish utterances participants and the anchorman, and is extended where appli- contain 142 Catalan words. For the sake of completeness, oc- cable by people offering a short opinion on the topic via tele- curances of words of other languages across all segments: En- phone. Each transcription document encloses a set of speaker glish 350, Arabic 132, French 62, German 35, Mandarin 23, descriptions, each with a unique identifier, full name (where Portuguese 21 and Italian 10. Words with ’unknown’ spellings identifiable), gender, and primary language. In case a speaker add up to 1963, those featuring a poor intelligibility accumulate converses in multiple languages, the record is duplicated with a to 11, and those being pronounced incorrectly 1392.

28 Proceedings of the I Iberian SLTech 2009

6. Non-Speech Events The speaker annotation and frequent speaker change facili- tate studies with emphasis on identification and tracking across The database annotation distinguishes several non-speech a particular broadcast or a collection. Moreover, gender infor- events as follows: mation provided for each speaker may facilitate features and • throat - coughing, clear one’s throat models for the distinction of speakers aside of a pure gender • breath - audible breath noise detection task. From the automatic speech recognition point of view, gender specific acoustic models can provide an increase • voice - untranscribed overlapped or background speech of allophone discriminance. • laugh - laughing As the database provides overlapped speech to a larger ex- • artic - non verbal articulatory noise of the speaker, e.g. tend it may also become subject to overlapped speaker detection smack, swallowing, etc. and as a consequence to speech recognition. The language annotation provides sufficient references for • > 1 pause - silence, i.e. long speaker pause ( second) language identification and tracking in a bilingual environment, • sound - non-articulatory harmonic noises, e.g. short mu- e.g. in the autonomous regions in Spain, where aside of pre- sic parts, beeps, other sound effects, etc. ferred use of the regional language, Spanish is well spoken and • rustle - rustling such as with paper or microphone rustle understood. Although the database provides a rich annotation, it needs • noise - any other noise not particularly identified above, to be noted, that a rather precise segmentation of acoustic inharmonic noise, non-articulatory events like, e.g. events, i.e. the provision of synchronization marker, is not avail- knocking, babble of voices, machines, etc. able although desirable for some of the above mentioned re- Table 6 tabulates the number of occurances of above listed search topics. events (needless to say, although the table refers to Catalan and The gender misbalance as shown in table 3 and 4 should be Spanish segments, the events are not anticipated to be different considered when deriving conclusions for particular phenomena in nature). or methods.

8. Conclusions # events Catalan Spanish breath 16777 (4.1%) 3992 (4.1%) In order to facilitate speech research and development activities throat 357 (.08%) 97 (.10%) in Catalan, a broadcast conversational speech database has been ` rustle 337 (.08%) 81 (.08%) derived from recordings of the Catalan TV3 Agora programme. voice 1352 (.33%) 270 (.28%) The recordings were segmented according to its speakers, dis- tinct environment and channel conditions, and received literal laugh 163 (.04%) 38 (.04%) transcriptions as well as detailed annotations with respect to pause 152 (.03%) 76 (.08%) acoustic events, speaking mode and speakers. The article de- sound 311 (.07%) 36 (.03%) scribed major aspects of the database material and its annota- artic 1224 (.30%) 319 (.33%) tion. Furthermore it provides qualitative measures to essential noise 1135 (.28%) 349 (.36%) phenomena and elements of annotation. It finally identified and discussed major subjects of its application. However, it needs Table 6: Non-Speech Events to be noted, that in particular for speech recognition system de- velopment, the amount of available Catalan speech data in the target domain is not competitive with other languages so far. As further collection efforts are recommended, a new collection of 7. Discussion general broadcast news data is in progress. The above described new Catalan (-Spanish) speech data con- tain a wide range of phenomena of natural language. With the 9. Acknowledgements spontaneous nature of the broadcast conversations, the data fea- The database development was funded by the Generalitat de ture characteristic phenomena particularly occuring in sponta- Catalunya in the framework of the TECNOPARLA project [3]. neous speech, e.g. non-verbal events, disfluencies, incomplete- The authors would like to thank Mateu Aguil´oBosch for the co- ness, repetition and repair, and therefore encourage to be subject ordination efforts and Verbio Technologies S.L for performing of a profound linguistic and acoustic analysis. The distinction the transcription. into planned and spontaneous speech may motivate further in- vestigation. Aside from the fundamental research, the database supports the development of speech processing systems, in par- 10. References ticular speech recognition, acoustic event detection, as well as [1] Max W. Wheeler, The Phonology of Catalan. Oxford,UK: language and speaker identification and tracking. Oxford University Press, 2005. Speaker and music overlap opens challenges to automatic literal transcription, and speaker and language identification [2] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman, “Tran- tasks. A frequent speaker change and speaking rate acceleration scriber: a free tool for segmenting, labeling and transcrib- induce grammatically incomplete sentences and orthographi- ing speech,” in First International Conference on Language cally incomplete words. The speaking rate acceleration in parts Resources and Evaluation (LREC), 1998, pp. 1373–1376. of the debate also accounts for a higher dynamic in allophone [3] H. Schulz, M. R. Costa-Juss`a, and J. A. R. Fonollosa, durations. Both may be subject to benchmarks of existing ap- “TECNOPARLA - Speech Technologies for Catalan and proaches in language and acoustic modelling as well as oppor- its Application to Speech-to-Speech Translation,” in Proce- tunity for refinement. samiento del Lenguaje Natural, 2008.

29

Proceedings of the I Iberian SLTech 2009

An XML Resource Definition for Spoken Document Retrieval

Contributors in Alphabetical Order German´ Bordel, Arantza Casillas, Mikel Penagarikano, Luis J. Rodr´ıguez-Fuentes, Amparo Varona Grupo de Trabajo en Tecnolog´ıas Software (GTTS) Universidad del Pa´ıs Vasco {german.bordel, arantza.casillas, mikel.penagarikano, luisjavier.rodriguez, amparo.varona}@ehu.es

Abstract the transcription and annotation of speech signals which sup- ports its own format, not specifically suited for spoken docu- In this paper, an XML resource definition is presented fitting ment retrieval applications. The MATE project [8] aimed to in with the architecture of a multilingual (Spanish, English, facilitate re-using language resources by addressing the prob- Basque) spoken document retrieval system. The XML resource lems of creating, acquiring, and maintaining language cor- not only stores all the information extracted from the audio pora. MATE designed a stand-off XML architecture to repre- signal, but also adds the structure required to create an index sent the information of spoken dialogue corpora at multiple lev- database and retrieve information according to various crite- els: prosody, morpho-syntax, co-reference, dialogue acts, com- ria. The XML resource is based on the concept of segment and municative difficulties and inter-level interaction. VoiceXML provides generic but powerful mechanisms to characterize seg- [9] was designed for creating audio dialogs that feature synthe- ments and group segments into sections. Audio and video files sized speech, digitized audio, recognition of spoken and DTMF described through this XML resource can be easily exploited in key input, recording of spoken input, telephony, and mixed ini- other tasks, such as topic tracking, speaker diarization, etc. tiative conversations. Its major goal was to bring the advantages 1. Introduction of web-based development and content delivery to interactive voice response applications. Nowadays, finding multimedia (audio and video) resources is becoming as important as finding text resources. However, The work presented in this paper involves designing an search engines are usually limited to adjacent texts (hand sup- XML resource to store information from various knowledge plied transcripts or close captions) to index and classify multi- sources, all of them relevant to the task of spoken document media documents. These texts are just short descriptions, shal- retrieval: URL, segmentation, audio type, language, speaker, low categorizations or partial transcriptions of the contents, so speech transcription with spontaneous speech events, morpho- the resulting index is very coarse and the search cannot focus syntactic analysis, etc. A new XML resource is defined because on specific items. none of the existing ones (Transcriber, MATE, VoiceXML, etc.) A key advantage can be taken from using Automatic Speech covers completely the requirements of our SDR system. With Recognition (ASR) and Natural Language Processing (NLP) the aim to produce test data, we have developed a tool which technologies, since they allow to transcribe and enrich spoken translates Transcriber annotations into XML resource descrip- documents, thus leading to more accurate indexes and more fo- tors which can be processed by our SDR system. cused search results [1]. Some systems have been already devel- The rest of the paper is organized as follows: section 2 oped in this way, most of them dealing with spoken documents briefly outlines the main features of our SDR system; section in English, such as SpeechBot [2] (an experimental web-based 3 describes the elements and attributes of the XML resource; multimedia search tool from HP Labs which was withdrawn in finally, conclusions are given in section 4. November 2005) and SpeechFind [3] (a spoken document re- trieval system developed at the University of Texas, currently 2. The system architecture used to transcribe USA historic recordings from the last 110 The SDR system fetches audio and video resources from inter- years). In the last years, some systems have been developed net or from local repositories, processing the audio signals and which deal with other languages, such as the NTT system for creating a collection of XML resource descriptors with infor- Japanese [4] and the ASEKS system for Chinese [5]. mation at various levels of knowledge. The SDR system also We have designed a multilingual (Spanish, English, processes user queries and searches an index database to re- Basque) Spoken Document Retrieval (SDR) system that works trieve those resources matching the queries. The index database with both audio and video resources [6]. In the case of video is periodically updated from the collection of XML resource de- resources, only the audio signal is processed. The system con- scriptors. A web interface allows users to formulate queries and sists of a sequence of processing agents, from the crawler that process the answers of the SDR system. The SDR system ar- fetches the audio/video resource to the morpho-syntactic ana- chitecture consists of four key elements (see Figure 1): (1) the lyzer that adds information about word function and structure. crawler/downloader; (2) the audio processing module; (3) the The input to each agent is a resource descriptor containing all information retrieval module; and (4) the user interface. the information added by previous agents. The output is the The crawler/downloader fetches multimedia resources same resource enriched with information specific to the current and creates the corresponding XML resource descriptors. Two agent. different kinds of resources are considered: multimedia files ob- XML seems to be the best option to represent resource de- tained from internet and multimedia files stored at local reposi- scriptors. XML document definitions related to speech have tories. Upon fetching, resources are uniquely identified by their been previously proposed in several projects such as Tran- SHA1 hash, which allows to discard copies of the same resource scriber, MATE and VoiceXML. Transcriber [7] is a tool for at different locations and avoid redundant processing. Audio

31 Proceedings of the I Iberian SLTech 2009

items in the index database, and ranks them according to some predefined metric, which takes into account various measures. The user interface is a web application based on Java Server Pages (JSP) accepting queries, sending them to the search engine, and receiving the list of matching items. This way, users can interact with the SDR system by using a standard web browser. The web application composes and presents suc- cessive HTML pages showing the list of matching items, with information about the resource name, location and size, seg- ment boundaries (time stamps), links (thumbnails) to cached copies of the original multimedia resources and transcription excerpts which link to the full recognized transcriptions. 3. The XML document structure The XML document structure was designed by means of XML Schema [10], taking into account the kind of data to be in- dexed and retrieved and the modules operating on them. The Figure 1: The spoken document retrieval system is organized XML definition [11] is based on the concept of segment and around a collection of XML resource descriptors and consists of provides generic but powerful mechanisms to: (a) characterize four key elements: a crawler/downloader, a sequence of audio segments, and (b) group segments into sections. Each segment processing modules, an information retrieval module (including is characterized by a set of features and consists of a sequence an index database) and a web interface. of words, multi-words and acoustic events. Words and multi- words may also include phonetic, lexical and morpho-syntactic signals in PCM format are extracted from multimedia files. For information. The root element, named , consists presentation purposes, cached copies of the original multimedia of a sequence of four consecutive elements: , resources are saved in Flash video format. , and , which are de- Audio processing is performed in several steps. At each scribed in detail in the following paragraphs. step, the XML resource descriptor is enriched with informa- < > < > tion specific to a knowledge level. The identification of speech 3.1. The processors and source elements and non-speech regions in spoken documents is a key step: if These two mandatory elements include metadata describing non-speech segments are excluded from recognition, not only where the audio and video resources were taken from and how computation time is saved, but also better transcriptions are ob- they were processed, and key features that allow to play their tained. So, before applying the ASR engine, the audio input is contents (see Example 1). The element consists divided into acoustically homogeneous regions called segments, of a sequence of elements, each containing infor- which are further classified as speech and non-speech segments. mation about one of the agents that enriched the XML resource. Additionally, language identification is performed for speech The element includes attributes specifying resource segments. Identifying the target language is critical for the ASR location and format-related features: URL, URI, cached file, engine to use adequate acoustic and syntactic models, and for cached audio, creation/download date and time, file size, au- the NLP module to apply the linguistic knowledge specific to dio length (in seconds), audio format, sampling rate, number of that language. In particular, our SDR system deals with three channels, coding (alaw, mulaw, linear) and resolution (bytes per languages: Spanish, English and Basque. Optionally, speaker sample: 8, 16, 32, 48). identification may be performed. Speaker identification has al- 3.2. The element ways a positive impact on the accuracy of ASR, since it allows to apply speaker adaptation techniques. Once the speech seg- Audio segmentation can be done keeping in mind different ments, the language and (optionally) the speaker are identified, objectives: discriminating speech from non-speech segments, the ASR engine is applied to get the most likely word transcrip- identifying speaker turns, etc. So, our XML structure allows for tion according to previously estimated acoustic and language the existence of one or more segmentations, each uniquely iden- models. Finally, lemmatization and morpho-syntactic analy- tified within the same XML resource. A segmentation consists sis of word sequences are applied, which allows to know the of a sequence of one or more segments. lemma, number, gender and case of each word. This informa- 3.2.1. The element. tion helps to index and search inflected forms and also to dis- This element is characterized by: (a) three mandatory attributes: ambiguate between homonyms. id (identifier), offset (start point, in seconds) and length (dura- The information retrieval module takes the collection of tion, in seconds); (b) a sequence of zero or more instances of enriched XML resource descriptors as input to create a hier- the element; and (c) a sequence of at least one of the archized structure of word references which makes the search elements , and , in any order process easier and faster. The index structure, which contains (see Example 2). location information for each word in the automatically gen- erated transcriptions, is dynamically updated each time a new 3.2.2. The element. XML resource is added to the system. The information re- This element provides a generic and flexible way to characterize trieval process begins when the user formulates a query. NLP segments. It has three attributes, all of them mandatory: name tools are then applied to preprocess the query, yielding a list (feature name), value (feature value), likelihood (the likelihood of query items (relevant words, topics or even speakers) which of the feature value, a real value in the range [0,1]) and pro- are searched within the index structure. The system retrieves cessorName (name or description of the agent extracting that those segments (or sequences of segments) matching the query feature from the audio signal). In this way, an arbitrary number

32 Proceedings of the I Iberian SLTech 2009

Example 1: The element includes information about the modules that have processed the multimedia resource. The element stores metadata (location, format, size, etc.) useful for playing the audio and video contents. http://gtts.ehu.es/Iberian_SLTech_2009_Example.avi ...... of features can be extracted and assigned to audio segments: gory, which can take ten values: six noises (breath, puff, cough, speaker, language, topic, etc. (see Fig. 2). Note that segments laugh, click and saturation), three filled pauses (e long, a long could be also characterized by defining the corresponding at- and m long) and the default value other. Three optional at- tributes (e.g. speaker id, language id, topic, etc.), but such an tributes allow to specify the start time (offset), the duration approach has two disadvantages: (1) the number and type of at- (length) and the acoustic confidence (confidence; default value: tributes is fixed, and (2) it does not provide any mechanism to 1.0) of the event. specify how each feature was extracted. 3.3. The element < > 3.2.3. The word element. Finally, the element provides a generic and flex- The element is designed to represent all the informa- ible way to group segments into sections, according to any tion that can be extracted from a recognized word, through eight given criterion, specified by the attribute criterion. Since var- attributes: transcription (ortographic transcription), pronuncia- ious segmentations might be available, the attribute segmenta- tion (phonetic transcription), confidence (acoustic confidence, tion id tells what segmentation the segments are taken from. a real value in the range [0,1]), offset (start time, in seconds), Besides these attributes, contains a sequence of length (duration, in seconds), realization (taking three possible one or more

elements. There can be various sec- values: regular, mispronounced and cutoff), lemma (canonical tioning elements in an XML resource, corresponding to differ- form of the lexeme) and case (grammatical function). Only the ent clustering criteria: topic, language, speaker, etc. (see Fig. ortographic transcription is mandatory. However, three other at- 2). Moreover, since there can also be n > 1 segmentations, tributes: pronunciation, offset and length can be easily obtained the same criterion can be applied n times to different segmen- as a by-product from the recognizer. The recognizer should tations, yielding n different elements. This is an also be able to provide a confidence value for each hypothe- efficient way to access the same information from various points sized word, since confidence could play a fundamental role in of view. ranking the search results. The default confidence is 1.0. De- 3.3.1. The
element. tecting cutoff or mispronounced words is not so easy. In fact, the attribute realization is there just for the case we were able to Each
element is uniquely identified by the attribute detect non-regular word realizations reliably. Its default value id (mandatory). The attributes seg start and seg end (also is ”regular”. Finally, lemma and case, extracted by the NLP mandatory) point to the first and last segments in the section, module (which takes into account both word transcriptions and respectively. Optionally, the start time and the duration can be context), play an important role in normalizing documents and specified by the attributes offset and length. For any given cri- queries before searching for matching segments. terion, a set of classes can be defined, segments tagged with the most likely class (by means of elements), and sec- 3.2.4. The element. tions computed and characterized through class likelihoods. So, The element allows us to handle expressions besides the above mentioned attributes, the
element made up of several words, whose meaning cannot be derived contains a sequence of elements. from the meanings of the member words. Each multiword con- 3.3.2. The element. sists of a sequence of two or more elements, plus ad- ditional information stored in five optional attributes. Three of The element has two attributes: descriptor (class de- them: confidence, offset and length can be derived from mem- scriptor) and likelihood (class likelihood). It tells how well any ber words. The other two: lemma and case, are extracted by given section matches a predefined class. NLP tools. 4. Conclusion < > 3.2.5. The event element. In this paper, an XML resource has been presented fitting in The element may represent two kinds of phenomena: with the architecture of a multilingual spoken document re- (1) those related to spontaneous speech and (2) those coming trieval system. The XML resource not only stores informa- from external non-linguistic sources (environmental/channel tion extracted from the audio signal (segmentation, transcrip- noises). Two mandatory attributes are defined to specify the tion, etc.) but also adds structure which helps to create an in- type of event: category, which can take four possible values: dex database and to retrieve information according to various noise, filled pause, silent pause and OOV word; and subcate- criteria (keywords, topic, speaker, language, etc.). Around a

33 Proceedings of the I Iberian SLTech 2009

Example 2: The element consists of a sequence of features followed by a sequence of words, multiwords and/or events.

Figure 2: The element consists of a sequence of consecutive segments defined according to their acoustic contents (non-speech segments are shaded darker). Each segment can be characterized by an arbitrary number of features. For instance, the first segment in the example above is assigned the language L2, the speaker S1 and the topic T1, with likelihoods 0.7, 0.9 and 0.6, respectively. Segments can be grouped into sections according to various criteria. In the example above, three grouping criteria are showed: language, speaker and topic. Each section is characterized by a set of class likelihoods. For instance, in the grouping according to topics, the first section includes topics T1 and T2, with likelihoods 0.5 and 0.1, respectively. core element called , two generic elements are de- [3] J. H. Hansen, R. Huang, B. Zhou, M. Seadle, J. R. Deller, A. R. fined: , to characterize segments, and , Gurijala, M. Kurimo, and P. Angkititrakul, “SpeechFind: Ad- to group segments. The XML resource allows to handle an ar- vances in Spoken Document Retrieval for a National Gallery of bitrary number of segmentations and sectionings. At a lower the Spoken Word,” IEEE Transactions on Speech and Audio Pro- cessing, vol. 13, no. 5, pp. 712–730, 2005. level, each segment consists of an arbitrary number of words, multiwords and events, in any order. These latter elements ac- [4] K. Ohtsuki, K. Bessho, Y. Matsuo, S. Matsunaga, and Y. Hayashi, count for phenomena related to spontaneous speech and ex- “Automatic Multimedia Indexing,” IEEE Signal Processing Mag- azine, vol. 23, no. 2, pp. 69–78, March 2006. ternal non-linguistic sources. Audio and video files described through this XML resource can be easily exploited in other [5] R. Ye, Y. Yang, Z. Shan, Y. Liu, and S. Zhou, “ASEKS: A P2P Audio Search Engine Based on Keyword Spotting,” in ISM’06: tasks, such as topic tracking, speaker diarization, etc. Proceedings of the Eighth IEEE International Symposium on Mul- 5. Acknowledgements timedia, San Diego, CA, USA, 2006, pp. 615–620. [6] Hearch: http://gtts.ehu.es/Hearch/. This work has been jointly funded by the University of the Basque Country, under project INFO09/29, and the Spanish [7] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman, “Transcriber: Development and use of a tool for assisting speech corpora pro- MICINN, under project TIN2009-07446. duction,” Speech Communication, vol. 33, no. 1-2, pp. 5–22, Jan- 6. References uary 2001. [8] The MATE Project: http://mate.nis.sdu.dk. [1] J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava, “Speech and Language Technologies for Audio [9] Voice Extensible Markup Language (VoiceXML) Version 2.0, Indexing and Retrieval,” Proceedings of the IEEE, vol. 88, no. 8, http://www.w3.org/TR/voicexml20. pp. 1338–1353, 2000. [10] W3C XML Schema: http://www.w3.org/XML/Schema. [2] J. V. Thong, P. Moreno, B. Logan, B. Fidler, K. Maffey, and [11] The EHIZTARI Resource Definition (ERD): M. Moores, “SpeechBot: An Experimental Speech-Based Search http://gtts.ehu.es/Ehiztari/erd.xsd. Engine for Multimedia Content in the Web,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 88–96, 2002.

34 Proceedings of the I Iberian SLTech 2009

CORPOR SYSTEM: CORPORA OF THE PORTUGUESE LANGUAGE AS SPOKEN IN SÃO PAULO

Zilda Maria Zapparoli

Universidade de São Paulo, CNPq, FAPESP, Brasil [email protected]

Abstract Speech samples produced by informants were collected between 1972 and 1973, totaling 54 hours of recordings that This work briefly discusses the construction of the register dialogical interactions between documenter and 216 Orthographic and Phonetic Information Databases of the informants. Informants come from three cities in the state of Portuguese Language Spoken in the State of São Paulo (São São Paulo (São Paulo, Campinas and Itu), and are of both Paulo City, Campinas, Itu) in a Relational Database System. sexes, different ages and education level, and diverse Informatics resources were used to store, process and analyze socioeconomic backgrounds. In all, 432 dialogs were authentic oral language, and the Bases include orthographic recorded, since there were two kinds of dialogic interaction and phonetic information about the Portuguese language as with each informant: interviews and conversations. spoken in those areas of the state of São Paulo, organized, The Informants Distribution Diagram presents the listed and stored taking into account linguistic and distribution of the informants in the categories (variables and extralinguistic annotations. The results obtained can serve as a their sub-levels), offering various possibilities for contrastive valuable aid, for example, in studies requiring automatic studies. processing of the Portuguese language. Index Terms: Linguistic Informatics, data processing 2.2. Constitution of the corpus: speech transcription technologies in Linguistic studies, CorPor project, relational for computational treatment database system, databanks of phonetic and orthographic information about the Portuguese language as spoken in São This is an annotated electronic corpus with the necessary Paulo, electronic corpora of the Portuguese language as information to identify linguistic variables (such as words, spoken in São Paulo their position in the utterance as well as the position of the utterance in the discourse, orthographic and phonetic transcriptions, the kind of phonic juncture with the preceding 1. Introduction and the subsequent words) and extralinguistic variables (such as region of origin, sex, education, age, socioeconomic This study is interdisciplinary par excellence, as it combines background and the conditions in which the dialog was Linguistics and Informatics resources in the study of language produced). There is an exclusive code for each lexical item, in use, to store, process and analyze authentic oral language and about 180,000 occurrences. data. The work briefly discusses the construction of The way in which information is codified and structured Orthographic and Phonetic Information Databases (or endows the Bases with the functionality that will permit the DataBanks), Corpora and Lexicons of the Portuguese extraction of different corpora and lexicons. Language Spoken in the State of São Paulo (São Paulo City, Campinas and Itu). The data were originally collected for a 2.3 Databank management system doctorate thesis (1980) and the bases generated at the time for mainframe computers, as in [1], have been made compatible The Information Bases are stored in a database system − with current operating systems. Firebird − and the data structure follows the relational data The Bases are stored in the relational database format, model, so that the Bases contain linguistic and extralinguistic which offers researchers the possibility of easy, reliable, rapid, information about the various relations between the stored and fully automatic access, for consultation, recovery and data, in this case a collection of orthographic and phonetic exploration of extensive and varied data, in the study of data of the Portuguese language as spoken in the State of São various aspects of language − phonetic, phonological, lexical, Paulo. morphological, syntactic, textual and discursive. The environment used for programming was Delphi, This study, therefore, belongs in the field of Linguistic produced by Borland Software Corporation, which uses Informatics, drawing support from the various areas that share Pascal Language with object-oriented extensions ( Pascal the belief in the positive results of the interaction between Object), associated with resources of Structured Query Linguistics and Informatics − it makes use of Informatics Language (SQL) [2]. resources in Linguistics studies in order to build Information Besides research resources for access to the information on Bases that, in turn, can offer a contribution to the areas that the Bases, the System includes resources for text production use Linguistics in Computer Sciences, such as the automatic and for the edition of research results. For user access and processing of the Portuguese language. research by means of SQL language commands, the Orthographic-Phonetic Information Databases, as well as 2. Methodological procedures Corpora and Lexicons (Dictionaries) generated from them, integrate the CorPor System, with each one of them constituting a module with its own records and fields. 2.1. Structure of the oral language corpus

35 Proceedings of the I Iberian SLTech 2009 3. The CorPor system: main components annotation (column 2) and cumulative frequency of orthographic unit (column 1), as in the sample presented in Table 2. 3.1. Orthographic-Phonetic information databases of the Portuguese language as spoken in São Paulo 3.4. Inter-word coarticulation and phonetic liaison lexicon of the Portuguese language as spoken in São The Orthographic-Phonetic Information Databases of the Paulo Portuguese Language as Spoken in São Paulo bring information about each one of the 216 informants, organized The Inter-word Coarticulation and Phonetic Liaison Lexicon, according to the recording order and the annotation and also extracted from the Orthographic-Phonetic Information structuring procedures adopted, i.e. the Bases bring lexical Databases of the Portuguese Language as Spoken in São information organized according to the relations between Paulo , includes the phonetic liaison category (column 1), the linguistic and extralinguistic data. Table 1 brings an extract accentual combination in inter-word coarticulation (column 2), from the Databases. the lexical-syllabic phonetic transcription of phonetic liaison

occurrences, that is, phonic liaisons taking place between two 3.2. Electronic corpora of the Portuguese language as or more words (columns 3, 4, 5 and 6), with the corresponding spoken in São Paulo (textual databases) orthographic transcription (columns 7, 8, 9 and 10), as shown in the sample in Table 3. Electronic Corpora of the Portuguese Language as Spoken in São Paulo (Textual DataBases) can be extracted from the 4. Conclusions Orthographic-Phonetic Information Databases , with various possibilities of exploration by linguistic analysis programs, as in [3], for use in different areas of language studies and related In tune with the latest tendencies in language studies and fields. It is possible to generate as many corpora as there are cutting-edge technologies, this research can offer valuable linguistic and extralinguistic variables annotated, with contributions; (1) by meeting the demand, in Brazil, for different combinatory possibilities. Below is an extract from electronic speech transcription corpora with phonetic the corpus of educated speakers of Portuguese from São Paulo transcriptions; (2) by permitting scientific and technological (informants are from the city of São Paulo – Paulistanos − and interchange and enriching the interaction between the exact have university degrees), with speech transcription. On sciences and language sciences; (3) within the scope of Textual DataBases the punctuation codes were replaced by the Linguistics, for research based on corpora and the utilization corresponding marks. of computer technologies in studies of language in use; (4) at the interface between Linguistics and Informatics, by offering Lexical Code: 1011111 – Informant from São Paulo (1), linguistic information knowledge for the development, testing female (0), university degree (1), 25 to 29 years (11), upper and evaluation of speech processing systems for the Brazilian class (1), stimulated response, dialogical interaction (1) variety of the Portuguese language − recognition and synthesis −, one of the most complex areas in Natural Language De profissional ou... Processing.

Nossa mãe! depende do dia —isso que é o problema, 5. Acknowledgements entende?— Eu optei um curso de complementação pedagógica e, agora, tem uns trabalhos, para apre/ I would like to thank Manoel Vidal Castro Melo for his apresentar, então, eu estou fazendo esses trabalhos: tem o support and orientation in the analysis and programming for de sociologia —para entregar— e um sobre o INCRA; tem the development of the mainframe system and Edenis Gois uma tese que eu estou corrigindo a parte de português, Cavalcanti, for the creation of the system for use in PCs. toda parte de ortografia e construção —é de minha prima que tra/ trabalha no Butantã, sabe?; ela está fazendo uma tese sobre educação e saúde; também estou 6. References dando uma olhada na tese dela de manhã—. Tsu que mais que eu faço de manhã?... tempo de aulas, corrige-se [1] Z. M. Zapparoli Castro Melo, “Análise do comportamento provas; agora vai mudar —engano— vou mudar também; fonológico da juntura intervocabular no português do Brasil agora, de manhã, vou dar aula no Mackenzie; à tarde, (variante paulista). Uma pesquisa linguística com tratamento venho para cá —varia—. computacional”, Ph.D. dissertation, Universidade de São Paulo, São Paulo, SP, Brasil, 1980. [2] C. Szyperski, Component Software: Beyond Object-Oriented 3.3. Orthographic-Phonetic frequency lexicon of the Programming . Boston: Addison-Wesley, 1998. Portuguese language as spoken in São Paulo [3] Z. M. Zapparoli, A. Camlong, Do Léxico ao Discurso pela Informática . São Paulo: EDUSP/FAPESP, 2002, 256 p. + CD- The Frequency Lexicon was extracted from the complete ROM. [4] International Phonetic Association, Handbook of the International version of the corpus; for each word, it presents the Phonetic Association . Cambridge: Cambridge University Press, orthographic transcription (column 3), the corresponding 1999. phonetic transcriptions, with and without syllabic separation (columns 5 and 4 respectively), frequency of phonetic unit

36 Proceedings of the I Iberian SLTech 2009 Figure and tables

Figure 1. Informants distribution diagram

Table 1. Orthographic-Phonetic information databases of the Portuguese language as spoken in São Paulo

Key 1 Lexical Code 2 Obs. 3 Orthographic Transcription 4 Punct. 5 IS L 6 Phonetic Transcription 7 ES L/P 8 1 10111100101001 já 'JA 101 2 10111100101002 viajei 101 VI A 'J&Y 38 3 10111100101003 um 38 )2 101 4 10111100101004 bocadinho 1 101 BO KA 'D5 ^U 1 5 10111100201001 eu '&W 101 6 10111100201002 fui 101 'FUY 101 7 10111100201003 pela 101 P& L 5 8 10111100201004 6 Associação 5 A SO SYA 'S@% 101 9 10111100201005 dos 101 DUS 100 10 10111100201006 Professores 100 P>O F& 'SO >IZ 101 11 10111100201007 de 101 DI 101 12 10111100201008 6 Francês 4 101 F>@ 'S&Y 32 13 10111100201009 sabe 7 32 'SA BI 1 14 10111100301001 olha 4 '0 ?A 37 15 10111100301002 o 37 U 101 16 10111100301003 curso 101 'KU> S% 15 17 10111100301004 em 15 1 101 18 10111100301005 si 101 'SI 1 19 10111100301006 não 3 'N2 1 20 10111100301007 não 'N2 101 1 Order. 2 Lexical item identification code − informant, type of dialogue, discourse, utterance and word 3 Code for morpho-syntactic deviations, acronyms, proper names, foreign words 4 Orthographic transcription 5 Punctuation code 6 Initial syllable liaison code 7 Phonetic transcription [4] 8 End syllable liaison code / real pause

37 Proceedings of the I Iberian SLTech 2009

Tabela 2. Orthographic-Phonetic frequency lexicon of the Portuguese language as spoken in São Paulo

Accu. Freq.of Phone. Trans. Orthographic Phonetic Phonetic Transcrption / Ortho. Trans. 1 Freq. 2 Transcrption 3 Transcrption 4 Syllable 5 2 2 abacate ABA'KACI A BA 'KA CI 1 1 abacaxi ABAKA'$I A BA KA '$I 1 1 abacaxis YABAKA'$IZ YA BA KA '$IZ 1 1 abaixo A'BA$U A 'BA $U 2 1 abaixo A'BAY$U A 'BAY $U 1 1 abalado ABA'LADU A BA 'LA DU 1 1 abandonar AB@DO'NA A B@ DO 'NA 2 2 abandonei AB@DO'N&Y A B@ DO 'N&Y 1 1 abandonou AB@DO'NO A B@ DO 'NO 1 1 abatida ABA'CIDA A BA 'CI DA 3 3 aberta A'BE>TA A 'BE> TA 4 1 aberta A'BERTA A 'BER TA 1 1 abertas A'BE>TAS A 'BE> TAS 1 1 aberto A'BETU A 'BE> TU 6 2 aberto A'BERTU A 'BER TU 7 1 aberto YA'BE>T YA 'BE> T 8 1 aberto YA'BE>TU YA 'BE> TU 1 Orthographic transcription accumulated frequency 2 Phonetic transcription frequency 3 Lexical item orthographic transcription 4 Lexical item phonetic transcription without syllabic division [4] 5 Lexical-syllabic phonetic transcription [4]

Table 3. Inter-word liaison lexicon of the Portuguese language as spoken in São Paulo

Liaison 1 Stress 2 Phon 1 3 Phon2 4 Phon3 5 Phon4 6 Ortho.1 7 Ortho. 2 8 Ortho.3 9 Ortho.4 10 101 TA 'JA VI A 'J&Y já viajei 101 AA )2 BO KA 'D5 ^U um bocadinho 101 TT '&W 'FUY eu fui 5 AA P& L A SO SYA 'S@% pela Associação 100 AA DUS P>O F& 'SO >IZ dos Professores 101 AA DI F>@ 'S&Y de Francês 37 AA '0 ?A U olha o 15 AA 'KU> S% 1 curso em 101 TT 'N2 'S&Y não sei 33 ATA SY 'E W se é o 15 AA 'KU> S% 1 curso em 101 TA 'SI SI si se 2 AA 'VA LY A vale a 17 AA 'P7 N 1 'T3) /I pena entende 27 AT MAY Z '&W mas eu 101 AA 'WA $U KI acho que 1 Inter-word coarticulatory category 2 Inter-word syllable stress – combinatorial stress in inter-word context (T = stressed syllable; A = unstressed syllable) 3 Phonetic transcription of word 1 in word sequency [4] 4 Phonetic transcription of word 2 in word sequency [4] 5 Phonetic transcription of word 3 in word sequency [4] 6 Phonetic transcription of word 4 in word sequency [4] 7 Orthographic transcription of word 1 in word sequency 8 Orthographic transcription of word 2 in word sequency 9 Orthographic transcription of word 3 in word sequency 10 Orthographic transcription of word 4 in word sequency

38 Proceedings of the I Iberian SLTech 2009

Machine Translation of the Penn Treebank to Spanish

Martha Alicia Rocha1, Joan Andreu Sanchez´ 2

1 Departamento de Sistemas y Computacion,´ Instituto Tecnologico´ de Leon,´ Mexico´ 2Instituto Tecnologico´ de Informatica,´ Universidad Politecnica´ de Valencia, Spain [email protected], [email protected]

Abstract Penn Treebank corpus. But Adaptation Techniques should be also considered. In this work we explored the problem of translating the Penn As a first approximation to this scenario, in this work we Treebank corpus to Spanish. For this problem, we considered explored the feasibility of translating the Penn Treebank cor- Phrase-based Machine Translation techniques. Given that there pus to Spanish by means of phrase-based MT techniques. An not exist parallel training data for this corpus, we used a large important problem with this approach arises when “in-domain” out-of-domain training data set, and a small “hight-quality” in- parallel data is not available. In such case, several approaches domain training data set. We studied simple and effective Do- have been explored, like Domain Adaptation techniques [15]. main Adaptation techniques that were used for other applica- We show how simple adaptation techniques can be very effec- tions. We report experiments on a small test set of sentences tive when applied to the translation of the Penn Treebank corpus manually translated from the Penn Treebank corpus. when. Index Terms: Penn Treebank, Machine Translation, Domain This paper is organized as follows: next section reviews Adaptation. basic adaptation techniques in MT. Then, we describe how we intend to tackle the translation of the Penn Treebank to Spanish. 1. Introduction Experiments are reported in Section 3, and some concluding remarks are given in Section 4. The Penn Treebank corpus [1] is one of the most referred data sets that has been extensively used for different sort of Natu- 2. Adaptation in MT ral Language Processing problems, including but not limited to Language Modeling [2], Word Sense Disambiguation [3], PoS There exist different MT techniques, like word-based models Tagging [4, 5], Statistical Parsing [6], Maximum Entropy tech- [10], those that are based in finite-state models [16], syntax- niques [5, 7], among others. Recently, it has been also success- based models [9], or phrase-based models [11, 12]. In this work fully used for Language Modeling in Machine Translation (MT) we will focus on phrase-based MT. [8]. In phrase-based MT, the source sentence is split into In recent years, promising Syntax-based MT systems have phrases and then a large phrase translation table that contains been introduced [9]. This sort of systems may be benefited from paired source-target phrases is used to translate the source sen- the availability of parallel annotated corpus that conveys syntac- tence to a target sentence. Target sentences are filtered accord- tic information in order to learn syntactic models. The very rich ing to a language model. Each paired phrase entry (e, f) in the linguistic information that has the Penn Treebank corpus, that phrase table has associated several scores hi: phrase translation includes syntactic information, semantic information and PoS probabilities, reordering models, lexical translation probabili- tags, makes it very interesting for Syntax-based MT. It seems ties, etc. In the decoding process, hypotheses are recombined in clear that the availability of this corpus adequately translated a log-linear model and the best-scoring translation is searched would be of major interest for MT. according to expression: The translation of this corpus would be more useful as more X score(e, f) = exp λ h (e, f). perfect the translation was. Currently there exist powerful tech- i i (1) niques for MT [10], and phrase-based MT approach is among i the most popular [11, 12]. This approach uses automatic meth- The weights λi associated to each component hi are usually ad- ods in order to learn the translation models from large parallel justed with a discriminative method on development data [17] in corpus. This technique has demonstrated to obtain moderate order to optimize a standard metric, like for example the BLEU results for tasks of hight complexity [13]. metric [18]. The most important components in expression (1) Although interesting MT systems have been proposed in are the translations models and the language model. the last years, perfect translation could be just guaranteed after The language model component in expression (1) is usually human supervision. However, reviewing the full translation of estimated from target monolingual corpora. The parameters of the Penn Treebank corpus would be a very expensive work. If translations models in expression (1) are usually trained from the goal is to obtain perfect translations, other techniques like parallel corpora by using word alignment techniques [10]. Once Computer-Assisted Translation (CAT) should be explored [14]. both models are estimated, a test set is translated by using a In this approach the user translates interactively a data set and decoder system. Usually, both the training data set and the test the CAT system adapts on-line both the translation models and data set belong to the same domain. the search process. After some time, “hight-quality” translated When parallel data from the application domain is not avail- data is available that can be used to change the models. This able, Domain Adaptation (DA) techniques may be considered to seems a quite natural scenario for translating “perfectly” the obtain good results. The basic idea in DA is to adapt the models

39 Proceedings of the I Iberian SLTech 2009 trained with parallel data of one domain to a different domain. ment test, and final test. For this experiments we used only the We now summarize some DA techniques. sentences of training set to length 40 words. We called this set Different DA scenarios can be described depending on the EU corpus. The main characteristics of this training set can be availability of in-domain training data. In cross-domain adapta- seen in Table 1. tion, a small sample of parallel in-domain text is available. This small parallel in-domain data is used for adapting the models. Another possible scenario is when no data is available ahead of Table 1: Characteristics of the Europarl (EU) corpus. time and is generated dynamically. In such case, dynamic adap- Sentence Pair 730,740 tation techniques can be used. This second scenario is defined Running words Spa. 15,702,800 for Computer-Assisted Translation [14], in which the transla- Running words Eng. 15,242,854 tion is a carried out on-line by a human expert. The system Vocabulary Spa. 102,821 adapts dynamically to the user corrections. Note that in this sit- Vocabulary neg. 64,076 uation the corrections introduced by the human have an added value since has been validated and can be considered as “hight quality” translations. As we have described in previous section, we prepared a Other DA techniques have been defined for taking profit small data set to be used as in-domain data set. For this work, of latent information that could be present in the training data. we translated manually the first 300 sentences from section 23 Thus, Mixture Modeling was studied in [19], and Mixture- of the Penn Treebank. We called this set Small Parallel Penn Model Adaptation was studied in[20]. The main advantage of Treebank (SPPT) set. This is usually the section used for test- those techniques is their capability to learn specific probability ing. Note that although this is a very small data set, they were distributions that better fit subsets of training data set. manually translated, and could be considered as a “hight qual- Simple and effective DA techniques were studied in [15] for ity” translation set. Note also that this small corpus tried to a phrase-based MT system. The basic idea was to combine both simulate the CAT translation scenario that was previously de- the out-of-domain language model and the in-domain language scribed. Two Spanish native speakers independently translated model as separate components in expression (1) or to merge all the set of sentences. Then, each of them reviewed the transla- data an to obtain an unique component. Something similar was tions of the other person, and finally they reached an agreement carried out for the translations models. when different translations were proposed for each sentence. The main characteristics of this data set can be seen in Table 2. 2.1. Translation of the Penn Treebank It is important to note that the relation between the number of running words in Spanish and the vocabulary size was 3.7. The It should be noted that for the translation of the Penn Treebank same relation for the EU corpus was 152.8. This reveals that a corpus to Spanish, we intended to obtain “hight-quality” trans- great number of words of SPPT corpus could be singletons. lations of the corpus, and therefore, a final supervision of an From SPPT corpus, 50 sentences were used for develop- expert human would be appropriate. ment, 100 sentences were used for test, and 150 sentences were In this scenario, no parallel text was available, and there- used for training. The sentences to be included in each of these fore the appropriate technique would be the one previously de- three sets considerable affected the results, as we describe be- scribed in which the system adapts dynamically the models low. according to the corrections introduced by the human expert. However, the human cost of this translation could be very large. Therefore, it makes sense to try first other approaches less ex- Table 2: Characteristics of the Small Parallel Penn Treebank pensive. (SPPT) corpus. The approach that we considered in this work was similar to Sentence Pair 300 [15]. In that work, an in-domain corpus was used for DA. Both Running words Spa. 6,109 the out-of-domain and the in-domain data sets were combined Running words Eng. 5,689 in different ways in order to improve translation results. The Vocabulary Spa. 1,664 in-domain data set was not so large as the out-of-domain, but Vocabulary Eng. 1,498 enough training data was available. However, for the Penn Treebank there was not any sort of training data and just a small portion was manually translated. Standard free software tools were used for the experiments. This small data was used to improve the results obtained from a For training of the language models, we used SRILM toolkit1. standard baseline system. In this way the approach followed In all the experiments, 5-gram models trained with default op- in this work can be considered as a combination of two ap- tions were used as language models. For training translation proaches: DA adaptation from in-domain data together with models, we used GIZA++2 [22] with default options. Default “hight-quality” translations. translation models (hi components in expr. (1)) were used. Fi- nally, MOSES3 [23] was used for decoding. The parameters 3. Experiments of the model were tuned with MERT technique by improving BLEU metric. The out-of-domain corpus used in the experiments was the Eu- In a similar way to [15], several systems were trained, each roparl corpus [13]. This is a set of parallel texts that is free with a different way of combining the information of the two available for several languages including English and Spanish. corpora. The different combinations were the following: This corpus was built from the proceedings of European Parlia- ment, which are published on the web. For our experimentation, 1http://www.speech.sri.com/projects/srilm/ we used the second version of this corpus [21]. This corpus is 2http://www.fjoch.com/GIZA++.html divided into four separate sets: training, development, develop- 3http://www.statmt.org/moses/

40 Proceedings of the I Iberian SLTech 2009

• B: baseline system. This systems was trained only with the EU corpus. It corresponds to situation in which there Table 4: BLEU, Word Error Rate (WER) and Translation Error is not adaptation data available. Rate (TER) obtained for the translation systems. System BLEU WER TER • B+M50: in this system, the parameters of the baseline B 25.5 56.4 53.4 system were adjusted with MERT on a development set B+M50 22.1 61.4 58.3 composed of 50 sentences of SPPT corpus. B+M50+TW 17.6 64.1 62.8 • B+M50+TW: a second translation table that was trained B+M50+LW+TW 22.8 60.5 56.8 with 150 sentences from SPPT was added to previous B+M50+TWMerg 23.3 57.8 55.0 system. B+M50+TWMerg +LWMerg 23.6 57.7 54.7 • B+M50+TW+LW: a second language model that was trained with 150 sentences from SPPT was added to pre- vious system. • B+M50+TWMerg: both EU set and 150 training sen- tences from SPPT were merged and then only a trans- Table 5: BLEU results for the experiment with SPPT corpus, lation table was obtained. 50 tuning sentences of SPPT when we translated SPPTts and SPPTtr (closed vocabulary in were used for MERT. the last case). SPPTtr SPPTts • B+M50+TWMerg+LWMerg: both EU set and 150 Without MERT 71.1 16.8 training sentences from SPPT were merged, and then SPPTdev 60.8 16.9 With MERT only a translation table and only a language model were SPPTtr 76.0 13.2 obtained. 50 sentences of SPPT were used for MERT. Table 3 summarizes this information.

Table 3: Data sets and number of sentences used in the transla- This experiment was interesting, because prevented us to use tion systems. cross-validation in the experiment with SPPT corpus. System Tr. set Dev. set Test set Therefore, we repeated the experiments but choosing each B EU - SPPT 100 sentence in SPPTdev, SPPTtr, and SPPTts randomly. We tried B+M50 EU SPPT 50 SPPT 100 several seeds and we chose the partition that provided the best B+M50+TW results. Table 6 shows the obtained results. In this partition, B+M50+LW+TW EU+ the out-of-vocabulary words of the Spanish side of SPPTts with B+M50+TWMerg SPPT SPPT 50 SPPT 100 regard to the Spanish side of the EU set was 91 words, and this B+M50+TWMerg 150 value decreased to 53 when we merged both the EU set and the +LWMerg Spanish side of SPPTtr.

In order to simulate CAT approach mentioned in Section 3, Table 6: BLEU, Word Error Rate (WER) and Translation Error as a first attempt we chose the development, training, and test Rate (TER) obtained for the translation systems with SPPTdev, sets of SPPT as follows. Consecutive sentences for each set SPPTtr, and SPPTts randomly generated. were chosen. Thus, sentences from 1 to 50 were used for devel- opment (SPPTdev set), sentences from 51 to 150 were used for System BLEU WER TER test (SPPTts set), and sentences from 151 to 300 were used for B 20.8 61.9 58.2 training (SPPTtr set). Table 4 shows the obtained results for the B+M50 22.2 61.5 58.1 translation systems that have previously described. B+M50+TW 18.3 68.8 65.1 In this first experiment, we observed that the best translation B+M50+LW+TW 18.6 65.0 61.6 results were obtained by the baseline system, and adjusting the B+M50+TWMerg 23.4 58.5 54.2 translation with MERT did not achieve to improve this result. B+M50+TWMerg We thought that this could be due to the fact that the chosen test +LWMerg 23.1 59.7 55.8 set included a lot of out-of-vocabulary words that prevented the system to improve the baseline result. We tested this hypothesis in the following way. First, we trained a system with SPPTtr set, and then we translated in- In this experiment, we can see that some of the proposed dependently both SPPTts and SPPTtr (close vocabulary in the systems improved the baseline results. We can also see that last case), without MERT, and with MERT with SPPTdev and for this task and with this amount of data is better to merge SPPTtr, both again independently. Table 5 shows the BLEU the training data than combine them in two separate table and obtained results. two separate language models as opposite to the results reported The large difference between results in column SPPTtr and in [15]. This could be due to the fact that with the small amount column SPPTts revealed that the vocabulary of SPPTtr and of training data available for SPPT, the weights of the log-lineal SPPTts was quite different. Note also that when we used SPPT- model could not be better tuned. dev for MERT, the BLEU for SPPTtr decreased more than 10 Table 7 shows some translation results. Section 23 of the points. Note in addition that when we used SPPTtr for MERT Penn Treebank includes a lot of sentences closely related to the (over-learning the SPPTtr set), the BLEU decreased for SPPTts. exchange stock market and real state companies.

41 Proceedings of the I Iberian SLTech 2009

[12] P. Koehn, “Pharaoh: a beam search decoder for phrase-based sta- Table 7: Translation example tistical machine translation models,” in Proc. of AMTA, 2004. Source kaufman & broad , a home building company , [13] ——, “Europarl: A parallel corpus for statistical machine transla- declined to identify the institutional investors . tion,” in Proc. of MT Summit, 2005. System kaufman & broad , un hogar edificio compan˜´ıa output , desciende identificar a los inversores institu- [14] S. Barrachina, O. Bender, F. Casacuberta, J. Civera, E. Cubel, S. Khadivi, A. L. H. Ney, J. Tomas,´ and E. Vidal, “Statistical cionales . approaches to computer-assisted translation,” Computational Lin- Reference kaufman & broad , una compan˜´ıa de cons- guistics, vol. 35, no. 1, pp. 2–28, 2009. truccion´ de casas , declino´ identificar los inver- [15] P. P. Koehn and J. Schroeder, “Experiments in domain adaptation sores institucionales . for statistical machine translation,” in Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: Association for Computational Linguistics, June 2007, pp. 224–227. 4. Conclusions [16] F. Casacuberta and E. Vidal, “Machine translation with inferred stochastic finite-state transducers,” Computational Linguistics, We have presented a work for the translation to Spanish of the vol. 30, no. 2, pp. 205–225, 2004. Penn Treebank. A phrase-based model has been used for this task. Out-of-domain training data was used, together a very [17] F. Och, “Minimum error rate training in statistical machine trans- lation,” in Proceedings of the 41st Annual Meeting of the Associa- small “hight-quality” in-domain training set, simulating a CAT tion for Computational Linguistics. Sapporo, Japan: Association system. The obtained results cleary improved when “hight- for Computational Linguistics, July 2003, pp. 160–167. quality” in-domain training data was included in the system. [18] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method For future work we intend to use CAT systems for the same for automatic evaluation of machine translation,” in Proc. 40th problem. Annual meeting of the ACL, 2002, pp. 311–318. [19] J. Civera and A. Juan, “Domain adaptation in statistical machine 5. Acknowledgements translation with mixture modelling,” in Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech This work has been partially supported by the EC (FEDER) Republic: Association for Computational Linguistics, June 2007, and the Spanish MEC under grant TIN2006-15694-CO2-01, pp. 177–180. and by the Spanish research programme Consolider Ingenio 2010: MIPRCV (CSD2007-00018). The first author is sup- [20] G. Foster and R. Kuhn, “Mixture-model adaptation for SMT,” ported by “Division´ de Estudios de Posgrado e Investigacion”´ in Proceedings of the Second Workshop on Statistical Machine and by “Metrolog´ıa y Sistemas Inteligentes” research group of Translation. Prague, Czech Republic: Association for Compu- Instituto Tecnologico´ de Leon.´ tational Linguistics, June 2007, pp. 128–135. [21] P. Koehn and C. Monz, Eds., Proceedings on the Workshop on 6. References Statistical Machine Translation. New York City: Association for Computational Linguistics, June 2006. [1] M. Marcus, B. Santorini, and M. Marcinkiewicz, “Building a large annotated corpus of english: the penn treebank,” Computational [22] F. Och and H. Ney, “A systematic comparison of various statistical Linguistics, vol. 19, no. 2, pp. 313–330, 1993. alignment models,” Computational Linguistics, vol. 29, no. 1, pp. 19–52, 2003. [2] B. Roark, “Probabilistic top-down parsing and language mod- eling,” Computational Linguistics, vol. 27, no. 2, pp. 249–276, [23] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, 2001. N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source [3] D. Bikel, “A statistical model for parsing and word-sense disam- toolkit for statistical machine translation,” in Proc. of the ACL biguation,” in Proceedings of the 2000 Joint SIGDAT conference Companion Volume Proceedings of the Demo and Poster Sessions, on Empirical methods in natural language processing and very June 2007, pp. 177–180. large corpora, 2000, pp. 155–163. [4] E. Brill, “Transformation-based error-driven learning and natu- ral language processing: a case study in part-of-speech tagging,” Comput. Linguist., vol. 21, no. 4, pp. 543–565, 1995. [5] A. Ratnaparkhi, “A maximum entropy model for part-of-speech tagging,” in Proc. Empiriral Methods in Natural Language Pro- cessing, University of Pennsylvania, May 1996, pp. 133–142. [6] M. Collins, “Head-driven statistical models for natural language parsing,” Computational Linguistics, vol. 29, no. 4, pp. 589–637, 2003. [7] E. Charniak, “A maximum-entropy-inspired parser,” in Proc. of NAACL-2000, 2000, pp. 132–139. [8] E. Charniak, K. Knight, and K. Yamada, “Syntax-based language models for statistical machine translation,” in Proc. of MT Summit IX, New Orleans, USA, September 2003. [9] D. Chiang, “Hierarchical phrase-based translation,” Computa- tional Linguistics, vol. 33, no. 2, pp. 201–228, 2007. [10] P. Brown, S. D. Pietra, V. D. Pietra, and R. Mercer, “The math- ematics of statistical machine translation: parameter estimation,” Computational Linguistics, vol. 19, no. 2, pp. 263–311, 1993. [11] R. Zens, F. Och, and H. Ney, “Phrase-based statistical machine translation,” in Proc. of the 25th Annual German Conference on Artificial Intelligence, LNAI, 2479, 2002, pp. 18–32.

42 {iae

Proceedings of the I Iberian SLTech 2009

Adapting the Unisyn Lexicon to Portuguese: Preliminary issues in the development of LUPo

Simone Ashby, José Pedro Ferreira, Sílvia Barbosa

Instituto da Linguística Teórica e Computacional (ILTEC), Lisbon, Portugal {simone,zpferreira,silvia}@iltec.pt

Abstract integrated and well informed system. LUPo will capitalize on having direct access to a morphological parser, part of This paper presents some preliminary issues and proposed speech information, syllable boundaries, mappings of solutions in the development of an accent-independent European and Brazilian Portuguese spelling variants, and pronunciation lexicon for Portuguese, known as the etymological relationships. Portuguese Unisyn Lexicon (LUPo). LUPo's objectives are Our approach to this work is explicitly knowledge- presented within the context of the Portal da Língua driven, as motivated by: (1) the need for greater linguistic Portuguesa knowledge base. Key considerations are input in statistically derived speech processing algorithms; addressed for encoding morphological boundaries, treating (2) the success of the English Unisyn model in creating a orthographic forms, and handling loan words. Here, it is highly scalable, extendible, and customizable lexicon for argued that the knowledge-driven paradigm exemplified in accommodating a large number of regional variants; and the original English Unisyn Lexicon, along with the Portal da (3) complementary objectives for the establishment of a Língua Portuguesa's relational structure and rich Portuguese cross-dialectal database and the first freely lexicographic content present a good foundation for available online resource of its kind to provide phonetic establishing a tightly integrated and well informed system. transcriptions. Index Terms: lexicon, pronunciation, Portuguese accents, In subsequent sections of this paper, the design morphology, orthography, loan words, dictionary, relational objectives and principal architectural components of LUPo database, speech synthesis are presented, followed by a brief description of the Portal. Three preliminary issues in the development of LUPo are 1. Introduction then discussed concerning the encoding of morphological Adapting speech technologies to accommodate a wider boundaries, treatment of orthographic forms, and the number of speakers, and represent regions and countries for handling of foreign loan words. whom such development concerns have largely been overlooked, carries significant economic and political weight 2. LUPo in narrowing the global digital divide. Semi-automatic The LUPo project will produce an accent-independent approaches for exploiting regularities between graphemes pronunciation lexicon for Portuguese, along with tools for and phones have yielded good results. However, such generating accent-specific output for lexical entries and systems rarely extend to multiple accents, and make limited multi-word texts. Users will have the option of accessing the use of morphology and other types of lexicographic open-source lexicon and tools as a standalone application or information. Moreover, these projects typically occur in via the Portal (http://www.portaldalinguaportuguesa.org). isolation, and are governed by private sector interests that The Portal module will be accessible as part of the page view prohibit the sharing of data and tools. for each lexical entry, wherein the user can select a desired Fitt's Unisyn Lexicon [1] presents a paradigm for accent to view the corresponding transcription for a given minimizing the costs of representing multiple pronunciation word. Online and offline users will also have access to a tool variants by using knowledge-driven approaches to specify for entering a fixed amount of text, selecting a desired correspondences between a master lexicon and different accent, and generating multi-word transcribed output for that accent-specific targets. Implicit in the notion of a master accent, while also having the option to show the rules where lexicon is the expression of phonological variation in the they apply. form of key symbols, a kind of metaphoneme based on Our methodology will incorporate many of the strategies Wells' keywords concept [2]. Key symbols, which can used by Fitt to create the English Unisyn Lexicon. Standard additionally be used to encode stress, syllables, and Brazilian Portuguese (BP) and European Portuguese (EP) morphology, make up the lexical entries and set them apart lexicons will be merged to form the basis of the master as accent-independent. Instead of creating hundreds of lexicon. Lexical items will be represented as underspecified thousands of phonetic transcriptions for each new accent, forms using key symbols. Regional hierarchical relationships such data are generated automatically through the application will be encoded within the system to enable the inheritance of accent-specific post-lexical rules. By framing this of accent related features. Regional accents will be modeled information within the context of a regional accent hierarchy, one by one, based on the representativeness of the accent and a single rule can be used to describe a number of accents. the availability of data or informants. And the tools for This paper presents some of the preliminary issues and generating surface output will be developed from Perl proposed solutions in the development of an accent- scripts. independent lexicon and rule system for generating accent- Unlike Fitt's Unisyn Lexicon, LUPo will be stored in the specific pronunciations in Portuguese, otherwise known as multi-dimensional and lexicographically rich Portal the Portuguese Unisyn Lexicon (LUPo). Our methodologies database, thereby enabling the cross-referencing of will be a reformulation of those originally employed by Fitt semantically informed morphological parses, part of speech to adapt this successful paradigm to Portuguese, and take information, spelling variants, and foreign loan word advantage of the relational structure and rich lexicographic attributes. 'Blackout', for example, which is stored in the content of the Portal da Língua Portuguesa [3] knowledge Portal's dictionary of estrangerismos (loan words), will base (hereafter referred to as the 'Portal') to create a more automatically be excluded from the application of post- 43 Proceedings of the I Iberian SLTech 2009 lexical rules, thereby eliminating the need to hard-code it of a system of files organized by COUNTRY, REGION, (and other loan words) in an exception dictionary. Instead, TOWN, and PERSON. Each node will inherit the features of the appropriate morpho-phonological rules will be re-routed the previous node, provided the inheritance is not broken by to the aportuguesamento (Portuguese spelling adaptation) to the introduction of new features at a lower level. As the which this word is mapped. lowest level, attributions to PERSON will override all other The Portal's morphological parser will enable LUPo to level specifications. Similarly, features attributed to local provide better lexical coverage while eliminating the need to areas will override the settings of wider geographical areas. encode redundant information. Thus, LUPo can be used to Applying an example from Brazil, a general rule may be generate transcriptions for the words 'actividade' and attributed to the COUNTRY node for expressing the coronal 'practicamente' without the need to store these and other plosives /t/ and /d/ as [d] and [t] (Figure 1). However, given inflected nominal forms in the lexicon. what we already know about accents specific to Salvador, The redundancy of treating spelling variants as separate Belo Horizonte, and Rio de Janeiro, it will be necessary to entries in the master lexicon will be avoided by making use introduce a separate rule at the TOWN level for transforming of the Portal's existing system of cross referencing BP and these sounds into the affricates /ʧ/ and /ʤ/, thereby EP forms. For example, the master pronunciation for the overriding the previous rule for these variants. Portuguese superlative meaning 'great' will have a single entry in the master lexicon that maps to the respective BP and EP spellings ótimo and óptimo, along with corresponding forms from previous recent orthographic accords. These and other topics concerning utilization of the Portal's relational structure and lexicographic content are discussed in greater detail in section 4.

2.1. Objectives In keeping with LUPo's development initiatives, our objectives for the project are as follows: • Create an accent-independent master lexicon for Portuguese using an extended set of X-SAMPA-based typographical symbols that account for morphological boundaries and other phenomena for encoding lexical entries and processing conversions. Figure 1: Sample regional accent hierarchy and post- • Use a knowledge-driven approach to create a system of lexical rule application for Brazil. post-lexical morpho-phonological rules for processing conversions from the master lexicon to accent-specific targets. 2.3. Master lexicon and key symbols • Develop tools for automatically generating accent- specific output for individual lexical entries and multi- Accent-specific transformations will be derived from a word texts. master lexicon represented through the use of key symbols, • Establish a regional accent hierarchy for specifying based on an extended interpretation of the X-SAMPA which rules apply to one or more accents, along with alphabet. In addition to representing consonants and vowels, default inheritances for all the sub-nodes of a large typographical markers will be used to encode stress, geographic area, and a system for overriding these syllables, morphological constituents, and other phenomena, values. e.g. deletion of word-final rhotics for some BP varieties. • Create pronunciation models for: standard BP and EP; The specification of consonants and vowels requires an the Lisbon accent and at least one additional EP analysis of Portuguese phonology and the underspecification accent; the two major BP accents, as they are actually of segments. This work will be guided by the research team's spoken in Rio de Janeiro and São Paulo, plus one or phonologists, who will be involved in distinguishing types of more other accents specific to Brazil; and three or systematic variation (for defining global symbols) from more accents from the continents of Africa and Asia. cases of allophonic variation (for defining post-rule • Enhance the Portal by introducing richly detailed and symbols). Based on the work involved in developing the varied phonetic content, and open it up to a wider original Unisyn Lexicon [4], we predict that the more audience. intensive aspect of this task will be the specification of • Provide the research community and general public morphological boundaries, as these will be instrumental in with a freely available electronic data standard for: either triggering or blocking post-lexical rules. testing the results of different speech processing The development of an accent-independent master systems, conducting empirical analyses across multiple lexicon will proceed as follows: Portuguese accents, and facilitating L2 studies of • Preprocess the 1961 edition of the Dicionário da Portuguese. Língua Portuguesa da Academia Brasileira de Letras • Facilitate the entry of lesser or undocumented regional (the only known large-scale lexical resource to contain variants into the digital domain. phonetic transcriptions for BP) to adapt transcriptions • Establish the basis for a subsequent project aimed at to the criteria followed for the phonetic information in developing a TTS module for inclusion in the Portal the Portal. and as a freely available, open-source standalone • Extract from the Portal the 2,217 high-frequency application. lemmas that correspond to [5]. Establish links to the relevant databases for getting spelling variants, 2.2. Regional accent hierarchy syllable boundaries, part of speech information, LUPo's architecture will be framed by a regional accent foreign loan word attributes, frequency data, and hierarchy that feeds accent-specific transformations by morphological constituents. specifying whether and how the rules apply to geographically defined entities. As in Fitt's Unisyn Lexicon, it will consist 44 Proceedings of the I Iberian SLTech 2009

• Merge the lexical entries in the Portal with those kind to provide richly detailed and varied phonetic output for contained in the Dicionário da Língua Portuguesa da a large number of Portuguese accents. Indeed, it will be the Academia Brasileira de Letras. first free online resource to provide any manner of phonetic • Extract differences between the two resources and transcription data for Portuguese. ensure that key symbols account for every confirmed instance of variation. 4. Preliminary development issues • Introduce frequency data and a system for weighting high-frequency words. In this section, three key issues in the initial development of • Fully integrate LUPo into the Portal. LUPo are discussed in relation to the structural and The end results of these actions will be the project's first lexicographic attributes of the Portal. instantiation of an accent-independent lexicon, which will provide the basis for applying post-lexical rules and 4.1. Morphology generating accent-specific target output. The subsequent task As previously indicated, one of the advantages LUPo will of creating and evaluating morpho-phonological post-lexical have over the original Unisyn Lexicon is that it will reside rules will feed from this data, and result in an iterative set of within a set of relational databases containing detailed improvements to the master lexicon through the modeling of lexicographic information, e.g. syllable boundary, stress, and new accents and application to a wider number of lemmas. morphological encodings. Here, we intend to show how 2.4. Morpho-phonological rule sets having these data will enhance the reliability of LUPo, while greatly reducing the amount of manual labor required to The task of creating post-lexical morpho-phonological rule develop such a lexicon. sets can be broken down into four successive types of Vowel height presents one of the most challenging activities. These include: (1) using semi-automatic methods aspects of devising a suitable grapheme-to-phoneme system to model the rules for generating standard BP and EP output for Portuguese. Mid and low vowels are usually raised in an (effectively recreating the transcriptions in the Portal and the unstressed position, especially in standard EP. This height processed version of the Dicionário da Língua Portuguesa variation is predictable to an acceptable extent by knowing da Academia Brasileira de Letras); (2) expanding the base the underlying phonologic segment, syllable boundaries and lexicon and post-lexical rules for standard BP and EP to the stress position. For instance, /o/ will typically be pronounced 55K transcribed lemmas in the Portal, along with as [o] in a stressed position and as [u] in an unstressed one corresponding inflected forms; (3) and extending the rule (sopa, ['so.pɐ], 'soup'; ensopado, [ẽ.su.'pa.ðu] 'stew'), while / sets to include as many actual spoken accents of Portuguese ɔ/ will always be pronounced [ɔ] in a stressed position and from Africa, Asia, Europe, and South America as the [u] or [ɔ] in an unstressed one (e.g. roda, ['ʀɔ.ðɐ]; 'wheel'; project's resource and time constraints allow. rodinha, [ʀɔ.'ði.ɲɐ]; 'small wheel' rodagem, [ʀu.'ða.ʒɐ̃̃j], For steps 1-3 above, we will also be performing a 'running in [e.g. an engine]'). As these examples show, the number of subroutines. For example, the modeling of each rules can be extended to words sharing the same root. new accent will first be done for the 2,217 high-frequency The problem is that orthography, stress position, and lemmas in [5] before undergoing thorough evaluation by the syllable structure are not enough to determine whether the 'o' project's dialect consultants and informants. Each time the in roda and sopa corresponds to a [-low] or [+low] segment. data and rules are evaluated, it will be necessary to refine the This also applies to the [-high] front vowels /e, ɛ/, both key symbols, make changes to master lexicon, and add represented by the grapheme 'e'. Without a workaround for mappings. Once the initial set of pronunciations and rules this problem, most of the entries containing a non-final 'o' or have been thoroughly checked and modified, we will extend 'e', along with their morphologically related words, would the rule set for that accent to the entire list of lemmas and need to be checked manually. This problem is ubiquitous in inflected forms before, again, subjecting the resulting any phonetically transcribed lexicon of Portuguese. pronunciations and rules to a system of spot-checking and The Portal will soon feature a database containing revision (Figure 2). morphological information. The morphological database Throughout all of these processes, it will be necessary to includes morphological boundary encodings, along with the define new entries in the exceptions dictionary, and create identification of roots and affixes so that any given root or and adjust mappings to LUPo's regional accent hierarchy. affix will bear a unique record ID. The completion of this database will enable us to: (1) improve the overall Portal 3. Portal da Língua Portuguesa architecture by tying words to their morphological constituents; and (2) predict the phonologic behavior of any The Portal is an online knowledge base dedicated to word that contains a previously analyzed constituent (the providing the general public with a set of free Portuguese latter of which should be particularly relevant in the language resources, as well as serving as an open-source development of LUPo). Through pursuit of this repository of lexicographic information for the research methodology, we expect to greatly reduce the need for community. Its modular architecture enables it to extend far performing manual checks. beyond the bounds of a traditional dictionary, while its LUPo will not only benefit from the lexicographic relational structure offers the advantage of dealing with content contained in the Portal, but it will itself have an homographs, spelling variants, inflected forms, loan words, impact on the lemma list in the Portal's central database. and etc. Consider the following examples: The Portal was originally conceived as a lexical database 1a) molho, masc. noun, , 'bundle' focused on representing Portuguese inflectional morphology ['mɔ.ʎ+u] 1b) molho, masc. noun, , 'sauce' (i.e. MorDebe), and began in 2004. It continues to be ['mo.ʎ+u] 2a) molhada, fem. noun, , 'large bundle' maintained by the Instituto da Linguística Teórica e [mɔ.'ʎ+að+ɐ] 2b) molhada, fem. noun, past [ ], 'wet'1 Computacional (ILTEC) in Lisbon, and contains more than mu.'ʎ+aðɐ 150,000 lemmas and close to 1.5 million word forms. Currently, the Portal is a meaningless dictionary, and The Portal currently receives 4000-4500 hits by unique homographs of the same class are treated as a single entry, users per day and is increasingly regarded as a standard 1 resource for inquiries about the Portuguese language. The transcriptions provided in section 4.1 are for standard Inclusion of LUPo in the Portal will greatly enhance the EP. The plus symbol '+' is used to mark morphological Portal as a pan Lusophone resource and the only one of its boundaries. The assumptions about the phonological system of Portuguese are based on the analyses of [6]. 45 F

Proceedings of the I Iberian SLTech 2009

provided there is no formal feature (such as different identification of loan words and the rerouting of LUPo inflectional paradigms) telling them apart. Once fully queries to the Portuguese spelling adaptation, if a reliable integrated in the Portal, the data contained in LUPo will one exists. enable us to identify homographs such as those in examples The issue lies in the inconsistent handling of loan words (2a) and (2b) above, and split them into different entries in Portuguese. For, while some have been fully adapted, such through the application of a set of formal features that, until as holígane for 'hooligan' and surfe for 'surf', others function recently, have been lacking in the Portal, i.e. phonetics. more as graphical-phonetic hybrids. E.g. the representation of 'iceberg' as icebergue, which retains the [aɪ] sound from 4.2. Orthography 'ice', thus barring the otherwise standard grapheme-to-phone mapping of 'i' > [i]. To further complicate matters, some An example of a formal feature that currently sets aportuguesamentos lack authority in the spoken world. The homographs apart is an entry's inflectional paradigm. For Portal currently maps 'bluff' to the graphical blefe, which example, til takes two plurals, tis and tiles, the former appears in dictionaries and is pronounced in standard BP as corresponding to a botanic species endemic of the ['blƐ.fɨ], but is dispreferred for ['blɐf] in standard EP. Indeed, Portuguese archipelagos and the latter to the graphical 'tilde' this last example helps to illustrate yet another problem in symbol. These words currently have two separate entries in the treatment of loan words: the use of different spelling (and the Portal. pronunciation) adaptations for regional variants of There is an explicit link between entries that corresponds Portuguese. to alternative ways of spelling the same words, as in the case Given the above problems and their implications for of luzecu and luze-cu, 'firefly'. When LUPo is fully generating accurate pronunciations in LUPo, we will adopt a integrated into the Portal, we will be able to profit from phased approach to dealing with loan words that makes use having this information by checking whether the of the Portal's current ability to identify these forms, while transcription for each orthographic form is the same, thus steadily improving the manner in which aportuguesamentos guaranteeing correctness. are encoded in the estrangerismos dictionary, and One problem associated with this setup is the fact that formulating morpho-phonological rules that apply to the some alternative orthographic forms are only accepted in language of origin. specific countries. This stems in part from the poor lexicographic tradition of many of the Portuguese speaking countries, the fact that only Brazil has a state-mandated 5. Conclusions language normalizing organization, and the lack of The Unisyn Lexicon presented in [1] offers a set of strategies coordination between countries where Portuguese is and methodologies suitable for extending this model to recognized as an official language. Portuguese. By incorporating many of the same design Fortunately, the Portal contains a database that explicitly constituents as Fitt's model and utilizing the existing links these entries, in each case identifying the country architecture of the Portal, LUPo will function as a free, open- where such forms are acceptable. By using the information in source accent-independent pronunciation lexicon for this database, we will avoid generating pronunciations for Portuguese. Issues concerning the representation of spellings that are incorrect for a given regional variant. In morphology, orthography, and loan words will no doubt other words, we will have an automatically generated present us with a variety of challenges. However, by exclusion list. Such is the case for the graphical pair constructing LUPo as a tightly integrated module that has lambugem (EP) and lambujem (BP). While in this particular full access to the lexicographically rich Portal data, we example, the pronunciation will be unaffected (since expect to overcome many of these problems, while lambugem and lambujem are pure orthographic variants), enhancing the manner in which lexical entries are other country-specific word pairs exist that have distinct represented in the Portal. pronunciations. Albeit the fact that Portuguese orthography is overtly 6. Acknowledgements phonological, and that a unified orthography has been in place since 2008, phonetic differences in each of the The authors gratefully acknowledge the support of the countries where Portuguese is spoken still surface at the Fundação para a Ciências e Tecnologia, and the cooperation orthographic level. An example is the generalized difference of Susan Fitt, whose development of the original English in the pronunciation of [+round, +front] vowels before a Unisyn Lexicon is the inspiration for this work. nasal consonant in proparoxytones (words with stress on the antepenultimate syllable). These are almost always produced 7. References as [-low] in Brazil and [+low] everywhere else. Given that diacritics are mandatory in Portuguese proparoxytones, this [1] Fitt, S., "Documentation and user guide to UNISYN creates an orthographic divide between the politically lexicon and post-lexical rules," Technical Report, defined regions where Portuguese is spoken, e.g. anônimo Centre for Speech Technology Research, University of (BP) vs. anónimo (EP). By using the information already Edinburgh, 2000. Online: http://www.cstr.ed.ac.uk/ contained in the Portal, we will avoid generating double projects/unisyn/, accessed on 10 October 2008. pronunciations for orthographic forms that, in fact, only exist [2] Wells, J. C., Accents of English. Cambridge: Cambridge in one system. We will also save a potentially huge amount University Press, 1982. of manual effort required to track them down, as this [3] “http://www.portaldalinguaportuguesa.org.” problem, alone, accounts for close to 1.2% of the lexicon. [4] Fitt, S. “Morphological approaches for an English pronunciation lexicon,” in Proc. of Eurospeech, 4.3. Loan words Aalborg, Denmark, 2001. [5] INIC and CLUL, Português Fundamental: Vocabulário It has already been suggested that the Portal's dictionary of e Gramática. Lisboa: Instituto Nacional de Investigação estrangerismos will be useful in excluding foreign loan Científica, Centro de Linguística da Universidade de words from the application of post-lexical rules. By mapping Lisboa, 1984. word borrowings in LUPo's master lexicon to their [6] Mateus, M. H. and d'Andrade, E. The Phonology of corresponding aportuguesamento in the estrangerismos Portuguese. Oxford: Oxford University Press, 2000. dictionary, we avoid the need to treat these forms as exceptions. Thus, the Portal will be instrumental in both the 46 Proceedings of the I Iberian SLTech 2009

Speech Recognition

47

Proceedings of the I Iberian SLTech 2009

A Baseline System for the Transcription of Catalan Broadcast Conversation

Henrik Schulz1, Jose´ A. R. Fonollosa1, and David Rybach2

1Department of Signal Theory and Communications Technical University of Catalunya (UPC), Barcelona, Spain [email protected], [email protected]

2Human Language Technology and Pattern Recognition RWTH Aachen University, Aachen, Germany [email protected]

Abstract verb morphology. The eastern dialect includes Northern Cata- lan (French Catalonia), Central Catalan (the eastern part of Cat- The paper describes aspects, methods and results of the devel- alonia), Balearic, and Alguer`es limited to Alghero (Sardinia). opment of an automatic transcription system for Catalan broad- The western dialect includes North-western Catalan and Valen- cast conversation by means of speech recognition. Emphasis cian (south-west Catalonia). Catalan shares many common lexi- is given to Catalan language, acoustic and language modelling cal properties with the languages of Occitan, French, and Italian methods and recognition. Results are discussed in context of which are not shared with Spanish or Portuguese. In compari- phenomena and challenges in spontaneous speech, in particular son with Spanish that has a faint vowel reduction in unstressed regarding phoneme duration and feature space reduction. positions, Catalan exposes vowel reduction in various varieties - in particular with the presence or absence of the neutral vowel 1. Introduction ”schwa” /@/. More specifically, the appearance of a neutral vowel in reduced position in eastern Catalan is regarded as a The transcription of spontaneous speech still poses a chal- fundamental distinction to western Catalan. Among the eastern lenge to state-of-the-art methods in automatic speech recogni- dialects, Balearic allows the neutral vowel in stressed position tion. Spontaneous speech exhibits a significant increase in intra- unlike Central Catalan and the western dialects [3]. The voiced speaker variation, in speaking style and speaking rate during its labiodental fricative /v/ is confined to Balearic and northern Va- term. It involves phenomena such as repetition, repair, hesi- lencian, while in the remaining dialects the sound converges as tation, incompleteness and disfluencies. The increase in spon- bilabial /B/ [4]. In Eastern Catalan, the Nasals /m/ (bilabial), taneity compared to planned or read speech leads furthermore /n/ (alveolar), /J/ (palatal), and /N/ (velar) appear in final posi- to a reduction in spectral or feature space respectively, and in tion. /m/, /n/, and /J/ also appears intervocalically. /N/ is only duration. The paper focuses on aspects of the development of found word internally preceding /k/ [5]. The voiced alveolar a transcription system for Catalan broadcast conversations by liquid /rr/ in word final position only appears to be pronounced means of automatic speech recognition carried out in the frame- in Valencian. Furthermore, a word final voiceless dental stop /t/ work of the TECHNOPARLA project [1]. is omitted in the Eastern and Northern dialectual region. The subsequent sections address major aspects of the Cata- lan language, characteristics of the underlying broadcast con- versational speech, as well as a description of the methods ap- plied for feature extraction, acoustic and language modelling, 3. Broadcast Conversational Speech and in recognition. Results are discussed and put into context by examining phenomena of spontaneous speech, assessing fea- ture distribution, duration and disfluencies of speech in broad- The broadcast conversational speech data used during these cast conversation. studies originate from 29 hours of transcribed Catalan television The ASR acoustic model (AM) training and decoding sub- debates (known as Agora),` 16% interferred with background system have been developed in the RWTH Open Source ASR music, 4% with overlapping speech and 3% originating from framework [2]. replayed telephony speech. The debates exhibit sporadic ap- plause, rustle, laughing, or harrumph of the participants. Seg- 2. Catalan Language ments containing background music, speaker overlap, and tele- phony speech have been excluded at this stage, and are subject Catalan, mainly spoken in Catalonia - a north-eastern region of of separate studies. Short term events of the same remained Spain - and Andorra, is a Romance language. As its geographic in the data, since a removal of affected words may fragment proximity suggests, Catalan shares several acoustic phonetic the recordings. Speakers intermittently also tend to use Span- features and lexical properties with its neighbouring Romance ish words in conversations due to their virtual bilinguality. Also languages such as French, Italian, Occitan and Spanish. Never- Spanish proper names remain as such. The gender distributes to theless there are fundamental differences to all of them. Sub- 1/3 female, 2/3 male respectively. The speaking style features stantial dialectual differences divide the language into an east- 95% spontaneous speech, the remainder planned speech. Most ern and western group on the basis of phonology as well as speakers are not considered professional.

49 Proceedings of the I Iberian SLTech 2009

4. Acoustic Model Transcribed Data [h] 20 # Segments 21420 An initial Catalan acoustic model (AM) was derived from a # Speakers 275 Spanish AM that was developed during the project TC-STAR # Running Words 272k [6]. While carrying out the first alignment iteration, Catalan allophones that extend the original set of Spanish allophones borrow the appropriate models from the original AM instead of Table 1: Statistics on acoustic model training data following the approach of using monophone context indepen- AGORA` dent models to bootstrap context dependent models. The original feature space comprises 16 Mel frequency cep- Transcribed Data [h] 31 stral coefficiants (MFCC) extended by a voicedness feature, whereas the cepstral coefficiants are subject to mean and vari- # Segments 11190 ance normalisation. Vocal tract length normalization (VTLN) # Speakers 140 is applied to the filterbank. The temporal context is preserved # Running Words 280k by concatenating the features of 9 consecutive frames. Subse- quently a linear transformation reduces the dimensionality. Table 2: Statistics on acoustic model training data A training phase is carried out by several steps: Prior to SPEECON-S the AM estimation, a linear discriminative analysis estimates a feature space projection matrix (LDA). Furthermore, a new phonetic classification and regression tree (CART) is grown fol- 4 dialectual regions Eastern, Valencian, Balearic and North- lowed by Gaussian mixture estimation, that iteratively splits and Western Catalan in training and recognition. refines the Gaussian mixture models. The AM provides context dependent semi-tied continuous density HMM using a 6-state topology for each tri-phone. Their 5. Language Model and Vocabulary emission probabilities are modelled with Gaussian mixtures Language model and vocabulary for recognition are derived sharing one common diagonal covariance matrix. A CART ties from a textual corpus, composed of articles of the online edi- the HMM states to generalized triphone states. tion of ’El Periodico’, a weekly journal published in Catalan Based on the broadcast conversational training data, the and Spanish. It encompasses 10 subsets, each focused on a baseline AM has been estimated passing a number of iterations separate topic with a total size of 43.7 million words, 1.8 mil- of re-alignment and intermediate model estimation, whereas lion sentences respectively. The 4-gram backing-off language LDA and CART are re-estimated twice per iteration. model comprises about 10.1 M multi-grams and achieves min- VTLN Gaussian mixture classifier estimation during train- imal perplexity (PPL) with a linear discounting and modified ing employs solely normalised MFCC. Kneser-Ney smoothing methodology. The estimation of lan- The iterative training procedure has been enhanced by us- guage models is carried out with the SRI LM toolkit [9]. The ing Maximimum Likelihood Linear Regression (MLLR) [7] lexicon contains the 50k most frequent words of the ’El Period- adapted AM during the first Viterbi alignment of acoustic train- ico’ corpus. As for AM training, each word received multiple ing data within an iteration. phonetic transcriptions. In addition to the speaker independent AM, Speaker Adap- tive Training (SAT) [8] has been employed, aiming to model 6. Recognition and Results less speaker specific variation in the (SAT) AM. It compensates the loss of speaker specificity of the SAT AM through speaker The recognition follows a multi-pass approach, depicted in Fig- specific feature space transforms using CMLLR [7]. The trans- ure 1, i.e. a first pass using the speaker independent AM, fol- forms are estimated using a compact AM, i.e. a single Gaussian lowed by segmentation and clustering of segments, a second AM, with minimal speaker discriminance. The SAT formalism and third pass, both applying the SAT based AM. Whereas the relies on the concept of acoustic adaptation and is as such ap- corresponding feature space transforms for a speaker cluster are plied estimating the feature transforms of corresponding speak- again estimated using CMLLR. The third pass receives a model ers in recognition. parameter adaptation by means of MLLR [10]. Both last passes In summary, AM estimation has been carried out for 2 derive their adaptation transform estimates from unsupervised types: a speaker independent AM and a SAT-AM. transcriptions of their previous recognition pass. Besides the training data of broadcast conversation (AGORA)´ - statistics outlined in Table 1, 2 additional rich context speech corpora were evaluated selectively for training: Dev-Set Test-Set a read speech corpus (FREESPEECH) and spontaneous utter- Duration [h] 0:45 1:15 ances of the SpeeCon corpus (SPEECON-S), see Table 2. The # Speakers 10 17 FREESPEECH corpus in its entirety displayed a degradation of # Running Words 8120 14916 accuracy, and therefore is not further described. µ [s] speaker duration 227 265 Comparing the ratio of number of running words and to- σ [s] speaker duration 95 142 tal duration in Table 1 and 2 indicate significant differences in OOV [%] 4.2 3.5 speed, although the speaking style for both is considered spon- PPL 223.7 199.6 taneous. The phoneme set contains 39 phonemes + 6 auxiliary units for silence, stationary noise, filled pauses and hesitations, as Table 3: Statistics on development and test set for recognition well as speaker and intermittant noise. Pronunciations were modelled with the UPC rule based phonetizer considering the The overall recognition results in Table 4 and 5 - µ denotes

50 Proceedings of the I Iberian SLTech 2009

Figure 1: Multi-pass system architecture for recognition.

racy. Moreover, considering the observed mean and standard Dev-Set Test-Set deviation for speaker durations in Table 3, the estimated trans- WER % µ µs σs µ µs σs formations for speaker adaptation may be less reliable and lead 1. Pass 38.1 37.6 9.8 34.2 33.1 7.6 to non-favourable speaker adaptation. 2. Pass 35.9 35.2 9.7 30.8 29.3 7.4 3. Pass 35.1 34.9 9.5 30.2 28.9 7.3 7. Discussion In broadcast conversation, speech exhibits various speaking Table 4: Recognition results in multi-pass system architec- styles with a continuous and frequent change. These can be ` ture using AGORA Corpus qualified as planned, extemporaneous or highly spontaneous. Putting the results into context, three major phenomena were Dev-Set Test-Set assessed: duration reduction, feature distribution reduction, as WER % µ µs σs µ µs σs well as ratios of filled pauses, mispronunciations and word frag- 1. Pass 34.2 32.2 9.4 28.2 27.5 6.6 ments. 2. Pass 33.9 32.0 9.4 26.1 25.5 6.3 In order to qualify the exposed speaking style for the con- 3. Pass 33.4 31.5 9.1 25.8 25.2 6.2 versational broadcast transcription task, duration and feature space were examined, and compared to those of read speech. The latter was retained from the Catalan FREESPEECH Table 5: Recognition results in multi-pass system architec- database comprising read-aloud sentences. As auxialiary exper- ture using AGORA` and SPEECON-S Corpus iments indicated, the accuracy obtained for this task was above 95% WER. Duration reduction for both vowels and consonants is a known phenomenon in spontaneous speech [11]. Phoneme the word-error-rate (WER) across the two sets, µs, σs the mean durations have been obtained from pruned forced alignments. and standard deviation of WER across speakers - are fairly high Figure 2 depicts the duration of phonemes regarding read at first glance, but need to be reviewed considering three ma- speech (FREESPEECH) and spontaneous broadcast conversa- jor aspects: the phenomena of broadcast conversational speech, tion (AGORA).´ Speech in conversational broadcast exhibits a the amount of available adequate acoustic and language model significantly lower mean duration for all phonemes and an in- training data, and the composition of training and testing data. creased standard deviation compared to read speech. The in- The development set, although biased due to parameter creased standard deviation suggests a significant higher vari- optimization, poses a larger challenge than the test set. Fur- ability of the exposed speech in broadcast conversation but also thermore, the higher standard deviation across the individual an alteration of its style. speaker error rates in the development set suggests speakers of particular challenge. A larger perplexity (PPL) and out-of- vocabulary rate (OOV), as indicated in Table 3 may additionally 200 account for the differences. Although Table 3 exhibts a gener- AGORA ally high PPL, the distribution of segment PPL (not displayed) 180 FREESPEECH shows a positive skewness indicating a few high perplexity out- 160 liers. A breakdown of these exceptions particularly highlights words at segment boundaries and repetitions as contributors. It 140 emphasises the limitation of the current language model with 120 respect to phenomena of spontaneous speech as it is estimated solely on news paper articles. Moreover, words with unknown 100 Duration [ms] context account for exceptional high PPL. A reduction of OOV 80 by using more textual data will diminish this effect. 60 Although the SPEECON-S data differ in the level of spon- taneity (data collection environment) from those of AGORA,` 40 i j l t f r k v s y z J a e o u b d g p n L Z rr E B S w D N O G m @ the extension of the acoustic training data provides an improve- uw Nasals Liquids Vowels Plosives ment of relative 17%. Fricatives Comparing the results of the speaker independent recogni- Semi-Vowels tion of the 1. pass with those using the SAT AM in 2nd and Figure 2: Mean phoneme durations of broadcast conversational 3rd pass in Table 5, there are larger improvements. As both, and read speech. the 2nd and 3rd pass use speaker adaptation based on previ- ously obtained unsupervised transcriptions, potential improve- The standard deviations indicate a blurred transition be- ments tend to be lower due to the overall lower level of accu- tween the two. This fact and the noticeable high variation in

51 Proceedings of the I Iberian SLTech 2009 phoneme duration of broadcast conversation suggests a method- data are desireable to estimate models and transforms more reli- ological change in modelling durations. The HMM topology ably. As the language model corpus is derived from textual writ- as mentioned above, also referred to as One-Skip HMM, re- ten language, the phenomena addressed above have not been ceives a global set of transition probabilities. Noticing variation modelled. OOV and PPL still exhibit a lack of appropriate in in speaking style, these parameters should be instantaneously domain data for both LM and vocabulary. adaptable, specific to phoneme or allophone respectively. The results are considered as baseline and encourage for A feature distribution analysis compares feature distribu- further efforts towards approaches to tackle the problem of tions of each phoneme given spontaneous broadcast conver- acoustic and linguistic data sparseness, discriminativeness of sational and read speech. The phoneme specific feature dis- features particular of spontaneous speech. tributions have been estimated based on labeled feature vec- tors containing 16 Mel-frequency cepstral coefficiants (MFCC), 9. References whereas the labels originate from the pruned forced alignments. The ratio of phoneme feature distributions has been defined ac- [1] H. Schulz, M. R. Costa-Juss`a, and J. A. R. Fonol- losa, “TECNOPARLA - Speech Technologies for Catalan cording to [12] as ||µp(C)−µ(C)||/||µp(R)−µ(R)||, whereas and its Application to Speech-to-Speech Translation,” in µp denotes the center of distribution of phoneme p given broad- cast conversational speech (C), and read speech (R) respec- Procesamiento del Lenguaje Natural, 2008. tively. µ(.) is the average of the phoneme specific means. The [2] N.N., “The RWTH Aachen University Speech phoneme feature distribution ratios shown in Figure 3 indicate Recognition System,” http://www-i6.informatik.rwth- significant differences of MFCC feature distributions for all aachen.de/rwth-asr, Nov. 2008. [Online]. Available: phonemes in broadcast conversation compared to read speech, http://www-i6.informatik.rwth-aachen.de/rwth-asr/ in most cases depicting a large reduction. As suggested in [12], [3] D. Herrick, “An acoustic analysis of phonological vowel the reduction in feature distribution ratio correlates with a loss reduction in six varieties of catalan,” Ph.D. dissertation, in accuracy. University of California, Santa Cruz, Sep. 2003. [4] Max W. Wheeler, The Phonology of Catalan. Oxford, UK: Oxford University Press, 2005. 1.4 1.3 [5] Recasens D. , “Place cues for nasal consonants with spe- 1.2 cial reference to Catalan,” Journal of the Acoustic Society of America, vol. 73, pp. 1346–1353, 1983. 1.1 1 [6] J. L¨o¨of, C. Gollan, S. Hahn, G. Heigold, B. Hoffmeis- 0.9 ter, C. Plahl, D. Rybach, R. Schl¨uter, and H. Ney, “The RWTH 2007 TC-STAR Evaluation System for European 0.8 English and Spanish,” in Interspeech, Antwerp, Belgium, 0.7 Reduction Ratio Aug. 2007, pp. 2145–2148. 0.6 [7] M. Gales, “Maximum likelihood linear transformations 0.5 for HMM-based speech recognition,” Computer Speech 0.4 i j l t f r

k v s y z J and Language, vol. 12, no. 2, pp. 75–98, 1998. a e o u b d g p n L Z rr E B S w D N O G m @ uw Nasals Liquids Vowels [8] Anastasakos, T. and McDonough, J. and Schwartz, R. and Plosives Fricatives Makhoul, J. , “A Compact Model for Speaker-Adaptive Semi-Vowels Figure 3: Phoneme feature distribution ratios between broadcast Training,” Proc. ICSLP, pp. 1137–1140, 1996. conversational and read speech. [9] A. Stolcke, “SRILM-an Extensible Language Modeling Toolkit,” in Seventh International Conference on Spoken At last, the fraction of filled pauses, word fragments and Language Processing. ISCA, 2002. mispronunciations for broadcast conversational speech and read [10] C. Leggetter and P. Woodland, “Maximum likelihood lin- speech was determined from their corresponding transcriptions. ear regression for speaker adaptation of HMMs,” Com- Linguistically, the broadcast conversations possess frequent puter Speech and Language, vol. 9, pp. 171–186, 1995. repetition and repairs. Mispronunciations and incompleteness [11] R. J. J. H. van Son and J. P. H. van Santen, “Strong Inter- encompass 3.6% of the transcribed spoken events, filled pauses action Between Factors Influencing Consonant Duration,” 6.5% - both emphasising the spontaneity of the language. On in EUROSPEECH 1997, 1997, pp. 319–322. the other hand, read speech exhibits linguistically neither repe- tition nor repair. The proportion of mispronunciations and in- [12] S. Furui, M. Nakamura, T. Ichiba, and K. Iwano, “Why is completeness is below 0.3%, the one of filled pauses 0.8%. Dif- the recognition of spontaneous speech so hard?” in Text, ferences in these ratios emphasises the assessment above. Speech and Dialogue, ser. Lecture Notes in Artificial In- telligence. Springer, 2005, pp. 9–22. 8. Conclusion Catalan, as a regional language poses the issue of availability of large amounts of appropriate data. Recent evaluations in broadcast conversational respectively spontaneous speech op- erate with an amount of AM training data with a factor 4 to 20. Given the high variability in feature space of spontaneous broadcast conversations, larger amounts of acoustic training

52 Proceedings of the I Iberian SLTech 2009

A Fast Discriminative Training Algorithm for Minimum Classification Error

B. Silva, H. Mendes, C. Lopes, A. Veiga and F. Perdigão

1 Department of Electrical and Computer Engineering, FCTUC, University of Coimbra Instituto de Telecomunicações, Polo II, University of Coimbra, Portugal {markexilva, maizena.mendes}@gmail.com, {calopes, aveiga, fp}@co.it.pt

training algorithm we have implemented MCE using the new Abstract objective function and one of the fastest GD based algorithms, In this paper a new algorithm is proposed for fast resilient backpropagation (Rprop) [7]. We also compare the discriminative training of hidden Markov models (HMMs) performance with maximum mutual information (MMI) and based on minimum classification error (MCE). The algorithm minimum phone error (MPE) algorithms. is able to train acoustic models in a few iterations, thus The rest of the paper is organized as follows: in Section 2 overcoming the slow training speed typical of discriminative the new objective function is defined and FMET is derived. training methods based on gradient-descendent. The algorithm Experimental results comparing FMET, Rprop, MMI and tries to cancel the gradient of the objective function in every MPE are presented in Section 3. Finally, Section 4 presents iteration. Re-estimation expressions of the HMM parameters some conclusions and guidelines for future work. are derived. Experiments with triphone and word models show that the proposed algorithm always achieves much better 2. Fast Discriminative Training for HMMS results in a single iteration than MCE, MMI or MPE do over In this section a new objective function is introduced for several iterations. discriminative training with MCE and the HMM parameter re-

estimation formulas are derived for the proposed training Index Terms : Speech recognition, discriminative training, algorithm, FMET. hidden Markov models. 2.1. The MCE objective function 1. Introduction The objective function used in MCE is defined as The conventional HMM training method is based on Maximum Likelihood (ML). However, it is well known that Nu J= ∑ l() d (n ) (1) discriminative training methods outperform ML. Different n=1 discriminative training criteria have been successfully tested, (n ) namely maximum mutual information (MMI) [1], minimum where Nu is the number of training utterances and l( d ) is a phone error (MPE) [2], and minimum classification error smooth loss function that emulates the zero-one recognition (MCE) [3,4]. The MCE criterion is especially attractive error count. Typically a sigmoid is used: because it minimizes a function that is directly related to the 1 performance of the recognizer, the classification error rate, l( d (n ) ) = . (2) ()−λd (n ) using N-best hypotheses. Recently a method using the 1+ e extended Baum-Welch (EBW) algorithm was proposed [5] but this works only for the 1-best hypothesis. Discriminative In these expressions d (n ) is the misclassification measure training approaches use iterative optimization algorithms to between the score of the labelled HMM sequence estimate model parameters and convergence speed therefore (n ) lab lab W= w,..., w n and a generalized mean (softmax) over plays an important role in training. The conventional lab{ n ,1 n, T ( ) } optimization method used in MCE is based on a gradient the scores of the competing N HMM sequences, W (n ) , for descent (GD) technique called Generalized Probabilistic Best k the n th utterance: Descent [4]. This method is easy to implement and presents effective results, but the training speed is slow and it is 1 α difficult to estimate learning rates. These limitations have led N  Best (n ) ()n 1 α g( W k )  (n ) to a need for new training algorithms that perform faster than dL= ⋅log∑ e − gW (lab ), (3) N  GD-based algorithms. Best k =1  k≠ lab  Another problem encountered in the conventional objective function used in MCE is that the sigmoidal loss g(W) is a discriminant function computed as the log likelihood function saturates when gross errors occur and gradient of the sequence of acoustic observations, methods cannot subsequently improve on these errors. In this ()n () n ()n () n XX=(n ) = { x ,..., x (n ) } , (4) paper we present a new discriminative training objective 1: T 1 T function which solves this problem. A new fast discriminative training algorithm for HMMs given the best state alignment of the HMM sequence W, which based on MCE is also introduced. This algorithm is an is computed with the Viterbi algorithm. extension of continuous speech recognition using HMMs from In (3) we introduce a scalar L, multiplied by the softmax the method proposed in [6] for multiple-category classification term which controls the relative importance of true sequence (n ) (n ) problems. We have called it fast minimum error training Wlab and competing sequences Wk , k=1.. NBest . This scalar is (FMET). In order to compare the performance of the proposed the key point of this algorithm: setting L=0 implies ML

53 Proceedings of the I Iberian SLTech 2009 training and L=1 corresponds to the classical MCE. It can be M j (h ) even greater than 1, if the stochastic restrictions of the HMM bxj()=∑ cbx jm ⋅ jm () , (7) parameters remain verified, as explained below. m=1

where Mj is the number of mixture components of state j 2.2. The method within HMM h, cjm is the weight of the mth mixture

The proposed method aims to cancel the gradient of the component and bjm ( x ) is a Gaussian pdf objective function, J, in every iteration. This is a necessary −1 (x −µ )(T Σ−1 x − µ ) condition in order to minimize J. Setting the gradient ∇J = 0 1 ( 2 jmjm jm ) bjm ( x )= e , (8) leads to a set of re-estimation expressions for the HMM 2π Σ jm parameters. However, there are restrictions that these parameters must obey: the transition probabilities and mixture where µ jm and Σ jm are respectively the mean vector and the weighs must be positive and add up to 1 and the covariance covariance matrix of the mth mixture component of state j. It is matrices must be non-negative definite. These conditions can also assumed that the covariance matrices are diagonal. The easily be set by an appropriate choice of the scalar L. It turns gradient of the objective function is out that this scalar has the same role as the learning rate in the

GD methods. GD methods benefit from a different learning Nu ()n () n rate per parameter. Almost the same applies here: instead of a ∇=J∑ψ ⋅∇ d , (9) single global L, we found it advantageous to have an L-scalar n=1 for each HMM state and for each transition from that state. where This turns into a slight modification of the decoding process of (n ) λd (n ) the competing sequences Wk , with the Viterbi algorithm, (n ) λe ψ = (n ) . (10) that becomes 1+ eλd

(n ) (n ) T k k k k (n ) ()wnt, () w nt ,(n ) () w nt , () w nt , Differentiating d in order to a parameter θ of an gW(k )=∑ L q log bx q ( ) + L q log a qq, . (5) ( t()() tt t−1 tt − 1 ) t =1 HMM and assuming, without loss of generality, that k = 0 corresponds to a labelled (lab) utterance, results in

In this equation, Q= { qq1, 2 ,..., q T } is the best state sequence given by Viterbi alignment over all models in the ()nNBest () n (h ) (h ) ∂d(n ) ∂ g( W k ) utterance n; aij is a transition probability and bj ( x ) is the =∑ ξk ⋅ (11) ∂θk =0 ∂ θ k pdf of state j belonging to HMM h= w n, t found at each frame where t for each utterance n and competing sequence k. The two (h) (h) types of L-scalars are Lj for each state j and Li for the  (n ) −1 , k = 0 (h ) ξk =  (12) transitions from each state i, aij . These weightings will help (n ) ζ k , k= 1  N Best to ensure the statistical constraints of the HMM parameters. As indicated in [6], we cannot guarantee convergence at each and iteration; however, it has been shown experimentally that this α g( W (n ) ) e k method produces much better results than the GD algorithm. ζ (n ) = . (13) k NBest Although the sigmoid loss function is suitable for error (n ) α g( W r ) counting, its gradient approaches zero for an utterance with a ∑ e r =1 large value of d (n ) , meaning that the utterance is misclassified. This is not suitable for approaches which These last parameters weight the importance of the optimize (1) through differential methods, because these competing sequence k in the solution and add up to 1. In order utterances will make an insignificant contribution to the to simplify the analysis, the following parameters are gradient. In order to solve this limitation we propose the introduced: following function N (n ) Best ()n () k () n (n ) Ω=⋅htjm ψ δ(whqj ,)(,) ⋅ δ ⋅⋅ ξβ () x (14) (n ) λ d (n ) ∑ n, t t k jm t l( d )= log( 1 + e ) . (6) k =1

This function approaches zero when d (n ) is negative ()n () n () lab (n ) Ω=⋅htjmψδ(wh n, t ,)(,) ⋅ δ qj t ⋅ β jm ( x t ) (15) (utterance n is correctly recognized) and approaches λd (n ) (n ) N when d is positive (utterance n is incorrectly recognized). (n ) Best ()()n k (n ) This overcomes the sigmoid limitation, especially at the first Θ=⋅htij ψδ∑ (whnt, ,) ⋅ δ ( qi t− 1 ,) ⋅ δξ (,) qj t ⋅ k (16) k =1 steps of the algorithm where a large d (n ) does not mean necessarily an outlier, due to a mislabelled utterance for and example. ()n () n () lab Θ=⋅htijψδ(w n, t ,) h ⋅ δ ( qi t− 1 ,) ⋅ δ (,) qj t (17) 2.3. Estimation formulas where In this section the re-estimation equations are derived for the HMM parameters as well as the limits for the L-scalars. It is c⋅ b( x ) β (x ) = jm jm . (18) assumed that the state pdf is a Gaussian mixture, jm bj ( x )

54 Proceedings of the I Iberian SLTech 2009

δ(m,n) is the Kronecker delta function. Expression (18) can be with the constraint that 0≤η < 1 . Note that when η = 0 interpreted as the weight of the mth component in the overall FMET leads to ML training. mixture. Resuming the differentiation that began in (9), in order to 2.4. Training Procedure obtain all parameters of the HMMs and make the gradient vanish, we obtain the following estimation expressions for It should be noted that this algorithm is intended for batch each vector component l: mode operation. The main steps to FMET implementation are the following: ()n (n ) Nu T N u T ()()nnh ()(n ) () n Ω⋅−x L Ω⋅htjm x ∑∑htjm t, l j ∑∑ t , l 1. Initialize all HMMs with ML (or take η=0). (h ) nt==11 nt == 11 µ jml = ()n () n ; (19) NuT N u T 2. Accumulate (14), (15), (16) and (17) for all training ()n () h (n ) ∑∑Ωhtjm −L j ∑∑ Ω htjm utterances. nt==11 nt == 11 (h ) (h ) 3. Determine Li and Lj for all HMMs and states,

(h ) 2 according to (26) and (27), respectively. σ jml = 4. Update all HMMs parameters computing (19), (20) ()n (n ) Nu T N u T and (22), using (23) and (24). ()()nnhh ()2 ()(n ) () nh () 2 ∑∑Ωhtjm()x t, l −−µ jml L j ∑∑ Ωhtjm () x t, l − µ jml (20) 5. Save the new HMMs and evaluate the performance n=1 t = 1 n=1 t = 1 ()n () n using the updated HMMs. NuT N u T ()n () h (n ) 6. Return to step 2) until the required number of ∑∑Ωhtjm −L j ∑∑ Ω htjm nt==11 nt == 11 iterations is reached.

To obtain estimation expressions of probability transitions If the performance does not improve in one iteration, it or Gaussian mixture weights, we need to ensure that the will normally improve in the subsequent ones. Also, the following stochastic restrictions are verified: FMET first result, in all experiments, was better than the best

(h ) (h ) Ns M j GD result over several iterations. (h ) (h ) ∑ aij = 1 ; ∑ c jm = 1 (21) j =1 m=1 3. Experiments and Results

(h ) where Ns is the number of states of HMM h. Using The experiments were carried out using a Portuguese speech Lagrangian multipliers the solutions are command database [8]. The training set consisted of 103001 utterances and the test set consisted of 27382 different (h ) (h ) (h ) pij (h ) d jm utterances of 254 commands. aij = (h ) ; c jm = (h ) (22) Ns M j Acoustic models were built for monophones, triphones (h ) (h ) ∑ pij ∑ d jm and words using HTK3.4 [9]. The input features were 12 j =1 m=1 MFCCs plus energy, and their first and second order time where derivatives computed at a rate of 10ms and within a window of ()n () n 25ms. Evaluation was done by means of accuracy rate. NuT N u T ()h () n () h (n ) To describe the Portuguese language 38 monophones were pij=∑∑ Θ−⋅ htij L i ∑∑ Θ htij (23) nt==11 nt == 11 used plus a silence model. Each class was modelled by a three- and state left-to-right HMM, except the silence one where transitions to previous states were allowed. Each state was ()n ()n Nu T N u T ()h () n () h (n ) modelled using a mixture of 16 Gaussians. The set of d= Ω−⋅ L Ω htjm . jm∑∑ htjm j ∑∑ (24) triphones is composed by 955 HMMs (found in the 254 nt==11 nt == 11 commands) also with 16 Gaussians. The 255 whole-word It should be noted that some other constraints need to be models correspond to the 254 commands plus the silence. The verified: number of emitting states ranged from 3 to 39 modeled with 10 Gaussians. The test was carried out using a task grammar ()h2 () h () h σ jml>0 ; a ij ≥ 0; c jm ≥∀ 0, hjiml, , , , (25) with all the 254 word-commands in parallel, preceded by and ending with the silence model. (h ) (h ) Therefore Lj and Li need to fulfil the following One of the initial conditions of the method is the choice of restrictions the λ in (10) and in α (13) parameters. Using typical Viterbi decoding scores we found λ = 0.05 and α = 0.001 a good (n ) Nu T  trade-off between these two parameters. In FMET a value of Θ(n )  ∑∑ htij η between 0.6 and 0.8 was used. In MCE using Rprop the (h ) n=1 t = 1  Li =η ⋅ min (n ) (26) j Nu T  update value of each HMM parameter was set by multiplying (n ) ∑∑ Θhtij  0.01 by each parameter value, and the increasing and n=1 t = 1  decreasing factors were η+=1.2 and η−=0.5, respectively. For and MMI and MPE the i-smoothing and learning rate factors were set as suggested in [9]. (n ) Nu T   Table 1 presents the results obtained after training the ()n () n () h 2 ∑∑ Ωhtjm()x t, l − µ jml   HMMs with only one iteration of FMET. The results with (h ) n=1 t = 1 Ωhjm Lj =η ⋅ min minN (n )  , (27) Rprop were obtained with 10 iterations for monophones and 2 m l u T (n ) 2 ()n () h Ωhjm Ωhtjm x − µ iterations for triphones and words. The results with MMI and ∑∑ ()t, l jml  n=1 t = 1   MPE are the best obtained after 4 iterations. In fact, the results with MPE for triphones decreased at the 2 nd iteration and then

55 Proceedings of the I Iberian SLTech 2009 increased again at further iterations. As can be seen the single iteration of FMET method outperform Rprop, MMI and MPE 4. Conclusions using triphone and whole-word models. In this paper a fast training algorithm based on MCE was introduced. This algorithm attempts to minimize the objective Table 1 – Comparison of FMET, Rprop, MMI and function in a single step. A new objective function was also MPE performances. proposed. Although the convergence of this method cannot be guaranteed, it has been shown experimentally that this method produces much better results than the Rprop, MMI or MPE Method Monophones Triphones Word approach. Moreover, it does not only achieve better results faster, but also archive results that other approach cannot ML (before) 90.87% 97.48% 96.82% achieve with several iterations. The presented results, although preliminary, allow us to extend the conclusions derived in [6] MCE/Rprop 91.53% 97.55% 96.92% for the MCE-based HMM parameter estimation. As future work we intend also to apply the method to other well-known MMI 91.95% 97.53% 96.92% speech databases and larger tasks.

MPE 91.97% 97.55% - 5. REFERENCES FMET (1 st it.) 91.86% 97.66% 97.28% [1] Y. Normandin et al., “High performance connected digit recognition using maximum mutual information estimation,” With monophone models, MMI and MPE methods IEEE Trans. Speech Audio Processing, vol. 2, pp. 229–311, outperform FMET’s first iteration result but only at the 4th April 1994. iteration. This is shown in Figure 1 where the evaluation [2] D. Povey and P. C. Woodland, “Minimum phone error and i- performances with monofone models over the first 4 iterations smoothing for improved discriminative training,” ICASSP-02, Orlando, FL, May 2002, pp. 105–108. are presented. [3] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification

error rate methods for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 257–265, May 1997. 92 [4] W. Chou, “Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition”, 91.8 Proc. IEEE, vol. 88, no. 8, pp. 1201–1222, Aug. 2000. [5] H. Xiaodong, L. Deng, W. Chou, “A Novel Learning Method for FMET Hidden Markov Models in Speech and Audio Processing”, IEEE Rprop 91.6 8th Workshop on Multimedia Signal Processing, Oct. 2006.

(%) MMI MPE [6] Q. Li and B-H. Juang, “Study of a Fast Discriminative Training 91.4 Algorithm for Pattern Recognition”, IEEE Trans. Neural Networks, Vol. 17, No. 5, pp. 1212-1221, Sep. 2006. Accuracy 91.2 [7] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROP algorithm”, Proc. ICNN, San Francisco, CA, 1993, pp. 586–591. 91 [8] J. Lopes, C. Neves, A. Veiga, A. Maciel, C. Lopes, F. Perdigão, L. Sá, “Development of a Speech Recognizer with the Tecnovoz 90.8 Database”, Propor 2008, International Conference on 0 1 2 3 4 iteration Computational Processing of Portuguese, Sept. 2008. [9] Young, S. et al, “The HTK book. Revised for HTK version 3.4”, Figure 1: Performance comparison with monofone models. Cambridge University Engineering Department, Cambridge, December 2006.

56 Proceedings of the I Iberian SLTech 2009

Global Discriminative Training of a Hybrid Speech Recognizer

Carla Lopes 1,2,3 , Fernando Perdigão 1,2

1 Instituto de Telecomunicações, 2 Department of Electrical and Computer Engineering, University of Coimbra, Portugal, 3 Instituto Politécnico de Leiria-ESTG, Portugal {calopes, fp}@co.it.pt

ANN/HMM, called Hidden Neural Networks, in which all Abstract parameters are estimated simultaneously according to the Hybrid speech recognizers usually involve a frame-based discriminative conditional maximum likelihood (CML) classification followed by a segment alignment system, criterion. The approach proposed in this paper is somehow trained separately. The simplicity of such systems is related to the state-corrective CML (SCCML) method counterbalanced by the lack of a global optimisation scheme described by Johansen, [7] used in the computation of a free for the whole system. In this paper we propose a grammar gradient, now extended to the train of a hybrid discriminative training method for MLP/HMM hybrids based recognizer. on the optimization of a global cost function at the phone In this paper we propose a discriminative training method recognition level. The MLP weights, usually updated applied to a hybrid ANN/HMM phone recognizer. The ANN according to the target values, are now updated according to consists of a Multi Layer Perceptron (MLP) network, whose the misclassifications present in the output of the system. frame-based outputs represents a posteriori probabilities of Results are presented for the TIMIT phone recognition task phone occurrences and are used as state occupancy and show that this method compares favourable with recent probabilities in HMMs. A global backpropagation learning published results in this task. The global discriminative scheme is defined considering a strict integration between the training method was also applied to a Portuguese speech HMM and the ANN. The error minimization is based on the database leading to promising results. gradient descent algorithm and the result is a maximization of Index Terms : discriminative training, hybrid speech the phone accuracy rate and not likelihood maximization, as recognizers, phone recognition. usual. Phones models are trained to maximize their accuracy rate whilst also maximizing the distance between the correct 1. Introduction phone and its rivals. The main goal is to improve phone accuracy in the aligned output string, instead of in the Multi Hybrid speech recognizers have been used with considerable Layer Perceptron output, as usually done. The method uses success in several applications, [1-7]. The hybrid framework, the difference between the reference and the best acoustic in which discriminative classifiers (like Artificial Neural likelihood of the observation sequences to update the MLP Networks (ANNs), Support Vector Machines (SVMs) and weights. Conditional Random Fields (CRFs)) are combined with generative models (Hidden Markov Models (HMMs)), 2. Global Discriminative Training Method allowed for significant gain in performance with respect to standard HMM in several situations. Discriminative training A global discriminative training method (GDTM) for approaches, which are also applied to HMMs systems, aim to training the parameters of a hybrid MLP/HMM as a whole is minimize the training error. Different training criteria have proposed. MLP is a natural structure for discriminative been successfully tested: Maximum Mutual Information training; however the network weights are usually updated (MMI),[8],[9] Minimum Classification Error (MCE) [10], according to the target values presented in the output layer Minimum Phone/Word Error (MPE/MWE) [11], and methods rather than according to the best sequence of HMM states. To based on the Principle of Large Margin (PLM), [12]. overcome this problem, we propose a training method based However, training such a hybrid system is not on a cost function that minimizes the classification error of straightforward, and that justifies why usually classification the global hybrid system, operating at the recognition level. and alignment undertake separate training steps. The free parameters of the system are updated according to Most hybrid systems are prone to inferior performance the misclassifications between the labeled sequence and the due to the lack of a global optimisation scheme for the whole reference. Figure 1 illustrates the proposed method. system. One of the persisting challenges is, therefore, to design an integrated discriminative training method to train Single layer MLP HMM Decoder hybrid systems as a whole. Bengio et al , [2] have already Labelled sentence focused on this goal, proposing a hybrid system where eight Ph1 Ph2 Ph3 Ph1 ANN outputs (classifying plosives) were used as inputs for an Feature Vectors Reference HMM system, whose states were modulated by Gaussian Ph1 Ph2 Ph3 Ph2 Ph1 Mixtures Models (GMMs). Droppo and Acero, [8], proposed ∂E ∂E a general discriminative training method but applied to both ()m ∂y j ∂a the front-end feature extractor and back-end acoustic model of Error Update an automatic speech recognition system. Wu and Huo in [6], Evaluation parameters propose a MCE training approach for the joint design of a feature compensation module (SVM) and HMM parameters Figure 1: Schematic diagram of the proposed global of a speech recognizer. In [13], Riis proposes a hybrid discriminative training method.

57 Proceedings of the I Iberian SLTech 2009

The goal is to compute the gradient of the cost function P|(xsj) P|( s j x ) . (4) with respect to the MLP outputs and to backpropagate this = P x P s gradient through the entire structure, all the way back to the () ()j first MLP layer. The output alignment (Viterbi trellis) plays a The a priori phone probabilities P(s ) are estimated off- major role in the training process, since the gradient of the j cost function, with respect to the MLP outputs is computed line from the training data. based on this alignment. In order to have the best alignment sequence to compare with the reference alignment, a Viterbi 2.2. Gradient with respect to the outputs of the MLP decoder was fully incorporated into our training scheme. We used the gradient descendent method to update the network weights. In this case the error gradient for an MLP 2.1. Cost Function output yj is: Formulation and estimation of a correctly specified cost ∂E NBD ∂e(n ) function would be central to the discriminative training = ∑ procedure. The cost function should focus on multiple ∂yjn=1 ∂ y j (5) decoding alternatives, for instance using a N-best list, and (n ) Tn  T n  ∂e ()rec1 () lab 1 consider all kinds of errors: substitutions, insertions and =∑δ()js,t  − ∑ δ () js , t  ∂    deletions. In a first approach, we used only the contribution of yjt=1 y j  t = 1 y j  the best hypothesis provided by the Viterbi decoder. In this where (rec ) and (lab ) is the state/phone observed at frame t in case the Levenshtein distance aligns two label sequences. One st st is the correct sequence, Wlab , and the other is the best the Viterbi and reference alignment, respectively, with Tn observations. δ(i,j) is the Kronecker delta. It is interesting to decoding hypothesis given by the recognizer, Wrec . Using the note that whenever there is a misalignment (different s(rec ) Viterbi algorithm, we define an error function as: t and (lab ) ), then two outputs will always contribute to the st (1) dW( rec, W lab) = gW ()() rec − gW lab gradient, in opposite directions. The output which agrees with the reference will contribute with a negative value and the where g( W lab ) and g( W rec ) represent the reference and the wrong output with a positive value, telling the network to best acoustic likelihood of the observation sequence. This increase and decrease their corresponding values, difference is always greater than zero, and is only zero if the respectively, according to the gradient descent algorithm. two transcriptions are exactly the same (labels and time Figure 2 aims at illustrate the procedure considering the alignments coinciding). recognition of only four phones. If an error occurs If NBD is the total number of training utterances, the (misclassification or misalignment) it will be given an global cost is then given by: indication to two outputs of the MLP.

NBD N BD ()n () n () n . (2) E=∑ dWW(rec , lab ) = ∑ e LAB Ph2 Ph1 Ph4 Ph1 Ph2 n=1 n = 1 REC Ph2 Ph1 Ph3 Ph4 Ph1 Ph2 NW If WW=1 = { ww 1, 2 , , w N } is the sequence of phones W in an utterance, the total log-likelihood (assuming a bigram model) is given by: 1

NW 2 gW( )= log P Xwtk | + log() P( ww | )  (3) ∑()()tkk−1 kk −1  k=1 3 where 4 Gradient behaviour of MLP outputs tk t log P( Xt | w k ) = ( k−1 )

tk Figure 2: Example of the gradient for each MLP

=∑ log()()ass, + log b st (x ) + log a ss,+ 1 output, in the presence of misclassifications or ()tt−1 t() tt k k = t t k−1 misalignments. is the cost of traversing the HMM of phone wk with Another interesting point is that the cost function based observations from t=tk-1 to t=tk. The function bs(x) is the on Cross-Entropy, in the usual MLP training with targets, also likelihood of observing x in the HMM state s. The last term in has gradients inversely proportional to the outputs. However, the previous equation corresponds to the exit probability of a gradient term is computed for every frame, using the the wk HMM. difference between the outputs and targets, which contrasts In the hybrid system the MLP output predictions are with the present global cost function that “blames” the MLP interpreted as the a posteriori phone probabilities of jth only when a misalignment occurs in the Viterbi and reference phone/state, P(s j | x) , given the feature observation vector x. phone strings. After computing the gradient of the cost function with The likelihood ratio, Px |s /P() x , used in the HMM ( j ) respect to the MLP outputs we simply back-propagate framework, is replaced by the posterior probabilities, using gradients through the entire structure, all the way back to the Bayes's rule, first layer of the MLP. We used the resilient back propagation algorithm to accelerate the convergence to a solution.

58 Proceedings of the I Iberian SLTech 2009

2.3. Gradient with respect to the HMM parameters 3.1. TIMIT phone recognition task The proposed hybrid MLP/HMM phone recognizer uses a When using TIMIT database, two single layer MLPs Hidden Markov model to temporally align the speech signal, networks, with 1000 nodes, were trained for phone frame but instead of using a priori state-dependent observation classification. In one, the last layer performs a 1-to-39 probabilities defined by a Gaussian mixture, it uses a classification over the set of phones while in the other the last posteriori probabilities estimated by the MLP, keeping the layer performs a 1-to-61 classification. Both training and overall HMM topology unchanged. In the hybrid system the testing were carried out using the TIMIT database [16]. In output predictions of the MLP are interpreted as the a Baseline61 the original 61 phone set was used while in posteriori phone probabilities, , with ph Baseline39 the train was made by means of the 39 phones P( ph i | x) i proposed by Lee and Hon [17]. The training set consisted of representing the ith phone/state and x the feature observation all si and sx sentences of the original training set (3698 vector.. In this way the only updatable HMM parameters are utterances) and the test set consisted of all si and sx the state transitions, a ij , which can be updated according to the following equation: sentences from the complete 168-speaker test set (1344 utterances). The targets derive from the phone boundaries provided by the TIMIT database. For evaluation purposes we ∂E NBD ∂e( n ) collapsed the 61 TIMIT labels into the standard 39 phones, ()m= ∑ () m ∂an=1 ∂ a ij ij (6) [17]. Table 1 shows the baseline results. Both systems ()rec N achieved similar results. Baseline39 reached a Correctness (n ) WT n    ∂e ()rec () rec 1 = ∑δ(,)km ∑  δ ( sist− ,)( δ t ,) j   rate of 72,79% and an Accuracy rate of 69,52% while for the ∂a()m 1 a (m )  ij k=1 t = 1 ij   Baseline 61 the corresponding rates are 72,46% and 69,60%. N ()lab In order to evaluate the training capabilities of the proposed WT n 1   − δ(,)km δ ( sis()lab ,)( δ () lab ,) j   discriminative training method, and also to achieve rapid ∑ ∑ t−1 t (m )  k=1 t = 1 aij   convergence to a solution, the discriminative training method is implemented, starting from prior separately trained MLP and HMM systems, in a similar way to that reported in [2]. 3. Experimental Results We will refer to the hybrid systems trained with the global discriminative training method as GDTM-MLP39/HMM and Phone recognition experiments were carried out using two GDTM-MLP61/HMM. Results are presented in Table 1. different sets of speech material: English speech data from The results for GDTM indicate improvements both in TIMIT database, [16] and European Portuguese speech data Correctness and Accuracy. When using 39 phones, from TECNOVOZ database, [19]. correctness rise up to 73.94, while when using 61 phones de Speech is analyzed every 10ms with a 25ms Hamming improvement was of 1,37% (1,9 of relative improvement). window. Thirty-nine parameters were used as standard input With regard to accuracy the improvements are not so features of the MLPs representing 12 Mel Frequency Cepstral expressive (about 1% of relative) in both situations. Coefficients (MFCCs), plus energy, and its first and second derivatives. The context window used is 170ms but only 9 Table 1. TIMIT Phone recognition results. frame features were used, one every other. The unused frame features are used in the next window analysis. The current (%) Relative frame is in the centre of the context window (temporal System Corr Acc Improvement information of past and future is included), [20]. Corr Acc The softmax function was used as the activation function of Baseline39 72.79 69.52 - - the output layer so that the output values can be interpreted as Baseline61 72.46 69.60 - - posterior probabilities. The other hidden layer uses a sigmoid GDTM-MLP39/HMM 73.94 70.30 1.6 1.1 activation function. All the network weights and bias are adjusted using batch training with the resilient back- GDTM-MLP61/HMM 73.83 70.27 1.9 1.0 propagation (RP) algorithm [13] so as to minimize the minimum-cross-entropy error between network output and the 3.1.1. Comparison with other works target values. The results are not comparable with the ones posted in [18] The hidden Markov models used in the hybrid system and [5] because the authors of those works evaluated their were built for each phone (English and Portuguese separately) systems by means of phone classification and not phone using HTK3.4, [15], in order to estimate the transition recognition, as we have done. But the results compare probabilities between states. Each phone was modeled by a favorably with the findings presented by an ASAT (Automatic three-state left-to-right HMM and each state was modeled by Speech Attribute Transcription) group in [3] and by Morris a single Gaussian model. In the hybrids MLP/HMM system, and Fosler-Lussier in [4]. These works have in common with the a priori state likelihoods are replaced by the posterior the present work only the fact that they present results under probabilities given by the output predictions of the MLP. the same conditions (same speech material and same Each of the three states shares the same MLP output. We used recognition rates). The ASAT group, [3] uses confidence HTK with some changes in order to replace the usual scores of phonetic attributes classes, coming from an MLP, an Gaussian mixture models with the outputs of the MLP. The HMM and an SVM in a CRF for phone recognition. They performance was evaluated by means of Correctness (Corr) point out a Corr rate of 73,39% and Acc rate of 69,52%. This and Accuracy (Acc) using the HTK evaluation tool value is similar to our Baseline results and below our GDTM- HResults . MLP/HMM results. Morris and Fosler-Lussier in [4] use phonological features provided by an ANN together with 61 class posteriors, provided by another ANN, also as input of a CRF. We did not yet reached their 71,49% Acc rate.

59 Proceedings of the I Iberian SLTech 2009

3.2. TECNOVOZ phone recognition task 6. References TECNOVOZ is a European Portuguese speech database, [19] [1] E. Trentin, M. Gori, “A survey of hybrid ANN/HMM models collected in 2007.. The collected speech includes commands for automatic speech recognition”,. Neurocomputing. , vol. 37, and phonetically rich read sentences. The sentence utterances pp. 91-126, March 2001. were divided into 20364 for the training set and 2262 for the [2] Bengio, Y., Mori, R., Flammia, G. and Kompe, R., “Global testing set. To describe the Portuguese language 37 phones Optimization of a Neural Network-Hidden Markov Model were used including a silence model and a short pause. Hybrid”, IEEE Transactions on Neural Networks, Vol. 3, No. 2, March 1992, pp 252-259. In the hybrid MLP/HMM system a single hidden layer [3] Bromberg, I., et al ., "Detection-based ASR in the automatic MLP network, with 1000 nodes, was trained for frame-based speech attribute transcription project," in Proc. of phone classification. The last layer performs a 1-to-37 Interspeech2007, pp. 1829-1832, August, 2007. classification over the set of phones. The targets were [4] Morris, J. and Fosler-Lussier, E., "Conditional Random Fields obtained by forced alignment using the triphone model set for Integrating Local Discriminative Classifiers," IEEE described in [19]. Transactions on Acoustics, Speech, and Language Processing, Table 2 presents the baseline results. Correctnes reached 16:3, pp 617-628, March 2008. 49.78% and Accuracy reached 45.59%. These results should [5] Scanlon, P., Ellis, D. and Reilly, R., "Using Broad Phonetic be considered as preliminary because the number of the Group Experts for Improved Speech Recognition", IEEE Transactions on Audio, Speech and Language Processing, training iterations of the MLP was reduced and the used vol.15 (3) , pp 803-812, March 2007. targets were not entirely verified. In fact, the triphone set used [6] Wu, J., Huo, Q.: “An Environment-Compensated Minimum for forced alignment does not include all the triphones needed Classification Error Training Approach Based on Stochastic for this task/corpus. Thus, targets can be refined in order to Vector Mapping”. IEEE Transactions on Audio, Speech & accomplish the 37 phone recognition task so as to achieve Language Processing 14(6): 2147-2155 (2006). better results. [7] Johansen, F.T., “Global discriminative modelling for automatic Starting from this network and applying the proposed speech recognition”, Ph.D. thesis. , The Norwegian University GDTM, Correctnes rise up to 55.43% (5.65% above) and of Science and Technology, Trondheim, Norway (1996).

Accuracy to 49.33% (3.74 above), representing 11.4% and [8] Droppo, J., Acero, A., “Joint Discriminative Front End and Back End Training for Improved Speech Recognition

8.2% of relative improvement, respectively. Accuracy”, in Proc. of the Int. Conf. on Acoustics, Speech, and Besides the preliminary results, the same improvement Signal Processing. Toulouse, May, 2006. trend observed on TIMIT task, was also verified with [9] Woodland, P.C. and Povey. D. “Large scale discriminative TECNOVOZ task, which indicates that the global training is training of hidden Markov models for speech recognition”. useful. Computer Speech and Language, 16:25–47, 2002. [10] Chou, W., Lee, C.-H., Juang, B.-H. and Soong, F.-K. “A Table 2. TECNOVOZ Phone recognition results. minimum speech error rate pattern recognition approach to speech recognition,” Int. J. Pattern Recognition Artificial (%) Relative Intelligence, Special Issue on Speech Recognition for Different System Corr Acc Improvement Languages, vol. 8, no. 1, pp. 5–31,1994. Corr Acc [11] Povey, D. and Woodland, P., “Minimum phone error and i- smoothing for improved discriminative training,” in IEEE Int. Baseline37 49.78 45.59 - - Conf. on Acoustics, Speech, and Signal Processing, 2002, vol. GDTM-MLP37/HMM 55.43 49.33 11.4 8.2 1, Orlando, FL, May 2002, pp. 105–108. [12] Yu, Dong; Deng, Li, “Large-Margin Discriminative Training of Hidden Markov Models for Speech Recognition”, in Proc of the 4. Conclusions International Conference on Semantic Computing, 2007. Volume , Issue , 17-19 Sept. 2007 Page(s):429 – 438. This paper describes a global discriminative training method (GDTM) applied to a hybrid MLP/HMM phone recognizer. [13] Riis, S., Krogh, A. “Hidden Neural Networks: A Framework for HMM/NN Hybrids”, in Proc of ICASSP97 April, 1997. The proposed method optimizes the network parameters as a [14] Riedmiller, M. and Braun, H. “A direct adaptive method for function of the whole system. The MLP weights, which are faster backpropagation learning: The RPROP algorithm,” in usually updated according to the target values presented in the Proc. ICNN, San Francisco, CA, 1993, pp. 586–591. output layer, are now updated according to the [15] Young, S. et al, The HTK book. Revised for HTK version 3.4, misclassifications present in the output of the hybrid system. Cambridge University Engineering Department, Cambridge, These misclassifications are computed comparing the output December 2006. of the best Viterbi alignment with the reference alignment [16] Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., and provided in the database. The gradient of the alignment errors Dahlgren, n., DARPA, TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NIST, 1990. are back propagated through the entire structure, all the way [17] Lee, K. and Hon, H., “Speaker-independent phone recognition back to the first MLP layer. This results in a minimization of using hidden Markov models”, IEEE Transactions on the classification error of the global hybrid system, and also Acoustics, Speech, and Signal Processing, vol.37 (11), maximizes the phone accuracy. November 1989, pp. 1642-1648. GDTM was tested using two databases: English TIMIT and [18] Gunawardana, A., Mahajan, M., Acero, A., and Platt. J., Portuguese TECNOVOZ. In both tasks relative improvements “Hidden conditional random fields for phone classification,” in in correctness and accuracy were achieved vis-à-vis the Proc. Interspeech, 2005, pp. 1117–1120. corresponding baselines. [19] Lopes, J., Neves, C., Veiga, A., Maciel, A., Lopes, C., Perdigão, F., Sá, L., “Development of a Speech Recognizer with the Tecnovoz Database”, Propor 2008, International Conference on 5. Acknowledgements Computational Processing of Portuguese, Aveiro, Portugal. [20] Lopes, C., Perdigão, F., "A Hierarchical Broad-Class Carla Lopes would like to thank the Portuguese Classification to Enhance Phone Recognition", 17 th European foundation: Fundação para a Ciência e a Tecnologia for the Signal Processing Conference (EUSIPCO-2009), Glasgow, PhD Grant (SFRH/BD/27966/2006). Scotland, August 2009.

60 Proceedings of the I Iberian SLTech 2009

Towards Microphone Selection Based on Room Impulse Response Energy-Related Measures

Martin Wolf, Climent Nadeu

TALP Research Center, Department of Signal Theory and Communications Universitat Politècnica de Catalunya, Barcelona, Spain {mwolf, climent}@gps.tsc.upc.edu

Abstract In this work, we focus on room scenarios where there are In a room where several distant microphones are capturing multiple microphones capturing signal in parallel. Quality of signals in parallel, the quality of the recorded speech signals recorded speech in each of them differs. We cope with the dis- strongly depends on the characteristics of the room impulse turbing effects of the room by choosing the best signal available responses that describe the wave propagation between each (in terms of speech recognition rate) at given time and under source and each microphone. In this paper we present an ini- given conditions. The decision should be made before the fea- tial attempt to investigate the possibility of selecting the micro- ture extraction takes place without a need of any feedback from phone that offers the best quality of speech. As we want to the ASR system. Method is suitable for scenarios where mi- apply it to an automatic speech recognition system, we aim to crophone array processing is not possible or desired since no select the microphone according to some optimization criterion assumptions about the position of microphones are made. In that has been inferred from the recognition rate in a prior learn- order to increase space diversity microphones should be dis- ing process. Several energy-related measures that carry relevant tributed around the room, rather then concentrated in one place. information of the room impulse response are being considered. In [1], space-diversity speech recognition technique using They should be estimated directly from the speech signal, pos- distributed multi-microphone in room was investigated as a new sibly in real time, but avoiding the need to estimate the whole of speech recognition. Authors propose microphone selection room impulse response. In this paper, we present the rationale method based on maximum likelihood. There are several distant behind the intended investigation, and offer preliminary experi- speech models trained for the room. In the first pass, speech is ments for a large vocabulary continuous speech recognition task independently recognized for each microphone, and the model which show how microphone selection using an ideal relative giving maximum likelihood is selected. In the second pass, the energy measure can largely improve the recognition rate. microphone with maximum likelihood is chosen. In this way Index Terms: microphone selection, reverberation, room im- the most reliable model and acoustic channel are selected. Dis- pulse response, ASR advantage of this approach is that several acoustic models need to be trained under different conditions and evaluated in paral- lel. In our approach we assume only one speech model because 1. Introduction microphone selection is made before recognition and is based Speech recognition in a room using distant microphones is a solely on the measures extracted from the speech signal. challenging task mainly due to background noise and reverber- In previous study [2], relation among different parts of RIR ation. Acoustic signal is reflected from the walls and objects and Word Error Rate (WER) were investigated. Results show and arrives to the microphone attenuated and with different de- there are certain components of RIR that harm the recognition lays. In reverberant environments degraded copies of the origi- more than others. If we identify and measure these components, nal signal sum up in the receiver introducing interference even it should be possible to say how is the signal in each of micro- after the original sound disappears. This is usually modeled by phones affected by the conditions in the room. The least harmed convolution of the room impulse response (RIR) with the origi- signal will presumably lead to lower WER. nal speech signal Estimation of RIR is a costly and difficult process, espe- cially, if positions of speaker or conditions in the room are y(t) = x(t) ∗ h(t) (1) changing over time. Therefore for purposes of ASR it is more where y(t), x(t) and h(t) are the recorded signal, the original desirable to avoid such methods that require exact RIR mea- speech signal and the room impulse response respectively. surement. Many techniques have been developed and successfully In the remaining part of the paper we describe the method- tested to cope with noises with short temporal effects (telephone ology, show preliminary results and outline future directions. channel effect, additive noises), but few have been so far re- The work is preliminary. ducing the long lasting effect of room reverberation. In con- ventional Automatic Speech Recognition (ASR) systems, short 2. Experimental setup time spectra is used to derive the features for recognition and effect of reverberation may be observed as temporal smearing There are two basic questions: what are the parameters of RIR of the short term spectra. Size of the analyzed window is very that should be taken into account when making decision, and short (tens of milliseconds) in comparison to the usual length of how to measure or estimate them? To answer this we define RIR, therefore techniques developed for the reduction of linear a set of experiments where close talk microphone recordings, distortion in the short-term spectra usually fail. without influence of the room reverberation, are artificially con-

61 Proceedings of the I Iberian SLTech 2009

4 3 3. Microphone selection based on energy Microphone 4 Microphone 3 related measures of the room impulse 4 response As the speech recognition accuracy varies strongly across the 3 various microphones in the room, our objective is to design a way to select the microphone that offers the highest average ac- curacy. For that purpose, we want to rely the decision on mea- 5 5 Table9 7 2 sures or parameters associated to the RIR that would indicate Microphone 5 Microphone 2 2 Microphone the degree of harming caused by the reverberation to the signal and, consequently, to the recognition performance. If we were able to compute those measures from the speech signals asso-

3 ciated to the various microphones, we would be able to choose 4 1 2 the best microphone before entering the recognition system. 6

1 To find out candidates for RIR measures that are useful to Microphone 6 Microphone 1 6 1 that purpose, we designed a process as outlined in block dia- gram in Figure 2. First we trained the acoustic models of a speech recognition system using general databases (the Catalan Figure 1: UPC smart room – experimental arrangement Speecon and FreeSpeech). Then, we used the trained system to recognize speech signals from FreeSpeech that were convolved with a set of RIRs measured in our UPC smart room. Let’s denote by WERi the obtained WER corresponding to volved with the RIR measured in the UPC smart room [3]. the i-th RIR. Note that the exact h(n) is known for each micro- RIR measurements were made using a sweep excitation sig- phone. Now we can choose a particular measure Mj , compute nal with logarithmically increased frequency. Signal was repro- its values Mji from every RIRi and compare (correlate) those duced from the speaker held on the chest of a person. Seven dif- values Mji with the corresponding values of WERi. In this way, ferent positions in the room and four directions of reproduction we can see the relation between each of the defined RIR mea- (orientation of the speaker) were defined, emulating the sce- sures Mj and the speech recognition rate, and choose the most nario where a person is giving a talk and moves along the room. relevant one(s). Then, such measure(s) can be used for selecting Setup may be seen in Figure 1. In the experiment we used 6 the best microphone before entering the recognition system. microphones placed on the walls 2.4m above the ground. Seven Once relevant measures are identified, question is how to positions, 6 microphones and 4 directions give a total number estimate them in the real scenario where the RIR is not known of 168 RIRs that were used in the experiment. in advance. This problem is still open.

2.1. ASR system and databases 3.1. Energy-based features RIR can be split into 3 parts: direct sound and early reflections, Experiment was made with the RWTH Aachen university late reflections and very late reflections. In [2], it was experi- speech recognition system [4] using Catalan Speecon and mentally shown that early reflections are not harming the speech FreeSpeech databases. The Speecon database is made of real recognition. On the other hand, the middle part (late reflections world speech signals recorded in room and outside environ- between approximately 70ms and 2/3 of reverberation time T60) ments using four microphones (one close-talk and three distant is the harming one. microphones). The Catalan FreeSpeech database was build for We investigated relations among WER and different mea- an automatic dictation system and consists of close-talk record- sures Mj based on RIR energy and experimentally identified ings of large vocabulary continuous speech. several candidates for the features: For training, approximately 121 hours of recordings data from both databases were selected. In the testing phase only 1. Energy of the whole RIR a subset (duration approximately 1.5 hour) of FreeSpeech 2. Energy of direct wave and early reflections (approx. 0- database was used. Note that the acoustic models are trained in 70ms) normalized by energy of whole RIR a multi-conditional way and they were not trained specifically 3. Energy of late reflections normalized by energy of whole for the UPC smart room. RIR (M3) Speech signal was framed applying 25ms long Hamming 4. Ratio between energies of early and late reflections window with 10ms overlap. Basic speech feature vector con- sists of 16 Mel frequency cepstral coefficients (MFCC) ex- Among them, measure M3 calculated as energy of the late re- tended by a voicedness feature [5]. Mean and variance normal- flections (50ms and 190ms) normalized by the energy of the ization was applied on the cepstral coefficients and fast Vocal whole RIR tract length normalization (VTLN) to the bank of filters. The 190ms temporal context is preserved by concatenating the features of P 2 hi (t) 9 consecutive frames. Prior to the acoustic model training, lin- t=50ms M3i = (2) ear discriminant analysis (LDA) was applied in order to reduce Ti P 2 the dimensionality and increase the class separability. Acous- hi (t) tic modeling was using Hidden Markov models and emission t=0ms probabilities were modeled with continuous Gaussian mixtures showed the highest correlation index between the parameter and sharing one common diagonal covariance matrix. the WER (equal to 0.78632). Exact intervals of late reflections

62 Proceedings of the I Iberian SLTech 2009

Training

Multi-condition Feature Acoustic Language training Extraction Model Model database Evaluation

Classifier WER i Comparison (HMM/GMM)

Mji

Measure M Feature j extracted from Clean speech Extraction RIR testing set i

RIR i

RIR i RIR set

Testing

Figure 2: Block diagram of evaluation of different RIR features were identified empirically doing a grid search over different times what is more than 71% of cases. Average word error combinations of starting and ending times with the step 10ms. rate when the best microphone is selected after recognition was Index i in Eq. 2 denotes the measure taken from RIRi and Ti is 14.7% (ideal case) while in case of prior selection average re- the duration of given RIR. sult was only 1.1% worse. This indicates that even if the most This observation may be interpreted as lower the energy of appropriate microphone is not chosen, the chosen one is only late reflections normalized by global energy, lower the WER. slightly worse. We further see improvement of 5.6% using our It means that the microphone where this quotient of energies is method comparing to random selection (21.37%). the lowest will be chosen as the most suitable for recognition.

4. Preliminary results and discussion 5. Conclusion As a proof of concept, we made an experiment where we com- In this work we investigate the possibility to use energy based pared recognition results when microphone was selected prior measures from the RIR to make a microphone selection for im- to recognition, using only the measure described above (energy proving robustness of ASR. We defined a methodology and pre- of late reflections normalized by global energy) with the case, pared a setup to search for relevant properties of RIR that may where the best result was selected from all microphones after be extracted from the speech signal in each microphone prior to recognition (reference). recognition and indicate the input that would presumably lead Results are shown in Table 1. The first column denotes p to the increased recognition rate. – position and d - direction of the speaker in the room (Figure 1). All 6 microphones were included in this experiment and for So far we identified and verified one measure: energy of each position and orientation, the most suitable one was cho- late reflections normalized by energy of whole RIR, and showed sen. Numbers of selected microphones are in the column 2 and that, based only on this single criteria, it is possible to achieve 3. The “Reference” column contains number of microphone results that are only 1.1% worse in average than the case where that gave the lowest WER after recognition for given position microphone would be selected evaluating each input against the and direction. Column 3 shows what would be our choice if we speech model separately. measured the energy of late reverberation normalized by global energy and made microphone selection before any recognition There are several remaining problems to solve. Comple- takes place. Last two columns are showing WERs correspond- mentary measures are needed to further improve the selection ing to each microphone. process. Once more measures are available, they will proba- Average WER from all microphones in experiment was bly need to be integrated in an efficient way by means of a cost 21.37%. This corresponds to the case when microphones would function. Nevertheless, the most important remaining task is to be selected randomly. Next, it may be observed from 28 cases find a method to extract those parameters online from the speech (7 points and 4 orientations) same microphone was chosen 20 signal.

63 Proceedings of the I Iberian SLTech 2009

Selected microphone WER [%] Speech Recognition System, Online: http://www- Position Reference Elate Reference Elate i6.informatik.rwth-aachen.de/rwth-asr/ Direction E E [5] Zolnay A. A.,Uter R.S. and Ney H., “Robust Speech p1_d1 1 1 11 11 Recognition Using a Voiced-Unvoiced Feature,” Interna- p1_d2 1 1 13.1 13.1 tional Conference on Spoken Language Processing, vol. 2, p1_d3 2 3 16.6 16.9 2002, pp. 1065-1068. p1_d4 6 5 14.4 19.3 p2_d1 1 1 12.5 12.5 p2_d2 5 5 15.3 15.3 p2_d3 3 3 13.5 13.5 p2_d4 5 5 12.6 12.6 p3_d1 2 1 15.6 16.8 p3_d2 5 5 12.4 12.4 p3_d3 3 3 12.2 12.2 p3_d4 4 5 16.3 22.1 p4_d1 6 6 18.8 18.8 p4_d2 3 3 10 10 p4_d3 3 3 19.3 19.3 p4_d4 4 5 15.9 18.7 p5_d1 6 6 14.4 14.4 p5_d2 2 2 17.1 17.1 p5_d3 4 4 18.4 18.4 p5_d4 2 5 14.7 21.4 p6_d1 6 5 15.6 19.9 p6_d2 1 1 12.9 12.9 p6_d3 3 4 17.8 22.8 p6_d4 6 6 9.7 9.7 p7_d1 1 1 14.2 14.2 p7_d2 2 2 15.6 15.6 p7_d3 3 3 13.1 13.1 p7_d4 5 5 17.7 17.7 average 14.7 15.8 WER

Table 1: Results - selection based on late reflections normalized by global energy

6. Acknowledgements This work has been supported by the project SAPIRE (TEC2007-65470), founded by the Government of Spain, and by the project TECNOPARLA, founded by the Government of Catalonia. Authors would also like to thank to Henrik Schulz for providing help with the acoustic models and the ASR system setup.

7. References [1] Shimizu Y., Kajita S., Takeda K. and Itakura F., “Speech recognition based on space diversity using distributed multi-microphone,” Proc. of ICASSP, 2000, pp. 1747-1750. [2] Petrick R., Lohde K., Wolff M. and Hoffmann R., “The Harming Part of Room Acoustics in Automatic Speech Recognition”, Proc. of INTERSPEECH, 2007, pp. 1094- 1097. [3] Neumann J., Casas J.R., Macho D. and Hidalgo J.R., “Inte- gration of audiovisual sensors and technologies in a smart room,” Personal Ubiquitous Comput., vol. 13, 2007, pp. 15- 23. [4] RWTH ASR - The RWTH Aachen University

64 Proceedings of the I Iberian SLTech 2009

Speech Synthesis

65

Proceedings of the I Iberian SLTech 2009

Towards an Objective Voice Preference Definition for the Portuguese Language

Luis Coelho 1, Horst-Udo Hain2, Oliver Jokisch2 and Daniela Braga3

1ESEIG, Instituto Politecnico do Porto, Porto, Portugal 2Laboratory of Acoustics and Speech Communication, TU Dresden, Germany 3Microsoft Language Development Center, Microsoft, Portugal [email protected], [horst-udo.hain, oliver.jokisch]@tu-dresden.de, [email protected]

Abstract normal vocal symptoms are also commonly evaluated [2, 3, 4]. These parameters are not automatically extracted, it is required In this paper, it is our aim to define a set of objective acous- a human subjective evaluation whose judgement can be contro- tic criteria, based on subjective listeners’ assessment of talent versial. As reported by other authors the subjective judgment voices, which can help to automatically rate the voice font qual- of distinct professionals does not always present an expressive ity, bearing in mind the objective definition of voice preference correlation [5, 6] and there are no guidelines or references for for the Portuguese language. For this purpose a multilingual performing the evaluation. and multispeaker database was recorded and a set of subjec- In this paper we explore the voice quality concept on the tive and objective information was obtained. The analysis of dimension of voice preference specifically for the Portuguese the data provided new results that can be successfully used to language, European (EP) and Brazilian (BP) varieties. For this define the quality of a given voice. The achieved results for purpose we collected an extensive voice database with profes- Portuguese were compared with those obtained for other lan- sional voices essentially from the media industry. Among sev- guage with objective of identifying common properties, which eral rules, the voice selection was performed to eliminate other was statistically confirmed with a within a 90% confidence in- factors than the acoustics itself. It is known that accent and even terval. dialect, new words, maybe even phraseology changes, such as Index Terms: voice pleasantness, speech synthesis between UK and United States, can lead to undesired bias on voice evaluation. The recordings were evaluated by groups of 1. Introduction listeners according to voice preference. To find objective voice preference clues we also extracted acoustic parameters and cor- The possibility to create an artificial voice that could imitate related obtained values with the subjective voice rankings. Ad- a human speaking is slowly becoming a reality. The develop- ditionally in a cross-lingual study we further extended the ini- ments in the last few years made it possible for Text to Speech tial recordings to other languages and performed new evalua- (TTS) Systems to generate highly intelligible speech with an tion surveys. Unlike voice talents, the human evaluator were almost natural prosody. The industry started to take advantage selected in order to create a heterogeneous group according to of this ready to use technology and several systems started to external parameters such as age and gender. These variations emerge from the laboratories to personal computers, cars, and provided indirect analysis that enriched the results. The find- more recently to several applications on mobile devices. How- ings for EP and BP were compared with the ones obtained for ever with the fulfilment of the basic requirement, which is in- the other languages and statistically significance tests were per- telligibility, additional demands arise. For a daily usage of such formed. technology the system must be robust, to transmit confidence, The rest of the paper is organized as follows. In the and the quality of the used voice font must be sufficient to pro- next section we briefly describe speaker selection and database vide a pleasant experience during interaction. Several compa- recording processes. The used criteria for voice talent selec- nies provide for each language several voice fonts that the user tion and the specific related issues are presented along with the can choose according to his preference, but usually this is not used recording structure. In section 3 we show how we pro- the case. ceeded to evaluate the voices by describing the process as well There are few known studies concerning voice quality as- as the subjective and objective parameters used for this purpose. sessment according to voice preference. This concept is often In section 4 we present the main outcomes for the independent associated in specialized literature with impairments or disor- analysis and for the comparison between EP+BP and the other ders and is mostly covered on medical publications. For this languages. Finally, in section 5 the main conclusion are pre- subject there is an extensive bibliography, but voice quality is sented and envisioned work is foreseen. understood, on a positive way, as a voice with no associated pathologies. The evaluation of such voices is performed by us- ing specific metrics that measure the deviation in relation to 2. Speech Resources a range of pre-defined values that define a healthy condition. Our initial studies are based on two voice talent selection pro- Standard scales as the GRBAS (grade, roughness, breathiness, cesses for European and Brazilian Portuguese with the aim of asteny and strain) or RASAT (roughness, asperity, breathiness, building a high quality voice font [7, 8] for a new TTS system. asteny and tension) [1] are used to help experienced profession- During the voice assessment process, which will be explained als to subjectively evaluate and classify the severeness of a voice in the next section, the candidates were asked to record a small dysfunction. Characteristics as hoarseness, raspiness, effort to text on a professional recording studio, to guarantee identical talk, vocal fry, uncomfortable or abnormal pitch and other ab- acoustical quality, while following a common script containing

67 Proceedings of the I Iberian SLTech 2009

30% a set of phonetically and prosodically rich sentences, with emo- tion indications. The voice assessment process followed a well 25% Non-native Native All defined pipeline with strict rules and organized in three stages.

The first stage was a national call for voice talents which had 20% to fulfil a few profile requirements. Each candidate had to be a female, have Portuguese as mother tongue, having studied up 15% to university level, speaking accent according to the national standard and to have some radio or theatre vocal experience. Out of several hundred candidates, a small set was invited to 10% send samples of their voices with the maximum quality they could produce. A subjective test was then conducted, using a 5 5% points rating MOS scale, with listeners who were familiarized with speech processing technology. The best scored candidates 0% were then invited to record a small text as described in section 123456 2. The final recordings were evaluated again by a survey where then listener elected the best voice for each attribute. The fi- Figure 1: Ranking for UK English speaker selection. Horizon- nal ranking was obtained by counting the number of votes each tal axis shows the speaker’s identification number and the verti- voice received during the survey (further details can be found cal axis indicates the relative preference according with mother on [7]). For extending the study base, similar procedures were tongue. conducted for Catalan (ES-CAT), Danish (DAN) and Finnish (FI) which allowed to establish comparisons and improve the confidence on the results. knowledge since the listeners had distinct backgrounds. This A set of recordings performed within a cooperation be- problem is addressed in [10] and more recently in [6]. In the for- tween Siemens AG, Munich, and TU Dresden for the creation mer ratings from speech and language therapists specialized in of new voices for an embedded version of the multi-lingual TTS voice with at least 2 years experience are compared with those system ”Papageno” was also used [9]. Amongst others, voices of final year speech and language therapy students. In total 14 for German (GE), UK and US English (ENG-UK and ENG- parameters like breathiness, roughness and monotony as well as US), French (FR) and Spanish (ES) have been recorded at TU pitch or loudness were investigated. An important basic condi- Dresden laboratories. As before, all the speakers were selected tion is that only the perceptual labels that are reliably judged ensuring that a set of requirements was fulfilled. In general, the by both listener groups should be used for comparison. The voice has to be intelligible, natural and pleasant. A special de- author concludes ”that perceptual strategies between more and mand is that it must be suitable for all the processing steps that less experienced listeners are not different, but rather that these are involved in speech synthesis. The voice quality (F0, jitter) listeners adopt different baselines during perceptual tasks”. To must be sufficient and allow for good results even after com- reduce the group variance it was asked to the listeners to rate the pression or codecs (e. g. adaptive multi-rate, AMR) are applied. voices more emotionally rather than using any of their previous The speaker needs to have phonetic and also prosodic abili- experience on the subject. ties (preferable a professional or semi-professional speaker) and An example of the variance among native and non-native should have experience in speaking a long time (about 4 hours speakers is presented in figure 1. In this case the depicted results per session) without any degradation of the voice quality (e. g. are for the selection of a UK English speaker out of six candi- a teacher, actor, newsreader). dates [9]. In most cases, the non-native listeners also preferred All the acoustic data was recorded in professional studios the candidate which received the highest rank by the native lis- with a sampling rate of 44.1kHz or higher (mono channel) and teners. In some cases, the opinions between younger and older with 16 bits resolution. From all the recordings a set of ran- natives differed more than between natives and non-natives. A domly chosen sentences was selected in order to obtain around similar behavior was observed for the other languages. 5 minutes of speech per speaker. For each language at least 5 In figure 2 we can observe how the listeners’ gender can in- speakers were considered and for EP and BP the sample was fluence the voice judgment. Some of the candidates are equally constituted by 10 speakers each. preferred by both genders but others, however, are clearly dis- The described recordings and related scripts, despite their criminated by women. Nevertheless the preferred voices show distinct origin, were made using identical criteria. The data or- a more balanced score for both genders. ganization enabled us to create a homogeneous basis for our Without forgetting the described issues the target voices analysis. were evaluated according with the following subjective pa- rameters: pleasantness (PLS), intelligibility (INT), sensuality 3. Evaluation (FEM), emotiveness (EMO), character (CHR) and speaking rate (SPD). Three more questions were asked addressing the listen- The evaluation of each voice was performed according to sub- ers’ judgment on the suitability of those voices on typical TTS jective and objective parameters, correlated afterwards. application, namely e-mail, news and instructions reading. A 5 points rating scale was used, which means that all voices were 3.1. Subjective Parameters classified with marks from 1 (bad) to 5 (excellent) in every sub- jective attribute. The subjective evaluation raised several issues related with po- tential biasing factors. Sex, age, expertise or native/non-native 3.2. Objective Parameters speaker, factors that go beyond the simple selection of parame- ters, can dramatically bias the listeners’ judgment analysis. One To objectively evaluate each of the recorded voices the follow- major concern was the level of expertise on speech processing ing acoustic parameters were considered: F0 (mean, maximum,

68 2,9 Proceedings of the I Iberian SLTech 2009 2,8 25% 2,7 EP+BP Others % Men % Women 20% 2,6

2,5 15% 2,4

10% 2,3

2,2 5% 2,1 12345 0% 123 4 567 8 9 101112 Figure 3: Average relative scores per language according to

fundamental210,0 frequency related parameters. Horizontal axis Figure 2: Ranking for EP speaker selection. Horizontal axis shows linear frequency (Hz). 208,2 shows the speaker’s identification number and the vertical axis 205,0 205,5 indicates the relative score according with listeners’ gender. 204,5 The shown were normalized to remove any bias resulting from 200,0 EP+BP Others the difference between the number of votes by gender. 199,3 195,0

193,4 190,0 minimum, range and standard deviation), energy (mean and 193,1 192,3 standard deviation), speaking rate (SPR in words per minute 191,4 185,0 excluding pauses) and pausing rate (PAR) (rating between the 187,2 186,5 duration of pauses and the total phonation duration without 180,0 pauses). The features were extracted using Praat [11] and Math- works Matlab. 175,0 Each parameter was independently correlated with the sub- 12345 jective evaluation results for finding acoustical clues of voice preference. The correlation values were calculated according to Figure 4: F0 mean for the first five best scored voices. Horizon- the equation: tal axis shows the candidate ranking and vertical axis shows linear frequency (Hz). P(x − x¯)(y − y¯) Correl(X,Y ) = (1) p(x − x¯)2(y − y¯)2 where X represents a set of x values and x¯ their average, all the other languages (lighter line). We can see that the ana- the same applies to Y . The output values are in the range -1 to lyzed f0 averages fit within a 20Hz range (around 26Hz on Mel 1, with 1 indicating a high linear relationship between the sets, perceptual scale) but yet presents a good diversity. The interest- -1 the inverse and 0 meaning that there is no linear dependence ing observation is that, despite EP plus BP voices present lower between sets. F0 values and the other voices present higher F0 values, they all converge to a same common frequency band around 193Hz. This indicates not only a gold value for the fundamental fre- 4. Results and Discussion quency of a female voice but also the cross-lingual uniformity The subjective evaluation results were used to build an ordered of this finding. candidate ranking for each language but, alone, are useless for Still concerning the fundamental frequency, we can see in establishing direct comparisons of speakers or for any cross- figure 5 a dot cloud on a fundamental frequency versus relative language analysis. The same happens with the objective results. score plane. Each point represents the f0 for the best ranked After gathering the results of subjective and objective assess- voice for a given language and has an associated score. We can ments a joint analysis was performed. observe three clusters for minimum, average and maximum f0 The speaking rate analysis results are presented in figure 3. values with increasing spatial variance. For the maximum f0 Again EP+BP and the other languages are presented separately. there is a trend for increased rankings on higher frequencies. It can be observed that a high speaking rate is a desired charac- This indicates that despite the low f0 preference it is also desir- teristic for all the languages. The flat lines around 2.7/2.8 words able to have a good vocal dynamic. The minimum f0 frequen- per second seem to indicate that this an interesting value for cies show a very small variance and the preferred values are this characteristic and that values higher than this can decrease close to the cluster f0 values. The average f0 cluster has a trian- the voice score. The multi-lingual analysis of this parameter gular shape with the best scores given to the lowest frequency can be misleading because same languages have much longer values. words (for example in German that has agglutinative processes On another analysis we tried to understand what perceptual in words composition) than others. strategy is, mostly unconsciously, used by the listeners to eval- In figure 4 we show the mean fundamental frequency (F0) uate subjective parameters. In table 1 we show, for a joint anal- ordered by candidate ranking (1 is the best scored voice and 5 ysis of EP plus BP, the correlation values between the voices’ is the worst scored voice for this speaker sample). The results scores for each subjective parameter and the related objective and presented separately for EP plus BP (darker line) and for parameters. Fundamental frequency seems to be an important

69 50% Proceedings of the I Iberian SLTech 2009

40% 5. Conclusions CAT_F0 In this paper we described the construction of a multi-lingual 30% CAT_F0_Min CAT_F0_Max multi-speaker voice database for voice quality analysis. The FI_F0 collected voices were evaluated by human listeners according FI_F0_Max 20% FI_F0_Min with a set of subjective parameters that allowed creating a voice DK_F0 preference ranking. Additionally a set of objective parameters DK_F0_Min were extracted and correlated with each individual rank. This DK_F0_Max 10% PT_F0 joint analysis leaded to several new interesting conclusions. PT_F0_Min Mainly we showed that the fundamental frequency values that PT_F0_Max gather more preferences are around 193Hz and that a speaking 0% 100 150 200 250 300 350 400 rate of 2.7/2.8 also brings additional votes. These results were also analyzed in two groups: one with the Portuguese language (European and Brazilian varieties) and another with a set of 7 Figure 5: Relative scores according to f0 values. Average, min- European languages. We showed that the preference for an EP imum and maximum values are presented for each language. or BP female voice is perceptually identical to the preferences Horizontal axis shows linear frequency (Hz) and vertical axis found on other European voices in distinct languages. shows relative score. The correlation results between objective and subjective pa- rameters are still preliminary but it was shown that there is a cor- parameter on the evaluation of voice quality and it is also useful relation between the voice quality ranking obtained by subjec- for judging character. We can also observe that both jitter and tive listening tests and acoustic parameters. These parameters shimmer have negative correlation. This could be an expected can therefore be used for an automatic preselection of promising observation but can point that high jitter specially reduces the speakers from a larger number of candidates. Further investiga- perception of an emotional voice and that high shimmer con- tions will focus on the results obtained by different groups of tributes to decrease intelligibility. Harmonic to noise ratio is listeners as young/old, native/non-native, and expert/non-expert also inversely related with speaking rate. This may mean that regarding the speech processing technology. A more compre- when a high dynamic is imposed to the phonatory system the hensive analysis evolving all the languages is on going and will capacity to procedure harmonic sounds is reduced because tis- be published on future work. sues settling time has a longer relative duration for each sound. Pitch has the higher correlation values which emphasize its im- 6. References portance on the judgment of a voice. [1] S. Pinho and P. Pontes, “Escala de avaliac¸ao˜ perceptiva da fonte glotica:´ Rasat,” Vox Brasilis, vol. 8, no. 3, pp. 8–13, 2002. [2] J. Kreiman, B. Gerratt, G. Kempster, and A. Erman, “Percep- Table 1: Correlation between objective and subjective scores tual evaluation of voice quality. review, tutorial and a framework for EP plus BP. for future research,” Journal of Speech and Hearing Research, vol. 36, pp. 21–40, 1993. PLS INT FEM EMO CHR SPD [3] L. Eskenazi, D. G. Childers, and D. M. Hicks, “Acoustic corre- SPR -0,13 -0,30 0,26 0,30 0,24 -0,01 lates of vocal quality,” Journal of Speech and Hearing Research, Jitter -0,62 -0,50 -0,76 -0,89 -0,79 -0,06 vol. 33, pp. 298–306, 1990. Shimmer -0,58 -0,90 -0,37 -0,07 -0,30 -0.83 [4] J. Kreiman and B. R. Gerratt, “Sources of listener disagreement in HNR -0,77 -0,71 -0,70 -0,41 -0,65 -0,95 voice quality assessment,” Journal of Acoustical Society of Amer- Pitch 0,97 0,50 0,29 0,78 0,84 0,57 ica, vol. 108, no. 4, pp. 1867–1876, 2000. [5] S. Blaustein and B. Asher, “Reliability of perceptual voice assess- ment,” Journal of Communications Disorders, vol. 16, pp. 157– An identical correlation table was produced for the remain- 161, 1983. ing languages and the absolute differences with table 1 are pre- [6] C. de Bruijn and S. Whiteside, “Effect of experience levels on sented in table 2. We can observe that all the values are below voice quality ratings,” in Proc. of Phonetics Teaching and Learn- 0.20 and that 30% of the values are below or equal to 0.10. This ing Conference, London, August 2007. may indicate that the obtained results for EP and BP voices and [7] D. Braga, L. Coelho, F. G. V. R. Junior, and M. S. Dias, “Subjec- judgments are coherent with the values for other European lan- tive and objective assessment of tts voice font quality,” in Proc. guages. A statistical analysis using a z-test came to confirm that of International Conference on Speech and Computers (SPECOM this is a valid assumption for a 90% confidence interval. 2007), Moscow, October 2007, pp. 306–311. [8] D. Braga, L. Coelho, F. G. R. Junior, and M. S. Dias, “Subjec- tive and objective evaluation of brazilian portuguese tts voice font Table 2: Absolute difference between two sets of correlation quality,” in Proc. of 14th International Workshop on Advances in values for EP plus BP and the other analyzed languages. Speech Technology, Maribor, Slovenia, July 2007, pp. 306–311. [9] O. Jokisch, G. Strecha, and H. Ding., “Multilingual speaker selec- PLS INT FEM EMO CHR SPD tion for creating a speech synthesis database,” in Proc. of Work- shop Advances in Speech Technology AST, Maribor, Slovenia, SPR 0,02 0,09 0,00 0,14 0,04 0,12 2004. Jitter 0,14 0,04 0,13 0,20 0,19 0,11 [10] J. Kreiman, B. Gerratt, and K. Precoda, “Listener experience and Shimmer 0,11 0,16 0,06 0,03 0,16 0,17 perception of voice quality,” Journal of Speech and Hearing Re- HNR 0,11 0,00 0,15 0,20 0,12 0,14 search, vol. 33, pp. 103–115, 1990. Pitch 0,17 0,11 0,15 0,16 0,04 0,02 [11] P. Boersma and D. Weenink, “Praat: doing phonetics by computer (version 5.1.05),” http://www.praat.org/, 2009.

70 Proceedings of the I Iberian SLTech 2009

A Detailed Analysis and Comparison of Speech Synthesis Paradigms

Luis Coelho 1, Daniela Braga2, Carmen Garcia-Mateo3

1ESEIG, Instituto Politecnico do Porto, Porto, Portugal 2Microsoft Language Development Center, Microsoft, Portugal 3Departamento de Teora de la Seal y Comunicaciones, University of Vigo, Spain [email protected], [email protected], [email protected]

Abstract To understand how good a typical HMM based synthesizer performs we analyzed in detail several synthesized speech units Hidden Markov Model based synthesis has gained a special re- and compared the obtained results, side by side, with a different levance in international speech conference and in contests like technique and with the original. In the next section we explain the Blizzard Challenge. This developing technology has proved how we proceeded to build the evaluation scheme and how the to be quite promising but naturalness is still an achievement to systems were developed. In section 3 we present the compari- conquer. In this paper we compare two TTS technologies with son results along with a very detailed analysis of large and small a human speaker and provide some of the major observed simi- speech units in time and frequency domains. The results of an larities and differences. Our analysis covers time and frequency objective analysis are also shown. The main conclusions are domains for several acoustical units in order to demonstrate, in presented in section 4. this short space, the capabilities of each technology. Index Terms: speech synthesis, comparison, HMM synthesis 2. Methodology 1. Introduction To evaluate the quality of HMM based synthesis against the tra- ditional concatenative approach and understand how close they A few years ago a new technique for speech synthesis was in- both can be to the original utterance several objective and sub- troduced [1] with very promising possibilities. This new frame- jective comparisons were made. In order to develop a fair as- work for corpus-based speech synthesis systems uses hidden sessment base two synthesis system were built using the same Markov models (HMM) for parameter modelling and can si- database, one based on unit selection diphone synthesis, using multaneously describe spectrum, pitch and duration in a unified Festival tools [12], and another one using HMMs, using HTS manner using dynamic features [2, 3, 4]. Beyond other char- tools [5]. Both toolkits are popular options for the development acteristics the stochastic base provides a highly flexible para- of such kind of systems but the distribution versions are far from metric modelling that allows the creation of a voice-font with the state-of-the art. Again we remember that it is not our pur- a small database and the conversion of that voice-font to a new pose to compare high-end systems (publicly unavailable) but to one with even few new data. Additionally and not less impor- analyze the capabilities of each tecnology. tant are the smaller system footprint and the database recording requirements which, in this case, are much less demanding. 2.1. Database The original concepts, implemented by HTS tools [5], have A 1 hour database was recorded at 44100 Hz in a professional been continuously developed as the consecutive results of the recording studio. A 30 year old female speaker, with Euro- Blizzard Challenge indicate. On 2005 [6] duration modelling pean Portuguese (EP) as mother tongue, read a set of previously was improved with the introduction of hidden-semi-Markov selected sentences from daily newspapers written in EP. Sen- models (HSMM). STRAIGTH [7] with mixed excitation, a high tence selection followed a phonetically balanced criterion. The quality vocoding technique, along with a new spectral and ape- data was then downsampled to 8 KHz since we had a mobile riodic analysis method helped to reduce the original buzzyness phone/PDA based application in mind. The database was au- in the generated speech brought additional quality. The acous- tomatically labeled at sentence, word and phone levels and the tical smoothing was reduced with the consideration of global result was revised by a phoneticist with experience on the task. variance (GV) on parameter generation for the synthesis filter. For labelling, 38 symbols were used for representation of the On 2006 [8] the previous system was enhanced with the addi- Portuguese phonemes and 4 extra symbols for marking silence, tion of full-covariance models with a semi-tied covariance ma- inspiration, tonic syllable and stops in plosives. Words and sen- trix in HSMM. GV pdf description also changed from diagonal tences have only marks identifying the beginning and end. For to a full-covariance matrix. On 2007 [9] new speaker adapta- evaluation purposes we selected from the database a set of 25 tion techniques were presented and the system was developed sentences (around 5 minutes), never used during training, and around this concept. An average voice model is used consider- generated similar utterances using the two developed systems. ing a mixed gender acoustic data. HSMMs have adaptive train- ing and adaptation and CSMAPLR transforms are used [10]. 2.2. System Description These results are quite remarkable and undoubtedly prove the success of the technique however some issues are reported. We started by developing a common front-end for both systems The main difficulties reside on the selection of a good vocoding performing only the required adaptations for providing the cor- technique and implicit excitation that allow the generation of rect information for each synthesis engine. We have used a high quality speech. set of EP specific modules [13]. The concatenative synthesizer

71 Proceedings of the I Iberian SLTech 2009

0.3672 500

0 Pitch (Hz)

–0.4297 0 0 2.39487 0 2.39487 Time (s) Time (s)

0.6719 500

0 Pitch (Hz)

–0.7891 0 0 2.70575 0 2.70575 Time (s) Time (s)

0.6406 500

0 Pitch (Hz)

–0.5703 0 0 2.54925 0 2.54925 Time (s) Time (s)

Figure 1: Time domain comparison of the utterance ”Os Figure 2: Comparison of pitch contours for the sentences pre- seguintes participantes recusaram.” (”The following partici- sented in figure 1 presented in the same order. pants refused.”) produced by human speaker, CS and HS, from top to bottom. Time axes are scaled for better evaluation of intra-phrasal segments. 3.2. Words On the word level we essentially evaluated the duration of the full units as well as the duration of smaller intra-word units. (CS) was built using a diphone inventory using all the possi- On the frequency domain it was possible to observe voiced and ble combinations for EP. The HMM based synthesizer (HS) ap- unvoiced sounds and the evolution of formant frequencies in- proximately followed the standard configuration for HTS, 20ms side and between phone units. In figure 3 we show the word Hamming windows with a 5ms frame rate, 1 Gaussian with di- ”seguintes” (/s@gi˜t@S/). In time domain we can observe again agonal covariance, feature vector with energy, log f0, 20th or- a softer evolution in HS but with very well defined occlusions der Mel-Cepstral analysis and their first and second order dis- (for example in /t/). The intra-word durations are in consonance crete derivatives. Using the same phoneme inventory used for with the CS though HS shows a longer overall duration. On fre- labelling we trained left/right context dependent HMMs, with quency domain we can see that the fricative sounds, at the be- 5 states, left-to-right topology with no jumps. Including front- ginning and end of the word, have their energy well distributed end, the CS had a footprint of 300 Mb while the HS used 8 Mb. over the whole spectrum as expected. The voiced sounds, al- most all the word, are well defined with clear and well posi- 3. Results and Discussion tioned formants. Between and inside phonemes there are no visible discontinuities. In the case of the HS the first formant 3.1. Phrase Structures frequently has a higher density which works well for auditory On the first comparison we wanted to assess the behaviour of perception. We can also observe on the central phoneme se- the systems in the situation of complete phrasal structure gen- quence that in the HS case the fast acoustical phenomenon are eration. The analyzed structures have a long duration and the less defined, the spectral bars are diluted by the context. On information they can provide is limited. We observed the time the other hand occlusions like /t/ are better defined because the domain and we mainly considered the following characteristics: artificial system is not constrained by continuity restrictions as amplitude envelope, segmental duration and pitch contour. In human articulators are. In these cases the HS system can pro- figure 1 we a have a time domain representation which, at this duce better than original articulations. scale, shows a high similarity between signals. Word length and pause rate are similar and any deviations in the total length of 3.3. Phonemes the sentence can be easily corrected by adjusting the speaking This was our lowest analysis level and we will only cover here rate (especially in the HS). The amplitude envelope has a high a part of the phonetic inventory, one sound for each articulation correlation with the original and with the CS. For the HS we can mode. We were concerned with the quality of glottal pulses, observe a smoother evolution of amplitudes though plosives are fonation regularity, jitter and shimmer all in time domain and more abruptly marked. formant localization, definition and evolution in the frequency In figure 2 we show the pitch contour for the same sentences domain. presented in figure 1 (extracted using Praat [14]). The used TTS front-end did not include a special prosody manipulation mod- 3.3.1. Vowels ule so the obtained curves are simplified. Both synthesizers pro- duced identical results when compared with the original curve Vowels generation is paramount in EP because they are very but the HS produced a more smoothed time evolution. Since our frequent and have a long duration when compared with other human speaker was a professional actress that always imposed a phones, usually they can represent more than 50% of the utter- very dynamic rhythm we believe that the differences can be am- ance duration. In figure 4 we can observe in time and frequency plified. For a more neutral voice, in an informal environment, an example of the vowel /a/. In the case of the HS the dura- the differences would be smaller. In any case the HS produced tion is approximately 50% higher than the original system. The an overall more monotonous speech (confirmed by std. dev.). harmonic content is very clear in the signal’s periodicity and

72 Proceedings of the I Iberian SLTech 2009

0.2969 4000 0.07031 4000

0 0 Frequency (Hz) Frequency (Hz) -0.3672 0 –0.04688 0 0 0.4323 0 0.4323 0 0.0970454 0 0.0970454 Time (s) Time (s) Time (s) Time (s)

0.5312 4000 0.1172 4000

0 0 Frequency (Hz) Frequency (Hz) -0.6562 0 –0.1094 0 0 0.462 0 0.462 0 0.101608 0 0.101608 Time (s) Time (s) Time (s) Time (s)

0.6406 4000 0.03125 4000

0 0 Frequency (Hz) Frequency (Hz) -0.5703 0 –0.03906 0 0 0.5849 0 0.5849 0 0.103245 0 0.103245 Time (s) Time (s) Time (s) Time (s)

Figure 3: Time domain, at left, and frequency domain, at Figure 5: Time and frequency domain comparison of the frica- right, comparison of the word ”seguintes” produced by human tive /s/ produced by human speaker, CS and HS, presented from speaker, CS and HS, presented from top to bottom. top to bottom.

0.1953 4000 Table 1: Normalized Euclidean spectral distance of the artifi- 0 cially generated phonemes to the original (lower is better). Frequency (Hz) –0.3672 0 /a/ /s/ /k/ /n/ /r/ Tot. 0 0.0960024 0 0.0490579 Time (s) Time (s) CS 0.85 0.92 0.87 0.85 0.74 4.23 0.5312 4000 HS 0.88 0.89 0.70 0.86 0.68 4.01

0 Frequency (Hz) –0.7891 0 0 0.0579077 0 0.0579077 Time (s) Time (s) the higher frequencies for the HS. In figure 6 an example of /k/, an unvoiced plosive, can be observed with her occlusion and 0.5547 4000 explosion moments cleanly defined except for the HS. During oclosion the HS generated spectrum is more saturated and the 0 typical burst should be more powerful. Also a noticeable silence Frequency (Hz) –0.2734 0 0 0.0778754 0 0.0778754 appears, the vertical white bar on spectrum. The duration of the Time (s) Time (s) moments was not correctly generated which makes the signal drag and sound very artificial. Our phoneme /k/, selected from a Figure 4: Time and frequency domain comparison of the vowel word beggining context, showed a very distinct behaviour from /a/ produced by human speaker, CS and HS. /t/ in the middle of the spectrum, figure 3 (not labeled but easily identified buy the higher power in top frequencies). An exam- ple of the nasal /n/ in the beggining of a word is shown in figure no significant jitter can be observed. Shimmer has a reduced 7. The duration of both artificially generated sounds is signifi- expression on the first two cases but the HS generates slowly cantly smaller but in the spectral representation no meaningful decaying glottal pulses. This is a consequence of the transition differences are observed. The frequency below 500Hz are satu- preparation to the next phoneme and we found no negative con- rated and the HS shows a higher spectral contrast. The obtained sequences on auditory perception. In frequency domain we can result for the HS is quite important since nasals are typically easily observe fundamental frequency and the first three for- not easy to generate. The MLSA [11] speech coding technique mants, all stable and clearly defined. The signal produced by correctly handled the sound. Some vocalization is noticed due the HS shows a higher spectral contrast which, for a vowel, in- to the effect of the following vowel (not shown). Finnaly we dicates clearlyness and good voice quality [15]. present the trill /r/, on a word beginning context, in figure 8. Again the resemblance between sounds is very high. A spectral 3.3.2. Consonants strip around 2000 Hz appears in all the sounds but with a greater homogeneity in the HS. For this system we can also observe a For consonants we will present one sound for each articulation blurred spectrum above and bellow 2000 Hz due to the paramet- mode with the exception of liquids that will not be presented ric model used for representation. Unlike previous sounds the (mainly because we could not collect a statistically meaning- spectral contrast in the HS is now smaller. ful set of sounds in our database for training the related HMM which would lead to an unfair comparison). An example of 3.4. Objective Evaluation a fricate /s/ can be found in figure 5 and, as expected for this sound, we observe an almost random signal in time domain, es- In addition we estimated the spectral distances of the sound gen- sentially noise. In the spectral representation the power is well erated by CS and HS to the original. We used an Euclidean distributed along the frequencies with a slight enforcement of distance and the normalized results are presented on table 1.

73 Proceedings of the I Iberian SLTech 2009

4. Conclusions 0.04688 4000 Much more pictures and comparisons would be necessary to

0 make a deeper analysis of these technologies. The developed

Frequency (Hz) systems are far from being state-of-the-art which for sure would -0.04688 0 0 0.07428 0 0.07428 enhance the comparison. We analyzed the raw technologies and Time (s) Time (s) presented their main advantages, drawbacks and development 0.09375 4000 potential. The main conclusions are: the CS system can produce very high quality utterances when the speaking style is identical 0 to the one used during database recording; the basic pitch con- Frequency (Hz) -0.07031 0 tour generated by the HS is smoother but can be easily manip- 0 0.08061 0 0.08061 Time (s) Time (s) ulated for generating a richer prosody; Smoothing can also be

0.08594 4000 observed in formant frequencies time evolution; This effect can be a benefit since the vocoding technique used in the HS can

0 correctly describe the most representative frequencies, shown

Frequency (Hz) by the higher contrast spectra, which leads to high levels of in- -0.05469 0 0 0.06319 0 0.06319 telligibility. Some of these conclusions confirm others reports Time (s) Time (s) and others bring new insights about HMM based synthesis. The obtained results show that this technology can compete with the Figure 6: Time and frequency domain comparison of the plosive old paradigm and still require much smaller footprints. We ex- /k/ produced by human speaker, CS and HS. pect that the presented analysis can provide research paths and information for additional developments and improvements.

0.3359 4000 5. References

0 [1] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kita- mura, T., ”Speech parameter generation algorithms for hmm- Frequency (Hz) –0.3828 0 based speech synthesis,” in Proc. IEEE Conf. on Acoustics, 0 0.13057 0 0.13057 Time (s) Time (s) Speech, and Signal Processing, 2000.

0.2969 4000 [2] Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. Kitamura, T., ”Simultaneous modelling of spectrum, pitch and duration in

0 HMM-based speech synthesis,” in Proc. of Eurospeech, 1999. [3] Tokuda, K., Kobayashi, T. and Imai, S., ”Speech parameter gen- Frequency (Hz) –0.3125 0 0 0.0328614 0 0.0328614 eration from HMM using dynamic features,” in Proc. of ICASSP, Time (s) Time (s) 660-663, 1995.

0.4922 4000 [4] Heiga Zen and Tomoki Toda, ”An Overview of Nitech HMM- based Speech Synthesis System for Blizzard Challenge 2005”, in

0 Proc. of InterSpeech 2005, 93-96, 2005.

Frequency (Hz) [5] HTS, April 2009, at http://hts.sp.nitech.ac.jp/. –0.4297 0 0 0.0456287 0 0.0456287 Time (s) Time (s) [6] Zen, H., Toda, T., and Tokuda, K., ”Details of the nitech hmm- based speech synthesis system for the blizzard challenge 2005,” IEICE Trans. Inf. Syst., vol. E90-D, no. 1, pp. 325-333, 2007. Figure 7: Time and frequency domain comparison of the nasal [7] Kawahara, H., Masuda, I. and Cheveigne, A. , ”Restructur- /n/ produced by human speaker, CS and HS. ing speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds”, in Speech Com- munication, 27:187-207, 1999. 0.07812 4000 [8] Zen, H., Toda, T., and Tokuda, K., ”The Nitech-NAIST HMM- Based Speech Synthesis System for the Blizzard Challenge 2006,” 0 IEICE Trans. Inf. Syst., vol. E91-D, 6:1764-1773, 2008. Frequency (Hz) –0.07031 0 [9] Yamagishi, J., Nose, T., Zen, H., Toda, T. and Tokuda, K., 0 0.144689 0 0.144689 Time (s) Time (s) ”Speaker-independent hmm-based speech synthesis system - HTS-2007 system for the blizzard challenge 2007”, 2007. 0.1406 4000 [10] Nakano, Y., Tachibana, M., Yamagishi, J. and Kobayashi, T.,

0 ”Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis,” in Proc. ICSLP, 2006. Frequency (Hz) –0.125 0 0 0.152752 0 0.152752 [11] Fukada, T.; Tokuda, K.; Kobayashi, T.; Imai S., ”An adaptive Time (s) Time (s) algorithm for mel-cepstral analysis of speech”, Proc. ICASSP-92,

0.04688 4000 1, 137-140, 1992. [12] Festival, 2009, at http://www.cstr.ed.ac.uk/projects/festival/. 0 [13] Braga, D., ”Natural Language Processing Algorithms for TTS

Frequency (Hz) Systems”, PhD thesis, Universidad da Coruna, 2008 –0.05469 0 0 0.0643784 0 0.0643784 Time (s) Time (s) [14] Boersma, P., Weenink, D., ”Praat: doing phonetics by computer”, April 2009, at http://www.praat.org/. Figure 8: Time and frequency domain comparison of the trill /r/ [15] Braga, D., Coelho, L., Resende, G., and Dias, M., ”Subjective and objective evaluation of brazilian portuguese tts voice quality”, In produced by human speaker, CS and HS, from top to bottom. Proc. Advances in Speech Technology, Ljubliana, 2007.

74 Proceedings of the I Iberian SLTech 2009

Gender, Speaker & Language Recognition

75

Proceedings of the I Iberian SLTech 2009

Detection of Children’s Voices

Rui Martins12, Isabel Trancoso12, Alberto Abad2, Hugo Meinedo2

1Intituto Superior Tecnico,´ Lisboa, Portugal 2INESC-ID Lisboa, Portugal [email protected]

Abstract most currently adopted for gender detection. Our own prelimi- This paper reports our recent work on extending our previous nary work is described in 4, before the concluding remarks. gender detector, targeted only at distinguishing between adult male and female voices, to encompass children’s voices as well. 2. Characteristics of children’s voices The classifiers were based on multilayer perceptrons and Gaus- sian mixture models and used Perceptual Linear Prediction co- There are several differences that can distinguish children’s efficients, plus deltas, and pitch as features. Despite the small voices from adult voices. The differences may be attributed to amount of training data for children’s voices, fairly good results anatomical and morphological differences in the vocal-tract ge- were obtained in a test corpus of similar recording conditions ometry, less precise control of the articulators and a less refined (minimum classification error rate of 2.6%). Tests on real life ability to control suprasegmental aspects as prosody. These as- corpora revealed the expected degradation with noisy environ- pects induce major differences in children speech, higher fun- ments and distant microphones. Tests with transformed female damental and formant frequencies, greater spectral variability, voices intended as cartoon child characters showed that they slower average speaking rate, higher variability in speaking rate were mostly classified as children’s voices. and higher degree of spontaneity [5]. Index Terms: gender detection, age effects, children voices. It is a well known fact that the fundamental frequency of children’s voices is much higher than for adults, where aver- 1. Introduction age values of 130 Hz for adult males, and 220Hz for adult fe- males can be found. No statistically significant gender differ- Gender detection (GD) is a very useful task for a wide range of ence exists for children below twelve. Children’s voices are also applications. In the Spoken Language Systems lab of INESC- known to have much higher formant frequencies (specially for ID, the GD module is one of the basic components of our audio the second and third formants), attaing values above 4 kHz. The segmentation system [1], where it is used prior to speaker clus- boundary values of the phonetic vowel space decrease with age, tering, in order to avoid mixing speakers from different genders becoming more compact, and leading to a decrease in dynamic in the same cluster. Gender information may also be used for range of the formants values and to a decrease of the variability building gender-dependent acoustic modules for speech recog- of spectral values. A 5-year old child presents values of for- nition [2]. In our fully automatic Broadcast News subtitling mants 50% higher than an adult male. Whereas in adults there system, deployed at the national TV channel since March 2008 are typically 3-4 formants in the 0.3-3.2kHz range, for children [3], gender information is also used to change the color of the one can only find 2-3 formants in this range. subtitles, thus helping people with hearing difficulties to detect which speaker the subtitle refers to, a useful hint that partially 2.1. Growing up compensates the small latency of the subtitling system. GD is also a prominent part of our participation in the The differences become less marked in the process of growing VIDIVIDEO European project, aiming at the semantic search up. During puberty, the male glottis changes so that the pitch of audio-visual documents [4]. In this application, the audio frequency is lowered about one octave. This change sometimes concept “male-voice” may be much easier to detect than the occurs over just a couple of weeks. The pitch drop usually oc- corresponding video-concept “male-speaker”. curs from age eleven to age thirteen and there is no significant Most gender classification systems are trained for distin- pitch change after fifteen. No abrupt changes are observed for guishing between male and female adult voices alone. In fact, girls, where the pitch drop from age seven to age twelve is sig- in some applications like Broadcast News (BN) transcription, nificant, indicating that the laryngeal growth ends around that children’s voices are relatively rare, hence justifying their non- age. In another study [6] it is shown that for male speakers, inclusion. The difficulties in collecting large corpora of chil- pitch drops 78% between the ages 12 to 15, and after that there dren’s voices may also be one of the reasons why most detec- are no significant changes. For female speakers, pitch drops be- tors do not attempt a 3-class distinction. In some applications tween ages 7 and 12, and stops after. The changes in female such as the automatic detection of child abuse (CA), however, speech are more gradual than in male speech, and the main dif- the detection of children’s voices may be specially important. ferences become more significant after age 12. This paper describes our first efforts at moving from our The size of the vocal tract develops somewhat similarly for original 2-class gender detection module to a 3-class module in- boys and girls in this age range [7]. [8] reports an almost linear cluding children’s voices. The paper starts with a brief overview scaling of formant frequencies with age. The scale presents a of the main differences of children’s voices relative to adult ones significant divergence in male / female after puberty, showing and how they become less pronounced as they grow up. Section the differences in physical changes between male and female 3 reviews the state of the art in terms of features and methods speakers. Another thing that changes with age is the internal

77 Proceedings of the I Iberian SLTech 2009

control loops of the articulatory system [9]. 4. Gender classification experiments 4.1. Corpora 3. Gender/age detection The original Male/Female gender classifier was an MLP trained The features most typically found in gender/age classifica- (and tested) on a corpus of Broadcast News (BN), with approx- tion methods are pitch, formants, Mel-Frequency Cepstral Co- imately 51h, (46h for training and 5h for cross-validation). The efficients (MFCC), Perceptual Linear Prediction Coefficients first training of the 3-class detector was done using the CMU (PLP), autocorrelation coefficients, linear prediction coeffi- Kids corpus [15]. The need to get a balanced amount of train- cients (or equivalent), etc. The slower average speaking rate ing data for all the 3 classes made us use a very restricted sub- of children relative to adults is also a motivation for includ- set of the BN male/female data (230 min. per class), with cor- ing delta, RASTA-PLP, or any other temporal modeling coef- responding limitations in the classifier results. More recently, ficients in the feature set. This large number of features also however, we had access to the child corpus collected at KTH motivates the adoption of dimensionality reduction approaches within the framework of the European project PF-STAR [16]. such as Independent Component Analysis and Principal Com- This allowed us to use an extended corpus around 515 min. per ponent Analysis [10]. class, which were subdivided into training (345 min.), develop- ment (65 min.) and test (105 min.). Table 1 shows the gender / Gender classifiers using Gaussian mixture models (GMM), age distribution of the combined corpora. Hidden Markov models (HMM), or multi-layer perceptrons (MLP) were proposed and tested with results about 95% of Gender/Age 4 5 6 7 8 Total accuracy. Most often, these results concern only male/female Male 16 36 35 27 18 132 (M/F) distinction. The comparison of the results reported in the Female 20 27 46 32 18 143 literature is hindered by the fact that they have all been obtained with different corpora. Although very frequently adopted for vi- Table 1: Gender/Age Distribution sual gender detection, Support Vector machines (SVM) are not so popular for audio gender detection. The two children corpora were recorded in very controlled GMMs are the most frequently adopted learning method conditions, and so is most of the BN adult corpus. This was for this task. In [11], a two-stage GMM based classifier shows the motivation for building a pilot set of recordings in condi- results in the order of 98% accuracy, for clean speech, and tions closer to real-life applications for gender detection. This Male/Female/Child (M/F/C) distinction, using a feature vec- evaluation corpus includes one broadcast news show (BN - chil- tor with pitch, formants, and RASTA-PLPs. The first stage at- dren’s day - 63 min.), two TV children’s show (CS - 45 min.), tempts to distinguish adult voices from children’s voices. The two family videos (FV - 30 min.), and 99 CA recordings (489 second stage attempts to distinguish between male and female min.). This CA recordings were divided in 2 sets. The CA adult voices. Speech which represents word recognizable speech and CA Another GMM-based approach to this problem was pro- Voice which represents the presence of a human voice (in gen- posed in [12], combining the information derived from the pitch eral with poor acoustic conditions). All have been manually with a GMM classifier trained with MFCCs, to enhance the per- labelled in terms of gender. Results will be presented for each formance of gender classification. The two scores are combined type of show separately, as the conditions widely differ. The BN using a weighted summation. This method showed results of show is the most similar to the recording conditions of the train- 96.7% and 99.7% for sentences and digits, respectively, in an ing corpora - almost no noise, and no speaker overlap. The CS M/F classification task. shows are also similar in terms of noise conditions, but multiple speaker overlap is frequent (manually marked as overlapping, Gender male/female detection is also applied in an audio with no gender labels). The FV files are characterized by loud segmentation task for broadcast news in [2]. The authors use background noise and multiple speaker overlap. The CA files an HMM-based phone recognizer with 45 context independent often have loud background music. phone models per gender, plus a silence/noise model. The out- Very frequently, the voices of child characters in cartoons put is a sequence of relatively short segments having male, fe- and games correspond to adult professional speakers. This was male or silence tags, which is then heuristically smoothed. The the case of the voices chosen for the Ecircus European project gender segmentation results can be improved by using a clus- [17], where the first recordings of a set of 100 English sentences tering procedure in which all segments are clustered using a by a 9-year old girl and a 10-year old boy attested the fact that top-down covariance-based technique. Error rates below 2.4% children have much greater difficulties than adults in recording have been obtained for wideband gender classification. large quantities of data for corpus-based concatenative synthe- [14] compares 5 different classifiers for gender and age: sis. In fact, they require shorter recording sessions and at slower multi-layer perceptron, k-nearest-neighbor model, Gaussian pace. It is also more difficult to assure the same speaking style mixture model, naive Bayes, and a simple decision tree. Empir- among recording sessions, since it often depends on the child ical features were adopted: pitch and its microvariations (shim- mood in that specific day. This was the motivation for build- mer and jitter), harmonics-to-noise ratio, articulation rate, num- ing synthetic voices from adult recordings for two female En- ber and duration of speech pauses. The age classification was glish speakers. The number of prompt files for each speaker was made according to a fine grid: child, teenager, adult, senior. 675. The total duration of the recordings was 24min. for each Hence the overall number of classes for the combined gen- speaker. Each adult voice was transformed to a child-like voice der/age classification problem was 8. The multi-layer percep- using PSOLA [18] and spectral scaling techniques. Synthetic tron performed best: 93.1% for adult M/F classification, 63.5% voices were then built both from non-transformed and trans- for the overall accuracy of the 8-class classification problem. formed inventories. The transformed voices were considered The greatest confusion of the 8-class problem was achieved, as believable children’s voices when played together with the car- expected in the child M/F distinction. toon characters. These 2 pseudo-children’s voices will be used

78 Proceedings of the I Iberian SLTech 2009

for a last set of tests. Results with generative classifiers such as GMMs us- ing both features simultaneously were worse than using only 4.2. Classification with GMM and MLP methods PLP+delta features. It is possible that the fact that pitch values are relatively close for female and children’s voices may have This section reports the experiments with different features and a negative influence in the results of generative methods. Dis- different machine learning methods. The evaluation metrics are criminative classifiers such as MLPs were not so sensitive to the classification error rate (CER), defined as the percentage of this close proximity. incorrectly classified frames, and the F-measure, defined as the The final set of GMM experiments combined the PLP- weighted harmonic mean of precision and recall. based classifier with the pitch-based classifier. The best com- bination of weights for the linear classifier was trained using a 4.2.1. 2-class baseline classifier logistic regression, with the Focal-Multiclass toolkit [21]. The The 2-class baseline classifier [19] is one of the components results are slightly worse compared with the previous experi- of the audio segmentation module that is used as a pre- ment (CER=5.0%). processing stage for our BN fully automatic subtitling system. The use of 12th order PLP coefficients can be questioned as As other classifiers in this module, it is based on feed-forward higher order cepstral coefficients are frequent in speaker iden- fully connected multi-layer perceptrons trained with the back- tification research. However, our gender classification experi- propagation algorithm. The MLP has 9 input context frames ments using 18 PLP coefficients (plus deltas) only showed im- of 26 coefficients (12th order PLP coefficients with energy plus provements for adult speakers, at the cost of degrading the re- deltas), two hidden layers with 250 sigmoidal units each, and sults for children voices, making the overall results worse. two softmax output units (one for each class) which can be viewed as giving a probabilistic estimate of the input frame be- 5. Results on real-life corpora longing to that class. Figures 1 and 2 present the results of the MLP and GMM (joint This classifier was trained and tested with different subsets PLP+pitch) classifiers for our different real-life corpora. The of the original BN corpus, achieving a CER of 2.30%, with F- corresponding CER results are shown in Table 2, respectively. measure=0.98. As expected, the best results were obtained for the BN show, which is the one with the recording conditions closest to the 4.2.2. 3-class MLP classifier training set. Our first 3-class experiments used an equal MLP architecture, with the same PLP+delta features. As expected, worse results were obtained (CER=4.70%), which we attributed to the dras- tic reduction in training material, and the addition of a third class. The worse results were obtained for female and chil- dren’s voices, where the F-measure was 0.95, versus 0.96 for male voices. Given the importance of pitch as a discriminative feature for this task, we next trained MLPs using PLP+delta+pitch simul- taneously, which resulted in an input vector of dimension 27 per frame. Pitch frequency was extracted using the SNACK toolkit [20]. A significant improvement was observed (CER=3.40%). The best results were obtained for male voices, where the F- measure was 0.98, versus 0.96 for female and children’s voices.

4.2.3. 3-class GMM classifier The next set of experiments was done using Gaussian mixture Figure 1: F-measure results on real-life corpora obtained with models and the same PLP features plus deltas (26 coefficients). the MLP classifier (PLP+deltas+pitch). Unlike the described MLP approach, the GMM classifier does not make use of context windows. The number of mixtures was varied from 32 to 512. As expected, best results were achieved CER % BN CS FV CA speech CA Voice for the largest number of mixtures (CER=2.6%). In terms of MLP 10.67 16.64 68.77 44.80 65.27 F-measure, the best results were obtained for children’s voices, GMM 18.54 31.96 43.52 47.41 62.55 where the score was 0.99, versus 0.97 for male and 0.96 for female voices. Table 2: CER results on real-life corpora obtained with the two The next experiment was done using GMMs trained only classifiers (PLP+deltas+pitch). with pitch information, and varying the number of mixtures from 2 to 32. The results were obviously much worse (min- imum CER=30.75%, for 8 mixtures). The highest F-measure was obtained for male voices, where the score was 0.80, versus 0.65 for children’s and 0.62 for female voices. 6. Results on pseudo-children’s voices Experiments using both types of features simultaneously Experiments with the Ecircus voices have shown us that the yielded CER=4.4% for 512 mixtures. In terms of F-measure, original voices were either classified as female or children’s the best results were obtained for male voices, where the score voices, which justifies the choice of these particular voices for was 0.98, versus 0.96 for children’s and 0.94 for female voices. cartoon child characters. The transformed voices were mostly

79 Proceedings of the I Iberian SLTech 2009

9. References [1] R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. Neto, “A prototype system for selective dissemination of broadcast news in european portuguese,” EURASIP Journal on Advances in Sig- nal Processing, Hindawi Publishing Corporation, no. 37507, May 2007. [2] P. Woodland, T. Hain, S. Johnson, T. Niesler, A. Tuerk, and S. Young, “Experiments in broadcast news transcription,” in Proc. ICASSP’1998, Seattle, USA, May 1998. [3] H. Meinedo, M. Viveiros, and J. Neto, “Evaluation of a live broad- cast news subtitling system for portuguese,” in Proc. Interspeech ’2008, Brisbane, Australia, Sep. 2008. [4] I. Trancoso, T. Pellegrini, J. Portelo,ˆ H. Meinedo, M. Bugalho, A. Abad, and J. Neto, “Audio contributions to semantic video search,” in Proc. ICME 2009 - IEEE International Conf. on Mul- timedia & Expo, Cancun, Mexico, 2009. Figure 2: F-measure results on real-life corpora obtained with [5] A. Potamianos and S. Narayanan, “A review of the acoustic and the GMM classifier (PLP+deltas+pitch). linguistic properties of children’s speech,” in Proc. International Workshop on Multimedia Signal Processing - MMSP’2007, Cha- nia, Greece, Oct. 2007. [6] J. Ajmera, “Effect of age and gender on lp smoothed spectral en- classified as children’s, specially by the GMM classifier, as velope,” in Proc. Speaker and Language Recognition Workshop, shown in Table 3. IEEE Odyssey 2006, San Juan, Puerto Rico, Jun. 2006. [7] J. Wilpon and C. Jacobsen, “A study of speech recognition for % Male Female Child children and the elderly,” in Proc. ICASSP ’1996, Atlanta, Geor- MLP Original 5.38 67.68 26.94 gia, USA, May 1996. Transformed 5.55 37.56 56.89 [8] S. Potamianos, A.; Narayanan, “Robust recognition of children’s GMM Original 0.23 20.63 79.14 speech,” IEEE Trans. on Speech and Audio Processing, vol. 11, Transformed 0.11 0.08 99.81 no. 6, pp. 603–616, Nov. 2003. [9] J. Sundberg, The Science of the Singing Voice. Northern Illinois Table 3: Results with the Ecircus voices in terms of percentage University Press, Dekalb Illinois, 1987. of frames classified in each gender/age class. [10] C. Huang, T. Chen, S. Li, E. Chang, and J. Zhou, “Analysis of speaker variability,” in Proc. Eurospeech ’2001, Aalborg, Den- mark, Sep. 2001. [11] Y. Zeng and Y. Zhang, “Robust children and adults speech clas- sification,” in Fourth Int. Conf. on Fuzzy Systems and Knowledge 7. Conclusions Discovery - FSKD’2007, Haikou, China, Aug. 2007, pp. 721–725. [12] H. Ting, Y. Yingchun, and W. Zhaohui, “Combining mfcc and Most of the literature on the detection of children’s voices re- pitch to enhance the performance of the gender recognition,” in ports results obtained under controlled conditions. It is a well Proc. ICSP 2006, Guilin, China, Nov. 2006. known fact that results generally show a high sensitivity to the [13] K. Hye-Jin, B. Kyungsuk, and Y. Ho-Sub, “Age and gender clas- presence of noise, and distance to the microphone. Our prelimi- sification for a home-robot service,” in Proc. 16th IEEE Interna- nary experiments with real-life recordings confirm this expected tional Conference on Robot and Human Interactive Communica- degradation. Nevertheless the results may still be quite useful tion, Jeju, Korea, May 2007. for a wide range of applications. [14] C. Mller, “Automatic recognition of speakers’ age and gender on The present results show the higher sensitivity of discrimi- the basis of empirical studies,” in Proc. Interspeech ’2006, Pitts- burgh, USA, Sep. 2001. native classifiers when dealing with noisy environments, show- ing an over-adaption to the clean training environment. Gen- [15] M. Eskenazi, J. Mostow, and D. Graff, “The cmu kids corpus,” in Linguistic Data Consortium, Philadelphia, USA, 1997. erative classifiers on the other hand are not so accurate in con- trolled conditions as discriminative ones, but in real conditions [16] A. Batliner, M. Blomberg, S. DArcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, S. Steidl, and M. Wong, “The they tend to perform significantly better. The fusion of both pf star childrens speech corpus,” in Proc. Interspeech 2005, Lis- types of classifier is one of our next tasks. bon, Portugal, Sep. 2005. The reduced amount of training data for children’s voices [17] C. Weiss, L. Oliveira, S. Paulo, C. Mendes, L. Figueira, M. Vala, is one of the problems we face. We are currently investigating P. Sequeira, A. Paiva, T. Vogt, and E. Andre, “Ecircus: Build- the possibility of unsupervised training approaches by adding to ing voices for autonomous speaking agents,” in Proc. 6th ISCA our children’s voices training set all the segments that have been Speech Synthesis Workshop, Bonn, Germany, Aug. 2007. classified as children with a high confidence measure. We are [18] F. Charpentier and E. Moulines, “Pitch-synchronous waveform also considering multi-stage classifiers, instead of 3-class ones. processing techniques for text-to-speech synthesis using di- phones,” in Proc. Eurospeech 1989, Paris, France, Sep. 1989. [19] H. Meinedo, “Audio pre-processing and speech recognition for 8. Acknowledgements broadcast news,” Ph.D. dissertation, Instituto Superior Tecnico,´ Lisbon, Portugal, 2008. The authors would like to thank Mats Blomberg and Daniel Ele- nius for letting us use the KTH PF-STAR children corpus, and [20] K. Sjolander and J. Beskow, “Wavesurfer - an open source speech tool,” in Proc. ICSLP’2000, Beijing , China, 2000. our colleague Lu´ıs Oliveira for his help with the Ecircus voices. This work is part of the MSc thesis of Rui Martins, and is partly [21] D. van Leeuwen and N. Brummer, “Channel-dependent gmm and multi-class logistic regression,” in Proc. Odyssey’2006, 2006. funded by the European projects I-DASH and VIDIVIDEO.

80 Proceedings of the I Iberian SLTech 2009

Unsupervised SVM based 2-Speaker Clustering

Binda Celestino, Hugo Cordeiro, Carlos Meneses Ribeiro

Multimedia and Machine Learning Group Department of Electronic Telecommunication and Computer Engineering Instituto Superior de Engenharia de Lisboa (ISEL), Portugal [email protected], {hcordeiro,cmeneses}@deetc.isel.ipl.pt

Tests were conducted in the audio stream of two interview Abstract videos in Portuguese, each one with two male speakers. This paper proposes two algorithms for the task of 2-speaker Speaker segmentation is evaluated computing false alarm rate unsupervised clustering. The first one creates two SVM (FAR) and missed detection rate (MDR). Results must be models, one for each speaker. The second creates only one considered as preliminary but if the speech segmentation was SVM model, being each speaker assigned to each class of the well conceived no errors were found. same model. These clustering algorithms are based on The rest of the paper is organized as follows: the next traditional two-classes SVM and use MLSF coefficients as section describes the proposed system to segment two acoustic features to represent the speakers. speakers that include the proposed clustering algorithm; Tests were conducted in the audio stream of two interview section 3 presents the experimental tests and results; and the videos in Portuguese, each one with two male speakers. last section concludes and launches some future work. Results must be considered as preliminary but if the speech segmentation was well conceived no errors were found. 2. Proposed System Index Terms: speaker clustering, speech segmentation, The proposed system for speaker segmentation is divided in speaker segmentation, support vector machine, mel line two steps. The first step segments the input speech in spectrum frequencies. segments where is more likely to contain speech from just one

speaker. The second step clusters this speech segments and produces the final speaker segmentation. Clustering algorithm 1. Introduction relies in SVM models trained with MLSF coefficients as There are several applications that demand the identification acoustic features. of speakers during a conversation. Typically, speaker segmentation and clustering is used as a pre-processing stage 2.1 Speech segmentation of another task, for example in automatic speaker recognition. The first step of the speaker segmentation is to segment the The aim of speaker clustering and segmentation applied to a input speech signal. This is achieved through an energy based multi speaker conversation is to group all segments belonging voice activity detector (VAD) adapted from [6]. The main to the same speaker. Normally, there is no a priori information drawback of using an algorithm as simple as this emerges of the number of speakers in the conversation as well as whom when speakers interrupt each other without any pause, they are. Like a telephone conversation, whoever, only two creating a segment with two speakers. In consequence, a miss speakers are involved and this can be known in advance. detection failure is introduced as the unsupervised clustering Most common approaches to speaker segmentation use algorithm that is in the second step can not split these Bayesian Information Criterion (BIC) [1] or Generalised segments. Likelihood Ratio (GLR) [2]. These are supported by Gaussian mixtures models (GMM). Alternative methods can use 2.2 MLSF coefficients Support Vector Machine (SVM) [3] and multi-class SVM [4]. The more commonly used acoustic feature is the well MLSF was proposed in [5] as a feature that carries speaker known MFCC coefficients. Mel Line Spectrum Frequencies information. This feature is a modification of the well known (MLSF) was also proposed in [5] as an alternative feature, that LSF coefficients to include perceptual information. This is show to have similar performance to MFFC coefficients. achieved as long as the autocorrelation coefficients are This paper proposes two algorithms for the task of computed as the inverse Fourier Transform of the 2-speakers unsupervised clustering. The first one creates two mel-spectrum energies, originating mel-autocorrelation SVM models, one for each speaker. The second one creates coefficients and therefore mel-line spectrum frequencies. only one SVM model, being each speaker assigned to each The encoding properties of the LSF are widely known and class of the same model. These clustering algorithms are based typically used in speech coders. MLSF coefficients have the on traditional two-classes SVM and use MLSF coefficients as same characteristics in the representation as the LSF, arranged acoustic features to represent the speakers. on the unit circle, which can benefit from the advantages To have a complete speaker segment system, speech offered by the quantization to recognize speakers remotely segmentation must be performed before speaker clustering. without transmission of the speech signal itself. These speech segments must have speech only from just one Differences between the MLSF poles from the same frame speaker. also contains speaker information, as are related to the formants bandwidth.

81 Proceedings of the I Iberian SLTech 2009

2.3 SVM speaker model and frame score Stage 3: Final models

MLSF coefficients are computed frame by frame and, for 8. Use the speaker models S1 and S2 to compute the each speaker, a codebook of MLSF trained with the LBG difference between frame scores ∆S in all segments algorithm acts as input feature. longer than 1 second. A speaker is represented by a SVM model with a Gaussian 9. For the segments assigned to model S1, create a new kernel. This model is trained with the respective MLSF speaker model S with the segments within the codebook against a world MLSF codebook (trained with 1 several different speakers). All MLSF codebooks have the upper half ∆S. This new model is trained with same number of codewords, independently from the length of codebooks of 1024 codewords. the speech material. This accounts for different time lengths 10. For the segments assigned to model S2, create a new between speakers. speaker model S2 with the segments within the Evaluate a segment X in a model Sj trained for speaker j is bottom half ∆S. This new model is also trained with frame based. Each frame is scored by the corresponding SVM codebooks of 1024 codewords. model and the decision is made based on the frame score over the entire segment, defined by the rate of frames classified on Stage 4: Final clustering the model: 11. Use the speaker models S1 and S2 to compute ∆S for number of frames classified in model j K(X | S ) = (1) all the segments. If ∆S is positive the respective j total number of frames segment is assigned to speaker 1 and if ∆S is negative the respective segment is assigned to In the context of speaker identification, the higher frame speaker 2. score corresponds to the identified speaker. In the context of speaker verification, the frame score is compared to a given The clustering algorithm is based in the difference of threshold. A speaker is accepted as true speaker if the frame frame scores between the two speaker models. This suggests a score is higher than this threshold. new algorithm based on one single model, where the speakers are trained against each other. This new algorithm is described 2.3 Unsupervised clustering algorithms as follows: The two speakers clustering algorithm is based on [7], but with some major adaptations. Instead of based on LLRS over Stage 1: Initial model UBM with MFCC coefficients as features, this algorithm is 1. Segment the test speech file. based on SVM coefficients and use the frame score to measure 2. Generate a speaker model S based on the biggest the adaptation between segments and the speaker model, with 1 MLSF as acoustic features. Also some simplifications are segment, trained with a codebook of length 128 made to obtain the first approximation of the two speaker codewords, and fix one of the two speakers. models. The algorithm is described as follows: 3. For all the segments with more than 2 seconds, compute the frame score. Generate a two speaker’s Stage 1: Initial models model S12 based on two segments: the biggest segment and the segment with the lowest frame 1. Segment the test speech file. score. The former is more likely to be from one 2. Generate a speaker model S1 based on the biggest speaker and the latter more likely to be from a segment, trained with a codebook of length 128 second speaker. codewords, and fix one of the two speakers. 3. For all the segments with more than 2 seconds, Stage 2: Initial clustering compute the frame score. Generate a speaker model S based on the lowest frame score, the more likely 4. All the remainder segments longer than 1 second 2 were scored against the two speaker’s model S . to be from the second speaker. 12 5. The segment with biggest frame score (more likely Stage 2: Initial clustering to be from speaker 1) and the lowest frame score (more likely to be from speaker 2) are used to retrain 4. All the remainder segments longer than 1 second the two speakers model S12, if they obtain more than were scored against speaker models S1 and S2. The 50% and less than 50% respectively. difference between the respective frame scores are 6. Repeat steps 4 and 5 until no more segments longer computed as: than 1 second are left.

Δs = K(X | S ) − K(X | S ) (2) 1 2 Stage 3: Final model 5. The segment with biggest positive ∆S (more likely 7. Use the two speaker’s model S to compute the to be from speaker 1) is used to retrain model S . 12 1 frame score in all segments greater than 1 second. 6. The segment with lowest negative ∆S (more likely 8. Create a new two speaker’s model S based on top to be from speaker 2) is used to retrain model S . 12 2 25% frame scores (more likely to be from speaker 1) 7. Repeat steps 4 to 6 until no more segments longer and the lowest 25% frame rates (more likely to be than 1 second are left.

82 Proceedings of the I Iberian SLTech 2009

from speaker 2). This new model is trained with The feature adopted is a codebook of MLSF coefficients. codebooks of 1024 codewords. MLSF coefficients have the same characteristics than LSF, 9. If necessary for future identification or verification, benefit from the advantages offered by the quantization to create separate speaker models S1 and S2, based in recognize speakers remotely without transmission of the the same assumption of step 8. speech signal itself. Before clustering, segmentation of the input speech must Stage 4: Final clustering be performed, in segments containing speech from just one speaker. The proposed speech segmentation, based in energy, 10. Use the two speaker’s model S12 to compute the can not discriminate speakers interrupting each other, frame score for all the segments. If the frame score compromising the clustering. However, if speech is bigger than 50% the respective segment is segmentation is well conceived, preliminary tests shows the assigned to speaker 1, and if the frame score is less feasibility of the proposed unsupervised clustering algorithm. than 50% the respective segment is assigned to Although tested with only two files, no errors are found. speaker 2. A set of improvements can be performed in this system. This will focus future work in several directions, namely: If speaker identification of the two clustered speakers must (1) Improve the speech segmentation process, particularly be performed from a set of previous speaker models, the when the speakers are interrupting each other. This implies to clusters segments take as a whole must be scored in this set of discard a simple VAD based in energy speech segmentation models and the higher frame score corresponds to the and adopt a more complex speaker segmentation method, identified speaker. which finds speaker changes based on maxima of some distance measure between two adjacent windows shifted along 3. Tests and Results the speech signal. (2) Expand the clustering algorithm to detect Tests were conducted in the audio stream of two interview multi-speakers. This can be obtained with a multi-class SVM videos in Portuguese, each one with two male speakers. For model; demonstration purpose, the identified speaker was subtitle in (3) Identify the clustered speakers if they belong to a set the original video. of speakers with known models. This is performed evaluating As acoustic features, MLSF coefficients are computed all the models. The model with the higher frame score frame by frame. Each frame has 20 ms and 50% overlap corresponds to the identified speaker. between frames. The vector feature order is 32, 16 MLSF Finally, evaluation with a more extensive and consistent coefficients and more 16 corresponding to temporal deltas. corpus must be performed and results must be compared with Like in cepstral mean subtraction, all features are normalized a reference system. to have zero mean and unit variance. For each speaker, a codebook of MLSF coefficients 5. References trained with the LBG algorithm acts as input feature to the SVM classifier. For the word model 10 male speakers from [1] Rissanen, J., “Stochastic Complexity in Statistical Inquiry. the “2002 NIST Speaker Recognition Evaluation Corpus” [8] Series”, Computer Science, 1989, Vol. 15. World Scientific, are used. Singapore, Chapter 3, 1989 [2] Gish, H., Siu, M.-H., Rohlicek, R., “Segregation of speakers for The speaker SVM models are trained with a codebook of speech recognition and speaker identification.” IEEE 128 codeword in the initial stages 1 and 2 of the algorithms International Conference on Acoustics Speech and Signal and 1024 codewords in the final stages 3 and 4. Processing, 1991. 873-876, 1991 One of the two test files is well speech segmented, as the [3] Fergani B., Davy M., Houacine A., “Speaker Diarization using speakers do not interrupt each other. For this test file both the one-class support vector machine” Speech Communication 50, unsupervised clustering algorithms are able to segment the 355-365, 2008 two speakers without any errors. [4] Nazari, M., Faez K., “Speaker Detection and Clustering with For the other test file the speech segmentation do not SVM Tecnique in Persian Conversational Speech”, SETIT 2007, Sciences of Electronic, Technologies of Information and generate segments with only one speaker for all segments, as Telecommunications , 2007 the speakers tends to interrupt each other. As the clustering [5] Cordeiro, H., Meneses Ribeiro, C., “Speaker characterization algorithms are not able to split these segments, detection with MLSFs”, IEEE Odyssey 2006, the Speaker and Language failures are introduced. However, the segments containing Recognition Workshop, Porto Rico, 2006. only one speaker are well grouped. [6] Lamel, L., Rabiner, L., Rosenberg A., Wilpon, J., “An Improved In order to check if the cluster algorithms perform well if Endpoint Detector for Isolated Word Recognition”, IEEE the segmentation was well done, the segments with two Transactions on Acoustics, Speech, and Signal Processing, speakers was hand segmented before running the cluster volume ASSP-29, Nº 4, 777-785, 1981. algorithms. Without surprise no errors are found in the final [7] Deng, J., Zheng, T. F., and Wu, W., “UBM Based Speaker Segmentation and Clustering for 2-Speaker Detection”, ICSLP speaker segmentation. 2006, 116-125, 2006. [8] “2002 NIST Speaker Recognition Evaluation Corpus”, Online: 4. Conclusions and future work http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L DC2004S04 This work presents an unsupervised clustering algorithm of two unknown speakers. The algorithm is based on SVM and uses the frame score to measure the adaptation between segments and the speaker model.

83

Proceedings of the I Iberian SLTech 2009

Speaker Verification with Shifted Delta Cepstral Features: Its Pseudo-Prosodic Behavior

Dayana Ribas Gonzalez,´ Jose´ R. Calvo de Lara

Advanced Technologies Application Center, CENATAV, Cuba {dribas, jcalvo}@cenatav.co.cu

Abstract 2. Shifted Delta Cepstral and Prosodic Features

This paper examines the linear relation between Shifted Delta For efficient representation of the cepstral dynamic trajectory Cepstral (SDC) features and the dynamic of prosodic features. over some short segment of speech, Furui [1] suggested the use SDC features were reported to produce superior performance of an orthogonal polynomial fit of each cepstral coefficient c(t) h 1st to ∆ features in Language Identification and speaker recogni- trajectory over a finite length time window d. The order ∆ (t) tion systems. A selection of more correlated SDC features is coefficient, or the generalized spectral slope in time, c is used in speaker verification to evaluate its robustness to chan- denoted as: P nel/handset mismatch. The experiment reflects superior perfor- D d=−D dhdc(t + d) mance of selected SDC features regarding to features in speaker ∆c(t) = P (1) D h d2 verification using speech samples from NIST 2001 Ahumada d=−D d database. A rectangular window (hd = 1) of reasonable length has to Index Terms: speaker verification, shifted delta cepstral, be used to ensure a smooth fit to the data points from one frame prosodic features, channel mismatch. to the next. ∆ and ∆∆ features usually have been calculated using Eq. (1) with D between 2 to 4, depending on frame time length. 1. Introduction Originally proposed by Bielefeld [6], SDC features are specified by a set of 4 parameters, (N, D, P, k) where: Different studies have been done to use dynamic information contained in speech. The most popular approach consists in • N: number of coefficients in each cepstral vector. extracting first and second order time derivatives of instanta- • d: time advance and delay for the delta computation. neous cepstral features: delta (∆) and delta-delta (∆∆) fea- tures. Furui [1] used cepstral coefficients and their regression • P: time shift between consecutive blocks. coefficients for speaker recognition, and established the effec- • k: number of blocks whose delta coefficients are con- tiveness of combining temporal and dynamic features. catenated to form the SDC vector (∆) and (∆∆) features reflect short-term speech spectral First, a N-dimension cepstral feature vector is computed in dynamics and don’t capture longer term variation in speech, each speech frame t, then each c coefficient is differenced using reflected in other ’high level’ speaker dependent features, as spaced tD frames to obtain the ∆ features, at last k different prosodic, phonetic and linguistic [2]. But these last approach ∆ features, spaced P frames apart, are stacked to form a SDC require a lot of speech samples and are also time consuming feature vector for each frame. The SDC vector at frame time t and computationally complex. is given by the concatenation from i=0 to k-1 blocks of all the Recently the use of a longer term temporal feature called ∆c(t + iP), where: Shifted Delta Cepstral (SDC), in language recognition [3] and P D dc(t + iP + d) speaker recognition [4, 5], has improved the performance of the ∆c(t + iP ) = d=−PD (2) D 2 recognizer in front to channel and handset mismatch. d=−D d As a longer term temporal feature, SDC reflect the dynamic Eq.(2) is a generalization of eq.(1) with hd = 1, including of the spectral features and could have a pseudo prosodic be- the iP time shift. havior. This paper explore this possibility, evaluating the linear The calculation of SDC features doesn’t require extra com- relation between SDC features and the dynamic of two prosodic putational cost, respect to ∆ features, recent experiments have features -pitch and energy- in two different contexts -read text shown an improvement of speaker recognition performance and free expression- and selecting a reduced set of the most cor- [4, 5] without an increase of dimensionality. related SDC features in a speaker recognition experiment under Prosodic features are considered longer term characteris- channel and handset mismatch conditions. tics because they provide a description of the habitual attributes The rest of the paper is organized as follows. Section 2 de- of the speaker. Pitch and energy have a robust performance in scribes the features. Section 3 evaluates the lineal relation be- speaker recognition specially when dealing with noisy and mis- tween SDC features and the dynamic of pitch and energy. Sec- matched channels. Besides they have speaker specific informa- tion 4 describes the experiment and results. Section 5 concludes tion, due to vocal folds physical differences between speakers. this work and gives future research direction. The unpractical aspect of prosodic features is the high amount

85 Proceedings of the I Iberian SLTech 2009 of data needed for a successful recognition, also the procedure required to obtain them is complicated and computationally ex- 1 X t = 1 pensive [7]. Φxy[m] = x[t]y[t + m] (3) 2N + 1 N − m Prosodic information can be used taken global statistics of the features, like mean and standard deviation of the pitch and If x and y are standardized, the limits of cross-correlation energy. But that approach doesn’t capture the temporal dynamic are −1 ≤ Φxy[m] ≤ 1, the bounds 1 indicating maximum cor- information of the prosodic feature. Another approach is to ob- relation and 0 indicating no correlation. A high negative corre- tain a representation of the temporal trajectory of the pitch and lation indicates a high inverse linear relation. energy contours. But that isn’t efficient enough. Previous work had proven the utility of the derivative functions of pitch and 3.1. Cross-correlation between SDC components energy in the description of their dynamic [8]. Proposed combinations of SDC features (Table 1) are consti- tuted by two blocks ∆c(t) and ∆c(t+2), obtained with eq.(2) 2.1. A pseudo prosodic behavior of SDC features evaluated at i=0, 1 with P=2,3 and D=2,3. Both blocks are Dynamic ∆ and ∆∆ features, evaluated over extended speech highly correlated, due by the strong linear dependence between time intervals, have been used in speaker recognition as a char- them. Cross-correlation between two consecutive blocks of any acteristic which contains useful additional information about SDC vector is +1, at P distance of the lag m=0, and present speaker identity. Furui [1] recommends a time interval of 90 ms maximum negative correlation in two symmetrical lags respect to preserve the transitional information associated with changes to P at (D+2). Figure 1 shows, in combination SDC (N, 2, 2, from one phoneme to another, Soong and Rosemberg [9] rec- 2), the correlation of ∆c(t), the cross correlation between both ommends a time interval from 100 to 160 ms to obtain good blocks of SDC, and the correlation of the mean of both blocks, estimates of the trend of spectral transitions between syllables. all of them have the same behavior. Alternatively SDC, as a longer term temporal feature, de- scribes the spectral dynamic of speech. Cepstral features con- tain information about speech formants structure and its dy- namic can reflect the movement and position of vocal and nasal articulators, if the time interval is enough longer. In each frame, SDC features reflect the temporal dynamic of the articulators in the next frames, as a pseudo-prosodic feature vector, computed without having to model the prosodic structure of the speech. Three combinations of SDC features are proposed to ob- tain a good estimate of the dynamic of spectral transitions and compare the behavior of SDC and cepstral + ∆ feature. The value k was fixed at 2 to ensure similar dimensionality between features. It was considered the time interval necessary and suf- ficient to choice the value D. Table 1 reflects used combinations of SDC features: Figure 1: Cross correlation between two consecutive SDC blocks. Table 1: SDC features combinations. D P k frames time interval This property of high correlation between any two consec- 2 2 2 7 147ms utive blocks is used to simplify the computation of the cross- 2 3 2 8 168ms correlation between SDC and prosodic features in this work, ∆ 3 2 2 9 189ms representing SDC feature as the mean of blocks c(t) and ∆c(t+2).

This work evaluates the pseudo-prosodic behavior of SDC 3.2. Cross correlation between SDC and ∆pitch/∆energy features through the linear relation between SDC and the dy- To evaluate correlation between mean SDC features and the namic of pitch and energy. Then, those SDC feature vectors dynamic of prosodic features, two expressions -read text and more correlated, will be selected to evaluate its robustness in a spontaneous speech- of 30 speakers of NIST2001 Ahumada telephone speaker recognition experiment. database [10] were used, representing about 90 minutes of tele- phone speech. 12 MFCC+∆ vectors and their corresponding 3. Temporal relation of SDC features with SDC vectors, and pitch and energy values, were synchronously prosodic features obtained in each frame, to conform the time sequences. ∆pitch and ∆energy were calculated using eq.(1) with D=2. Mean and To evaluate the lineal relation that could exists between SDC variance normalization were applied as a feature standardiza- and prosodic features, this work uses the temporal correlation tion method. between a time sequence of SDC features and the dynamic Cross-correlation of the three proposed combinations of of pitch and energy. Cross-correlation between two N-length SDC features with ∆pitch and ∆energy, presents very similar sequences x and y, provides a statistical comparison of both as behavior respect to upper and lower peaks values and their lags a function of the time-shift m and indicates the strength and positions. So, combination SDC(12,2,2,2) was selected for the direction of a linear relationship between them. experiment, as the less computationally expensive SDC feature of Table 1.

86 Proceedings of the I Iberian SLTech 2009

The results of cross-correlation evaluation between each Each frame of speech is represented by a 12-dimensional one of the 12 SDC features with ∆energy and ∆pitch, are MFCC features vector. Cepstral Mean and Variance Normaliza- showed through SDC features organized in decreasing order of tion feature normalization method is applied to MFCC features. correlation in Table 2. The highest correlations of SDC were The ∆ cepstral vector is obtained from each cepstral feature obtained with respect to ∆ energy. In general the correlation using Eq.(1) with D=2. The SDC(12,2,2,2) vector is obtained peaks are negative, reflecting an inverse lineal relation, it means, concatenating one additional ∆ cepstral vector separated P=2 an increase of one time sequence implies a decrease in the other. spaces, to original ∆ cepstral vector. This work evaluates the Although the values of the cross-correlation peaks are not behavior of the two selected SDC feature vectors (epig. 3.2) very impressive, there are some SDC features more correlated appended to MFCC vector, respect to MFCC + ∆ vector. So, than others. The most correlated values are between -0.65 and three different sets of features with the same dimensionality, are -0.35 and the rest are between -0.2 and 0.3. used in the experiment:

1. 12 MFCC+12 ∆, dimension 24 (baseline): M-D Table 2: SDC features organized in decreasing order of cross- 2. 12 MFCC+6 SDC more correlated with ∆energy: M- correlation. SDC-E order 1 2 3 4 3. 12 MFCC+6 SDC more correlated with ∆pitch: M- ∆ E sdc4 sdc5 sdc3 sdc2 SDC-P xcorr −0.65 −0.56 −0.55 −0.45 ∆ P sdc4 sdc6 sdc5 sdc3 The experiment performance is evaluated using a 64 mix- xcorr −0.67 −0.50 −0.48 −0.37 tures GMM/UBM classifier [11], trained and tested with the ten order 5 6 7 8 balanced phrases of 50 client speakers of the database. The ten ∆ E sdc6 sdc9 sdc11 sdc8 balanced phrases of other subset of 50 non client speakers are xcorr −0.37 −0.35 −0.27 −0.25 used to train the 256 mixtures UBM. ∆ P sdc9 sdc7 sdc1 sdc2 Experiment results are reflected in detection error tradeoff xcorr −0.35 −0.35 0.35 −0.30 (DET) plots: order 9 10 11 12 ∆ E sdc10 sdc12 sdc7 sdc1 xcorr −0.25 −0.25 −0.22 0.12 ∆ P sdc11 sdc12 sdc10 sdc8 xcorr −0.3 −0.27 −0.2 −0.18

Then, two vectors of six SDC features were used in speaker verification experiment, appended to MFCC vector, the first vector, more correlated with ∆energy, composed by sdc2, sdc3, sdc4, sdc5, sdc6 and sdc9 and a second vector, more correlated with ∆pitch, composed by sdc3, sdc4, sdc5, sdc6, sdc7 and sdc9. Both resultant vectors have the same dimensionality as MFCC+∆ vector.

3.3. Experiments and Results NIST 2001 Ahumada [10] is a speech database of 103 Spanish Figure 2: Speaker verification under low microphone handset male speakers, acquired under controlled conditions for speaker sensibility (< 1mV/P ). characterization and identification. A speaker verification ex- periment is performed using ten phonologically and syllabically balanced phrases in telephone sessions. Training samples set is obtained under good hand- set/channel characteristics, concatenating the ten balanced phrases (about 40 sec. of speech) of each one of 50 client speakers. Testing samples sets are obtained with each one of the phrases of the same speakers in another session (about 5 sec. of speech each), made using 9 randomly selected standard handsets and each speaker uses one of them. For each handset, three characteristics were reported: (a) microphone sensibility, (b) microphone band pass frequency re- sponse, and (c) signal to noise ratio in its associated channel. Test was performed with samples of those 50 clients who speak under the worst mismatch condition, in order to evaluate the robustness in front to channel mismatch due to: • < 1mV/P low microphone handset sensibility( ) Figure 3: Speaker verification under low microphone band pass • low microphone band pass frequency response(< 20dB) frequency response (< 20dB). • low signal to noise ratio in the channel(< 30dB)

87 Proceedings of the I Iberian SLTech 2009

As SDC features reflect correlation with prosodic features, without additional cost respect to ∆ features, they must be con- sidered as an alternative to ∆ features, in order to reduce the effects of channel/handset mismatch in speaker verification per- formance. Future work will be in the direction of evaluate another re- lations between SDC features and the dynamics and statistic of prosodic features. 5. References [1] S. Furui, ’Cepstral analysis for automatic speaker verification’, IEEE Transactions on ASSP, 29(2):254-272, 1981. [2] D. Reynolds and W. Andrews and J. Campbell and J. Navratil and B. Peskin and A. Adami and Q. Jin and D. Klu-sacek and J. Figure 4: Speaker verification under low signal to noise ratio in Abramson and R. Mihaescu and J. Godfrey and D. Jones and B. (< 30dB) Xiang.Stone and H.S., ”The SuperSID Project: Exploiting High- the channel . level Information for High-accuracy Speaker Recognition”, Pro- ceedings of the IEEE ICASSP 2003, vol. 4:784-787. The values of EER and DCF of the experiments are showed [3] P. A. Torres-Carrasquillo and E. Singer and M.A. Kohlerand R. J. in Table 3. Then Table 4 reflects the relative reduction in percent Greene and D.A. Reynolds and J.R. Deller Jr. ”Approaches to lan- guage identification using Gaussian Mixture Models and shifted of EER, for both sets of SDC features. delta cepstral features.” Proceedings of ICSLP 2002, pp. 89-92 [4] T. Kinnunen, C.W. E. Koh, L. Wang, H. Li, E. S. Chang. Temporal Table 3: EER and DCF results of baseline and two set of se- discrete cosine trans-form: Towards longer term temporal features lected SDC features. for speaker verification, Proceedings of ICSLP 2006. [5] J. Calvo and R. Fernndez and G. Hernndez. ”Channel/Handset Set of Low handset Low handset Low s/n Mismatch Evaluation in a Biometric Speaker Verification using features sensibility freq. response ratio Shifted Delta Cepstral Features.” Proceedings of CIARP 2007, M-D EER 14.7 14.2 15.3 LNCS 4756, pp.96-105. DCF 0.06 0.065 0.068 [6] Bielefeld. B. ”Language identification using shifted delta cep- M-SDC EER 13.7 13.9 12.6 strum.” Proceedings Four-teenth Annual Speech Research Sym- -E DCF 0.061 0.066 0.069 posium 1994. M-SDC EER 13.2 13.4 13.4 [7] L. Mary and B. Yegnanarayana. ”Prosodic features for Speaker -P DCF 0.068 0.062 0.058 Verification” Proceedings of Interspeech 2006. [8] A. Adami and R. Mihaescu and D. Reynolds and J. Godfrey. ”Modeling prosodic dynamics for Speaker Recognition.” Pro- ceedings of ICASSP 2003. Table 4: Reduction in percent of EER for both sets of selected [9] F. Soong and A. Rosenberg. ”On the use of instantaneous and tran- SDC features respect to baseline. sitional spectral information in speaker recognition.” IEEE Trans on Audio Speech and Signal Proc. 36(6):871-879, 1988. Mismatch condition SDC correlated SDC correlated with ∆ energy with ∆ pitch [10] J. Ortega-Garcia and J. Gonzalez-Rodriguez and V. Marrero- Aguiar. ”AHUMADA: A large speech corpus in Spanish for low handset sensibility 6.8 8.8 speaker characterization and identification.” Speech Comm. low handset freq. response 2.1 5.6 31:255-264, 2000. low s/n ratio in channel 17.6 13.7 [11] D. Reynolds and T. Quatieri and R. Dunn. ”Speaker Verification Using Adapted Gaussian Mixture Models.” Digital Signal Proc. 10:19-41, 2000. 4. Conclusions and Future Work This work reflects the results obtained in the evaluation of a prosodic-related vector of SDC features, in speaker verifica- tion using speech samples from mismatch telephone sessions of NIST2001 Ahumada database. Results in DET plots of speaker verification experiments reflect: • a superior performance respect to MFCC + ∆ features of both prosodic-related SDC features (see Table 3). • a better performance respect to MFCC + ∆ features, of SDC features more correlated with ∆ energy (see figures 2, 3 and 4). This result is consistent with the highest correlation of SDC features with ∆ energy (epig. 3.2) • a superior robustness of both prosodic-related SDC fea- tures, mainly under low s/n in the channel, consistent with robustness of prosodic features (see Table 4).

88 Proceedings of the I Iberian SLTech 2009 Multilevel and channel-compensated language recognition: ATVS-UAM systems at NIST LRE 2009

Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Javier Franco-Pedroso, Daniel Ramos, Doroteo T. Toledano, and Joaquin Gonzalez-Rodriguez

ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ignacio.lopez, javier.franco, daniel.ramos, doroteo.torre, joaquin.gonzalez} @uam.es

Abstract describe the system presented by ATVS to NIST LRE in its This paper presents the systems submitted by ATVS – 2009 edition, consisting of four different combinations of Biometric Recognition Group at 2009 language recognition acoustic and phonotactic subsystems. evaluation, organized by the National Institute of Standards The two ATVS spectral (also known as acoustic) and Technology of United States (NIST LRE’09). Apart subsystems were based in session variability compensated from the huge size of the databases involved, two main first-order sufficient statistics. The first system was built factors turn the evaluation into a very difficult task. First, the according to the FA-GMM linear scoring framework [2] and number of languages to be recognized was the biggest of all the second one is a SVM whose inputs are model NIST LRE campaigns (23 different target languages). supervectors adapted from the first-order compensated Second, the database conditions were strongly variable, with sufficient statistics [3]. The phonotactic components are telephone speech coming from both broadcast news, PhoneSVM composed of seven ATVS tokenizers and three extracted from Voice Of America (VOA) broadcast system, tokenizers made available by Brno University of Technology and conversational telephone speech (CTS). ATVS (BUT). The systems work in a front-end-back-end participation consisted of state-of-the-art acoustic and high- configuration: first, dual models are obtained in the front end level systems incorporating session variability compensation for VOA (22 models, indian-english was not trained in the via Factor Analysis. Moreover, a novel back-end based on front-end because of data scarcity) and CTS (14 models) data. anchor models was used in order to fuse individual systems Second, an anchor model back-end (23 VOA+CTS models, prior to one-vs.-all calibration via logistic regression. Results indian-english learned from other 22 model scores) was used both in development and evaluation corpora show the for fusion. Front-end scores were channel-dependent (22 robustness and excellent performance for most of the VOA/14 CTS) t-normalized while back-end scores are languages (among them, Iberian languages such as Spanish channel-independent (23 VOA+CTS) t-normalized. A and Portuguese) calibration stage was finally used for transforming output scores into log-likelihood ratios (logLR) in order to allow the 1. Introduction use of Bayes thresholds for decision making. The same logLR sets were submitted to the closed- and open-set conditions of Language recognition has been an increasing research area in the evaluation. the last years, mainly due to its interest in applications such as The development process prior to the blind submission was audio segmentation and indexing or information retrieval. carried out by the construction of a corpus, which we have This interest is also motivated by the availability of the called ATVS-Dev09, using Callfriend, LRE'07 and VOA technology to yield acceptable performance, which has databases, and including the 23 languages of LRE’09. fostered the deployment of real-world applications. Among The paper is organized as follows. First, ATVS individual the driving factors of this rapid performance improvement of spectral and high-level systems at the front-end are described state-of-the-art technologies, the efforts of the American in Section 2. Section 3 presents the fusion and calibration National Institute of Standards and Technologies (NIST) back-end. Finally, section 4 presents the results for ATVS- deserve special mention [1]. Due to the organization and Dev09 set-up and also for the blind NIST LRE 09 evaluation funding of the Language Recognition Evaluations (LRE), the dataset. ability of the technology to successfully face challenging problems has achieved a remarkable increase. Moreover, 2. ATVS Systems NIST LRE have settled the foundations for the establishment of common protocols for experimental evaluation, from valuable and rich publicly available databases to well-defined 2.1. Spectral systems evaluation methodologies. Therefore, it has become a highly valuable forum for scientific researchers and technology 2.1.1. DS-CS: FA-GMM linear scoring system developers who aim at adapting their systems to real-world challenges. ATVS DS-CS (DotScoring with Compensated Statistics) Following such objectives, ATVS – Biometric Recognition GMM-FA linear scoring system is based on the work Group of the Universidad Autonoma de Madrid (hereafter, presented in [2]. In this work a complete acoustic system ATVS) has been participating in NIST LRE since 2005, based on generative modelling GMM-FA framework is submitting systems at the spectral and higher levels for blind introduced, adding a new scoring approach based on a linear and public competition. The aim of this work is to approximation to log-likelihood ratios. System shows a great

89 Proceedings of the I Iberian SLTech 2009 performance in both computational burden and recognition speech for training. All these decoders are based on Hidden performance. Markov Models (HMMs) trained using HTK and used for Feature extraction is shared among acoustic systems, decoding with SPHINX. The phonetic HMMs are three-state consisting in 7 MFCCC with CMN-Rasta-Warping left-to-right models with no skips, being the output pdf of concatenated to 7-1-3-7 SDC-MFCCs. Given a UBM, zero each state modeled as a weighted mixture of Gaussians. and first order sufficient statistics are extracted for every The acoustic processing is based on 13 Mel Frequency utterance (train and test); then, first order statistics are Cepstral Coefficients (MFCCs) (including C0) and velocities session-variability compensated using FA, and models are and accelerations for a total of 39 components, computing a generated from the compensated statistics. Finally, scores are feature vector each 10ms and performing Cepstral Mean obtained via dot product between test first compensated Normalization (CMN). statistics and model supervector. For each test utterance, the systems make n-grams with The model for session variability compensation is as the 1-best solution produced by the phonetic decoders. follows: Support Vector Machines (SVMs) take the n-grams as input mmUx' =+ vectors [3]. where the low rank U matrix defines the session variability subspace (U), and x represents the channel factors estimated from the training data used to build the model m. U was trained via EM algorithm after a PCA initialization based on [4][5]. Only top-50 eigenchannels were taken into account. Two different GMM-FA linear scoring systems were developed according to the two different types of data presented in the evaluation. In that sense two UBMs and U matrices were trained from CTS and VOA data respectively. We found this approach to outperform the approach where mixed data (CTS, VOA) is processed to train a unique session Fig. 1: Hierarchical combination of phonotactic systems. variability subspace. T stands for t-norm, performed in a channel dependent way (VOA/CTS) in front-end systems. 2.1.2. SV-CS: SVM chanel compensated supervector Additionally, three speech recognizers (Hungarian, Czech ATVS supervector approach is also based on the statistics and Russian) from BUT (Speech@FIT, Speech Processing computed in 2.1.1 which are adapted from the UBM model Group at Faculty of Information Technology, Brno University (trained with the same data as 2.1.1 but having only 512 of Technology - FIT BUT, Czech Republic) have been used mixtures). Therefore, we obtain a single adapted statistic per as additional high-quality tokenizers (Ph8-Ph10). The utterance that summarises its information. Difference between PhoneSVM systems are built then in the same way as with the standard supervector, and statistics-based supervector is ATVS tokenizers. PhoneSVMs are combined in different that in the latter we replace the vector of means of the adapted ways to obtain different Front-end systems, as shown in GMM by the utterance-adapted statistics. figure 1. Each PhX system consists of 22 VOA and 14 CTS models trained separately. Channel dependent t-norm is the last stage of those phonotactic front-ends. 2.2. High level systems

2.2.1. PhX: Phone-SVMs 3. Fusion and calibration Each of the seven different ATVS Phone-SVM subsystems Our back-end/fusion strategy was based on the use of anchor (Ph1-Ph7) is based on the following steps. First a voice models [7], where high-dimensionality input vectors are activity detector segments the test utterance into speech and classified in a single SVM per target model (23) both for non-speech segments. The speech segments are recognized VOA and CTS data. Recently, the anchor models approach with one open-loop phonetic decoder. The best decoding is has been successfully used for speaker verification and used to estimate count-based 1-grams, 2-grams and 3-grams, language identification too [8, 9, 7]. By using anchor models, pruned with a probability threshold, resulting in about 40.000 each utterance is mapped into a model space where the n-grams per recognizer. These are rearranged as a feature relative behaviour of the speech utterance with respect to vector, which is taken as the input of an SVM that classifies other models can be learned. The mapping function consists the test segment as corresponding (or not) to one language of testing every single utterance over a cohort of reference [3]. models, known as anchor models. The feature vector is the The process described above is repeated for the seven concatenation of all the scores. A channel independent T- different open-loop phonetic recognizers used. In particular Norm (models from VOA and CTS) stage was applied for these subsystems use six phonetic decoders trained on scoring normalization. SpeechDat-like corpora, each of which contain over 10 hours In order to take the actual decision we followed a one-vs.- of training material covering hundreds of different speakers. all detection approach to calibrate the output log-likelihood- The languages of these phonetic decoders and the ratios (logLR). Each score for each of the 23 target languages corresponding corpora used are English (with the corpus with in the evaluation was mapped to a logLR assuming a target- ELDA catalogue number S0011), German (S0051), French language-vs.-rest configuration. Thus, a different score-to- (S0185), Arabic (S0183 + S0184), Basque (S0152) and logLR mapping was performed per target language. Linear Russian (S0099) (www.elda.org). We have also included a logistic regression [10] was trained on the complete 7th phonetic decoder in Spanish trained on Albayzin [6] development set of scores for each language and for each downsampled to 8 kHz, which contains about 4 hours of given duration (3s, 10s and 30s) separately. The FoCal toolkit 90 Proceedings of the I Iberian SLTech 2009 has been used in order to train logistic regression The test material (ATVSDevTest) was obtained from (http://niko.brummer.googlepages.com/focal). After LRE07Test (for target languages in both LRE07 and LRE09), calibrating logLR values, the logarithm of the Bayes and from manually labeled data from VOA2 and VOA3. A threshold has been used in order to take decisions. total number of about 15000 segments (30s, 10s and 3s) were used. The evaluation included about 15000 segments per duration (~45000 segments) and therefore about 1 million trials are defined, because every utterance is faced against every language model (23 languages). Details about the protocol can be found in the NIST LRE’09 evaluation plan [11]. In order to assess performance, two different metrics are used in this paper, both evaluating the capabilities of one-vs.- all language detection. On the one hand, DET curves measure the discrimination capabilities of the system. On the other hand, Cavg is a measure of the cost of taking bad decisions, and therefore it considers not only discrimination, but the ability of setting optimal thresholds (i. e., calibration). In this paper we also show the Cavg value of our calibrated systems for the Bayes threshold. Details about NIST performance measures can be found in [11].

Fig. 2: ATVS Fusion Scheme. T stands for t-norm, and C 4.2. Development Results for calibration.

Different combinations of systems presented in section 2 were Development results in ATVS-Dev09 for all durations submitted leading to a total of four different systems built (30s, 10s, 3s) and all submitted systems are presented in under different criteria: figure 3 while figure 4 shows the Cavg after calibration of the ATVS primary system. • ATVS4 was a fusion of the 10 PhoneSVM systems used (7 from ATVS, 3 using BUT freely-available recognizers) and it evaluates the performance of our high-level technology. • ATVS3 only included the acoustic DS-CS system, which was designed to optimize the computational burden but with a high level of recognition performance. • ATVS2 consisted of a fusion of all our systems, as shown in figure 2. This system illustrates the performance reached by fusing ATVS systems. • ATVS1 (primary) consisted of a fusion of ATVS2 and the primary system of another participant in NIST LRE. This shows how our systems can take advantage of other different sources of language recognition information.

4. Development and evaluation Results Fig. 3: Pooled DETs (EERs in %) of submitted systems 4.1. Databases, protocol and performace metric on ATVS-DevTest09. A closed-set development dataset, known as ATVS-Dev09, composed of portions or all of LRE'05, Callfriend, LRE'07 Results show the performance achieved for every and VOA data (different portions and/or selection criteria for submitted system. It is worth pointing out that acoustic train and test and for each language) was used to test the systems outperform phonotactic ones, but fusion of both submitted systems in the 23 languages of LRE’09. We refer kind of systems improve results, which encourages the closed-set as the task where only target languages are use of multilevel approaches for language recognition. included in the test stage, opposite to open-set where other Performance degradation due to duration of test segments non-target languages can be included. Detailed information is also showed. can be found in the NIST evaluation plan [11]. The training material (ATVS-DevTrain09) for the CTS The effect of using a session variability compensation language models consisted of the Callfriend database, the scheme based on factor analysis is presented in figure 5. full-conversations of NIST LRE 2005 and development data Here, the DS-CS system is evaluated on the ATVS-dev09 of NIST LRE 2007. For Russian data we used also RuSTeN with and without session variability compensation A (LDC 2006S34 ISBN 1-58563-388-7, www.ldc.upenn.edu). relative improvement on the EER of about of 56% is VOA models are obtained from speech segments (minimum obtained when compensation is applied. length 30s.) extracted from VOA2 and VOA3 long files (except manually labeled files, used for testing) using telephone labels provided by NIST.

91 Proceedings of the I Iberian SLTech 2009

ATvs

Fig. 6: Official results on closed-set 30s task

Fig. 4: Cavg of ATVS1 on ATVS-DevTest09 set for 30s test segments.

ATvs

Fig. 7: Official results on open-set 30s task

5. References [1] NIST LRE Website on http://www.nist.gov/speech/tests/lang/ (Accessed 04 June 2008). [2] N.Brümmer A. Strasheim, et al. ”Discriminative Acoustic Fig. 5: Pooled DETs (EERs in %) with acoustic dot- Language Recognition via Channel Compensated GMM Statistics”. scoring system with and without FA channel Interspeech 2009, Brighton, U.K. Accepted. compensation on ATVS-Dev09 prior to t-nomalization. [3] W. M. Campbell, J. P. Campbell et al. “Support vector machines 30s test segments. for speaker and language recognition,” Computer Speech and Language, vol. 20, no. 2-3, pp. 210–229, 2006. [4] Kenny, P. and Boulianne, G. et al, “Eigenvoice Modeling With 4.3. LRE09 Evaluation Results Sparse Training Data”, IEEE Trans.~on Speech and Audio Processing, vol. 13, no. , pp 345-354. Although a degradation of the performance of systems was [5] R. Vogt and S. Sridharan, “Explicit modelling of session observed in the evaluation with respect to development test, variability for speaker verification,” Computer Speech & Language, the behaviour of the systems in both experimental scenarios is vol. 22, no. 1, pp. 17–38, 2008. consistent. This degradation performance, common to all [6] A. Moreno, D. Poch et al, “ALBAYZÍN Speech Database: Design of the Phonetic Corpus,” in poceedings of EUROSPEECH. Berlin, participants, is due to the database mismatch among the Germany, 21-23 September 1993. Vol. 1. pp. 175-178. development and testing databases, and is a common effect in [7] I. Lopez-Moreno, D. Ramos et al. "Anchor-model fusion for NIST LRE. Moreover, the evaluation database exhibited a language recognition", in Proceedings of Interspeech 2008, Brisbane, higher variability in terms of number of speakers. Australia, September 2008. Figures 6 and 7 show the ATVS primary system evaluation [8] M. Collet, Y. Mami et al. "Probabilistic Anchor Models Approach results for the closed and open set tasks respectively. Results for Speaker Verification", in INTERSPEECH 2005. for the core condition (closed-set, 30s) are comparable to the [9] E. Noor1, H. Aronowitz "Efficient Language Identification using best systems in the evaluation. Moreover, it is worth Anchor Models and Support Vector Machines", in Odyssey 2006 ISBN: 1-4244-0472-X pp 1-6. highlighting the excellent performance of the ATVS primary [10] N.Brümmer, L. Burget et al. ”Fusion of Hetereogeneous speaker system in the open-set condition, where a second rank recognition systems in the STBU submission for the NIST speaker position was obtained. Results in that task prove the recognition evaluation 2006} IEEE Transactions on Audio, Speech robustness of anchor models working under ‘unseen’ and Signal Processing, 2007. Vol 15. pp 2072-2084 languages. [11] The 2009 NIST language recognition evaluation plan ”www.itl.nist.gov/iad/mig/tests/lre/2009/LRE09_EvalPlan_v6.pdf.”

92 Proceedings of the I Iberian SLTech 2009

Language Processing

93

Proceedings of the I Iberian SLTech 2009

Bilingual Example Segmentation based on Markers Hypothesis

Alberto Sim oes,˜ Jos e´ Jo ao˜ Almeida

Departamento de Inform atica,´ Universidade do Minho Campus de Gualtar, 4710–057 Braga {ambs,jj }@di.uminho.pt

Abstract Sim oes˜ and Almeida [6] explain how a probabilistic word alignment algorithm can be used for the automatic extraction The Marker Hypothesis was first defined by Thomas Green in of probabilistic translation dictionaries. This process relies on 1979. It is a psycho-linguistic hypothesis defining that there is a sentence-aligned parallel corpora. set of words in every language that marks boundaries of phrases The algorithm is language independent and therefore can in a sentence. While it remains a hypothesis because nobody be applied to any language pair. Experiments were executed has proved it, tests have shows that results are comparable to using diverse languages, which included Portuguese, English, basic shallow parsers with higher efficiency. French, German, Greek, Hebrew and Latin [7]. The algorithm The chunking algorithm based on the Marker Hypothesis is based on word co-occurrences and its analysis with statistical is simple, fast and almost language independent. It depends on methods. The result is a probabilistic dictionary which associate a list of closed-class words, that are already available for most words on two languages. languages. This makes it suitable for bilingual chunking (there is not the requirement for separate language shallow parsers). These dictionaries map words from a source language to This paper discusses the use of the Marker Hypothesis com- a set of associated words (probable translations) in the target bined with Probabilistic Translation Dictionaries for example- language. Given that the alignment matrix is not symmetric, based machine translation resources extraction from parallel the process extracts two dictionaries: from source to target lan- corpora. guage and vice-versa. Index Terms : Marker Hypothesis, Probabilistic Translation The formal specification for one probabilistic translation Dictionaries, Translation Examples, Machine Translation dictionary (PTD) can be defined as:

1. Introduction wA 7→ (occs (wA) × wB 7→ P (T (wA) = wB)) Machine Translation (MT) and Computer Assisted Translation Figure 1 shows two entries from the English:Portuguese (CAT) use previously translated documents, for example par- dictionary extracted from the EuroParl[8] corpus. Note that allel corpora aligned at the sentence level or the usual CAT these dictionaries include the number of occurrences of the translation memories. Unfortunately not all systems are able word on the source corpus, and a probability measure for each to adapt bilingual big sentence pairs to new sentences that re- possible translation. quire translation. This lack of re-usability is the motivation for Example-Based Machine Translation, a MT approach that europa 94 .7% segments bilingual sentence pairs into smaller segments with 8europeus 3.4% higher re-usability. These segments we call translation exam- europe ⇀ 42583 × <>europeu 0.8% ples . europeia 0.1% There are different articles on translation examples extrac- :> tion and generalization [1]. Sentence segmentation is generally est upido´ 47 .6% undertaken with language parsers or directly with generaliza- 8est upida´ 11 .0% > tion approaches [2, 3]. stupid ⇀ 180 × <>est upidos´ 7.4% There is other work [4] using the Markers Hypothesis [5] avisada 5.6% > 5.6% for this segmentation, but it is not dealing with the examples >direita alignment or with Iberian languages. : The presented document uses Probabilistic Translation Dic- Figure 1: Probabilistic Translation Dictionary examples. tionaries (PTD) [6] together with the Marker Hypothesis to seg- ment translation units into smaller aligned chunks (translation examples). Regarding these dictionaries it should be noted that, al- though we use the term translation dictionaries, not all word re- lationships on the dictionary are real translations. This is mainly 2. Probabilistic Translation Dictionaries explained by the translation freedom, multi-word terms and a One of the most important resources for MT is translation dic- variety of linguistic phenomena. tionaries. They are indispensable, as they establish relation- Notwithstanding the probabilistic nature of these dictionar- ships between the language atoms: words. Unfortunately, freely ies, there is work on bootstrapping conventional translation dic- available translation dictionaries have small coverage and for tionaries using probabilistic translation dictionaries [9] and on minority languages, are quite rare. It is crucial to have an auto- the connection between dictionaries quality and corpora genre mated method for the extraction of word relationships. and languages [10].

95 Proceedings of the I Iberian SLTech 2009

3. The Marker Hypothesis Occur. Marker Remaining segment 34 137 da comiss ao˜ The Marker Hypothesis was first defined by Thomas Green [5]. 17 277 do conselho It is a psycho-linguistic hypothesis stating that there is a set of 16 891 da uni ao˜ europeia words in every language that marks boundaries of phrases in a 11 379 em mat eria´ sentence. 9 880 de trabalho 9 850 da uni ao˜ English Portuguese 9 479 no sentido on em; sobre; em cima de; de; relativa 8 465 da europa once desde que; uma vez que; se 8 454 da UE only todavia; mas; contudo 8 004 do parlamento onto para; para cima de; em direcc¸ ao˜ a other outro; outra; outras; outros Table 2: Most occurring segments in the Portuguese language our nosso; nossa; nossos; nossas (from a total of 3 070 398 segments). ours o nosso; a nossa; os nossos; as nossas owing to devido a: por consequ enciaˆ de; por causa de Occur. Marker Remaining segment own pr oprio;´ ser propriet ario´ 13 566 and gentlemen past por; para al em´ disso; fora de 11 466 the commission pending durante; at e´ 11 079 in order per por; atrav es´ de; por meio de; devido a acc¸ ao˜ de 9 182 to make plus mais; a acrescentar a; a adicionar a 8 712 to be round em torno de; a` volta de 8 356 to do sort of esp ecie´ de; g enero´ de; tipo de; de certo modo 7 992 of the european union since desde; desde que; depois que 7 941 of the committee some algum; alguns; alguma; algumas 7 814 to say subject to sujeito a 7 574 with regard such este; esse; aquele; isto; aquilo supposing supondo; se; no caso de; dada a hip otese´ de Table 3: Most occurring segments in the English language (from than de; que; do que; que n ao˜ a total of 3 103 797 segments). that aquele; aquela; aquilo; esse; essa; isso; . . . the o; a; os; as segments in EuroParl [11] version 2 for the Portuguese and En- glish languages. Note that these results were obtained process- Table 1: Markers list excerpt. ing both sides of the parallel corpora in an independent way. Some other tests were performed to analyze the more pro- The algorithm uses a set of marker words (these are closed- ductive markers, as can be seen in table 4. This information is class words, like articles, conjunctions, pronouns, prepositions, useful to tune the segment alignment algorithm. numerals and some adverbs) and search for them in the sentence to find phrases boundaries. 4. Marker Hypothesis on Translation Units To illustrate the algorithm consider the following simple If we consider a translation units (for instance, the example sentence: above and its translation), and perform segmentation based on the Marker Hypothesis, the obtained result is: John spent all day playing with his friends. John spent / all day playing / with his friends The markers present on this sentence are the words “ all ”, “ with ” and “ his ”. These words are marked in the sentence: O Jo ao˜ passou / todo o dia / a jogar / com os seus amigos

John spent all day playing with his friends. As can be seen, the number of segments is not the same in different languages. This means that an alignment methodology The extracted segments start with one or more marker word (or is needed. A basic approach would be the use of the well known at the beginning of the sentence) and end right before the next sentence alignment algorithm [12], but this method uses just set of markers (or at the end of the sentence). This sentence sentence (or segment) length information. As these segments would be therefore split on three segments: have similar lengths this algorithm is not the best approach. Given the availability of Probabilistic Translation Dictio- John spent / all day playing / with his friends naries that include relationship information between words in the two languages, it is possible to perform a better alignment For our experiments we obtained an English list of marker task. words from MaTrEx [4] project, where the Marker Hypothesis For the segments alignment it is created a matrix where is also being used. each column represents a segment in the source language and The Portuguese list was created based on the English ver- each row represents a segment in the target language. Each cell sion and enriched after the analysis of some experiment results. is filled with the probability of the smaller segment (being it in Table 1 shows an extract of these lists. the source or target language) has its translation in the bigger To help the reader to evaluate the kind of segment extracted segment (algorithm presented in figure 2). Cells with higher using this algorithm, tables 2 and 3 show the most common values are selected as good alignment points and the translation

96 Proceedings of the I Iberian SLTech 2009

Portuguese English 815815 de 541197 to Data : Consider sA and sB are two segments in language 557697 , 471332 the A and B, with length (sA) < length (sB) and dic 468409 a 440903 of is a probabilistic translation dictionary. 352064 da 400417 , function transProb( dic , sA, sB) 297634 do 370161 and sMarkers ← markers (sA) 232629 e 252298 of the tMarkers ← markers (sB) 197922 que 214191 in markP rob ← quality (dic, sMarkers, tMarkers ) 196801 o 152164 a sT ext ← text (sA) 178537 em 131225 in the tT ext ← text (sB) 156299 dos 112446 for textP rob ← quality (dic, sT ext, tT ext ) [...] 105992 that 0.1 × markersP rob + 0 .9 × textP rob 35394 para a 92180 on return 33079 que o 91033 to the end 32213 de um 78264 we 31539 nos 70578 on the function quality( Dic , Set 1, Set 2) 31492 muito 67805 this sum ← 0 30805 as` 65092 that the for wA ∈ Set 1 do > 234 000 diff. markers > 198 000 diff. markers for wB ∈ dom (Tdic (wA)) do Table 4: More productive markers. if wB ∈ Set 2 then sum ← sum + P (wB ∈ T dic (wA))

sum return examples are extracted. This is shown in table 5. Note that this size (Set 1) example is not typical, but shown here for explanation purposes end only.

this decision on 16 Figure 2: Translation probability computation algorithm. shall take effect september 1999 a presente decis ao˜ produz efeitos 23.18 5.86 em 16 0.00 76.41 6. Conclusions de setembro 0.00 85.60 The use of the Marker Hypothesis as a tool to segment natural de 1999 0.00 84.10 text is easier than the use of complex shallow parser systems because it is easier to configure (easy to define what are or not Table 5: Alignment Matrix. markers) and it works almost “out of the box” with little ad- justments. Also, it requires little knowledge about the specific As usual on statistical methods, the extracted examples are language where it is being applied. This makes it versatile to be then sorted and counted. This number of occurrence is a statis- used on languages which have few resources. tical indicator of the alignment quality. Other translation mea- The use of Probabilistic Translation Dictionaries (PTD) to sures can be used to rank the extracted segments. perform segment alignment is quite efficient. Given that the PTD extraction is completely automatic consequently it is not a 5. Results analysis bottleneck for the full process. The translation examples extracted are interesting (although From a total of 1 507 225 different translation examples ex- they need an evaluation on a Machine Translation system). The tracted (an occurrence average of 1.6654) with alignment of one alignment algorithm can be improved which means that trans- to one segment, table 6 presents the 15 most occurring ones. lation examples quality can raise. From these 15 examples just two are not really correct. The For close languages like Portuguese and Spanigh we expect first one occurs because the closing parenthesis should be con- to have better quality results. Unfortunately at the time of writ- sidered a special marker, because it is related with the segment ing we did not have a list of markers for Spanish neither a fluent that appears before (unlike the other markers). The second bad Spanish speaker. example results from the fact that “is” is considered a marker in Unfortunately these examples can not be used alone in an the English list, while its translation is not in the Portuguese list example-based machine translation system as the boundary fric- (all forms of the verb “haver”) and that the original English list tion problem [13] is not solved. After translated examples con- does not include “there” as a marker (although it should be). catenation a concordancer should be used to uniform the sen- Tables 7 and 8 show examples of one to two and two to one tence. alignments. The stars mark the segment pairs that we evaluate as problematic. Most of these pairs are quite near translations with just one or two extra words. 7. Acknowledgments As the difference on number of segments raises the align- Part of this work was done in the scope of the Linguateca ment quality lowers. This fact is not directly related to the used project, contract no. 339/1.3/C/NAC, jointly funded by the Por- method but with the translation style. tuguese government and the European Union.

97 Proceedings of the I Iberian SLTech 2009

Occur. Portuguese English Occur. Portuguese / English 36 886 senhor presidente mr president 986 segue-se na ordem 8 633 senhora presidente madam president the next item 3 152 espero I hope 222 ( a sess ao˜ e´ suspensa 2 930 gostaria I would like ( the sitting was closed 2 572 o debate the debate 169 senhor presidente em exerc ´ıcio 2 511 penso I think mr president-in-office 2 356 est a´ encerrado is closed 148 da sess ao˜ de ontem 1 939 penso I believe of yesterday ’s sitting 1 932 muito obrigado thank 142 ( o parlamento aprova a acta 1 854 em segundo lugar secondly ( the minutes were approved 1 809 gostaria I should like ⋆ 138 dos assuntos econ omicos´ e monet arios´ ⋆ 1 638 ) senhor presidente mr president and monetary affairs ⋆ 1 524 ha´ there 113 a proposta da comiss ao˜ 1 423 infelizmente unfortunately the commission ’s proposal 1 345 creio I believe 110 a proposta da comiss ao˜ the commission proposal Table 6: Top 1 to 1 segment alignments. 106 per ´ıodo de perguntas question time Occur. Portuguese English ⋆ 101 , em nome , sobre a proposta 253 caros colegas ladies and gentlemen , on behalf 147 senhores deputados ladies and gentlemen 100 dos direitos do homem 143 devo dizer I have to say of human rights 142 lamento I am sorry 84 dos direitos da mulher 105 congratulo-me I am pleased on women ’s rights 95 estou convencido I am convinced ⋆ 72 da direita do hemiciclo 90 vamos agora proceder we shall now proceed from the right ⋆ 90 e senhores deputados ladies and gentlemen 67 por interrompida do parlamento europeu 90 agradec¸o I am grateful of the european parliament adjourned 79 e outros , em nome and others , on behalf 67 e´ muito importante 76 refiro-me I am referring it is very important ⋆ 72 muito obrigado thank you very 71 congratulo-me I am glad Table 8: Top occurring 2–1 segment alignments (from 542 671 70 passamos agora we shall now proceed different segments) 66 nao˜ h a´ d uvida´ there is no doubt

Table 7: Top occurring 1–2 segment alignments (from 360 065 [7] A. M. B. Sim oes,˜ “Extracc¸ ao˜ de recursos de traduc¸ ao˜ com base different segments) em dicion arios´ probabil ´ısticos de traduc¸ ao,”˜ Ph.D. dissertation, Escola de Engenharia, Universidade do Minho, Braga, May 2008. [8] P. Koehn, “EuroParl: a multilingual corpus for evaluation of ma- 8. References chine translation,” 2002, draft. [9] X. G. Guinovart and E. S. Fontenla, “T ecnicas´ para o desenvolve- [1] A. Way, “Translating with examples,” in Workshop on Example- mento de dicionarios de traduci on´ a partir de c orpora´ aplicadas na Based Machine Translation , M. Carl and A. Way, Eds., September xeraci on´ do Dicionario CLUVI Ingl es-Galego,”´ Viceversa: Re- 2001, pp. 66–80. vista Galega de Traducci on´ , vol. 11, pp. 159–171, 2005. [2] R. D. Brown, “Adding linguistic knowledge to a lexical example- [10] D. Santos and A. Sim oes,˜ “Portuguese-English word alignment: based translation system,” in Eighth International Conference on some experiments,” in LREC 2008 — The 6th edition of the Lan- Theoretical and Methodological Issues in Machine Translation guage Resources and Evaluation Conference . Marrakech: Eu- (TMI-99) , Chester, England, August 1999, pp. 22–32. [Online]. ropean Language Resources Association (ELRA)., 28–30, May Available: http://www.cs.cmu.edu/ ralf/papers.html 2008. [3] ——, “Automated generalization of translation examples,” [11] P. Koehn, “EuroParl: A parallel corpus for statistical machine in Eighteenth International Conference on Computational translation,” in Proceedings of MT-Summit , 2005, pp. 79–86. Linguistics (COLING-2000) , 2000, pp. 125–131. [Online]. [12] W. A. Gale and K. W. Church, “A program for aligning sentences Available: http://www.cs.cmu.edu/ ralf/papers.html in bilingual corpora,” in Meeting of the Association for Computa- [4] S. Armstrong, M. Flanagan, Y. Graham, D. Groves, B. Mellebeek, tional Linguistics , 1991, pp. 177–184. S. Morrissey, N. Stroppa, and A. Way, “MaTrEx: machine transla- [13] R. D. Brown, R. Hutchinson, P. N. Bennett, J. G. Carbonell, and tion using examples,” in TC-STAR OpenLab Workshop on Speech P. Jansen, “Reducing boundary friction using translation-fragment Translation , Trento, Italy, 2006. overlap,” in MT Summit IX , New Orleans, 2003. [5] T. R. G. Green, “The necessity of syntax markers. two experi- ments with artificial languages.” Journal of Verbal Learning and Behaviour , vol. 18, pp. 481–496, 1979. [6] A. M. Sim oes˜ and J. J. Almeida, “NATools – a statistical word aligner workbench,” Procesamiento del Lenguaje Natural , vol. 31, pp. 217–224, September 2003. [Online]. Available: http://alfarrabio.di.uminho.pt/ albie/publications/sepln2003.pdf

98 Proceedings of the I Iberian SLTech 2009

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages

Fernando Batista1,2, Isabel Trancoso1,3, Nuno Mamede1,3

1L2F - Spoken Language Systems Laboratory - INESC ID Lisboa R. Alves Redol, 9, 1000-029 Lisboa, Portugal http://www.l2f.inesc-id.pt/ 2DCTI – ISCTE - Institute of Science, Technology and Management, Portugal 3IST – Technical University of Lisbon, Portugal {fmmb,imt,njm}@l2f.inesc-id.pt

Abstract This paper is organized as follows: Sections 2 and 3 de- scribe the related work and the adopted approach. Section This paper shows experimental results concerning automatic 4 presents experimental results concerning automatic punctu- enrichment of the speech recognition output with punctuation ation and capitalization of Portuguese and Spanish. Section 5 marks and capitalization information. The two tasks are treated presents the final remarks and the future work. as two classification problems, using a maximum entropy mod- eling approach. The approach is language independent as re- inforced by experiments performed on Portuguese and Span- 2. Related work ish Broadcast News corpora. The discriminative models are trained for a language using spoken and written corpora from Whereas speech-to-text core technologies have been developed that language. This paper provides the first results on Spanish for more than 30 years, the metadata extraction/annotation tech- Broadcast News data and the first comparative study between nologies are receiving significant importance only in the recent Portuguese and Spanish, on this subject. years. For example, [3] contains an entire section dedicated to this subject, while this topic is only briefly mentioned in the Index Terms: Rich Transcription, Capitalization, Punctuation first version of this book, published in 2000. Producing rich marks, Speech processing transcripts usually involves the process of recovering structural information and the creation of metadata from that information. 1. Introduction Recovering punctuation marks and capitalization are two rel- The text produced by a standard speech recognition system con- evant MDA (Metadata Annotation) tasks, which contribute to sists of raw single-case words, without punctuation marks, with enriching the final recognition output. numbers written as text, and with many different types of dis- The first joint initiatives concerning automatic rich tran- fluencies. The missing information makes this representation scription of speech started around 2002. The five year project format hard to read and understand [1], and pose problems to DARPA-sponsored EARS program supported the goal of ad- further automatic processing. Capitalization is important for vancing the state-of-the-art in automatic rich transcription of improving human readability, parsing, and NER (Named Entity speech. The NIST RT evaluation series1 is another important Recognition). Punctuation marks, or at least sentence bound- initiative that supports some of the goals of the EARS program, aries, are important for parsing, information extraction, ma- providing means to investigate and evaluate STT (speech-to- chine translation, extractive summarization and NER. text) and MDE (Metadata Extraction) technologies, and pro- These tasks are important modules of the Broadcast News mote their integration. Nevertheless, despite the emerging rich (BN) processing system developed at our lab, which integrates transcription efforts, only a few of the most important MDE several other core technologies, in a pipeline architecture: jin- tasks are covered by these evaluation plans. gle detection, audio segmentation, automatic speech recogni- Two different rich transcription methods are proposed and tion (ASR), topic segmentation and indexation, and summa- evaluated by [4]. The first method consists of adapting the rization. The first modules of this system, including punctua- ASR system for dealing with both punctuation and capitaliza- tion and capitalization, were optimized for on-line performance, tion. This is done by duplicating each vocabulary entry with the given their deployment in the fully automatic subtitling system possible capitalized forms, modeling the full-stop with silence, that is running on the main news shows of the public TV chan- and training with capitalized and punctuated text. The second nel in Portugal, since 2008 [2]. This BN processing chain was method consists of using a ruled-based NE tagger and punctua- originally developed for European Portuguese, but was already tion generation. The paper shows that the first method produces ported to other varieties of Portuguese (Brazilian and African). worse results, due to the distorted and sparser language model The goal of the current work was to port the punctuation and (LM), suggesting a separation between the recognition process capitalization modules to Spanish, a language for which we re- and the enriching tasks. The rest of this section describes in cently developed our ASR system, thereby supporting the lan- more detail the previous work related to each one of the tasks. guage independence of our approaches. This paper provides the first results on Spanish BN data and the first comparative study between Portuguese and Spanish, concerning this subject. 1http://www.nist.gov/speech/tests/rt/

99 Proceedings of the I Iberian SLTech 2009

2.1. Punctuation Corpus ASR sentence

Different punctuation marks can be used in spoken texts, includ- ing: comma; period or full stop; exclamation mark; question Punctuation Capitalization Punctuation Capitalization mark; colon; semicolon; and quotation marks. However, most Features Features Features Features of these marks rarely occur and are quite difficult to automati- cally insert or evaluate. Hence, most studies focus either on full ME classifier stop or in comma, which have much higher corpus frequencies. ME train Comma is the most frequent punctuation mark, but it is also Cap Lexicon the most problematic because it serves many different purposes. Punctuation Capitalization It can be used to: introduce a word, phrase or construction; sep- model model Enriched sentence arate long independent constructions; separate words within a sentence; separate elements in a series; separate thousands in a Training process Punctuation and Capitalization number; and also to prevent misreading. [5] describes a method for inserting commas into text, and presents a qualitative evalu- Figure 1: Rich transcription tasks block diagram. ation based on the user satisfaction, concluding that the system performance is qualitatively higher than the sentence accuracy rate would indicate. 3. Approach description The work conducted by [6] and [7] uses a general HMM The same approach is used for the punctuation and capitaliza- framework that allows the combination of lexical and prosodic tion tasks, which can be treated as two classification tasks. Our clues for recovering punctuation marks. A similar approach was experiments use a discriminative approach, based on maximum also used for detecting sentence boundaries by [8, 9, 10]. A entropy (ME) models, which provide a clean way of express- study, using purely text-based n-gram language models, can be ing and combining different aspects of the information. This is found in [11], showing that using larger training data sets lead specially useful for the punctuation task, given the broad set of to improvements in performance. [12] describes a maximum lexical, acoustic and prosodic features that can be used. This entropy (ME) based method for inserting punctuation marks approach requires all information to be expressed in terms of into spontaneous conversational speech, which covers comma, features causing the resultant data file to become several times full stop, and question mark. Bigram-based features, combin- larger than the original one. On the other hand, the memory re- ing lexical and prosodic features, achieve the best results on the quired for training with this approach increases with the size of ASR output. the corpus (number of observations). This constitutes a prob- lem, making it difficult to use large corpora for training. How- 2.2. Capitalization ever, the classification is straightforward, making it interesting for on-the-fly usage. The capitalization task, also known as truecasing [13], consists Capitalization models are usually trained using large writ- of assigning the proper case information to each input word, ten corpora, which contain the required capitalization informa- which may depend on the context. Proper capitalization can be tion. The consequent memory problem is solved by splitting the found in many information sources, such as newspaper articles, corpus into several subsets, and then iteratively retraining with books, and most of the web pages. Besides improving the read- each one separately. The first subset is used for training the first ability of texts, capitalization provides important semantic clues ME model, which is then used to provide initial weights for the for further text processing tasks. The capitalization is not usu- next iteration over the next subset. This process goes on until all ally considered as a topic by itself. A typical approach, when subsets are used. Although the final ME model contains infor- dealing with processes where capitalization is expected, con- mation from all corpora subsets, events occurring in the latest sists of modifying the process that usually relies on case infor- training sets gain more importance in the final model. As the mation in order to suppress the need of that information [14]. training is performed with the new data, the old models are iter- An alternate approach is to previously recover the capitaliza- atively adjusted to the new data. This approach provides a clean tion information, which can also benefit other processes that framework for language dynamics adaptation: (1) new events use case information. are automatically considered in the new models; and (2) with time, unused events slowly decrease in weight [19, 20]. A common approach for capitalization relies on n-gram Figure 1 illustrates the classification approach for both LMs estimated from a corpus with case information [13, 4]. tasks. An updated capitalization lexicon containing the capi- Another approach consists of using a rule-based tagger, as de- talization of new words and mixed-case words can be used as a scribed in [15], which was shown to be robust to speech recog- complement for capitalization. nition errors, while producing better results than case sensitive The experiments described in this paper use the MegaM language modeling approaches. [16] describes an approach to tool [21] for training the ME models, which is open source the disambiguation of capitalized words where capitalization and efficiently implements limited memory BFGS for multi- is expected, such as the first word of the sentence or after a class problems (usually outperforms Iterative Scaling methods). period, which consists of a cascade of different simple posi- tional heuristics. Other approaches include Maximum Entropy Markov Models (MEMM) [17] and Conditional Random Fields 4. Experimental results (CRF). A study comparing generative and discriminative ap- This section describes some experiments recovering punctua- proaches can be found in [18]. A recent study on the impact of tion marks and capitalization for the Portuguese and Spanish using huge amounts of data can be found in [11]. languages. The evaluation is performed using the performance

100 Proceedings of the I Iberian SLTech 2009

#Words Duration Planned Spont. WER Manual Transc. ASR output Newspapers Corpus Train 477k 52h 54.6% 32.1% 11.3% Prec Rec SER Prec Rec SER Prec Rec SER Devel 66k 7h 51.2% 37.6% 20.8% Portuguese 84.4 86.7 29.1 73.2 77.7 50.4 93.8 86.5 19.0 Eval 135k 15h 55.6% 35.5% 20.3% Spanish 94.7 85.6 19.0 77.6 74.1 47.1 95.1 83.2 20.8

Table 1: Portuguese BN corpus properties. Table 5: Capitalization results for the BN corpora.

#Words Duration Planned Spont. WER

Train 152k 15h 71.6% 10.6% 11.0% GenderChgs1, and SpeakerChgs1 correspond to changes in Devel 25k 3h 72.5% 11.2% 17.2% speaker gender, and speaker clusters; T imeGap1 corresponds Eval 16k 2h 67.9% 14.7% 18.9% to the time period between the current and following word. For the moment, only lexical and acoustic features are being used in this task. Nevertheless, prosodic features, which already proved Table 2: Spanish BN corpus properties. useful for this task, will be included in future experiments. All the existing punctuation marks were replaced by a full stop or a comma: “.”: “;”, “!”, “?”, “...” => full stop; “,”, “-” => comma. metrics: Precision, Recall and SER (Slot Error Rate) [22]. Only capitalized words and punctuation marks are considered as slots 4.2. Capitalization and used by these metrics. Hence, for example, the SER for the capitalization task is computed by dividing the number of The capitalization experiments assume that the capitalization of capitalization errors by the number of capitalized words in the the first word of each sentence is performed in a separated pro- reference data. cessing stage (e.g. after punctuation), since its correct graphi- Tables 1 and 2 show details of Portuguese and Spanish BN cal form depends on its position in the sentence. Our experi- corpora subsets, respectively, which were used for training and ments consider four ways of writing a word: lower-case, first- evaluation. The Portuguese corpus is a subset of the BN Euro- capitalized, all-upper, and mixed-case (e.g. “McGyver”). The pean Portuguese Corpus, collected during 2000 and 2001. The following features were used for a given word w in the position Spanish BN corpus is a recent corpus, collected during 2008 and i of the corpus: wi, 2wi−1, 2wi, 3wi−2, 3wi−1. 2009. The manual orthographic transcription of these corpora The Portuguese capitalization model was trained with a provides the reference data, and includes punctuation marks and newspaper corpus collected from 1999 to 2004 and contain- capitalization information. For each corpus we had access not ing about 148M words. The Spanish capitalization model was only to the manual transcription, but also to the automatic tran- trained with the content of online text, daily collected since scription. Whereas the manual transcriptions already contain 2003, and containing about 79M words. The original texts reference punctuation and capitalization, this is not the case of were normalized and all the punctuation marks removed, mak- the automatic transcriptions. The required reference was pro- ing them close to speech transcriptions. Only data previous to duced by means of word alignments between the manual and the evaluation data period was used for training. automatic transcription. The higher WER (Word Error Rate) of The retraining approach described in Section 3 was fol- the Portuguese corpus may be attributed to the larger propor- lowed, and the most recent capitalization model was used for tion of spontaneous speech, as well the higher complexity of processing each evaluation subset. Table 5 shows the corre- the Portuguese phonological system. sponding results for both languages. While the performance is similar for written corpora in both languages, there is a signif- 4.1. Punctuation icant difference for the speech data, where the performance is better for Spanish. One important explanation is related with the The punctuation experiments use only the BN data, collected small portion of the Spanish spontaneous speech, which causes from broadcasted TV shows. Tables 3 and 4 show the re- little impact on the overall results. Nevertheless, the worse per- sults achieved for the Portuguese and Spanish data, respectively. formance for the Portuguese data is also due to the unusual topic The overall results are affected by the comma detection per- covered in the news by that time (War on Terrorism). Many for- formance, which mostly achieves above 100% SER. The Por- eign names, which can be rarely found in the news, were used tuguese evaluation data was annotated by different people, us- by that time. ing possibly different criteria, which explains the higher SER values for the comma. Results are strongly affected by the pres- ence of recognition errors, as shown in the performance dif- 5. Conclusions ference between manual and automatic transcripts. The Span- This paper presents a language independent approach for re- ish corpus contains only a small portion of spontaneous speech, covering punctuation marks and capitalization over speech data. which causes less impact on the overall results. Experiments were conducted over Portuguese and Spanish BN The following features are used for a given word w in corpora. The described approach is now implemented by two the position i of the corpus: wi, wi+1, 2wi−2, 2wi−1, 2wi, punctuation and capitalization modules, which have been inte- 2wi+1, 3wi−2, 3wi−1, pi, pi+1, 2pi−2, 2pi−1, 2pi, 2pi+1, grated in a speech recognition system, currently being used to 3pi−2, 3pi−1 (lexical), GenderChgs1, SpeakerChgs1, and daily process BN shows on-the-fly, for automatic subtitling. T imeGap1 (acoustic), where: wi is the current word, wi+1 is We plan to port the punctuation and capitalization modules the word that follows and nwi±x is the n-gram of words that to other languages, for which we recently developed our ASR starts x positions after or before the position i; pi is part-of- system, such as English and Brazilian Portuguese. We are cur- speech of the current word, and npi±x is the part-of-speech n- rently trying to further improve the performance of the punctu- gram for words starting x positions after or before the position i. ation module by introducing prosodic features, besides the cur-

101 Proceedings of the I Iberian SLTech 2009

Manual Transcripts Automatic Transcripts Full stop Comma ALL Full stop Comma ALL Focus Prec Rec. SER Prec Rec. SER Prec Rec. SER Prec Rec. SER Prec Rec. SER Prec Rec. SER All 81.1 65.9 49.5 41.6 29.5 111.9 60.3 45.5 69.2 68.7 60.5 67.1 29.8 21.4 128.9 48.9 38.5 88.7 Planned 85.7 68.7 42.8 34.8 25.7 122.5 64.1 49.6 60.0 75.6 66.4 55.1 26.5 23.4 141.4 53.7 47.2 78.6 Spontaneous 71.5 59.7 64.1 46.7 32.6 104.6 55.2 40.9 79.8 53.2 47.3 94.3 33.0 20.2 120.7 40.8 28.3 101.6

Table 3: Punctuation results for the Portuguese BN corpus.

Manual Transcripts Automatic Transcripts Full stop Comma ALL Full stop Comma ALL Focus Prec Rec. SER Prec Rec. SER Prec Rec. SER Prec Rec. SER Prec Rec. SER Prec Rec. SER All 87.0 67.4 42.7 51.2 31.3 98.5 71.4 49.5 61.7 76.9 58.9 58.9 43.5 23.5 107.0 63.1 41.2 73.9 Planned 87.9 67.3 41.9 52.8 31.4 96.6 72.8 49.5 59.5 82.3 58.1 54.3 48.0 23.5 102.0 68.3 40.9 69.0 Spontaneous 85.3 71.8 40.5 49.4 28.9 100.7 69.9 49.5 66.2 68.0 62.2 67.1 32.7 20.2 121.3 53.0 40.3 86.9

Table 4: Punctuation results for the Spanish BN corpus. rent lexical and acoustic features. [10] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, “Enriching speech recognition with automatic de- tection of sentence boundaries and disfluencies,” IEEE Transac- 6. Acknowledgements tion on Audio, Speech and Language Processing, vol. 14, no. 5, The authors would like to thank to Hugo Meinedo and pp. 1526–1540, 2006. Helena Moniz for their useful hints and Raquel Martinez [11] A. Gravano, M. Jansche, and M. Bacchiani, “Restoring punctu- for her support with the Spanish corpus. This work was ation and capitalization in transcribed speech,” in ICASSP 2009, funded by FCT projects PTDC/PLP/72404/2006 and CMU- (Taipei, Taiwan), 2009. PT/HuMach/0039/2008. INESC-ID Lisboa had support from [12] J. Huang and G. Zweig, “Maximum entropy model for punctu- the POSI Program of the “Quadro Comunitário de Apoio III”. ation annotation from speech,” in Proc. of the ICSLP, pp. 917 – 920, 2002. 7. References [13] L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla, “tRuE- casIng,” in Proc. of ACL-03, pp. 152–159, 2003. [1] D. Jones, F. Wolf, E. Gibson, E. Williams, E. Fedorenko, D. Reynolds, and M. Zissman, “Measuring the readability of [14] E. Brown and A. Coden, “Capitalization recovery for text,” Infor- automatic speech-to-text transcripts,” in Proc. of Eurospeech, mation Retrieval Techniques for Speech Applications, pp. 11–22, pp. 1585–1588, 2003. 2002. [15] E. Brill, “Some advances in transformation-based part of speech [2] J. Neto, H. Meinedo, M. Viveiros, R. Cassaca, C. Martins, and th D. Caseiro, “Broadcast news subtitling system in portuguese,” in tagging,” in AAAI ’94: Proc. of the 12 national conference on Proc. of ICASSP 2008, pp. 1561–1564, 2008. Artificial intelligence, vol. 1, pp. 722–727, 1994. [3] D. Jurafsky and J. H. Martin, Speech and Language Processing, [16] A. Mikheev, “A knowledge-free method for capitalized word dis- ch. 10 - Speech Recognition: Advanced Topics. Prentice Hall, ambiguation,” in Proc. of the ACL-99, pp. 159–166, 1999. 2008. [17] C. Chelba and A. Acero, “Adaptation of maximum entropy capi- [4] J.-H. Kim and P. Woodland, “Automatic capitalisation generation talizer: Little data can help a lot,” Proc. of the EMNLP ’04, 2004. for speech input,” Computer Speech & Language, vol. 18, no. 1, [18] F. Batista, D. Caseiro, N. Mamede, and I. Trancoso, “Recovering pp. 67–90, 2004. capitalization and punctuation marks for automatic speech recog- [5] D. Beeferman, A. Berger, and J. Lafferty, “Cyberpunc: a nition: Case study for portuguese broadcast news,” Speech Com- lightweight punctuation annotation system for speech,” Proc. of munication, vol. 50, no. 10, pp. 847–862, 2008. the ICASSP-98, pp. 689–692, 1998. [19] F. Batista, N. Mamede, and I. Trancoso, “Language dynamics and [6] H. Christensen, Y. Gotoh, and S. Renals, “Punctuation annotation capitalization using maximum entropy,” in Proc. of ACL-08: HTL using statistical prosody models,” in Proc. of the ISCA Workshop - Short Papers, pp. 1–4, 2008. on Prosody in Speech Recognition and Understanding, pp. 35–40, [20] F. Batista, N. Mamede, and I. Trancoso, “The impact of language 2001. dynamics on the capitalization of broadcast news,” in Proc. of In- [7] J. Kim and P. C. Woodland, “The use of prosody in a combined terspeech 2008, Sep. 2008. system for punctuation generation and speech recognition,” in [21] H. Daumé III, “Notes on CG and LM-BFGS optimization of lo- Proc. Eurospeech, pp. 2757–2760, 2001. gistic regression.” http://hal3.name/megam/, 2004. [8] Y. Gotoh and S. Renals, “Sentence boundary detection in broad- cast speech transcripts,” in Proc. of the ISCA Workshop: ASR- [22] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel, “Per- 2000, pp. 228–235, 2000. formance measures for information extraction,” in Proc. of the DARPA BN Workshop, 1999. [9] E. Shriberg, A. Stolcke, D. Hakkani-Tür, and G. Tür, “Prosody- based automatic segmentation of speech into sentences and top- ics,” Speech Communications, vol. 32, no. 1-2, pp. 127–154, 2000.

102 Proceedings of the I Iberian SLTech 2009

Demos

103

Proceedings of the I Iberian SLTech 2009 The TALP on-line Spanish-Catalan machine-translation system

Marc Poch, Mireia Farrus,´ Marta R. Costa-jussa,` Jose´ B. Marino,˜ Adolfo Hernandez,´ Carlos Henr´ıquez, Jose´ A. R. Fonollosa

Center for Language and Speech Technologies and Applications (TALP), Technical University of Catalonia (UPC), Barcelona, Spain {mpoch, mfarrus, mruiz, canton, adolfohh, carloshq, adrian}@gps.tsc.upc.edu

Abstract where the n-th tuple of a sentence pair is referred to as (s,t)n . The system is trained with the aligned Spanish-Catalan paral- In this paper the statistical machine translator (SMT) between lel corpus taken from El Peri´odico newspaper, which contains Catalan and Spanish developed at the TALP research center 1.7 million sentences. To improve it several techniques based (UPC) and its web demonstration are described. on the use of grammatical categories, lexical categorisation and text processing are used. Most of these techniques are based on 1. Introduction preprocessing the text that is used as input data for the baseline system and postprocessing the translation. Table1 shows the Statistical machine translation (SMT) is a machine translation improvement achieved in terms of BLEU after using the men- paradigm where translations are generated on the basis of sta- tioned techniques in 2000 sentences from El Peri´odico. tistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with es2ca ca2es the rule-based approaches to machine translation as well as with N-II 83.91 83.23 example-based machine translation. During the last three years we have developed new SMT architectures of general interest Table 1: BLEU results in both directions of translation. as well as specific adaptations and modules for the Spanish- Catalan pair. 3. Demonstration 2. System description The demonstration of the system consists of a website that al- The TALP translation system is based on an N-gram transla- lows the user to execute translations between Catalan and Span- tion model integrated in an optimized log-linear combination of ish pair of languages. As shown in Figure Fig. 1 the user can additional features improved by specic techniques based on the type text directly or send a text file to be translated in both di- use of grammatical categories, lexical categorisation and text rections. The web has an online spell checker for both Spanish processing, for the enhancement of the nal translation [1]. and Catalan to avoid typical mistakes that would end in a bad The translator is an N-gram-based SMT system. Such an ap- translation. proach is faced using a general maximum entropy approach in There is an option called Log that allows the user to see data which a log-linear combination of multiple feature functions is being preprocessed and postprocessed before and after being implemented. This approach leads to maximising a linear com- translated by the machine translator. This is very useful to see bination of feature functions: the changes that improvement techniques have on the data. The demonstration can be found online at “http://www.n-ii.org”. M t˜ = argmax λmhm(t,s) (1) t (m=1 ) X where the argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language, hm (t, s) are the feature functions and m are their corresponding weights. The main feature function is the Ngram-based transla- tion model which is trained on bilingual n-grams. This model constitutes a language model of a particular bi-language com- posed of bilingual units (translation units) which are referred to as tuples. In this way, the translation model probabilities at Figure 1: Capture image of the demonstration. the sentence level are approximated by using n-grams of tuples such as described by the following equation: 4. References I J J t˜1 = argmax{p(s1 ,t1 )} = K = (2) [1] Mireia Farr´us, Marta R. Costa-juss`a, Marc Poch, Adolfo tI 1 Hern´andez, and Jos´eB. Mari˜no, “Improving a Catalan-Spanish Statistical Translation System using Morphosyntactic Knowl- N = argmax n=1p((s,t)n|(s,t)n−x+1, K, (s,t)n−1) edge”, EAMT Barcelona 2009. I t1

105

Proceedings of the I Iberian SLTech 2009

CINTIL-Treebank Searcher

Patricia Nunes Gonc¸alves, Antonio´ Branco

NLX-Natural Language and Speech Group, Lisbon University, Portugal {patricia.nunes, antonio.branco}@di.fc.ul.pt

Abstract In order to briefly illustrate the syntax of the language used for describing syntactic patterns to be searched for in the tree- This is a short note to support the demonstration of the bank, consider the input expression S < VP << NP-DO. CINTIL-Treebank Searcher. This query matches any tree containing a top-to-bottom node : treebank, syntactic analysis, search, Portuguese Index Terms path where the sentence node (S) immediately dominates a verb phrase (VP), which in turn dominates (possibly non immedi- 1. The CINTIL-Treebank Searcher ately) a noun phrase (NP) bearing a direct object grammatical The CINTIL-Treebank Searcher is a freely available online ser- function (NP-DO). The figure below shows a tree found by the vice that permits to search the CINTIL-treebank and to visualize search based on the query: the syntactic analysis of the selected sentences. This service is made available aiming at supporting re- search and development in the realm of natural language sci- ence and technology. It is well suited to be used as a research tool on the syntactic structure of Portuguese both by students and advanced researchers working on Linguistics, Natural Lan- guage Processing, or any other area involving the grammatical study of the Portuguese language. This online service for treebank searching and visu- A fully-fledged description of the query language, together alization was developed and is being maintained and ex- with key examples, are provided in one of the web pages made tended by the NLX-Natural Language and Speech Group available in the The CINTIL-Treebank Searcher site. (http://nlx.di.fc.ul.pt), of the University of Lisbon, Department of Informatics. This service receives a description of a syntactic structure 2. The CINTIL Treebank pattern as input, entered by the user, and returns the list of sen- The CINTIL-Treebank is a corpus of sentences annotated with tences whose syntactic representation conforms to that pattern. their syntactic trees, that encode the constituency relations Subsequently, by clicking on one of the listed sentences, the among their elements. The treebank is composed of sentences user obtains the syntactic tree of that sentence. from the CINTIL-International Corpus of Portuguese [1] and it is developed at the NLX Group. The annotation of the CINTIL Treebank is performed by experts in Linguistics according to the mainstream method of annotation that is deemed to ensure a more reliable outcome: multiple annotation by independent annotators, followed by ad- judication. The annotation work is supported and its quality and con- sistency is ensured by resorting to a computational grammar. Each sentence is automatically analyzed by LXGram [2], an advanced grammar for the deep linguistic processing of Por- tuguese. Once a parse forest is obtained for a given sentence, independent annotators choose the analysis they each consider to be correct. In case of divergence between annotators, an ad- judicator make a final decision. 3. References [1] Barreto, F.; Branco, A.; Ferreira, E.; Mendes, A.; Nascimento, M.; Nunes F. and Silva J., 2006, “Open Resources and Tools for the Shallow Processing of Portuguese”, Proceedings of the 5th LREC, 2006. Genova, Italy. To describe the syntactic pattern he wants to search for, the [2] Branco, A. and Costa, F., “A Computational Grammar for Deep user resorts to a description language based on regular expres- Linguistic Processing of Portuguese: LXGram, version A.4.1”, sions extended to easily capture basic relationships among the University of Lisbon, 2008. inner components and tags of a syntactic tree. The search sys- [3] Levy, R. and Andrew, G., “Tregex and Tsurgeon: tools for query- tem builds on the Tregex library [3], available from Stanford ing and manipulation tree data structures”, In Proceedings LREC, University, as the underlying search engine for tree query. 2006.

107

Proceedings of the I Iberian SLTech 2009

Recent PhDs

109

Proceedings of the I Iberian SLTech 2009

PhD thesis: “Hierarchical language models based on classes of phrases: formulation, learning and decoding”. (Original: “Modelos de lenguaje jerarquicos´ basados en clases de phrases: formulacion,´ aprendizaje y decodificacion.”)´

Author: Raquel Justo. Supervisor: M. Ines´ Torres

Department of Electricity and Electronics. University of the Basque Country. Spain [email protected], [email protected]

1. Commitee integrated into a speech translation system. The methodology employed to integrate the language model in the ASR system • Renato de Mori. Professor of University of Avignon can be directly applied to this case. (France). • Jos Miguel Bened Ruiz. Professor Technical University 3. Curriculum Vitae of Valencia. 1. PERSONAL DETAILS • Eduardo Lleida Solano. Professor University of Zaragoza. Raquel Justo Blanco • Emilio Sanch´ıs Arnal. Technical University of Valencia. Departament of Electricity and Electronics • Javier Ferreiros Lopez. Technical University of Madrid. University of the Basque Country. 48940 Leioa. Spain. Qualification: with honors (cum laude) +34 946015364 [email protected] 2. Abstract This thesis focuses on the area of stochastic language model- 2. EDUCATION: ing. A stochastic language model captures the way in which the • Electronics Engineer Degree from the University combination of words is carried out in a specific language. It of the Basque Country in 2001 does so by making use of probability distributions of linguistic events, such as the frequency of appearance of words in sen- • Bachelor’s degree in Physics from the University tences. Large amounts of training data, not always available, of Cantabria in 2004. are required to get a robust estimation of the parameters defin- • PhD degree in Language and Computation Sys- ing such models. tems from the University of the Basque Country In this work, a two-level hierarchical language model, in 2009. based on classes of phrases, is proposed to deal with data sparse- ness. Each level in the model is associated to a different knowl- 3. RESEARCH: edge source. In the upper level the relations among classes are • Researcher at IKERLAN Research Center (MCC taken into account, i.e. relations among abstract entities em- S. Coop.) 2001-2003 ployed to generalize. In the second level the relations among words are considered. The cooperation between different lev- • Member of Pattern Recognition & Speech Tech- els allows to build an improved language model. Within this nology group in the University of the Basque framework different approaches and ways of combining mod- Country since 2003 els are defined and formulated. 4. TEACHING: Through out this work language modeling has been ex- plored in the framework of Automatic Speech Recognition • University of the Basque Country since 2007 (ASR). Thus, a methodology to integrate the proposed models 5. OTHERS into the decoding stage of the ASR system has been developed. In order to validate the presented approaches an experimen- • Researching stay in the Department of Information tal stage has been carried out using different databases. Three Systems and Computation. Technical University different languages and tasks of different complexity, sponta- of Valencia. 03/2006?07/2006 neous speech and read speech,... have been employed. • Member of “International Speech Communication On the other hand, the use of the proposed hierarchical lan- Association” (ISCA) and “Speech Technologies guage models within a dialogue system prototype has been ex- Thematic Network” (RTTH). plored. In this case the main goal is to maximize the perfor- mance of the system in real working conditions. • Member of the organizing committee of “AERFAI Finally, a translation model based on the same hierarchical Summer School 2008” nature has been defined and formulated. This model has been

111 Proceedings of the I Iberian SLTech 2009

6. AWARDS AND REVIEWS: 2007. Published in “Advances in Soft Comput- ing”. Volume 45, pp 421-428. • Special distinction given by Microsoft to the work “Different approaches to class-based language models using word segments” presented in IAPR • R. Justo, M. I. Torres. Two approaches to class- “CORES 07” based language models for ASR. Proceeding of the IEEE Machine Learning for Signal Processing • Panelist in “AMBI-SYS 08” Workshop. Thessaloniki, Greece. August 27-29, • Review of an article in “IEEE TASLP” journal. 2007. 7. PUBLICATIONS • R. Justo, M. I. Torres. Phrases in category-based • R. Justo, M. I. Torres. Phrase classes in two-level language models for Spanish and Basque ASR. language models for ASR. Pattern Analysis and Proceedings of Interspeech 2007. Antwerp, Applications. (in press) Belgium, August 27-31, 2007

• R. Justo, M. I. Torres. An approach to estimate • R. Justo, M. I. Torres.Word segments in category- perplexity values for language models based on based language models for automatic speech phrase classes. Proceedings of the 4th Iberian recognition. Proceedings of the 3rd Iberian Conference on Pattern Recognition and Image Conference on Pattern Recognition and Image Analysis. Volume 5534 of LNCS, June 10-12 Analysis. Volume 4477 of LNCS, June 6-8 2007, 2009, Pvoa de Varzim (Portugal), pp 409–416 Girona (Spain), pp 249-256

• V. Guijarrubia, M. I. Torres, R. Justo. Morpheme- • R. Justo, M. I. Torres, Lluis Hurtado. Modelos based Automatic Speech Recognition of Basque. de lenguaje basados en categoras semnticas en Proceedings of the 4th Iberian Conference on un sistema de dilogo de habla espontnea en Pattern Recognition and Image Analysis. Volume castellano. Actas IV Jornadas en Tecnologa del 5534 of LNCS, June 10-12 2009, Pvoa de Varzim Habla (IVJTH). Zaragoza. Noviembre 2006. (Portugal), pp 386–393 ISBN: 84-96214-82-6

• M. I. Torres, V. Guijarrubia, R. Justo, A. Perez,´ • R. Justo, M. I. Torres, J.M. Bened´ı. Category- F. Casacuberta. Statistical methods for speech based language models in Spanish spoken technologies in basque language. Actas de las V dialogue systems. Procesamiento de Lenguaje Jornadas en Tecnologa del Habla. Bilbao 12-14 Natural, vol. 37, pp 19-24 (SEPLN) Zaragoza, Noviembre, 2008. 13-15 Septiembre 2006. • R. Justo, O. Saz, V. Guijarrubia, A. Miguel, M. I. • Torres, E. Lleida. Improving dialogue systems in J. M. Bened´ı, E. Lleida, A. Varona, M. J. Castro, I. a home automation environment. Proceedings of Galiano, R. Justo, I. Lpez and A. Miguel. “Design International Conference on Ambient Media and and acquisition of a telephone spontaneous Systems (AMBI SYS 08). Quebec City, Canada. speech dialogue corpus in Spanish: DIHANA” February 11-14, 2008. Proceedings of LREC’06. Genova, Italy. 24-26 Mayo 2006.

• R. Justo, M. I. Torres. Segment-based classes • R. Justo, M. I. Torres. Statistical and linguistic for language modeling within the field of CSR. clustering for language modeling in ASR. Progress Proceedings of the 12th Iberoamerican Congress in Pattern Recognition, Image Anlisis and Appli- on Pattern Recognition (CIARP). Valparaiso, cations. LNCS, vol 3773, pp 556-565. La Habana. Chile, November 13-16, 2007. Published in Cuba. Noviembre 2005 Volume 4756 of LNCS pp 714–723 .

• A. Perez,´ V. Guijarrubia, R. Justo, M. I. Torres, F. Casacuberta. A comparison of linguistically and statistically enhanced models for speech- to-speech machine translation. Proceedings of International Workshop on Spoken Language Translation (IWSLT 07). Trento, Italy. October 15-16, 2007

• R. Justo, M. I. Torres: Different approaches to class-based language models using word segments. Proceedings of the IAPR International Conference on Computer Recognition Systems (CORES07). Wroclaw, Poland. October 22-25,

112 Proceedings of the I Iberian SLTech 2009

Dynamic Language Modeling for European Portuguese

Author: Ciro Martins*+ Supervisors: António Teixeira*, João Neto+

* Department Electronics, Telecommunications & Informatics/IEETA – Aveiro University, Portugal +L2F – Spoken Language Systems Lab – INESC-ID/IST, Lisbon, Portugal [email protected], [email protected], [email protected]

Hence, this thesis had 3 different main contributions: a 1. Committee novel approach for vocabulary selection using Part-Of-Speech (POS) tags to compensate for word usage differences across  Manuel Augusto Marques da Silva - Professor of the various training corpora; language model adaptation Aveiro University (Portugal). frameworks performed on a daily basis for single-stage and  Tanja Schultz - Professor of Karlsruhe University multistage recognition approaches; a new method for (Germany) and Assistant Research Professor of inclusion of new words in the system vocabulary without the Language Technologies Institute (LTI) in School of need of additional data or language model retraining. Computer Science – CMU.  Isabel Maria Martins Trancoso - Professor of Departamento de Engenharia Electrotécnica e de 3. Curriculum Vitae Computadores do Instituto Superior Técnico da Universidade Técnica Lisboa (Portugal).  Francisco António Cardoso Vaz - Professor of 3.1. Personal Details Aveiro University (Portugal). Ciro Alexandre Domingues Martins  João Paulo da Silva Neto - Professor of Rua dos Cabecinhos – Fráguas Departamento de Engenharia Electrotécnica e de 3850-707 Ribeira de Fráguas Computadores do Instituto Superior Técnico da Phone: +351 965365208 Universidade Técnica Lisboa (Portugal). E-mail: [email protected]  António Joaquim da Silva Teixeira - Professor of Aveiro University (Portugal). 3.2. Education

Qualification: Approved by unanimity  Ph.D. in Informatics Engineering, Aveiro University, 2008.  M. Sc. in Electrotecnic and Computers Engineering, 2. Abstract Instituto Superior Técnico (IST), Technical University of Lisbon, 1998. Most of today methods for transcription and indexation of  Graduated in Mathematics and Informatics, broadcast audio data are manual. Broadcasters process Universidade da Beira Interior, 1993. thousands hours of audio and video data on a daily basis, in order to transcribe that data, to extract semantic information, 3.3. Teaching and to interpret and summarize the content of those documents. The development of automatic and efficient  Instituto de Entre Douro e Vouga - Portugal, support for these manual tasks has been a great challenge and Assistant Professor, since 2006. over the last decade there has been a growing interest in the  Universidade Católica Portuguesa – Portugal, usage of automatic speech recognition as a tool to provide Assistant Professor, 1999-2007. automatic transcription and indexation of broadcast news and random and relevant access to large broadcast news databases. 3.4. Publications However, due to the common topic changing over time which Martins, C., Teixeira, A. and Neto, J. (2008). “Automatic Estimation characterizes this kind of tasks, the appearance of new events of Language Model parameters for unseen Words using leads to high out-of-vocabulary (OOV) word rates and Morpho-syntactic Contextual Information”, in Proceedings of consequently to degradation of recognition performance. This INTERSPEECH 2008, Brisbane, Australia, 2008. is especially true for highly inflected languages like the Martins, C., Teixeira, A. and Neto, J. (2008). “Dynamic Language European Portuguese language. Modeling for the European Portuguese”, in Proceedings of Several innovative techniques can be exploited to reduce PROPOR 2008, Curia, Portugal, 2008. Martins, C., Teixeira, A. and Neto, J. (2007). “Dynamic Vocabulary those errors. The use of news shows specific information, Adaptation for a daily Broadcast News Transcription System”, such as topic-based lexicons, pivot working script, and other in Proceedings of 2007 IEEE Automatic Speech Recognition sources such as the online written news daily available in the and Understanding Workshop, Kyoto, Japan, 2007. Internet can be added to the information sources employed by Martins, C., Teixeira, A. and Neto, J. (2007). “Vocabulary Selection the automatic speech recognizer. In this thesis we are for a Broadcast News Transcription System using a Morpho- exploring the use of additional sources of information for syntactic Approach”, in Proceedings of INTERSPEECH 2007, vocabulary optimization and language model adaptation of a Antwerp, Belgium, 2007. European Portuguese broadcast news transcription system. Martins, C., Teixeira, A. and Neto, J. (2006). “Dynamic Vocabulary Adaptation for a daily and real-time Broadcast News

113 Proceedings of the I Iberian SLTech 2009

Transcription System”, in Proceedings of IEEE/ACL 2006 Workshop on Spoken Language Technology, Aruba, 2006. Martins, C., Teixeira, A. and Neto, J. (2005). “Language Models in Automatic Speech Recognition”, revista do Departamento de Electrónica e Telecomunicações da Universidade de Aveiro, 2005. Martins, C., Neto, J. and Almeida, L. (1999). “Using Partial Morphological Analysis in Language Modeling Estimation for Large Vocabulary Portuguese Speech Recognition”, in Proceedings of EUROSPEECH 99, Budapest, Hungary, 1999. Martins, C. (1998). “Modelos de Linguagem no Reconhecimento de Fala Contínua”, Master Thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisbon, Portugal, 1998. Martins, C., Mascarenhas, M., Meinedo, H., Neto, J., Oliveira, L., Ribeiro, C., Trancoso, I. and Viana, M. (1998). “Spoken Language Corpora for Speech Recognition and Synthesis in European Portuguese”, in Proceedings of RECPAD 98, Associação Portuguesa de Reconhecimento de Padrões, Lisbon, Portugal, 1998. Neto, J., Martins, C. and Almeida, L. (1997). “The Development of a Speaker Independent Continuous Speech Recognizer for Portuguese”, in Proceedings of EUROSPEECH 97, Rhodes, Greece, 1997. Martins, C., Rodrigues, F. and Rodrigues, R. (1997). “An Isolated Letter Recognizer for Proper Name Identification Over the Telephone”, in Proceedings of RECPAD 97, Associação Portuguesa de Reconhecimento de Padrões, Coimbra, Portugal, 1997. Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S. and Robinson, T. (1995). “Speaker-Adaptation for Hybrid HMM-ANN Continuous Speech Recognition System”, in Proceedings of EUROSPEECH 95, Madrid, Espanha, 1995.

114 Proceedings of the I Iberian SLTech 2009

A phonological study of Portuguese language variety spoken in Beira Interior region: Some syntactic and semantic considerations.

Sara Candeias

Instituto de Telecomunicações, Polo de Coimbra [email protected] 1. Supervisors 3. Abstract This study proposes a model for a phonological description of - Jorge Manuel de Morais Gomes Barbosa – Full the speech patterns attested in the Portuguese language variety spoken in Beira Interior region (in Fundão, Professor Emeritus at University of Coimbra particularly). Our goal was to present the main phone prototypes, which could be considered in the description of the Portuguese - Telmo dos Santos Verdelho – Full Professor Emeritus at language, taking into account minority speech, particularly. Based in an analytic work, a spontaneous speech database was University of Aveiro collected in order to establish the pertinent features set in the referred to variety. In accordance with the so-called Functionalist Theory, sounds are considered by the fact that 2. Committee speech is perceived (and produced) regarding those distinctive features terms, which are correlated with an - Maria Helena Nazaré – Full Professor, Principal of the optimal-center-of-gravity region. Therefore, our approach explored the view that Categorical Perception, as Quantal University of Aveiro) Theory and Optimality Theory could support phonological system and (allo)phone inventories.

- Jorge Manuel de Morais Gomes Barbosa – Full The collected database comprises 142.040 (allo)phone occurrences. The (allo)phone realizations are described and Professor Emeritus at University of Coimbra analysed according to the syllabic context as a finer-grained category. For that reason, phonological system description - Telmo dos Santos Verdelho – Full Professor Emeritus at considered all structures of vocalic and consonantal phones, including allophone dispersion. Phones in contact in a speech University of Aveiro chain scope were also taking into account, specifically in a V- to-V context. The proposed phonological system description is supported - Maria Helena Dias Rebelo – Assistant Professor at by a statistical analysis of the frequency in relative and absolute values. University of Madeira It is suggested that these phonological phenomena maps may have their correlation in the Verb and Personal Pronoun - António José Ribeiro Miranda – Assistant Professor at syntactic-semantic categories too.

University of Aveiro Our belief was that phonological and phonetic-based of Portuguese variety spoken in Beira Interior contributes ultimately to extend (or redefine) the linguistic knowledge of - Rosa Lídia Torres do Couto Coimbra e Silva – Assistant Portuguese language..

Professor at University of Aveiro

- Urbana Maria Santos Pereira Bendiha – Assistant 4. Curriculum Vitae Sara Candeias received a degree in Classical and Portuguese Professor at University of Aveiro Language and Literature - scientific area - in 1994 from the Faculdade de Letras of University of Coimbra. Next year she fulfilled the pedagogical area at the same school. In 1999 she received her MSc in Classical and Portuguese Language also Qualification : Approved by unanimity from University of Coimbra. In 2007 she completed her PhD in Linguistics at University of Aveiro. University: University of Aveiro, Portugal Her domain of specialization is spoken Portuguese language: phonetic-phonology interface in spontaneous speech forms;

115 Proceedings of the I Iberian SLTech 2009 pertinent features set; research based in analytic work of the functionalist theory and phonetic features. She was Assistant Lecturer at the Dept. of Arts and Letters of University of Beira Interior and Invited Assistant Lecturer at the Faculdade de Letras, University of Coimbra. Sara's current research interests include Processing of Portuguese Language, as Syllable structure; Pronunciation dictionary of Portuguese phones and Coarticulation model systems of spoken Portuguese. Actually, she holds a post-doctoral grant in Computational Linguistics at the Instituto de Telecomunicações, Pole of Coimbra, Dept. Electrical & Computer Engineering, University of Coimbra.

116 Proceedings of the I Iberian SLTech 2009

PhD thesis: “From grapheme to gesture. Linguistic contributions for an articulatory based text-to-speech system” (Original: Do grafema ao gesto. Contributos linguísticos para um sistema de síntese de base articulatória)

Catarina Oliveira 1

1 School of Health, University of Aveiro/ IEETA, Portugal [email protected]

an automatic learning technique. The evaluation results of 1. Commitee this system motivated the exploitation of others • Ana Maria Vieira da Silva Viana Cavaleiro, University of automatic approaches, finding also to evaluate the impact Aveiro of the syllabic information integration in the systems. • João Manuel Nunes Torrão, University of Aveiro The gestural description of the European Portuguese (supervisor) sounds, anchored on the theoretical and methodological tenets of the Articulatory Phonology, was based essentially on the • Plínio de Almeida Barbosa, Departament of Linguistics analysis of magnetic resonance data (MRI), from which Instituto de Estudos da Linguagem/Unicamp all the measurements were carried out, aiming to • João Manuel Pires da Silva e Almeida Veloso, Faculty of obtain the quantitative articulatory parameters. The several Arts, University of Porto gestural configurations proposed have been validated, • Maria Aldina de Bessa Ferreira Rodrigues Marques, through a small perceptual test, which allowed identifying Instituto de Letras e Ciências Humanas, University of the main underlying problems of the gestural proposal. This Minho work provided, for the first time to PE, the development • António Joaquim da Silva Teixeira, University of Aveiro of a first articulatory based text-to-speech system. (supervisor) The dynamic description of nasal vowels relied either on the magnetic resonance data, for characterization of the oral gestures, either on the data obtained through Qualification : Approved by unanimity electromagnetic articulography (EMA), for the study of University: University of Aveiro the velum dynamic and of its relation with the remaining articulators. Besides that, a perceptive test was performed, 2. Abstract using TADA and SAPWindows, to evaluate the sensibility of the Portuguese listeners to the variations in the height Motivated by the central purpose of contributing for of velum and alterations in the intergestural coordination. This the construction, in the long term, of a complete text-to-speech study supported an abstract interpretation (in gestural terms) of system based in , we develop a linguistic the EP nasal vowels and allowed also to clarify crucial aspects model for European Portuguese (EP), based on TADA related with its production and perception. system (TAsk Dynamic Application), that aimed at the automatic attainment of the articulators trajectory from the input text. 3. Curriculum Vitae The specification of this purpose determined the development of a set of tasks, namely the 1) implementation a. PERSONAL DETAILS and evaluation of two automatic syllabification systems and two grapheme-to-phoneme (G2P) conversion systems, in view Catarina Alexandra Monteiro Oliveira of the transformation of the input in an appropriate School of Health, University of Aveiro/ IEETA format to the TADA; 2) the creation of a gestural Campus Universitário de Santiago database for the EP sounds, in so that each phone 3810-193 Aveiro, Portugal obtained at the output of the g2p system could have +351) 234 370 200 correspondence with a set of articulatory gestures adapted for [email protected] EP; 3) the dynamic analysis of nasality, on the basis of an articulatory and perceptive study. The two automatic syllabification algorithms b. EDUCATION implemented and tested make appeal to phonological • 2009 - PhD in Linguistics, University of Aveiro, knowledge on the structure of the syllable, being the first Portugal. one based in finite state transducers and the second one a • 2002/ 2003 – Post-graduation in Portuguese Studies, faithful implementation of Mateus & d'Andrade (2000) Department of Languages and Cultures, University of proposals. The performance of these algorithms - especially Aveiro, Portugal. the second - was similar to the one of other systems with • 1997/ 2002 – Degree (licenciatura) in “Línguas e the same potentialities. Literaturas Clássicas e Portuguesa”, Faculty of Arts, Regarding grapheme-to-phone conversion, we University of Coimbra, Portugal. follow a methodology based on manual rules combined with

117 Proceedings of the I Iberian SLTech 2009

c. RESEARCH INTERESTS Conference on Speech Communication and Technology, Lisboa, 2005. Articulatory Phonology Experimental Phonetics Nasal Vowels Articulatory Speech Synthesis

d. TEACHING • Articulatory and Acoustic Phonetics II , School of Health, University of Aveiro: 2007/ 2009- lectures and laboratory classes; 2005- 2007 – laboratory classes. • PLE (Portuguese as a Foreign Language) courses, Languages and Literatures Department, University of Aveiro: 2004 – level I (Beginners Course) and level II (Intermediate Course); 2003 - level I (Beginners Course), level II (Intermediate Course) and international summer course (intermediate level). • Portuguese language at High School: November/ December 2003 - Escola Secundária de Estarreja; September/ November 2002 - Escola E/B, 2.3 Dr. Acácio de Azevedo de Oliveira do Bairro; 2001 - Escola Secundária de Cantanhede. • Latin Language at High School: 2001 - Escola Secundária de Cantanhede

e. PARTICIPATION IN RESEARCH PROJECTS Portugal R&D Project "HERON – A Framework for Portuguese Articulatory Synthesis Research" (POSI/PLP/57680/2004), 2004-2007.

f. SELECTED PUBLICATIONS • Oliveira, C., Teixeira, A., Martins, P., “Speech Rate Effects on European Portuguese Nasal Vowels”, Interspeech'2009 - 10thAnnual Conference of the International Speech Communication Association, Brighton, UK. (Accepted). • Teixeira, A., Martins, P., Carbone, I., Oliveira, C., Silva, A., “An MRI Study of European Portuguese Lingual Coarticulation”, capítulo de livro a publicar por uma editora internacional (to appear). • Teixeira, A., Oliveira, C., Barbosa, P., “European Portuguese Articulatory Based Text-to-Speech: First Results” in António Teixeira, Vera Lúcia Strube de Lima, Luís Caldas de Oliveira and Paulo Quaresma (Eds), Computational Processing of the Portuguese Language, Springer, 2008 . • Oliveira, C., Teixeira, A., “On Gestures Timing in European Portuguese Nasals”, Trouvain, J. and Barry, W.J. (Cord.), Proceedings of 16th International Congress of Phonetic Sciences. Saarbrücken, Germany, 2007, 405- 408. • Teixeira, A., Oliveira, C., Moutinho, L., “On the Use of Machine Learning and Syllable Information in European Portuguese Grapheme- Phone Conversion” in R. Vieira, P. Quaresma, Maria G. V. Nunes, N. Mamede, C. Oliveira, M. C. Dias (Eds), Computacional Processing of the Portuguese Language, Springer, 2006, 212-215. • Oliveira, C., Moutinho, L., Teixeira, A., “On European Portuguese Automatic Syllabification”, Proceedings of Interspeech'2005 – Eurospeech- 9th European

118 Proceedings of the I Iberian SLTech 2009

Groups & Projects

119

Proceedings of the I Iberian SLTech 2009 Natural Language Science and Technology at the University of Lisbon, Department of Informatics: the NLX Group

António Branco

University of Lisbon, Department of Informatics NLX - Natural Language and Speech Group [email protected]

Abstract 2.4. Projects This is a brief presentation of the NLX-Natural Language and We have been participating in and coordinating a number Speech Group and its past and ongoing activities and results as national and international successfully completed R&D of July 2009. projects: Index Terms: natural language processing, human language • LT4eL technology, computational linguistics, laboratory, Portuguese, Language Technology for E-learning University of Lisbon, NLX Group • QueXting (coord.) Question Answering in the Portuguese Web 1. Introduction • GramaXing (coord.) Computational Grammar for Deep Linguistic NLX Group is the Natural Language and Speech Group of the Processing of Portuguese Department of Informatics of the University of Lisbon, • PALPORT Faculty of Sciences. This is a presentation of this group and its Fine-grained Psycholinguistic Assessment of major past and ongoing results and activities as of July 2009. Aphasia and Other Language Impairments For further and updated information, check its website at • TagShare (coord.) . http://nlx.di.fc.ul.pt Tagging and Shallow Processing Tools and Resources 2. Research and Development • LTRC Language Typology Resource Center 2.1. Mission • NeXing (coord.) Natural Negation Modeling and Processing At the NLX Group, we aim at pursuing research and development (R&D) activities in the field of artificial At present, a major project being conducted is intelligence and cognitive science, with special focus on • SemanticShare (coord.) speech and natural language interaction. Resources and Tools for Semantic Processing

2.2. Sponsors The major goal of this project is the construction of an annotated corpus of Portuguese, part of it aligned with similar Our activities are undertaken with the support of a number of corpora for others languages, and associated processing tools. sponsors under individual fellowships granted to its members The texts included in this corpus are annotated with and under contracts for R&D projects granted in open manually certified grammatical representations by human competitive calls assigned by independent experts. experts. These deep linguistic representations are informed by The sponsors contributing with the largest funding volume advanced linguistic analysis. They encompass different layers have been FCT-Fundação para a Ciência e Tecnologia of the of linguistic information and are accessed under different Portuguese MCTES-Ministério da Ciência e Tecnologia, e views. These views include corpora of the last and next Ensino Superior and the 6th and 7th Framework Programme of generations: PropBank: With phrases labeled with semantic the European Commission. functions and roles; LogicalFormBank: With sentence-level The list of sponsors also include the Luso-American semantic representations. Foundation, the British Council Portugal, and the former Results from past projects will be presented in the next GRICES-Gabinete de Relações Internacionais da Ciência e do sections by way of the presentation of some of the most Ensino Superior from MCTES. prominent online services, tools and resources developed in 2.3. Team their scope. At present, our team comprises 13 elements. Besides 2 faculty, 2.5. Networking and cooperation it includes 5 PhD students, 1 MA student and other 4 research All our projects were deployed by consortia of several groups assistants. In the past, it counted on the collaboration of 16 working in partnership. We have thus entertained a range of former members who have contributed to the group activities. cooperative ties with a large number of other prominent It was founded and is directed by the author of this paper. national and international institutions, including colleagues from Brazil.

121 Proceedings of the I Iberian SLTech 2009

2.6. European research infrastructure 2.9. Language resources We are participating in the preparatory project for a European In terms of resources, we have been responsible for key research infrastructure for human languages: resources for the Portuguese language, from which we • CLARIN highlight here, among others, the pioneering effort devoted to: Common Language Resources and Technology • MWNPT-Portuguese MultiWordnet Infrastructure This is a lexical ontology with over 17,200 manually validated concepts/synsets, linked under the 2.7. Applications and online services semantic relations of hyponymy and hypernymy. These concepts are made of over 21,000 word Our R&D projects were or are supported by public funding. In senses/word forms and 16,000 lemmas from both order to showcase the results we obtained in a way easy to European and American variants of Portuguese. grasp by laymen, and to pay back to the community its They are aligned with the translationally equivalent support, we have been ensuring the following online services: concepts of the English Princeton WordNet and, . XisQuê: Question Answering transitively, of the MultiWordNets of Italian, This is a real-time open-domain factoid question Spanish, Hebrew, Romanian and Latin. answering online service (beta version) based on the For detailed information, including distribution: Portuguese web [3, 4]. http://mwnpt.di.fc.ul.pt http://xisque.di.fc.lu.pt CINTIL-International Corpus of Portuguese . LX-Center: Linguistic Processing • This is a high quality, linguistically interpreted, 1 This is a web center of online linguistic services Million token corpus accurately hand tagged with aimed at both demonstrating a range of language respect to POS, lemmata, inflection, multi-word technology tools and at fostering the education, proper names and adverbial and closed classes. This research and development in natural language annotated corpus was developed in close cooperation science and technology [2]. http://lxcenter.di.fc.lu.pt with CLUL-Centro de Linguística da Universidade de Lisboa [1]. 2.8. Language processing tools For detailed information, including distribution: http://cintil.ul.pt In the course of our R&D activities, as instrumental assets for the execution of our projects, we developed or are developing 2.10. Events a range of language technology tools and resources. In terms of language technology, we developed a complete We have organized several national and international scientific pipeline of shallow processing tools that handle from the basic meetings. Among these, it is worth pointing out task of sentence splitting to the named entity recognition task. • DAARC - Discourse Anaphora and Anaphor This pipeline include state of the art tools for [6, 7]: Resolution Colloquia . Sentence splitting, We have been responsible for the organization of successive editions of these conferences, which are . Tokenization the international reference forum on anaphora. . Nominal lemmatization 3. Cooperation and Innovation . Nominal morphological analysis The previous sections briefly described the expertise we have . Nominal inflection been raising and the portfolio of key technological assets we . Verbal lemmatization developed for the computational processing of Portuguese. We are working to further expand and exploit these assets. We are . Verbal morphological analysis looking for new and renewed partnerships aiming at . Verbal conjugation establishing both further successful R&D projects and innovative and profitable entrepreneurial initiatives. . POS-tagging . Named entity recognition 4. References On a par with these tools for language processing, we [1] F. Barreto, A. Branco, E. Ferreira, A. Mendes, M. F. developed also some auxiliary tools that are instrumental to Nascimento, F. Nunes and J. Silva, 2006. Open Resources and explore the language resources developed (to be described in Tools for the Shallow Processing of Portuguese, LREC2006 the next sections): [2] A. Branco, F. Costa, E. Ferreira, P. Martins, F. Nunes, J. Silva . Annotated corpus concordancing and S. Silveira. 2009. LX-Center: A center for linguistic services, ACL-IJCNLP2009. . Treebank browsing and concordancing [3] A. Branco, L. Rodrigues, J. Silva and S. Silveira, 2008, Real- . Aligned wordnet browsing. Time Open-Domain QA on the Portuguese Web, LNAI 5290. [4] A. Branco, L. Rodrigues, J. Silva and S. Silveira, 2008, As for the deep processing, we are authoring: XisQuê: An Online QA Service for Portuguese, LNAI 5190. LXGram: Computational grammar [5] A. Branco and F. Costa, 2008, A Computational Grammar for This is a large-scale, multi-purpose precision Deep Linguistic Processing of Portuguese: LXGram, version grammar for deep linguistic processing of A.4.1, Technical Report, University of Lisbon. Portuguese [5]. [6] A. Branco, F. Costa, P. Martins, F. Nunes, J. Silva and S. For detailed information, including distribution: Silveira. 2008. ”LXService: Web Services of Language http://nlxgroup.di.fc.ul.pt/lxgram. Technology for Portuguese”. LREC2008. This grammar is being developed in the international [7] A. Branco and J. Silva, 2006,"LX-Suite: Shallow Processing Tools for Portuguese, EACL2006. consortium Delph-in (http://www.delph-in.net).

122 Proceedings of the I Iberian SLTech 2009

LX-Center: A center of online services for education, research and development on natural language science and technology

António Branco, Francisco Costa, Eduardo Ferreira, Pedro Martins, Filipe Nunes, João Silva and Sara Silveira∗

University of Lisbon, Department of Informatics {antonio.branco, fcosta, eferreira, pedro.martins, fnunes, jsilva, sara.silveira} @di.fc.ul.pt

As of July 2009, these are the online services offered: Abstract . LX-Conjugator This is a short note that supports the demonstration of the . LX-Lemmatizer LX-Center at the 1st Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian. . LX-Inflector . LX-Suite Index Terms: natural language processing, computational linguistics, online services, Portuguese . LX-NER . CINTIL Concordancer 1. LX-Center: online services . MWN.PT Browser The access to each one of these services is obtained by LX-Center is a web-based center of freely available online clicking on the corresponding button on the left menu of the linguistic services. These services are targeted at human users LX-Center web page: and aimed at both demonstrating a range of language technology tools and at fostering the education, research and development in natural language science and technology. The LX-Center is located at [1]. It encompasses linguistic services that are being developed, in all or part, and maintained at the University of Lisbon, Department of Informatics, by the NLX-Natural Language and Speech Group. At present, this center makes available the following functionalities: . Sentence splitting . Tokenization . Nominal lemmatization . Nominal morphological analysis

. Nominal inflection Fully fledged descriptions of each service are available at . Verbal lemmatization the corresponding web pages and in the white papers referred . Verbal morphological analysis to there. . Verbal conjugation 2. Web services . POS-tagging Most of the online services that integrate the LX-Center . Named entity recognition have a counterpart in terms of a web service for software . Annotated corpus concordancing agents. These web services are available under the LXService, presented elsewhere [2]. . Aligned wordnet browsing These functionalities are provided by one or more of the 3. References services integrating the LX-Center. For instance, the LX-Suite [1] http://lxcenter.di.fc.ul.pt. service accepts raw text and returns it sentence split, [2] A. Branco, F. Costa, P. Martins, F. Nunes, J. Silva and S. tokenized, POS tagged, lemmatized and morphologically Silveira. 2008. ”LXService: Web Services of Language analyzed. Some other services, in turn, may support only one Technology for Portuguese”. Proceedings of LREC2008. ELRA, of the functionalities above. For instance, the LXNER service Paris. ensures only named entity recognition.

∗ Authors are listed in alphabetical order.

123

Proceedings of the I Iberian SLTech 2009

Pattern recognition & speech technologies (http://grah.ehu.es)

M. I. Torres, R. Justo, A. Perez,´ V. Guijarrubia, J.M. Olaso, G. Sanchez,´ E. Alonso, J.M. Alcaide

Department of Electricity and Electronics. University of the Basque Country. Spain [email protected]

Abstract • Including user in the formulation of dialogue systems. Management of multimodality in models of understand- The Pattern Recognition & Speech Technology group (PR&ST) ing and decoding in speech-based interactive systems. is devoted to the research and technological development in ar- eas related to pattern recognition and speech and language tech- nologies. We are also aimed to advise PhD students, to train 3. Technical Objectives and Technology technical experts and to transfer technology to companies. Transfer • Compile linguistic resources to develop the proposed technologies. 1. Introduction • Working spoken dialogue systems that push the envelope The PR&ST group of the University of the Basque Country was of naturalness and sophistication. constituted in 1990. The work carried out in the group was • Integrate multilingual and multimodal input/output in mainly focused on technologies based on machine learning for speech-based interactive systems. Automatic Speech Recognition (ASR) systems. However, other • fields such as speech understanding and dialogue systems have Develop advanced multimodal interactive systems in also been considered. In the last years we have also worked to collaboration with the industrial sector to explore the cre- adapt recognition systems to real working conditions, specially ation of technology-based spin-off companies. treating spontaneous speech. On the other hand, alternative re- • One of the main goals of the group: proposal and devel- search lines have also been tackled in the framework of machine opment of doctoral thesis for phd training as well as the translation and language identification . These activities have training of specialists in speech technologies. led to the publication of numerous publications in international conferences and journals. 4. Current projects We have experience in collaborating on coordinate R&D • SD-TEAM: “Interactive Learning. Sefl-Evaluation and projects and working with companies (Telefonica I+D, EITB, Multimodal Technologies for Multidomain Spoken Dia- Adur Software Productions, Rosetta Testu Zerbitzuak, Fagor logue Systems” (MICINN) Electrodomesticos,´ Telvent, Softec, Scansoft). We have also collaborate with technological centers (Ametzagaia, Ikerlan, • MIPRCV: Multimodal Interaction in Pattern Recogni- Cidemco) and public agencies (UZEI, Euskoiker Foundation, tion and Computer Vision Linguistic Policy Ministry of Basque Government) (CONSOLIDER-INGENIO 2010) • We have developed several demos of automatic speech VOACDIS: “Mejora de la accesibilidad a vivienda recognition systems in Spanish and Basque for different tasks. de personas discapacitadas mediante sistemas de re- On the other hand, we have acquired different text and conocimiento” (MICINN-Program of applied research speech corpora and language resources in Spanish, English and with companies) Basque. Finally, it is worth mentioning the development of dia- logue system prototypes in the framework of R&D projects. 5. Funds • Regional: UPV/EHU research group, EJIE/UPV agree- 2. Scientific Objectives in the coming years ment, Intek, Gaitek,... • Acoustic modelling techniques aimed to deal with • National : MICINN. CONSOLIDER-INGENIO 2010, speech variability. thematic networks,... • European: SEVENTH FRAMEWORK PROGRAM • Development of hierarchical, cooperative language mod- (FP7): ICT - IP Call 2008-3 els. Application to recognition, understanding, dialogue and machine translation. • Projects to develop technology and research contracts.

• Development of specific technologies for lexical pro- Stable collaborations with first-level national and interna- cessing of the Basque Language. tional universities and centers. • Speech-to-speech translation in limited domains: model inference based on finite state transducers. Use of lin- Organizers of numerous scientific meetings and events such guistic knowledge. Integration of ASR and translation as the AERFAI Summer School 2008: “New trends in Pattern components for systems operating in limited domains for recognition for Speech technologies”: 7 first-level professors (6 Spanish/Basque and Spanish/English language pairs. countries), 50 european students in Science Faculty, June 2008.

125

Proceedings of the I Iberian SLTech 2009

SD-TEAM: Interactive Learning, Self-Evaluation and Multimodal Technologies for Multidomain Spoken Dialog Systems

Mar´ıa Ines´ Torres1, Eduardo Lleida2, Emilio Sanchis3, Ricardo de Cordoba´ 4, Javier Mac´ıas-Guarasa5

1Pattern Recognitiona & Speech Technologies Group. University of the Basque Country. Spain 2Communication Technologies Group. University of Zaragoza, Spain 3Pattern Recognition and Artifitial Intelligence Group. Polytechnic University of Valencia, Spain 4Speech Technology Group. Polytechnic University of Madrid, Spain 5Intelligent spaces and Transport Group. University of Alcala´ de Henares, Spain http://www:sd-team.es

Abstract stage of development and still not integrated into commercial Speech technology currently supports the development of dia- products in widespread use within society. Similar remarks can logue systems that function in limited domains for which they be applied to other spoken language processing systems such were trained and in conditions for which they were designed, as automatic translation, summarisation and information extrac- that is, specific acoustic conditions, speakers etc. The interna- tion. tional scientific community has made significant efforts in ex- These advances can and should allow for the entry of large ploring methods for adaptation to different acoustic contexts, swathes of the population into the Information Society, amongst tasks and types of user. However, further work is needed to which one can mention the disabled or those who, due to their produce multimodal spoken dialogue systems capable of ex- age, have missed out on contact with this type of technology. ploiting interactivity to learn online in order to improve their However, in spite of efforts to the contrary, rapid technological performance. growth has led to the marketing of products which sometimes The goal is to produce flexible and dynamic multimodal, do not meet expectations of people. Consequently, more robust interactive systems based on spoken communication, capable and versatile systems are needed that go much further than cur- of detecting automatically their operating conditions and espe- rent artefacts, which can be characterised as fixed as a result cially of learning from user interactions and experience through of their initial training. What is needed are systems that learn evaluating their own performance. Such ?living? systems will throughout their lifetime through interactions with users, with evolve continuously and without supervision until user satisfac- error-detection mechanisms and adaptation to novel situations tion is achieved. Special attention will be paid to those groups and environments, which would allow increases in robustness, of users for which adaptation and personalisation is essential: usability and maintainability. Such systems will have advanced amongst others, people with disabilities which lead to commu- sensory capabilities to permit the acquisition of relevant infor- nication difficulties (hearing loss, dysfluent speech, ...), mobil- mation from their environment, and which allow them to plan ity problems and non-native users. their behaviour in response to user demands [1]. Another ques- In this context, the SD-TEAM Project aims to advance the tionable notion is the idea of the computer being the centre of development of technologies for interactive learning and eval- the human-machine interaction. Rather, the human should be uation. In addition, it will develop flexible distributed archi- the centre of attention. Under this new view, the role of the tectures that allow synergistic interaction between processing computer is limited to that of an observer of interpersonal inter- modules from a variety of dialogue systems designed for dis- actions, contributing helpful information such as speech tran- tinct tasks, user groups, acoustic conditions, etc. These tech- scription, automatic translation, topic and speaker changes and nologies will be demonstrated via multimodal dialogue systems the like [2]. On the other hand, it is useful to think of the user as to access to services from home and to access to unstructured another system module, a view which introduces new scientific information, based on the multi-domain systems developed in challenges [3]. Some of these involve the use of user feedback the previous project TIN2005-08660-C04. to reduce system error, adaptation and evolution to different users, environmental conditions and tasks. At the same time, the recognition that human interactions are inherently multimodal 1. Introduction should permit improvements in the general usability of systems Technological advances have resulted in many devices with and hence their acceptance amongst end users. which we interact and that have transformed our everyday life. Significant scientific advances will be required to handle Within speech technology, the topic of the current proposal, of the technological challenges implied these new ideas. Tack- special relevance is the study of robust, intuitive and easy to ling these problems is the purpose of the SD-TEAM project use spoken dialogue systems as a means of human-computer in- described here. SD-TEAM is a natural successor of the EDE- teraction. In recent decades, significant advances in automatic CAN project [4] developed by the proposed group of scientists. speech recognition have been achieved, which has in turn led to The EDECAN project attempted to go beyond classical dia- an increased demand for voice-based interactive systems which logue systems to allow them to perform robustly in the face of are able to handle more complex tasks. Such is the case for changes in acoustic conditions due both to the environment and spoken dialogue systems, which are currently in a very early the user, to minimise the effort required in system redesign for

127 Proceedings of the I Iberian SLTech 2009

User profiles Acoustic context organic, capable of learning, growing, reconfiguring and self- Task knowledge LEARNING Confidence measures Hypothesis validation repairing, leading to greater robustness in the face of a wide range of operating conditions. Consequently, it is essential to Models: acoustic, multimodal inputs, language, dialog, study in depth the technological and methodological elements ... Models: acoustic, EVALUATION multimodal inputs, that will allow dialogue systems to approach the organic model, language, dialog, ... interaction in order to produce artefacts which automatically evolve to a Modules: cooperation Other modules: Other managers: ASR, ASR, Dialog, Understanding, Understanding, Information level where they guarantee user satisfaction. Amongst poten- etc etc Retrieval, etc. Modules: cooperation ASR, Understanding, tial users are those groups for which a static system simply etc would not function. These include users whose disability leads Managers: Dialog, interaction Information to communication problems (dysfluent speech, hearing loss, ...), Retrieval, etc. Managers: User Dialog, those with limited mobility, and non-native users. Project SD- Information SPEECH-BASED multimodal INTERACTIVEretrieval, SYSTEMetc. TEAM will improve the capabilities of voiced-based interactive (evolving) SPEECH-BASED multimodal INTERACTIVE SYSTEM) TASKS multimodal systems using advances in the development of self- Self-EVOLUTION Tecnologías de aprendizaje, evaluación y cooperación interactiva en aplicaciones de diálogo hablado multidominio y muñtimodal evaluation and interactive learning technologies.

Figure 1: Illustration of the goals of the SD-TEAM Project: 2.1. Scientific objectives interactive, voice-based systems capable of unsupervised evo- 1. Concerning interactive learning: lution via the development of technologies which support co- operation between different modules, operating environments, To explore efficient mechanisms for interactive • evaluation techniques and interactive learning. learning: methods of incremental adaptive learn- ing capable of using information contained in each interaction with the user to allow automatic and dynamic evolution of the system itself. new applications, and to customise the system for individual To explore efficient techniques for environment users. SD-TEAM project attempts to go further in converting • closed dialogue systems into dynamic and flexible interactive detection: speaker localization, identification of systems with a clear capacity for self-evolution via learning, the user, acoustic conditions and semantic domain, evaluation and cooperation during normal usage. These new liv- etc. To develop robust methods for adaptation ing systems will be based on information and experience gained to variable environments, application changes and in earlier projects, but equally will be capable of adaptation, user customisation (multilingualism, non-native interaction with users and other systems, and, most important speech, dysfluent speech, emotional speech, etc). of all, able to learn online in order to reconfigure themselves 2. Concerning interaction and cooperation between mod- . For this, a flexible distributed architecture (being developed ules, systems and the user: from that proposed in the EDECAN project) permits coopera- tion amongst different subsystems for understanding, recogni- To develop approaches which consider user inter- tion, handling of dialogue gestures etc, developed for different • action as an additional module or element of the environments, as well as natural multimodal user interaction, system that co-operates with the rest of the sys- both of which are based on autonomous operation the will al- tem. To explore methods that permit direct use low for constant, unsupervised evolution (see Fig. 1). Evolu- of information from the interaction in order to im- tion is based on objective functions aimed both at increasing prove the performance of highly-complex systems system performance and user satisfaction. A good summary of with imperfect behaviour (e.g. at the level of audio the overall goal of the SD-TEAM project can be found in ob- classification, transcription, etc.) by validating or jective 2 (Smarter machines, better services) of the Information updating hypotheses, asking new questions, etc. and Communication Technologies theme of the 7th Framework To develop likewise approaches that can handle the Programme of the EU: research must lead to systems more ca- • interaction between systems of differing complex- pable of detecting what is going on in their environment and ity, communicating and cooperating to achieve learning, reasoning and interacting with people in a more nat- given objectives. ural way [...] Instead of forcing users to understand how ma- chines function, it should be the systems which learn to work To develop distributed architectures that inte- • and interact better with us. This aspiration is even more rele- grate user interaction, inter-module interaction and vant for disabled users. Paradoxically, it is precisely this section adaptive learning as well as multimodal informa- of the population who would benefit most of such technological tion processing. advances, but for whom the use of such systems is out of reach 3. Concerning learning through self-evaulation: if their communication abilities are reduced by disabilities such as dysfluencies or hearing impairment. At the very least, these To explore methods for automatic extraction systems should evolve to a level of performance useful for this • of information relevant to the interaction pro- sector of society, as is proposed in the SD-TEAM project. cess: signal segmentation, event identifica- tion/classification, information extraction, etc. 2. Objectives To define measures of system self-evaluation, con- • fidence and system performance at all levels and Over the years, dialogue systems have been developed which incorporation of these into the learning process. know what they know and don?t know what they don?t know. Such systems are static and force the user to adapt to their 4. Horizontal objectives related to multimodality and e- needs. On the other hand, systems exist which are increasingly inclusion:

128 Proceedings of the I Iberian SLTech 2009

To explore efficient methods for integrated multi- will also develop flexible distributed architectures that allow • modal input processing: voice, touch screen, video subsystems created for different environments to work together. etc. for understanding and/or learning. To develop SD-TEAM will lead to significant advances in the technological methods and techniques for multimodal informa- capacity of the consortium which will be demonstrated through tion generation: voice, multilingual text, sign lan- prototypes for multiple-domain dialogue systems and in sys- guage, handwriting, etc. tems for audio information extraction, with special attention to To use techniques developed in SD-TEAM to im- accessibility for disabled users with problems of communica- • prove access for disabled people. tion or mobility. To fulfil its objectives, SD-TEAM relies on collaborative work with the network of project partners. Each 2.2. Technical objectives team has specialised expertise (as detailed in the lab descrip- tions) which complements that of other partners and will be To improve significantly the technological capabilities of needed to develop the distinct modules that make up complex • the consortium (large-vocabulary recognisers, dialogue voice-based interactive systems (and in which each group brings systems with wider application domains, processing of experience and solutions from different perpectives). This di- out-of-vocabulary items, coupled and decoupled archi- versity in viewpoints will enrich proposed solutions to scientific tectures, integration of user interaction, etc.). Develop- problems which arise during the project. All the subprojects ment of a flexible platform based on distributed architec- share the general scientific and technological objectives of the tures for the integration of modules involved in a voice- project, alongside those involved in the design and implementa- based multimodal interactive system capable of learning tion of the distributed architecture and in the constructions and and dynamic evolution using information obtained dur- evaluation of demonstrators. In addition, each subproject has its ing interaction with the user. own objectives: To construct a prototype demonstrator to illustrate the • scientific results and technology of the project. Dynamic 3.0.1. SD-TEAM-EHU: objectives dialogue systems applied to multiple domains starting incorporation of multilingual input-output (Spanish- from prototypes from the EDECAN project will be in- • Basque-English) into voice-based interactive systems. tegrated with interactive systems to access unstructured development of robust language identification systems speech (audio from TV programmes). Accessibility for • disabled people is foreseen. based on multilingual acoustic models translation model inference using finite-state transduc- 3. Partners and local objectives • ers. Integration of systems for ASR and automatic trans- lation from voice in limited domains. Maninly into a The groups in this proposal have a long history of working to- Spanish/Basque context. gether in the development of automatic speech recognition and spoken dialogue systems. The varied origins and specialisms of research into the development of hierarchical, co- each group make collaboration especially attractive. The Com- • operative language models and their application in munication Technology group from the University of Zaragoza recognition, understanding, dialogue management and (UZ) come from the field of signal and communication theory, automatic translation fundamental for the development of robustness in recognition incorporation of the user in the design of dialogue sys- systems, and having recently introduced speech technology to • tems; management of multimodality in search and mod- aid people with disabilities and speech production difficulties. els of understanding The Pattern Recognition and Artificial Intelligence group from the Polytechnic University of Valencia (UPV) and the Pattern significant growth in the technical capacity of the group • Recognition and Speech Technology group from the University in recognition, dialogue and voice translation of the Basque Country (EHU) possess essential knowledge con- cerning model learning, allowing the consortium to deepen the 3.0.2. SD-TEAM-UZ:objectives development of methods for learning from examples, language Robustness in adverse acoustic environments. modelling, understanding and dialogue as well as, in the case • of EHU, the development of limited-domain translation sys- Description of acoustic scenario: acoustic segmentation, tems. The group from UPV also work in natural language pro- • identification of acoustic events and acoustic environ- cessing, in particular in morphosyntactic labelling, word-sense ment, speaker identification, i.e. audio indexing. disambiguation, named-entity recognition and their application Lexical Robustness: detection and learning lexical fea- in information extraction and question answering. The Speech • tures from the speaker, which is fundamental for im- Technology group from the Polytechnic University of Madrid paired or non-native speaking users. (UPM) have extensive experience in the design and evaluation of person-machine dialogue systems based on speech technol- Development of algorithms for obtaining robust acoustic ogy, with powerful systems for speech recognition and under- • confidence measures for self evaluation and assessment standing and high-quality text to voice conversion. Finally, the purposes. Intelligent Spaces and Transport group at the University of Al- Cooperation of multimodal information sources (audio- cal (UAH) have wide experience in the positioning of intel- • visual, tactile, etc.) to increase the robustness in spoken ligent mobile agents with multiple cameras as well as multi- dialogue systems. microphone speech processing. The objective of the SD-TEAM project is to improve the capabilities of advanced voice-based Help to handicaped users: developing support systems multimodal interactive systems through the development of in- • for oral communication interfaces based on voice in- teractive learning and self-evaluation technologies. The project puts/outputs.

129 Proceedings of the I Iberian SLTech 2009

Architectures for distributed dialogue systems: making Audio-visual sensor fusion strategies for speaker identi- • advances to manage errors, multilingual interaction, self • fication and emotional state classification tasks. and inter-pair evaluation and assessment of every system module and the continuous and autonomous adaptation 4. Acknowledgements 3.0.3. SD-TEAM-UPV: objectives The authors would like to thank the Spanish Innovation and Sci- ence Minister for funding this project under grant TIN-2008- Codification of the multimodal input information and its 068-C05. • integration in the system models. Detection and classification of different semantic con- • 5. References texts. [1] IEEE Transactions on Systems, Man and Cybernetics, Vol. 35, Interaction with the user: personalization. No. 1,Jan, 2005. Special Issue on Ambient Intelligence. • Dynamic learning in dialog systems: adaptation of the [2] CHIL project: ?Computers in the Human Interaction • lexicon, the semantics and the dialog manager. Loop?. Proyecto integrado VI programa marco de la UE: http://chil.server.de/servlet/is/101 Learning with samples that contain errors, rejecting or • [3] Multimodal Interaction in Pattern Recognition and Computer Vi- incorporating them to the models. sion: Project funded by the Spanish Science Minister under spe- Dynamic learning of the dialog manager by means of its cial program Consolider-Ingenio 2007. http://miprcv.iti.es/ • use. Definition of success parameters to be used in the [4] EDECAN: Sistema de dialogo´ multidominio con adaptacion´ al self-evaluation and self-learning process. contexto acustico´ y de aplicacion.´ Project funded by the Spanish Science Minister: http://www.edecan.es Development of methodologies for accessing unstruc- • tured information (voice or text) using speech (Informa- tion retrieval and Question Answering). Detection of Named Entities and keywords. Development of inter- action methods with the user. Study of cooperation techniques between different sys- • tem modules: homogeneous (e.g., between different speech recognizers) or heterogeneous (e.g., between a recognizer and a dialog manager).

3.0.4. SD-TEAM-UPM:objectives Inclusion of multimodal inputs (speech and tactile • screen). Inclusion of multimodal outputs: speech (in- cluding emotional speech), screen, and sign language translation. Technologies for environment detection in adaptive • learning: speaker / user identification and language iden- tification. Dialog management with dynamic information: dy- • namic generation of LM, vocabularies, etc. Automatic learning of dialog management (BNs archi- • tecture from labeled dialogs) Technologies for the collection of relevant information • for the interaction: audio indexing, topic recognition, emotion recognition, especially the detection of anger. Personalization of dialog management: the system learns • in an unsupervised way the user preferences, proposing solutions to the user wishes with the smallest possible number of interactions.

3.0.5. SD-TEAM-UAH:objectives Robust systems for multimodal detection, localization, • tracking and pose estimation of multiple users in intel- ligent environments: using microphone arrays and mul- tiple cameras, and applying audio-visual sensor fusion strategies. Speech enhancement techniques: based in binaural • and microphone array speech processing and adaptation strategies based in simulation of reverberant environ- ments.

130 Proceedings of the I Iberian SLTech 2009

Recent work on the FESTCAT database for speech synthesis

Antonio Bonafonte1, Lourdes Aguilar2, Ignasi Esquerra1, Sergio Oller1, Asuncion´ Moreno1

1TALP Research Center, Universitat Politecnica` de Catalunya, Barcelona, Spain 2Departament Filologia Espanyola, Universitat Autonoma` de Barcelona, Bellaterra, Spain

Abstract uct, speech corpora for 10 speakers of reasonable size (1 hour) This paper presents our work around the FESTCAT project, and phonetic coverage. In the 2008 release we only used the × whose main goal was the development of voices for the Fes- 10 speakers 1 hour corpora to produce voices. Two versions tival suite in Catalan. In the first year, we produced the corpus of the voices were produced, using two technologies included and the speech data needed for build 10 voices using the Clu- in Festival: nits (unit selection) and the HTS (Markov models) methods. • Clunits: Concatenative speech synthesis using (specific) The resulting voices are freely available on the web page of the unit selection project and included in Linkat, a Catalan distribution of Linux. • More recently, we have updated the voices using new versions HTS: HMM-based speech synthesis of HTS, other technology (Multisyn) and we have produced a The clunit voices sound more natural that HTS. The vocoder child voice. Furthermore, we have performed a prosodic label- speech model included in HTS and the flat intonation generated ing and analysis of the database using the break index labels resulted in voices clearly synthetic and monotonous. However, proposed in the ToBI system aimed to improve the intonation while the HTS voices were very smooth and stable, the clunit of the synthetic speech. voices produced relatively frequent concatenation errors. The Index Terms: speech synthesis, databases, Festival voices, reason was either segmentation errors or just spectral, phase or prosody analysis. pitch discontinuities. On the other hand, the footprint of HTS voices is much smaller. For these reasons, the HTS voices were 1. Introduction included in the default LinKat installation while the other voices can be downloaded as additional packages. The different voices Some years ago, the Catalan Government promoted the produc- can be tried in the FestCat web page of the FestCat project [3]. tion of Linkat, a Linux distribution aimed to schools. Speech During the last year we have produced new voices based on the synthesis is a key component in many accessibility tools, as FestCat corpus: or Orca, but Catalan voices were not available at • that time in the open-source domain. Therefore, we set up a HTS voices have been trained using the last HTS release. project to produce speech synthesis corpus and to build voices The new version includes HSMM (Hidden Semi-Markov for the festival engine [1]. Not only the festival voices, but also model) to improve duration modeling and, more impor- the corpus will be released to allow their use in other synthesis tant, Global Variance (GV). This new feature produces engines. richer prosody and much more natural voices. In 2008, the first version of festival voices were released. Dur- • For the big voices (1 male + 1 female) Clunits and HTS ing this year, new version of festival voices are being produced. voices have been produced using the whole database. Furthermore, the labeling of the corpus is being improved to The quality of the new voices are significantly higher analyze the prosody so that better intonation models can be de- than the quality achieved with one-hour corpus. rived. • Voices using the festival technology Multisyn have been build. Multisyn is a general (classic) unit selection tech- 2. The FestCat Corpus nology and produces better results than Clunits. How- ever, the reasons to prefer HTS vs Clunits voices in the The primary goal of the FestCat corpus was to produce two default Linkat installation are still valid for selecting synthetic voices (one male and one female voices) for Festi- HTS vs. Multisyn. val [1] with similar quality to the best available English-voices • A child voice has been produced using a new 1-hour cor- included in Festival. Furthermore, the speech corpus would be pus. public and should allow to produce the best quality when used on state-of-the-art engines. 3.1. Prosody boundaries labeling The design and production process is based on the specifications introduced in the EU TC-STAR project and are described in [2]. As a continuing project, we have been labeling a subset of the Table 1 summarize the FestCat corpus. FestCat database with prosody information. This will allow a study of Catalan prosody both from a linguistic and a technol- 3. Festival Voices ogy point of view. The goal with respect to speech synthesis is to assess the generation of synthetic prosody using a symbolic As stated above, the original goal of the project was to pro- representation. duce 2 high quality voices (one male and one female). How- As a first step, we are labeling information of prosodic bound- ever, the speaker selection process produced, as a by prod- aries using the break-tier proposed in the Cat ToBI proposal [4].

131 Proceedings of the I Iberian SLTech 2009

Corpus Size The corpus size is around 90,000 words (aimed to 10 hours of speech). Corpus Design 80% of the corpus is designed to achieve high phonetic and prosodic variability. Subcorpus from dif- ferent domains have been produced (novels, news, teaching books, etc.) applying a greedy algorithm to a big raw corpus. Each utterance is a sentence or a short paragraph. For instance, the mean length of the news subcorpus is 25 words. The rest 20% is designed to improve coverage in doamins relevant in many TTS applications, as numbers, cities (from Catalonia, Spain and the word), commands found in screen readers, etc. Language and Phoneset The design goal is the Central Catalan dialect, but also Spanish, Galician, Euskera and English words need to be pronounced. The Catalan phoneset has been extended to include the missing Spanish phonemes (as SAMPA [x] and [T] and some stressed vowels). The corpus include a small Spanish subcorpus (20 min.) and some English and Euskera words. Recording conditions Recording studio; 96Khz, 24 bits; 3 synchronous channels (membrane microphone, close-talk micro- phone and laryngograph) Labeling Orthography and phonetic supervision. Automatic phone segmentation using HMM based toolkit. Speaker selection 10 professional speakers (5 male + 5 female) record 1h corpus. Build 10 TTS voices. Select 1 male and 1 female taking into account articulation and phonetic errors, voice stability on long sessions, pleasantness of the voice, quality of the 1h TTS voices, distortion in front of TD-PSOLA manipulation

Table 1: Summary of the FestCat corpus

As in other ToBI systems, the procedure is perceptually-based, acoustic measures (presence and duration of pause, value of although the labeler has visual information of the signal. We ending F0, duration of pre-break syllable, etc.) we can predict have used all the levels in table 2 to better capture the relation- the presence of break in 90% of cases; and we can differentiate ship among prosodic constituents. In order to mark the absolute between major and minor break in 80% of the cases. end of the elocution, at the end of the file, the level 5 proposed Further work will include linguistic features (for automatic la- in [5] has been added. This decision has two advantages: first, beling of databases) and will study the prediction based only on in declaratives, it serves as the minimum F0 value of the dec- linguistic features (for symbolic prosody prediction, in speech lination baseline; second, it prevents the processing of the si- synthesis). lences in this position, without any linguistic content. 4. Acknowledgments Break Description FestCat has been partially funded by the Generalitat de 0 Any clear example of cohesion between orto- Catalunya under the LINKAT and TECNOPARLA projects. graphic forms, such as vowel contacts The work done by L. Aguilar was possible thanks to the vis- 1 Any inter-word juncture (provided as default at iting position at the Universitat Politecnica` de Catalunya during every word boundary) the academic year 2008-09. 2 End of groups with some sense of disjuncture Authors want to thank all present and past members of the with respect to the following speech chunk TALP speech synthesis group that were involved somehow in 3 End of minor prosodic group the FestCat project. 4 End of major prosodic group 5. References Table 2: Break descriptions [1] A. W. Black, P. Taylor, and R. Caley, “The festival speech synthesis system,” 1996–2009. [Online]. Available: For each main speaker (1 male + 1 female) approx. 5 hours http://www.cstr.ed.ac.uk/projects/festival.html have been labeled manually by a graduate in linguistics, with [2] A. Bonafonte, J. Adell, I. Esquerra, S. Gallego, A. Moreno, and no prior training in prosodic labeling. The transcriber was look- J. Perez,´ “Corpus and voices for catalan speech synthesis,” in Proc. ing at a computer screen with a display of the signal (F0 curve of LREC Conf., Marrakech, Morocco, May 2008, pp. 3325–3329. and waveform) together with the phonetic marks corresponding [3] “FestCat: Catalan corpus and voices for speech synthesis,” 2007. to words, syllables and silences. Nevertheless, she was encour- [Online]. Available: http://www.talp.upc.edu/festcat aged to attend preferable to perception. To ensure the consis- [4] P. Prieto, L. Aguilar, I. Mascaro,´ F. Torres-Tamarit, and M. Vanrell, tency of the data, only one transcriber was recruited and one of “L’etiquetatge prosodic` cat tobi,” Estudios de Fonetica´ Experimen- the authors (L.A.) reviewed the corpus. The annotation has not tal, no. XVIII, pp. 287–309, 2008. been considered definitive until the transcriber and the reviewer [5] P. Price, M. Ostendorf, S. Shattuck-Hufnagel, and C. C. Fong, “The arrived to a consensus in the labels. Roughly, 10% of the words use of prosody in syntactic disambiguation,” Journal of the Acous- are followed by a minor break (BI3) and 10% of the words are tical Society of America, vol. 90, pp. 2956–70, 1991. followed by a major break (BI4 and BI5). [6] L. Aguilar, A. Bonafonte, F. Campillo, and D. Escudero, “Deter- In [6] we present a first study of the correlation between acous- mining intonational boundaries from the acoustic signal,” in Proc. tic features and break indexes and a classifier based on these of INTERSPEECH, Brighton, U.K., Sep. 2009. features. The preliminary results show that using only some

132 Proceedings of the I Iberian SLTech 2009

The Project HERON

António Teixeira1, Catarina Oliveira2, Paula Martins2, Inês Domingues3 Augusto Silva1

1 Department of Electronics, Telecommunications and Informatics, University of Aveiro / IEETA 2 School of Health, University of Aveiro / IEETA 3 University of Aveiro / IEETA [email protected], [email protected], [email protected], [email protected], [email protected]

potential advantages: it provides a good contrast between soft Abstract tissues, allows 3D modelling and covers the vocal tract in all of its extension, is non-invasive and considered as safe. Its In the project HERON, we developed a framework for disadvantages are related to the absence of the teeth in the articulatory speech synthesis for European Portuguese. The images, due to their lack of hydrogen protons; the acquisition system combines several modules, developed or adapted in the technique, in which the speaker must e lying down during project. The Linguistic Processing model uses new speech production; the relatively low temporal resolution syllabification and grapheme-to-phone modules developed in achieved; the noisy acquisition environment; and the reduced the project. The construction of the gestural scores is acoustic feedback, due to the use of headphones. performed using an adapted version of TADA (TAsk Dynamic Application), made available for research by Haskins 2.1. Main Results Laboratories. Other important result from HERON was the (first) comprehensive MRI database for European Portuguese • A comprehensive MRI database for European and analyses conducted on this data. Two new Portuguese [1, 4]. Three different types of acquisitions Electromagnetic Midsagital (EMMA) corpora were also were performed - 2D static, 3D static, and real-time; acquired, at GIPSA-Lab in Grenoble, one for EP nasals, and • Image processing tools for MRI [3]; the other, of smaller dimension, relative to EP laterals. • MRI based studies on European Portuguese (EP) production: all EP sounds [1, 4], nasals [5] and 1. Introduction coarticulatory effects [2]. About 20 ac, HERON, an engineer, decided to give voice to the statue of Memnon, so that it could answer to the caresses 3. Improvements to the synthesizer of his mother. Since then, or even before, many efforts have been dedicated to give voice to artifacts created by man. This task included the capability of synthesizing new types of Articulatory synthesis produces voice signals using models of sounds in SAPWindows, starting by adding fricative physical, anatomical and physiological characteristics of the modeling. human voice production system. Even those that use , because it is currently the best 3.1. Main Results available method, believe that in the long run that technique is not the answer. In the long term, articulatory synthesis has • Addition of fricatives synthesis to the SAPWindows more potential for high-quality speech synthesis. Since 1995 articulatory synthesizer [6]; we have been working in articulatory synthesis of Portuguese, • Adaptation of the synthesizer to process TADA (TAsk with encouraging results. Dynamic Application) output. TADA is a software The project main lines of research were: to collect new data to implementation of the articulatory phonology support the development of new articulatory models for approach, developed at . Portuguese; evolution of the University of Aveiro articulatory Specifically, TADA is a software implementation of synthesizer (SAPWindows); application of new linguistic the Task Dynamic model of inter-articulator speech theories to the development of the necessary processing coordination, incorporating also a coupled-oscillator modules so that the articulatory synthesizer may be the base of model of inter-gestural planning, and a gestural- a complete text to speech system. coupling model. The line of research on the data collection has been divided into two tasks, due to the different data and methodology used. Therefore, project had the following 4 tasks: 1) data 4. Linguistic Models collection to support the synthesizer development; 2) development of the synthesizer; 3) development of linguistic A text-to-speech system based on articulatory parameters must modules to enable articulatory based synthesis from text; 4) include linguistic processing and conversion from the collection of data and its analyses to support the development linguistic discrete variables to the time varying articulatory of the linguistic models in task 3. parameters.

2. MRI Database 4.1. Main Results Magnetic Resonance Imaging (MRI) was used in order to • A first Articulatory Based Text-to-Speech System for obtain new anatomic data needed for the development and EP. The system results from the combination of 3 validation of the articulatory models. This technique has some major parts: 1) Linguistic Processing (automatic

133 Proceedings of the I Iberian SLTech 2009

syllabification and grapheme-to-phone conversion); 2) 6. Acknowledgements the TADA system adapted for EP; 3) synthesizers (the incomplete Matlab CASY implementation, HLsyn and The project HERON (POSI/PLP/57680/2004) was funded by our articulatory synthesizer) [14]; the Portuguese Research Agency (FCT). Many thanks are due to GIPSA-Lab, Solange Rossato, for help with articulatory • Automatic syllabification [8]. Two different automatic design, and Christophe Savariaux, for help with the EMMA syllabification methods were developed. One uses a recordings. We also thank Radiology Department, Coimbra finite state transducers (FSTs) approach and is University Hospital (HUC). essentially based in the general description of the syllable constituents. The second consists in the implementation of an adapted version of Mateus and 7. References d'Andrade syllabification algorithm. [1] Martins, P., I. Carbone, et al. (2008). “European • Grapheme-to-phone (G2P) conversion [7]: best results Portuguese MRI Based Speech Production Studies." Speech were obtained by a system using a combination of a Communication, Volume 50 , Issue 11-12. rule based system (implemented as a finite state [2] Martins, P., I. Domingues, et al. (2008). “Coarticulatory transducer) with two machine learning systems (TBL - Effects on European Portuguese: A First MRI Study”. M. Transformation Based Learning - and MBL - Memory Embarki and C. Dodane, La Coarticulation : Indices, Direction Based Learning). et Représentation, L'Harmattan: Paris. • Preliminary version of Portuguese TADA [12]. The [3] Carbone, I. (2008). Segmentação do tracto vocal em dados work included mainly the definition of a gestural de ressonância magnética. DETI, Universidade de Aveiro. dictionary, but also some adjustments to coupling Mestrado em Engenharia Electrónica e Telecomunicações. graphs and the use of language-specific dictionary [4] Martins, P. (2007). Ressonância Magnética no Estudo da files. Produção do Português Europeu. DETI/DLC/SACS. Aveiro, Universidade de Aveiro. Mestrado em Ciências da Fala e da Audição. 5. Temporal Organization [5] Martins, P., I. Carbone, et al. (2007). “An MRI study of European Portuguese nasals”. Interspeech 2007, Antuérpia. New corpora were collected and analyses were done by the [6] António Teixeira, Roberto Martinez, Luís Silva, Luís phonetics researchers for temporal organization of Portuguese Jesus, José Carlos Príncipe, and Francisco Vaz (2005). articulatory gestures. “Simulation of human speech production applied to the study and synthesis of European Portuguese”. EURASIP Journal of Applied Signal Processing, Special Issue on Anthropomorphic 5.1. Main Results Proc. of Audio and Speech. [7] Teixeira, A., C. Oliveira, et al. (2006). “On the Use of • Acquisition of a new EMMA (electromagnetic Machine Learning and Syllable Information in European midsagittal articulography) database for EP nasals Portuguese Grapheme-Phone Conversion”. PROPOR 2006, and laterals [10]. EMMA provides valuable kinematic VII Encontro para o Processamento Computacional da Língua data relative to different articulators (lips, tongue, jaw, Portuguesa. R. Vieira, P. Quaresma, M. das Graças Volpe and velum) with good temporal resolution. However, Nunes et al., Itatiaia, RJ, Brasil, Springer. in the majority of available systems, the acquired data [8] Oliveira, C., L. de Castro Moutinho, et al. (2005). “On are two dimensional and limited to the trajectories of European Portuguese Automatic Syllabification”. InterSpeech some articulator fleshpoints. The process is invasive 2005, Lisboa. and articulation may be affected by the sensors. [9] Oliveira, C. and A. Teixeira (2007). “On Gestures Timing Recording took place at GIPSA-Lab (Grenoble, in European Portuguese Nasals”. ICPhS 2007, Saarbrücken. France) in October 2007. For each corpus, two [10] Oliveira, C. and A. Teixeira (2007). Nova Base de Dados subjects were recorded, a male and a female; EMMA relativa às Nasais e Laterais do Português Europeu. • Automatic annotation of gestures [11]. Velum, lips Dep. Electrónica e Telecomunicações / IEETA, Universidade and tongue tip gestures were automatically annotated. de Aveiro. Velocities for three receivers were automatically [11] Oliveira, C. and A. Teixeira (2006). Base de Dados calculated in order to determine temporal articulatory EMMA com Anotação Automática de Gestos. Dep. landmarks: movement onset, target achievement, Electrónica e Telecomunicações / IEETA, Universidade de target release and release offset. Gestural duration and Aveiro. inter-gestural timing were obtained based on these [12] Oliveira, C. (2009). Do garfema ao gesto. Contributos landmarks; linguísticos para um sistema de síntese de base articulatória. • Studies on nasals gestures [9, 12, 13]. Analyses Universidade de Aveiro. Dissertação de Doutoramento. concentrated on the characterization of gestures [13] Oliveira, C., Teixeira, A., Martins, P. (2009). “Speech duration, velum height and stiffness, and inter-gestural Rate Effects on European Portuguese Nasal Vowels”. timing. The results confirm the dynamic nature of Interspeech 2009, Brighton, UK. Portuguese nasal vowels. Results show also clear [14] Teixeira, A., Oliveira, C., Barbosa, P. (2008). effects of speech rate on temporal characteristics of “European Portuguese Articulatory Based Text-to-Speech: EP nasal vowels. Speech rate reduces the duration of First Results”. António Teixeira, Vera Lúcia Strube de Lima, velum gestures and increases the stiffness and inter- Luís Caldas de Oliveira and Paulo Quaresma (Eds), gestural overlap. Computational Processing of the Portuguese Language, Springer.

134 Proceedings of the I Iberian SLTech 2009

SDI Media is a language company with 30 offices worldwide, with 700 full-time employees and 7,000 freelance language experts. Originally starting as a localization company SDI Media focused on subtitling and dubbing company for entertainment clients. As language markets and service demands expanded, SDI Media grew with the industry to expand services utilizing our already existing global infrastructure.

More recent language services include voice prompt recordings (for commercial use and text to speech development) , data collection-recruitment and transcription, voice synthesized quality analysis, website localization, tutorial localization (including screen captures), UGC moderation and myriad other language services. Due to our vast network of owned and operated studios combined with our vetted partner studios, SDI Media has created a large global footprint that allows scalability and flexibility for new and expanding language services.

By attending the Iberian SLTech 2009 conference in Lisbon, SDI Media looks forward to meeting other people and groups who share our passion for language.

Contact Information: ErinRose Widner | SDI Media Group | Vice President, Multimedia Office : + 1 310.388.8821| Mobile: +1 310.945.7127 [email protected] skype: erinrosew

135

Proceedings of the I Iberian SLTech 2009

Microsoft Language Development Center’s activities in 2008/2009

Daniela Braga, António Calado, Pedro Silva, Miguel Sales Dias

Microsoft Language Development Center, Porto Salvo, Portugal {i-dbraga, i-antonc, i-pedros, Miguel.Dias}@microsoft.com

1. Introduction 3. Finished projects Microsoft Language Development Center was launched in In the last 2 years, MLDC was mainly dedicated to develop 2005, integrated in the Portuguese Microsoft (MS) subsidiary. TTS and ASR technology and components to Microsoft MLDC (http://www.microsoft.com/portugal/mldc) is the first Exchange 2010 in several languages. MLDC developed 4 TTS MS R&D center with the mission to bring key language languages in European Portuguese, Brazilian Portuguese, component product development to Europe and neighboring Catalan and Danish and produced key-language components to regions. MLDC acts as an expansion branch of the Redmond 10 ASR European languages (the same 4 languages produced based product group, responsible for speech R&D in Microsoft for TTS plus Finnish, Swedish, Italian, Korean, Dutch and and benefits from its experience, technological background Norwegian). MLDC was also dedicated to the Speech and support. The Global Speech Development Group in MS International Project (SIP), whose goal was to collect and has four locations: Redmond (USA), Mountain View (Silicon transcribe large amounts of telephony speech data in several Valley, USA), Porto Salvo (Portugal) and Beijing (China). European languages to be used in ASR acoustic model MLDC works closely with the Mountain View, Redmond and training. This data was also integrated in Microsoft Exchange Beijing groups in a multicultural and multidisciplinary team 2010 product. Smaller finished projects include: Virtual Hélia dealing with all aspects of ASR, TTS and speech applications (Talking head in European Portuguese language), prototype to for a large number of languages. MLDC staff included 17 create personalized TTS voices, User Verification using researchers (including 3 PhDs), software engineers and Compact Signatures of Multiple Face Profiles: Speaker ID + computational linguistics in Fiscal Year 2009 and currently Face ID (a joint project with MS Research India) and Info includes 11 people in the beginning of Fiscal Year 2010. Service: traffic, news and weather information in MS applications using Speech Technology. 2. MLDC’s action lines and activities 4. Ongoing internal projects MLDC has based his activity in the following action lines: 1) Performance of R&D in Speech Technology (speech We currently have the following ongoing projects: recognition - ASR, speech synthesis - TTS, speech 1) TN for EFIGS (British English, French, Italian, applications), fully integrated in the Microsoft roadmap, German, Spanish) languages: production of Text applied to a wide range of MS products and platforms (client, Normalization components for TTS and ASR for server, live, entertainment, automotive, mobility); 2) several Microsoft products; Establishment of cooperative links with the most innovative 2) TN and language models for Voice Search for Mobile universities, institutes, research laboratories and companies in in 6 languages (British & American English, French, Portugal and Europe, which are active in the speech and HCI Italian, Spanish, German); areas (Health, Accessibility, Inclusion, Ageing well, Digital 3) Speech Data Collection for Desktop (speech Libraries, Robotics), to pursue joint R&D, in natural and acquisition in 16 kHz and transcription for Russian multimodal human-computer communication, including and Italian). speech and natural language; 3) Collection of key multi- language components and resources, such as speech corpora, 5. Ongoing external projects text corpora and lexica. MLDC has been collaborating in the entire product R&D life MLDC has external collaborative research projects with cycle of ASR technology in all major European languages, academia and other industrial partners. The projects on this Brazilian Portuguese and Indian English, from data collection scope are called “Citizenship” projects. Currently we have one to usability testing and system evaluation: 1) Multilingual running project, TTS for Galician, a joint project with telephony and desktop speech data collection in all Western University of Vigo with the aim of developing a new TTS in European Languages and American variants and some Eastern Galician using Microsoft technology and University of Vigo’s Languages for multilingual ASR systems (2006-2007); 2) language resources. MLDC has also submitted several Several front-end components building for Western European European R&D projects and National (FCT and QREN) R&D Languages and American variants, such as phonetic lexicons initiatives with academic and industrial consortia. The most nd and phone sets (2006-2007), TN (Text Normalization) and recently approved project in the scope of the 2 call of the ITN (Inverse Text Normalization) (2008-2009), polyphony, European Ambient Assisted Living Joint Programme was morphology modules, POS taggers, word breakers, sentence PAE-LIFE (Personal Assistant to Enhance the Social Life of separators, etc.; 3) Multilingual acoustic models training and the Seniors), a Personal Life Assistant with the goal of fighting grammars building for web search controlled by speech (2008- isolation and exclusion of the elderly and allowing them have 2009); 4) System evaluation and bug fixing in real scenarios a more social and fulfilling life through technology controlled (2009); 5) Usability studies of the ASR system integrated in by multimodal interfaces. More projects were submitted and several Microsoft products (2009); are awaiting approval. MLDC belonged to LC-STAR II 6) Definition of specifications of features, groups of features, consortia and is part of the European Center of Excellence in software and systems design. Speech Synthesis network.

137

Proceedings of the I Iberian SLTech 2009

Contributors

A A. Pérez University of the Basque Country, Spain 125 A. Veiga University of Coimbra/IT, Portugal 53 Adolfo Hernández Technical University of Catalonia, Spain 105 Alberto Abad Inesc ID Lisboa, Portugal 77 Alberto Simões Universidade do Minho, Portugal 13, 95 Alex Acero Microsoft, USA 3 Amparo Varona Universidad del País Vasco, Spain 31 Ana Respício University of Lisbon, Portugal 21 Andrey Temko Universitat Politécnica de Catalunya, Spain 17 Antonio Bonafonte Universidad Politécnica de Catalunya, Spain 131 António Branco University of Lisbon, Portugal 107, 121, 123 António Calado Microsoft Language Development Center, Portugal 137 António Teixeira University of Aveiro, Portugal 133 Arantza Casillas Universidad del País Vasco, Spain 31 Asunción Moreno Universidad Politécnica de Catalunya, Spain 131 Augusto Silva University of Aveiro, Portugal 133 B B. Silva University of Coimbra/IT, Portugal 53 Binda Celestino Instituto Superior de Engenharia de Lisboa, Portugal 81 C Carla Lopes University of Coimbra/IT, Portugal 53, 57 Carlos Henríquez Technical University of Catalonia, Spain 105 Carlos Meneses Ribeiro Instituto Superior de Engenharia de Lisboa, Portugal 81 Carlos Teixeira Lasige, University of Lisbon, Portugal 21 Carmen Garcia-Mateo University of Vigo, Spain 71 Catarina Oliveira University of Aveiro, Portugal 117, 133 Catarina Ribeiro Lasige, University of Lisbon, Portugal 21 Ciro Martins University of Aveiro, Portugal 113 Climent Nadeu Universitat Politécnica de Catalunya, Spain 17, 61 D Daniel Ramos Universidad Autonoma de Madrid, Spain 89 Daniela Braga Microsoft Language Development Center 67, 71, 137 David Rybach RWTH Aachen University, Germany 49 Dayana Ribas González Advanced Technologies Application Center, Cuba 85 Doroteo Toledano Universidad Autonoma de Madrid, Spain 89 E E. Alonso University of the Basque Country, Spain 125 Eduardo Ferreira University of Lisbon, Portugal 123 Eduardo Lleida University of Zaragoza, Spain 127 Emilio Sanchis Polytechnic University of Valencia, Spain 127 Erin Rose Widner SDI Media Group 135 F Fernando Batista INESC ID Lisboa, Portugal 99 Fernando Perdigão University of Coimbra/IT, Portugal 53, 57 Filipe Nunes University of Lisbon, Portugal 123 Francisco Costa University of Lisbon, Portugal 123 G G. Sánchez University of the Basque Country, Spain 125 Germán Bordel Universidad del País Vasco, Spain 31 H H. Mendes University of Coimbra/IT, Portugal 53 Henrik Schulz Technical University of Catalunya (UPC), Spain 27, 49 Horst-Udo Hain TU Dresden, Germany 67 Hugo Cordeiro Instituto Superior de Engenharia de Lisboa, Portugal 81 Hugo Meinedo Inesc ID Lisboa, Portugal 77 I Ignacio Moreno Universidad Autonoma de Madrid, Spain 89 Ignasi Esquerra Universidad Politécnica de Catalunya, Spain 131 Inês Domingues University of Aveiro, Portugal 133 Isabel Trancoso University of Lisbon, Portugal 77, 99

139 Contributors (cont.) J Javier Franco- Pedro Universidad Autonoma de Madrid, Spain 89 Javier Gonzales Domingez Universidad Autonoma de Madrid, Spain 89 Javier Macías Guarasa University of Alcalá de Henares, Spain 127 J. M. Alacaide University of the Basque Country, Spain 125 J. M. Olaso University of the Basque Country, Spain 125 Joan Andreu Sánchez Universidad Politécnica de Valencia, Spain 39 João Silvia University of Lisbon, Portugal 123 Joaquin Rodriguez Universidad Autonoma de Madrid, Spain 89 José A. R. Fonollosa Technical University of Catalunya, Spain 27, 49, 105 José Calvo Lara Advanced Technologies Application Center, Cuba 85 José João Almeida Universidade de Minho, Portugal 95 José Mariño Technical University of Catalonia, Spain 105 José Pedro Ferreira Instituto da Linguística Teórica e Computacional, Portugal 43 L Lourdes Aguilar Universidad Autonoma de Catalunya, Spain 131 Luis Coelho Instituto Politécnico do Porto, Portugal 67, 71 Luis Javier Rodríguez-Fuentes Universidad del País Vasco, Spain 31 M Marc Poch Technical University of Catalonia, Spain 105 Maria Ines Torres University of the Basque Country, Spain 125, 127 Marta R. Costa Technical University of Catalonia, Spain 105 Martha Alicia Rocha Instituto Tecnológico de León, México 39 Martin Wolf Universitat Politécnica de Catalunya, Spain 61 Mateu Aguilo Universitat Politécnica de Catalunya, Spain 17 Miguel Sales Dias Microsoft Language Development Center, Portugal 137 MiKel Penagarikano Universidad del País Vasco, Spain 31 Mireia Farrús Technical University of Catalonia, Spain 105 N Nuno Mamede INESC ID Lisboa, Portugal 99 O Olivier Jokisch TU Dresden, Germany 67 P Patrícia Gonçalves University of Lisbon, Portugal 107 Paula Martins University of Aveiro, Portugal 133 Pedro Martins University of Lisbon, Portugal 123 Pedro Silva Microsoft Language Development Center, Portugal 137 R Raquel Justo Blanco University of Basque Country, Spain 111, 125 Ricardo de Córdoba Polytechnic University of Madrid, Spain 127 Rui Martin University of Lisbon, Portugal 77 S Sara Candeias University of Coimbra, Portugal 115 Sara Silveira University of Lisbon, Portugal 123 Sergio Oller Universidad Politécnica de Catalunya, Spain 131 Sílvia Barbosa Instituto da Linguística Teórica e Computacional, Portugal 43 Simone Ashby Instituto da Linguística Teórica e Computacional, Portugal 43 T Taras Butko Universitat Politécnica de Catalunya, Spain 17 V V. Guijarrubia University of the Basque Country, Spain 125 X Xavier Gómez Guionovart Universidade de Vigo, Spain 13 Z Zilda Zapparoli Universidade de São Paulo, Brasil 35