Modelling Phone-Level Pronunciation in Discourse Context
Total Page:16
File Type:pdf, Size:1020Kb
Modelling Phone-Level Pronunciation in Discourse Context Per-Anders Jande Doctoral Thesis Stockholm, Sweden, 2006 GSLT Graduate School of Language Technology Faculty of Arts, Göteborg University 405 30 Göteborg, Sweden KTH Computer Science and Communication Department of Speech, Music and Hearing 100 44 Stockholm, Sweden TRITA-CSC-A 2006:25 ISSN 1653-5723 ISRN KTH/CSC/A–06/25–SE ISBN 91-7178-490-X Akademisk avhandling som med tillstånd av Kungliga Tekniska Högskolan fram- lägges till offentlig granskning för avläggande av filosofie doktorsexamen måndagen den 11 december klockan 13.00 i sal F3, Kungliga Tekniska Högskolan, Lindstedts- vägen 26, Stockholm. © Per-Anders Jande, December 2006 Typeset by the author using LATEX Printed by Universitetsservice US AB Abstract Analytic knowledge about the systematic variation in a language has an important place in the description of the language. Such knowledge is interesting e.g. in the language teaching domain, as a background for various types of linguistic studies, and in the development of more dynamic speech technology applications. In previous studies, the effects of single variables or relatively small groups of related variables on the pronunciation of words have been studied separately. The work described in this thesis takes a holistic perspective on pronunciation variation and focuses on a method for creating general descriptions of phone-level pronunciation in discourse context. The discourse context is defined by a large set of linguistic attributes ranging from high-level variables such as speaking style, down to the articulatory feature level. Models of phone-level pronunciation in the context of a discourse have been created for the central standard Swedish language variety. The models are represented in the form of decision trees, which are readable for both machines and humans. A data-driven approach was taken for the pronunciation modelling task, and the work involved the annotation of recorded speech with linguistic and related information. The decision tree models were induced from the annotation. An important part of the work on pronunciation modelling was also the development of a pronunciation lexicon for Swedish. In a cross-validation experiment, several sets of pronunciation models were created with access to different parts of the attributes in the annotation. The prediction accuracy of pronunciation models could be improved by 42.2% by making information from layers above the phoneme level accessible during model training. Optimal models were obtained when attributes from all layers of annotation were used. The goal for the models was to produce pronunciation representations representative for the language variety and not necessarily for the individual speakers, on whose speech the models were trained. In the cross-validation experiment, model-produced phone strings were compared to key phonetic transcripts of actual speech, and the phone error rate was defined as the share of discrepancies between the respective phone strings. Thus, the phone error rate is the sum of actual errors and discrepancies resulting from desired adaptations from a speaker-specific pronunciation to a pronunciation reflecting general traits of the language variety. The optimal models gave an average phone error rate of 8.2%. Keywords Pronunciation modelling . Pronunciation variation . Discourse-context . Phone-level vari- ation . Central standard Swedish . Spoken language annotation . Data-driven methods . Machine learning . Decision trees . Pronunciation lexicon development . Machine-readable lexicon . Phonology . Discourse . Lexicon Sammanfattning Analytisk kunskap om den systematiska variationen i ett språk har en viktig plats i be- skrivningen av språket. Sådan information är intressant t.ex. inom språkundervisnings- området, som bakgrund till olika typer av lingvistiska studier och i utvecklandet av mer dynamiska takteknologitillämpningar. I tidigare studier har effekterna av enstaka variab- ler eller relativt små grupper av relaterade variabler på uttalet av ord undersökts separat. Arbetet som beskrivs i denna avhandling tar ett helhetsperspektiv på uttalsvariation och fokuserar på en metod för att skapa generella beskrivningar av fonnivåuttal i diskurskon- text. Diskurskontexten definieras av en stor mängd lingvistiska attribut som sträcker sig från högnivåvariabler som talstil ner till artikulatoriska särdragsnivån. Modeller av fonni- våuttal i kontexten av en diskurs har skapats för språkvarieteten central standardsvenska. Modellerna är representerade i formen av beslutsträd, vilka är läsbara för både maskiner och människor. Ett datadrivet angreppssätt antogs för uttalsmodelleringsuppgiften och ar- betet innefattade uppmärkning av inspelat tal med lingvistisk och relaterad information. Beslutsträdsmodellerna inducerades från uppmärkningen. En viktig del av uttalsmodelle- ringsarbetet var också utvecklandet av ett uttalslexikon för svenska. I ett korsvaliderings- experiment skapades ett flertal uttalsmodeller med tillgång till olika delar av attributen i uppmärkningen. Precisionen i uttalsmodellers förutsägelser kunde förbättras med 42,2% genom att göra information från lager ovanför fonemnivån tillgängliga vid träningen av modeller. Optimala modeller uppnåddes när attribut från alla uppmärkningslager använ- des. Målet för modellerna var att producera uttalsrepresentationer som är representativa för språkvarieteten och inte nödvändigtvis för de enskilda talare på vilkas tal modellerna tränats. I korsvalideringsexperimentet jämfördes modellgenererade fonsträngar med fa- cittranskriptioner av faktiskt tal och fonfelfrekvensen definierades som andelen avvikelser mellan de respektive fonsträngarna. Fonfelfrekvensen är således summan av faktiska fel och avvikelser som uppkommit genom önskade anpassningar från ett talarspecifikt uttal till ett uttal som speglar generella drag hos språkvarieteten. De optimala modellerna gav en genomsnittlig fonfelfrekvens på 8,2%. Acknowledgements I would like to thank my supervisor Rolf Carlson for his ideas and valuable com- ments and suggestions during my thesis work. I would also like to thank Beáta Megyesi for introducing me to the Department of Speech, Music and Hearing at KTH. Thanks to Beáta Megyesi, Sara Rydin, Håkan Melin and Botond Pakucs for helping out with Linux-oriented questions. Thanks to Jens Edlund for perl advice and help with databases and sql. Thanks to Jonas Beskow for advice on tcl/tk and help with the KTH text-to-speech system. Thanks to Rolf Carlson for help with Rulsys. Thanks to Botond Pakucs for php advice. Thanks to Giampiero Salvi for advice and discussions on LATEX. Thanks to Leif Grönquist for tools, information and tips related to the Göteborg Spoken Language Corpus. Thanks to Rolf Carlson, Sheri Hunnicutt, Kjell Gustafson and David House for proof reading and commenting the thesis and related papers. Thanks to Kjell Gustafson for discussions on CentLex matters and many other things and for good co-operation in the CentLex project. Thanks also to everyone who have made tools and resources used for the work described in this thesis available. Special thanks Robert Bannert and Peter Czigler for their VaKoS database, to the Grog project participants for the Radio speech data and annotation, to the department of Linguistics at Göteborg University for access to the Göteborg Spoken Language Corpus, to Kåre Sjölander for his excellent free software and for access to his aligner and phoneme models, and to Beáta Megyesi for part-of-speech tagging and parsing software. Thanks to Grötgänget, Sheri Hunnicutt, Karl-Erik Spens, Inger Karlsson, Kjell Elenius, Mats Blomberg, Preben Wik and others for many culinary experiences and interesting discussions over lunch. The research reported in this thesis was carried out at the Centre for Speech Technology (CTT), a competence centre at KTH, supported by Vinnova (the Swedish Agency for Innovation Systems), KTH and participating Swedish com- panies and organisations. The research was supported by the Swedish National Graduate School of Language Technology (GSLT). v Per-Anders Jande's Publications on Pronunciation Modelling • Per-Anders Jande (2003). Evaluating Rules for Phonological Reduction in Swedish. In Proceedings of Fonetik, pp. 149–152, Lövånger, Sweden June 2–4 2003. • Per-Anders Jande (2003). Phonological Reduction in Swedish. In Proceedings of Proceedings of the International Congress of Phonetic Sciences (ICPhS) pp. 2557–2560, Barcelona, Catalonia, August 3–9 2003. • Per-Anders Jande (2004). Pronunciation variation modelling using decision tree induction from multiple linguistic parameters. In Proceedings of Fonetik pp. 12–15, Stockholm, Sweden, May 26–28 2004. • Per-Anders Jande (2005). Annotating Speech Data for Pronunciation Variation Modelling. In Proceedings of Fonetik pp. 25–28, Göteborg, Sweden, May 25–27 2005. • Per-Anders Jande (2005). Inducing Decision Tree Pronunciation Variation Mod- els from Annotated Speech Data. In Proceedings of Interspeech pp. 1945–1948, Lisbon, Portugal, September 4–8 2005. • Per-Anders Jande (2006). Integrating Linguistic Information from Multiple Sources in Lexicon Development and Spoken Language Annotation. In Proceed- ings of the LREC workshop on merging and layering linguistic information pp. 1–8, Genoa, Italy, may 23 2006. • Per-Anders Jande (2006). Modelling Pronunciation in Discourse Context. In Proceedings of Fonetik pp. 69–72, Lund, Sweden, June 7–9 2006. • Per-Anders Jande (Submitted). Spoken Language Annotation and Data-Driven Modelling