Department of Linguistics



The XXIIth Swedish Conference

June 10-12, 2009

Proceedings, FONETIK 2009, Dept. of Linguistics, University

Previous Swedish Phonetics Conferences (from 1986)

I 1986 Uppsala University II 1988 Lund University III 1989 KTH Stockholm IV 1990 Umeå University (Lövånger) V 1991 VI 1992 Chalmers and Göteborg University VII 1993 Uppsala University VIII 1994 Lund University (Höör) - 1995 (XIIIth ICPhS in Stockholm) IX 1996 KTH Stockholm (Nässlingen) X 1997 Umeå University XI 1998 Stockholm University XII 1999 Göteborg University XIII 2000 Skövde University College XIV 2001 Lund University XV 2002 KTH Stockholm XVI 2003 Umeå University (Lövånger) XVII 2004 Stockholm University XVIII 2005 Göteborg University XIX 2006 Lund University XX 2007 KTH Stockholm XXI 2008 Göteborg University

Proceedings FONETIK 2009 The XXIIth Swedish Phonetics Conference, held at Stockholm University, June 10-12, 2009 Edited by Peter Branderud and Hartmut Traunmüller Department of Linguistics Stockholm University SE-106 91 Stockholm

ISBN 978-91-633-4892-1 printed version ISBN 978-91-633-4893-8 web version 2009-05-28 http://www.ling.su.se/fon/fonetik_2009/proceedings_fonetik2009.pdf

The new symbol for the Phonetics group at the Department of Linguistics, which is shown on the front page, was created by Peter Branderud and Mikael Parkvall.

© The Authors and the Department of Linguistics, Stockholm University

Printed by Universitetsservice US-AB 2009

2 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University


This volume contains the contributions to FONETIK 2009, the Twentysecond Swedish Phonetics Conference, organized by the Phonetics group of Stockholm University on the Frescati campus June 10-12 2009. The papers appear in the order in which they were given at the Conference. Only a limited number of copies of this publication was printed for distribution among the authors and those attending the meeting. For access to web versions of the contributions, please look under www.ling.su.se/fon/fonetik_2009/. We would like to thank all contributors to the Proceedings. We are also indebted to Fonetikstiftelsen for financial support.

Stockholm in May 2009

On behalf of the Phonetics group

Peter Branderud Francisco Lacerda Hartmut Traunmüller

3 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University


Phonology and Speech Production F0 lowering, creaky voice, and glottal stop: 8 Jan Gauffin’s account of how the larynx works in speech Björn Lindblom Eskilstuna as the tonal key to Danish 12 Formant transitions in normal and disordered speech: 18 An acoustic measure of articulatory dynamics Björn Lindblom, Diana Krull, Lena Hartelius and Ellika Schalling Effects of vocal loading on the phonation and collision threshold 24 pressures Laura Enflo, Johan Sundberg and Friedemann Pabst

Posters P1 Experiments with synthesis of 28 Jonas Beskow and Joakim Gustafson Real vs. rule-generated tongue movements as an audio-visual speech 30 perception support Olov Engwall and Preben Wik Adapting the Filibuster text-to-speech system for Norwegian bokmål 36 Kåre Sjölander and Christina Tånnander Acoustic characteristics of onomatopoetic expressions in child- 40 directed speech Ulla Sundberg and Eeva Klintfors

Swedish Dialects Phrase initial accent I in South Swedish 42 Susanne Schötz and Gösta Bruce Modelling compound intonation in Dala and Gotland Swedish 48 Susanne Schötz, Gösta Bruce and Björn Granström The acoustics of long close vowels as compared to 54 Central Swedish and Swedish Eva Liina Asu, Susanne Schötz and Frank Kügler

Fenno-Swedish VOT: Influence from Finnish? 60 Catherine Ringen and Kari Suomi

4 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Prosody Grammaticalization of prosody in the brain 66 Mikael Roll and Merle Horne Focal lengthening in assertions and confirmations 72 Gilbert Ambrazaitis On utterance-final intonation in tonal and non-tonal dialects of 78 Kammu David House, Anastasia Karlsson, Jan-Olof Svantesson and Damrong Tayanin Reduplication with fixed tone pattern in Kammu 82 Jan-Olof Svantesson, David House, Anastasia Mukhanova Karlsson and Damrong Tayanin

Posters P2 Exploring data driven parametric synthesis 86 Rolf Carlson, Kjell Gustafson Uhm… What’s going on? An EEG study on perception of filled 92 pauses in spontaneous Swedish speech Sebastian Mårback, Gustav Sjöberg, Iris-Corinna Schwarz and Robert Eklund HöraTal – a test and training program for children who have 96 difficulties in perceiving and producing speech Anne-Marie Öster

Second Language Transient visual feedback on pitch variation for Chinese speakers of 102 English Rebecca Hincks and Jens Edlund Phonetic correlates of unintelligibility in Vietnamese-accented English 108 Una Cunningham Perception of Japanese quantity by Swedish speaking learners: A 112 preliminary analysis Miyoko Inoue Automatic classification of segmental second language speech quality 116 using prosodic features Eero Väyrynen, Heikki Keränen, Juhani Toivanen and Tapio Seppänen

5 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Speech Development Children’s vocal behaviour in a pre-school environment and resulting 120 vocal function Mechtild Tronnier and Anita McAllister Major parts-of-speech in child language – division in open and close 126 class words Eeva Klintfors, Francisco Lacerda and Ulla Sundberg Language-specific speech perception as mismatch negativity in 10- 130 month-olds’ ERP data Iris-Corinna Schwarz, Malin Forsén, Linnea Johansson, Catarina Lång, Anna Narel, Tanya Valdés, and Francisco Lacerda Development of self-voice recognition in children 136 Sofia Strömbergsson

Posters P3 Studies on using the SynFace talking head for the hearing impaired 140 Samer Al Moubayed, Jonas Beskow, Ann-Marie Öster, Giampiero Salvi, Björn Granström, Nic van Son, Ellen Ormel and Tobias Herzke On extending VTLN to phoneme-specific warping in automatic speech 144 recognition Daniel Elenius and Mats Blomberg Visual discrimination between Swedish and Finnish among L2- 150 learners of Swedish Niklas Öhrström, Frida Bulukin Wilén, Anna Eklöf and Joakim Gustafsson

Speech Perception Estimating speaker characteristics for speech recognition 154 Mats Blomberg and Daniel Elenius Auditory white noise enhances cognitive performance under certain 160 conditions: Examples from visuo-spatial working memory and dichotic listening tasks Göran G. B. W. Söderlund, Ellen Marklund, and Francisco Lacerda Factors affecting visual influence on heard vowel roundedness: 166 Web experiments with and Turks Hartmut Traunmüller

6 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Voice and Forensic Phonetics Breathiness differences in male and female speech. Is H1-H2 an 172 appropriate measure? Adrian P. Simpson Emotions in speech: an interactional framework for clinical 176 applications Ani Toivanen and Juhani Toivanen Earwitnesses: The effect of voice differences in identification accuracy 180 and the realism in confidence judgments Elisabeth Zetterholm, Farhan Sarwar and Carl Martin Allwood Perception of voice similarity and the results of a voice line-up 186 Jonas Lindh

Posters P4 Project presentation: Spontal – multimodal database of spontaneous 190 speech dialog Jonas Beskow, Jens Edlund, Kjell Elenius, Kahl Hellmer, David House and Sofia Strömbergsson A first step towards a text-independent speaker verification Praat plug- 194 in using Mistral/Alize tools Jonas Lindh Modified re-synthesis of initial voiceless plosives by concatenation of 198 speech from different speakers Sofia Strömbergsson

Special Topics Cross-modal clustering in the acoustic – articulatory space 202 G. Ananthakrishnan and Daniel M. Neiberg Swedish phonetics 1939-1969 208 Paul Touati How do Swedish encyclopedia users want pronunciation to be 214 presented? Michaël Stenberg LVA-technology – The illusion of “lie detection” 220 F. Lacerda

Author Index 226

7 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

F0 lowering, creaky voice, and glottal stop: Jan Gauf- fin’s account of how the larynx works in speech Björn Lindblom Department of Linguistics, Stockholm University

Abstract completeness of our understanding of F0 lower- ing, he also tried to do something about it. F0 lowering, creaky voice, Danish stød and glottal stops may at first seem like a group of only vaguely related phenomena. However, a Jan Gauffin’s account theory proposed by Jan Gauffin (JG) almost JG collaborated with Osamu Fujimura at RILP forty years ago puts them on a continuum of at the University of Tokyo. There he had an op- supralaryngeal constriction. The purpose of the portunity to make films of the vocal folds using present remarks is to briefly review JG:s work fiber optics. His data came mostly from Swe- and to summarize evidence from current re- dish subjects. He examined laryngeal behavior search that tends to reinforce many of his ob- during glottal stops and with particular atten- servations and lend strong support to his view tion to the control of voice quality. Swedish of how the larynx is used in speech. In a com- word accents provided an opportunity to inves- panion paper at this conference, Tomas Riad tigate the laryngeal correlates of F0 changes presents a historical and dialectal account of (Lindqvist-Gauffin 1969, 1972). relationships among low tones, creak and stød Analyzing the laryngoscopic images JG be- in Swedish and Danish that suggests that the came convinced that laryngeal behavior in development of these phenomena may derive speech involves anatomical structures not only from a common phonetic mechanism. JG:s su- at the glottal level but also above it. He became pralaryngeal constriction dimension with F0 particularly interested in the mechanism known lowering ⇔ creak ⇔ glottal stop appears like as the ‘aryepiglottic sphincter’. The evidence a plausible candidate for such a mechanism. strongly suggested that this supraglottal struc- ture plays a significant role in speech, both in How is F0 lowered? articulation and in phonation. [Strictly speaking the ‘ary-epiglottic sphincter’ is not a circular In his handbook chapter on “Investigating the muscle system. It invokes several muscular physiology of laryngeal structures” Hirose components whose joint action can functionally (1997:134) states: “Although the mechanism of be said to be ‘sphincter-like’.] pitch elevation seems quite clear, the mechan- In the literature on comparative anatomy JG ism of pitch lowering is not so straightforward. discovered the use of the larynx in protecting The contribution of the extrinsic laryngeal mus- the lungs and the lower airways and its key cles such as sternohyoid is assumed to be sig- roles in respiration and phonation (Negus nificant, but their activity often appears to be a 1949). The throat forms a three-tiered structure response to, rather than the cause of, a change with valves at three levels (Pressman 1954): in conditions. The activity does not occur prior The aryepiglottic folds, ventricular folds and to the physical effects of pitch change.” the true vocal folds. JG found that protective Honda (1995) presents a detailed review of closure is brought about by invoking the “arye- the mechanisms of F0 control mentioning sev- piglottic muscles, oblique arytenoid muscles, eral studies of the role of the extrinsic laryngeal and the thyroepiglottic muscles. The closure muscles motivated by the fact that F0 lowering occurs above the glottis and is made between is often accompanied by larynx lowering. How- the tubercle of the epiglottis, the cuneiform car- ever his conclusion comes close to that of Hi- tilages, and the arytenoid cartilages”. rose. An overall picture started to emerge both At the end of the sixties Jan Gauffin began from established facts and from data that he ga- his experimental work on laryngeal mechan- thered himself. He concluded that the tradition- isms. As we return to his work today we will al view of the function of the larynx in speech see that, not only did he acknowledge the in- needed modification. The information conveyed by the fiberoptic data told him that in speech

8 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the larynx appears to be constricted in two ways: at the vocal folds and at the aryepiglottic folds. He hypothesized that the two levels “are independent at a motor command level and that different combinations of them may be used as phonatory types of laryngeal articulations in different languages”. Figure 1 presents JG’s 2- dimensional model applied to selected phona- tion types. In the sixties the standard description of phonation types was the one proposed by Lade- foged (1967) which placed nine distinct phona- tion types along a single dimension. In JG’s account a third dimension was also envisioned with the vocalis muscles operating for pitch control in a manner independent of glottal abduction and laryngealization.

Figure 2. Sequence of images of laryngeal move- ments from deep inspiration to the beginning of phonation. Time runs in a zig-zag manner from top to bottom of the figure. Phonation begins at the lower right of the matrix. It is preceded by a glottal stop which is seen to involve a supraglottal con- striction. Not only does the aryepiglottic sphincter me- chanism reduce the inlet of the larynx. It also participates in decreasing the distance between Figure 1. 2-D account of selected phonation types arytenoids and the tubercle of the epiglottis thus (Lindqvist-Gauffin 1972). Activity of the vocalis shortening and thickening the vocal folds. muscles adds a third dimension for pitch control When combined with adducted vocal folds this which is independent of adduction/abduction and action results in lower and irregular glottal vi- laryngealization. brations in other words, in lower F0 and in JG’s proposal was novel in several respects: creaky voice.

(i) There is more going on than mere ad- justments of vocal folds along a single adduction-abduction continuum: The supralaryngeal (aryepiglottic sphincter) structures are involved in both phonato- ry and articulatory speech gestures; (ii) These supralaryngeal movements create a dimension of ‘laryngeal constriction’. They play a key role in the production of the phonation types of the languages of the world. (iii) Fiberoptic observations show that la- ryngealization is used to lower the fun- damental frequency. Figure 3. Laryngeal states during the production (iv) The glottal stop, creaky voice and F0 high and low fundamental frequencies and with the lowering differ in terms of degree of la- vocal folds adducted and abducted. It is evident that ryngeal constriction. the low pitch is associated with greater constriction at the aryepiglottic level in both cases.

9 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Evaluating the theory predicted every subject showed a consistently lower F0 for the creaky voice (75 Hz for male, The account summarized above was developed 100 Hz for female subjects). in various reports from the late sixties and early Another expectation is that the Danish stød seventies. In the tutorial chapter by Hirose should induce a rapid lowering of the F0 con- (1997) cited in the introduction, supraglottal tour. Figure 4 taken from Fischer-Jørgensen’s constrictions are but briefly mentioned in con- (1989) article illustrates a minmal pair that con- nection with whispering, glottal stop and the forms to that prediction. production of the Danish stød. In Honda (1995) The best way of assessing the merit of JG’s it is not mentioned at all. work is to compare it with at the phonetic re- In 2001 Ladefoged contributed to an update search done during the last decade by John Esl- on the world’s phonation types (Gordon & La- ing with colleagues and students at the Univer- defoged 2001) without considering the facts sity of Victoria in Canada. Their experimental and interpretations presented by JG. In fact the observations will undoubtedly change and ex- authors’ conclusion is compatible with Lade- pand our understanding of the role played by foged’s earlier one-dimensional proposal from the pharynx and the larynx in speech. Evidently 1967: “Phonation differences can be classified the physiological systems for protective closure, along a continuum ranging from voiceless, swallowing and respiration are re-used in arti- through breathy voiced, to regular, modal voic- culation and phonation to an extent that is not ing, and then on through creaky voice to glottal yet acknowledged in current standard phonetic closure……”. frameworks ((Esling 1996, 2005, Esling & Har- JG did not continue to pursue his research ris 2005, Moisik 2008, Moisik & Esling 2007, on laryngeal mechanisms. He got involved in Edmondson & Esling 2006)). For further refs other projects without ever publishing enough see http://www.uvic.ca/ling/research/phonetics . in refereed journals to make his theory more In a recent thesis by Moisik (2008), an anal- widely known in the speech community. There ysis was performed of anatomical landmarks in is clearly an important moral here for both se- laryngoscopic images. To obtain a measure of nior and junior members of our field. the activity of the aryepiglottic sphincter me- The question also arises: Was JG simply chanism Moisik used an area bounded by the wrong? No, recent findings indicate that his aryepiglottic folds and epiglottic tubercle (red work is still relevant and in no way obsolete. region (solid outline) top of Figure 5). His question was: How does it vary across various phonatory conditions? The two diagrams in the lower half of the figure provide the answer. Along the ordinate scales: the size of the observed area (in percent relative to maximum value). The phonation types and articulations along the x-axes have been grouped into two sets: Left: conditions producing large areas thus indicating little or no activity in the aryepiglot- tic sphincter; Right: a set with small area values indicating strong degrees of aryepiglottic con- striction. JG’s observations appear to match these results closely.

Figure 4. Effect of stød on F0 contour. Minimal Conclusions pair of Danish words. Adapted from Fischer- JG hypothesized that “laryngealization in com- Jørgensen’s (1989). Speaker JR bination with low vocalis activity is used as a mechanism for producing a low pitch voice” One of the predictions of the theory is that the and that the proposed relationships between occurrence of creaky voice ought to be asso- “low tone, laryngealization and glottal stop ciated with a low F0. Monsen and Engebretson may give a better understanding of dialectal (1977) asked five male and five female adults variations and historical changes in languages to produce an elongated schwa vowel using using low tone”. normal, soft, loud, falsetto and creaky voice. As

10 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Current evidence lends strong support to his Esling H & Harris J H (2005): “States of the view of how the larynx works in speech. His glottis: An articulatory phonetic model observations and analyses still appear worthy of based on laryngoscopic observations”, 345- being further explored and tested. In particular 383 in Hardcastle W J & Mackenzie Beck J with regard to F0 control. JG would have en- (eds): A Figure of Speech: A Festschrift for joyed Riad (2009). John Laver, LEA:New Jersey. Edmondson J A & Esling J H (2006): “The valves of the throat and their functioning in tone, vocal register and stress: laryngoscop- ic case studies”, Phonology 23, 157–191 Fischer-Jørgensen (1989): “Phonetic analysis of the stød in Standard Danish”, Phonetica 46: 1–59. Gordon M & Ladefoged P (2001): “Phonation types: a cross-linguistic overview”, J Pho- netics 29:383-406. Ladefoged P (1967): Preliminaries to linguistic phonetics, University of Chicago Press: Chicago. Lindqvist-Gauffin J (1969): "Laryngeal me- chanisms in speech", STL-QPSR 2-3 26-31. Lindqvist-Gauffin J (1972): “A descriptive model of laryngeal articulation in speech”, STL-QPSR 13(2-3) 1-9. Moisik S (2008): A three-dimensional Model of the larynx and the laryngeal constrictor mechanism:, M.A thesis, University of Vic- toria, Canada. Moisik S R & Esling J H (2007): "3-D audito- ry-articulatory modeling of the laryngeal Figure 5. Top: Anatomical landmarks in laryngos- constrictor mechanism", in J. Trouvain & copic image. Note area bounded by the aryepiglot- W.J. Barry (eds): Proceedings of the 16th tic folds and epiglottic tubercle (red region (solid International Congress of Phonetic outline). Bottom part: Scales along y-axes: Size of Sciences, vol. 1 (pp. 373-376), Saarbrücken: the observed area (in percent relative to maximum Universität des Saarlandes. value). Left: conditions with large areas indicating Monsen R B & Engebretson A M (1977): little activity in the aryepiglottic sphincter; Right: “Study of variations in the male and female Small area values indicating stronger of aryepiglot- glottal wave”, J Acoust Soc Am vol 62(4), tic constriction. Data source: Moisik (2008). 981-993. Negus V E (1949): The Comparative Anatomy and Physiology of the Larynx, Hafner:NY. Acknowledgements Negus V E (1957): ”The mechanism of the la- I am greatly indebted to John Esling and Scott rynx”, Laryngoscope, vol LXVII No 10, Moisik of the University of Victoria for permis- 961-986. sion to use their work. Pressman J J (1954): ” Sphincters of the la- rynx”, AMA Arch Otolaryngol 59(2):221- 36. References Riad T (2009): “Eskilstuna as the tonal key to Esling J H (1996): “Pharyngeal consonants and Danish”, Proceedings FONETIK 2009, the aryepiglottic sphincter”, Journal of the Dept. of Linguistics, Stockholm University International Phonetic Association 26:65- 88. Esling J H (2005): “There are no back vowels: the laryngeal articulator model”, Canadian Journal of Linguistics/Revue canadienne de linguistique 50(1/2/3/4): 13–44

11 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Eskilstuna as the tonal key to Danish Tomas Riad Department of Scandinavian languages, Stockholm University

Abstract is the typological tendency for stød to occur in This study considers the distribution of the direct vicinity of tonal systems (e.g. Baltic, creak/stød in relation to the tonal profile in the SE Asian, North Germanic). Also, the phonetic stød basis of Central Swedish (CSw) spoken in conditioning of stød (Da. ), that is, Eskilstuna. It is shown that creak/stød corre- sonority and stress, resembles that of some to- lates with the characteristic HL fall at the end nal systems, e.g. Central Franconian (Gussen- of the intonation phrase and that this fall has hoven and van der Vliet, 1999; Peters, 2007). earlier timing in Eskilstuna, than in the stan- Furthermore, there is the curious markedness non dard variety of CSw. Also, a tonal shift at the reversal as the lexically -correlating stød and accent 2 are usually considered the marked left edge in focused words is seen to instantiate 1 the beginnings of the dialect transition to the members of their respective oppositions. This Dalabergslag (DB) variety. These features fit indicates that the relation between the systems into the general hypothesis regarding the ori- is not symmetrical. Finally, there is phonetic gin of Danish stød and its relation to surround- work that suggests a close relationship between ing tonal dialects (Riad, 1998a). A laryngeal F0 lowering, creak and stød (Gauffin, 1972ab), mechanism, proposed by Jan Gauffin, which as discussed in Lindblom (this volume). relates low F0, creak and stød is discussed by The general structure of the hypothesis as Björn Lindblom in a companion paper (this well as several arguments are laid out in some volume). detail in Riad (1998a; 2000ab), where it is claimed that all the elements needed to recon- struct the origin of Danish stød can be found in Background the dialects of the Mälardal region in : According to an earlier proposal (Riad, 1998a; facultative stød, loss of distinctive accent 2, and 2000ab), the stød that is so characteristic of a tonal shift from double-peaked to single- Standard Danish has developed from a previous peaked accent 2 in the neighbouring dialects. tonal system, which has central properties in The suggestion, then, is that the Danish system common with present-day Central Swedish, as would have originated from a tonal dialect type spoken in the Mälardal region. This diachronic similar to the one spoken today in Eastern order has long been the standard view (Kro- Mälardalen. The development in Danish is due man, 1947; Ringgaard, 1983; Fischer- to a slightly different mix of the crucial fea- Jørgensen, 1989; for a different view, cf. tures. In particular, the loss of distinctive accent Libermann, 1982), but serious discussion re- 2 combined with the grammaticalization of stød garding the phonological relation between the in stressed syllables. tonal systems of Swedish and Norwegian on The dialect-geographic argument supports the one hand, and the Danish stød system on parallel developments. The dialects of Dala- the other, is surprisingly hard to find. Presuma- bergslagen and Gotland are both systematically bly, this is due to both the general lack of pan- related to the dialect of Central Swedish. While Scandinavian perspective in earlier Norwegian the tonal grammar is the same, the tonal make- and Swedish work on the tonal dialectology up is different and this difference can be under- (e.g. Fintoft et al., 1978; Bruce and Gårding, stood as due to a leftward tonal shift (Riad, 1978), and the reification of stød as a non-tonal 1998b). A parallel relation would hold between phonological object in the Danish research tra- the original, but now lost, tonal dialect of Sjæl- dition (e.g. Basbøll 1985; 2005). land in and the surrounding dialects, All signs, however, indicate that stød should which remain tonal to this day: South Swedish, be understood in terms of tones, and this goes South Norwegian and West Norwegian. These for phonological representation, as well as for are all structurally similar tonal types. It is un- origin and diachronic development. There are contested, historically and linguistically, that the striking lexical correlations between the South Swedish and South Norwegian have re- systems, where stød tends to correlate with ac- ceived many of their distinctive characteristics cent 1 and absence of stød with accent 2. There from Danish, and the prosodic system is no ex-

12 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ception to that development. Furthermore, the marked realizational profile of the fall, but tonal system of South Swedish, at least, is suf- there are also distributional factors that likely ficiently different from its northern neighbours, add to the salience, one of which is the very the Göta dialects, to make a direct prosodic fact that the most common place for curl is in connection unlikely (Riad, 1998b; 2005). This phrase final position, in the fall from the focal excludes the putative alternative hypothesis. H tone to the boundary L% tone. In this contribution, I take a closer look at Below are a few illustrations of typical in- some of the details regarding the relationship stances of fall/curl, creak and stød. Informants between creak/stød and the constellation of are denoted with ‘E’ for ‘Eskilstuna’ and a tones. The natural place to look is the dialect of number, as in Pettersson and Forsberg (1970, Eskilstuna, located to the west of Stockholm, Table 4), with the addition of ‘w’ or ‘m’ for which is key to the understanding of the pho- ‘woman’ and ‘man’, respectively. netic development of stød, the tonal shift in the 500 dialect transition from CSw to DB, and the 400 generalization of accent 2. I have used part of 300 the large corpus of interviews collected by 200 Bengt Nordberg and his co-workers in the 60’s, 100 and by Eva Sundgren in the 90’s, originally for 0 ba- ge- ˈri- et the purpose of large-scale sociolinguistic inves- (Hz) Pitch ’the bakery’ tigation (see e.g. Nordberg, 1969; Sundgren, 2002). All examples in this article are taken L H L , , , , from Nordberg’s recordings (cf. Pettersson and 0 0.7724 Forsberg, 1970). Analysis has been carried out Time (s) in Praat (Boersma and Weenink, 2009). Figure 1. HL% fall/curl followed by creak (marked ‘, , ,’ on the tone tier). E149w: bage1ˈriet ‘the bak- Creak/stød as a correlate of HL ery’. Fischer-Jørgensen’s F0 graphs of minimal 500 stød/no-stød pairs show that stød cooccurs with 400 a sharp fall (1989, appendix IV). We take HL 300 to be the most likely tonal configuration for the 200 occurrence of stød, the actual correlate being a 100 L target tone. When the HL configuration oc- 0 curs in a short space of time, i.e. under com- Pitch (Hz) jo de tyc- ker ja ä ˈkul pression, and with a truly low target for the L ’yes, I think that’s fun’ tone, creak and/or stød may result. A hypothe- sis for the phonetic connection between these H , , L , , phenomena has been worked out by Jan Gauf- 0 1.125 fin (1972ab), cf. Lindblom (2009a; this vol- Time (s) ume). Figure 2. HL% fall interrupted by creak. E106w: The compressed HL contour, the extra low 1ˈkul ‘fun’. L and the presence of creak/stød are all proper- ties that are frequent in speakers of the 500 Eskilstuna variety of Central Swedish. Bleckert 400 (1987, 116ff.) provides F0 graphs of the sharp 300 tonal fall, which is known as ‘Eskilstuna curl’ 200 (Sw. eskilstunaknorr) in the folk terminology. 100 Another folk term, ‘Eskilstuna creak’ (Sw. 0 eskilstunaknarr), picks up on the characteristic (Hz) Pitch å hadd en ˈbä- lg creak. These terms are both connected with the ’and had (a) bellows’ HL fall which is extra salient in Eskilstuna as H L well as several other varieties within the so- called ‘whine belt’ (Sw. gnällbältet), compared 0 1.439 with the eastward, more standard Central Time (s) Swedish varieties around Stockholm. Clearly, Figure 3. HL% fall interrupted by stød (marked by part of the salience comes directly from the ‘o’ on the tone tier). E147w: 1ˈbälg ‘bellows’.

13 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

As in Danish, there is often a tonal ’re- Jørgensen 1989, 8). In Gauffin’s proposal, the bound’ after the creak/stød, visible as a re- supralaryngeal constriction, originally a prop- sumed F0, but not sounding like rising intona- erty used for vegetative purposes, could be tion. A striking case is given in Figure 4, where used also to bring about quick F0 lowering, cf. the F0 is registered as rising to equally high Lindblom (2009a; this volume). For our pur- frequency as the preceding H, though the audi- poses of connecting a tonal system with a stød tory impression and phonological interpretation system, it is important to keep in mind that is L%. there exists a natural connection between L tone, creaky voice and stød. 500 400 300 The distribution of HL% 200 The HL% fall in Eskilstuna exhibits some dis- 100 tributional differences compared with standard 0 Central Swedish. In the standard variety of Pitch (Hz) Pitch till exempel ke- ˈmi Central Swedish (e.g. the one described in ’for example chemistry’ Bruce, 1977; Gussenhoven, 2004), the tonal H , , L , , structure of accent 1 is LHL% where the first L 0 1.222 is associated in the stressed syllable. The same tonal structure holds in the latter part of com- Time (s) pounds, where the corresponding L is associ- Figure 4. HL% fall with rebound after creak. 1 ated in the last stressed syllable. This is sche- E106w: ke ˈmi ‘chemistry’. matically illustrated in Figure 6. Creaky voice is very common in the speech of several informants, but both creak and stød are facultative properties in the dialect. Unlike 1ˈm å l e t ‘the goal’ Danish, then, there is no phonologization of stød in Eskilstuna. Also, while the most typical 2 context for creak/stød is the HL% fall from fo- ˈm e l l a n ˌm å l e t ‘the snack’ cal to boundary tone, there are instances where Figure 6. The LHL% contour in standard CSw ac- it occurs in other HL transitions. Figure 5 illus- cent 1 simplex and accent 2 compounds. trates a case where there are two instances of creak/stød in one and the same word. In both cases the last or only stress begins

500 L, after which there is a HL% fall. In the 400 Eskilstuna variety, the timing of the final fall 300 tends to be earlier than in the more standard 200 100 CSw varieties. Often, it is not the first L of 0 LHL% which is associated, but rather the H dä e nog skillnad kanskeom man får sy på (...) ˈhe- la tone. This holds true of both monosyllabic sim- ’there’s a difference perhaps if you get to sew on (...) the whole ’ H, ,L, , H o L plex forms and compounds. 0 2.825 500 Time (s) 400 Figure 5. Two HL falls interrupted by creak and 2 300 stød. E118w: ˈhela ‘the whole’. Stød in an un- 200 stressed syllable. 100 It is not always easy to make a categorical 0 distinction between creak and stød in the Pitch (Hz) då va ju nästan hela ˈstan eh vowel. Often, stød is followed by creaky voice, ’then almost the entire town was...eh’ and sometimes creaky voice surrounds a glottal H L , , , closure. This is as it should be, if we, following 0 2.231 Gauffin (1972ab), treat stød and creak as adja- Time (s) cent on a supralaryngeal constriction contin- Figure 7. Earlier timing of final HL% fall in sim- uum. Note in this connection that the phenome- 1 non of Danish stød may be realized both as a plex accent 1. E8w: ˈstan ‘the town’. creak or with a complete closure (Fischer-

14 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Simplex accent 2 exhibits the same property, 500 cf. Figure 10. 400 300 500 200 400 100 300 0 200

Pitch (Hz) Pitch såna därsom intehade nå ˈhus- ˌrum 100 ’such people who did not have a place to stay’ 0 Pitch (Hz) där åkte vi förr nn å ˈba- da L H , , L , , ’we went there back then to swim’ 0 2.128 , , , L H L Time (s) 0 2.271 Figure 8. Earlier timing of final HL% fall in com- Time (s) pound accent 2. E8w: 2ˈhusˌrum ‘place to stay’ Figure 10. Lexical L tone in the main stress syllable Another indication of the early timing of of simplex accent 2. Earlier timing of final HL% HL% occurs in accent 2 trisyllabic simplex fall. E8w: 2ˈbada ‘swim’. forms, where the second peak occurs with great Listening to speaker E8w (Figures 7, 8, 10, regularity in the second syllable. 11), one clearly hear some features that are characteristic of the Dalabergslag dialect (DB), 500 spoken northwestward of Eskilstuna. In this 400 300 dialect, the lexical/post-lexical tone of accent 2 200 is L, and the latter part of the contour is HL%. 100 However, it would not be right to simply clas- 0 den var så ˈgri- pan- de (...) hela ˈhand- ling-en å så där sify this informant and others sounding much ’it was so moving (...) the entire plot and so on’ like her as DB speakers, as the intonation in H L H L H LH L , , , compounds is different from that of DB proper. 0 3.14 In DB proper there is a sharp LH rise on the Time (s) primary stress of compounds, followed by a plateau (cf. Figure 12). This is not the case in Figure 9. Early timing of HL% in trisyllabic accent 2 2 this Eskilstuna variety where the rise does not 2 forms. E106w: ˈgripande ‘moving’, ˈhandlingen 2 ‘the plot’. occur until the final stress. The pattern is the same in longer compounds, too, as illustrated in In standard CSw the second peak is variably Figure 11. realized in either the second or third syllable (according to factors not fully worked out), a 400 feature that points to a southward relationship 300 with the Göta dialects, where the later realiza- tion is rule. 200 The compression and leftward shift at the 100 end of the focused word has consequences also 0 Pitch (Hz) i kö flera timmar för att få en palt-bröds-ka- ka for the initial part of the accent 2 contour. The ˈ ˌ ˌ lexical or postlexical accent 2 tone in CSw is ’in a queue for several hours to get a palt bread loaf’ H. In simplex forms, this H tone is associated L H , ,L, , 2 to the only stressed syllable (e.g. Figure 5 ˈhela 0 2.921 ‘the whole’), and in compounds the H tone is Time (s) associated to the first stressed syllable (Figure Figure 11. Postlexical L tone in the main stress syl- 6). In some of the informants’ speech, there has lable of compound accent 2. E8w: 2ˈpaltˌbrödsˌkaka been a shift of tones at this end of the focus ‘palt bread loaf’. domain, too. We can see this in the compound 2 Due to the extra space afforded by a final ˈhusˌrum ‘place to stay’ in Figure 8. The first stress of the compound is associated to a L tone unstressed syllable in Figure 11, the final fall is rather than the expected H tone of standard later timed than in Figure 8, but equally abrupt. CSw. In fact, the H tone is missing altogether.

15 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

mer tonal dialect in Sjælland and the surround- Variation in Eskilstuna and the re- ing tonal dialects of South Swedish, South Norwegian and West Norwegian. construction of Danish The further development within Sjælland The variation among Eskilstuna speakers with Danish, involves the phonologization of stød regard to whether they sound more like the and the loss of the tonal distinction. The recon- CSw or DB dialect types can be diagnosed in a struction of these events finds support in the simple way by looking at the lexical/post- phenomenon of generalized accent 2, also lexical tone of accent 2. In CSw it is H (cf. Fig- found in Eastern Mälardalen. Geographically, ure 5), in DB it is L (cf. Figure 10). Interest- the area which has this pattern is to the east of ingly, this tonal variation appears to co-vary Eskilstuna. The border between curl and gener- with the realization of creak/stød, at least for alized accent 2 is crisp and the tonal structure is the speakers I have looked at so far. The gener- clearly CSw in character. The loss of distinc- alization appears to be that the HL% fall is tive accent 2 by generalization of the pattern to more noticeable with the Eskilstuna speakers all relevant disyllables can thus also be con- that sound more Central Swedish, that is, nected to a system like that found in Eskilstuna, E106w, E47w, E147w, E67w and E118w. The in particular the variety with compression and speakers E8w, E8m and E149w sound more relatively frequent creak/stød (Eskilstuna CSw like DB and they exhibit less pronounced falls, in Figure 12). For further aspects of the hy- and less creak/stød. This patterning can be un- pothesis and arguments in relation to Danish, derstood in terms of compression. cf. Riad (1998a, 2000ab). According to the general hypothesis, the DB variety as spoken further to the northwest of Conclusion Eskilstuna is a response to the compression in- stantiated by curl, hence that the DB variety The tonal dialects within Scandinavia are quite spoken has developed from an earlier tightly connected, both as regards tonal repre- Eskilstuna-like system (Riad 2000ab). By shift- sentation and tonal grammar, a fact that rather ing the other tones of the focus contour to the limits the number of possible developments left, the compression is relieved. As a conse- (Riad 1998b). This makes it possible to recon- quence, creak/stød should also be expected to struct a historical development from a now lost occur less regularly. The relationship between tonal system in Denmark to the present-day the dialects is schematically depicted for accent stød system. We rely primarily on the rich tonal 2 simplex and compounds in Figure 12. Arrows variation within the Eastern Mälardal region, indicate where things have happened relative to where Eskilstuna and the surrounding varieties the preceding variety. provide several phonetic, distributional, dialec- tological, geographical and representational Simplex Compound pieces of the puzzle that prosodic reconstruc- tion involves. Standard CSw

Eskilstuna CSw Acknowledgements I am indebted to Bengt Nordberg for providing Eskilstuna DB me with cds of his 1967 recordings in Eskilstuna. Professor Nordberg has been of in- DB proper valuable help in selecting representative infor- mants for the various properties that I was Figure 12. Schematic picture of the tonal shift in looking for in this dialect. accent 2 simplex and compounds. The tonal variation within Eskilstuna thus Notes allows us to tentatively propose an order of dia- 1. For a different view of the markedness issue, chronic events, where the DB variety should be cf. Lahiri, Wetterlin, and Jönsson-Steiner seen as a development from a double-peak sys- (2005) tem like the one in CSw, i.e. going from top to 2. There are other differences (e.g. in the reali- bottom in Figure 12. Analogously, we would zation of accent 1), which are left out of this assume a similar relationship between the for- presentation.

16 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Liberman, A. (1982) Germanic Accentology. References Vol. I: The Scandinavian languages. Min- neapolis: University of Minnesota Press,. Basbøll H. (1985) Stød in Modern Danish. Lindblom B. (to appear) Laryngeal machanisms Folia Linguistica XIX.1–2, 1–50. in speech: The contributions of Jan Gauffin. Basbøll H. (2005) The Phonology of Danish Logopedics Phoniatrics Vocology. [ac- (The Phonology of the World’s Languages). cepted for publication] Oxford: Oxford University Press. Lindblom B. (this volume) F0 lowering, creaky Bleckert L. (1987) Centralsvensk diftongering voice, and glottal stop: Jan Gauffin’s ac- som satsfonetiskt problem. (Skrifter utgivna count of how the larynx is used in speech. av institutionen för nordiska språk vid Upp- Nordberg B. (1969) The urban dialect of sala universitet 21) Uppsala. Eskilstuna, methods and problems. FUMS Boersma P. and Weenink D. (2009) Praat: do- Rapport 4, Uppsala University. ing phonetics by computer (Version 5.1.04) Peters J. (2007) Bitonal lexical pitch accents in [Computer program]. Retrieved in April the Limburgian dialect of Borgloon, In 2009 from http://www.praat.org/. Riad, T. and Gussenhoven C. (eds) Tones Bruce G. and Gårding E. (1978) A prosodic ty- and Tunes, vol 1. Typological Studies in pology for Swedish dialects. In Gårding E., Word and Sentence Prosody, 167–198. Bruce G., and Bannert R. (eds) Nordic pros- (Phonology and Phonetics). Berlin: Mouton ody. Papers from a symposium (Travaux de de Gruyter. l‘Institut de Linguistique de Lund 13) Lund Pettersson P. and Forsberg K. (1970) Beskriv- University, 219–228. ning och register över Eskilstunainspelning- Fintoft K., Mjaavatn P.E., Møllergård E., and ar. FUMS Rapport 10, Uppsala University. Ulseth B. (1978) Toneme patterns in Nor- Riad T. (1998a) Curl, stød and generalized ac- wegian dialects. In Gårding E., Bruce G., cent 2. Proceedings of Fonetik 1998 (Dept. and Bannert R. (eds) Nordic prosody. Pa- of Linguistics, Stockholm University) 8–11. pers from a symposium (Travaux de Riad T. (1998b) Towards a Scandinavian ac- l‘Institut de Linguistique de Lund 13) Lund cent typology. In Kehrein W. and Wiese R. University, 197–206. (eds) Phonology and Morphology of the Fischer-Jørgensen E. (1989) A Phonetic study , 77–109 (Lin- of the stød in Standard Danish. University guistische Arbeiten 386) Tübingen: Nie- of . (revised version of ARIPUC 21, meyer. 56–265). Riad T. (2000a) The origin of Danish stød. In Gauffin [Lindqvist] J. (1972) A descriptive Lahiri A. (ed) Analogy, Levelling and model of laryngeal articulation in speech. Markedness. Principles of change in pho- Speech Transmission Laboratory Quarterly nology and morphology. Berlin/New York: Progress and Status Report (STL-QPSR) Mouton de Gruyter, 261–300. (Dept. of Speech Transmission, Royal Insti- Riad T. (2000b) Stöten som aldrig blev av – tute of Technology, Stockholm) 2–3/1972, generaliserad accent 2 i Östra Mälardalen. 1–9. Folkmålsstudier 39, 319–344. Gauffin [Lindqvist] J. (1972) Laryngeal articu- Riad T. (2005) Historien om tonaccenten. In lation studied on Swedish subjects. STL- Falk C. and Delsing L.-O. (eds), Studier i QPSR 2–3, 10–27. svensk språkhistoria 8, Lund: Studentlittera- Gussenhoven C. (2004) The Phonology of tur, 1–27. Tone and Intonation. Cambridge: Cam- Ringgaard K. (1983) Review of Liberman bridge University Press. (1982). Phonetica 40, 342–344. Gussenhoven C. and van der Vliet P. (1999) Sundgren E. (2002) Återbesök i Eskilstuna. En The phonology of tone and intonation in the undersökning av morfologisk variation och Dutch dialect of Venlo. Journal of Linguis- förändring i nutida talspråk. (Skrifter utgiv- tics 35, 99–135. na av Institutionen för nordiska språk vid Kroman, E. (1947) Musikalsk akcent i dansk. Uppsala universitet 56) Uppsala. København: Einar Munksgaard. Lahiri A., Wetterlin A., and Jönsson-Steiner E. (2005) Lexical specification of tone in North Germanic. Nordic Journal of Linguis- tics 28, 1, 61–96.

17 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Formant transitions in normal and disordered speech: An acoustic measure of articulatory dynamics Björn Lindblom1, Diana Krull1, Lena Hartelius2 & Ellika Schalling3 1Department of Linguistics, Stockholm University 2Institute of Neuroscience and Physiology, University of 3Department of Logopedics and Phoniatrics, CLINTEC, Karolinska Institute, Karolinska University Hospital, Huddinge

Abstract. second mirrored by parallel changes in ‘rate of articulatory movement’? At present it does not This paper presents a method for numerically seem advisable to take a parallelism between specifying the shape and speed of formant tra- movement speed and number of phonetic units jectories. Our aim is to apply it to groups of per second for granted. normal and dysarthric speakers and to use it to make comparative inferences about the tem- Temporal organization: Motor control in poral organization of articulatory processes. normal and dysarthric speech To illustrate some of the issues it raises we here present a detailed analysis of speech samples Motor speech disorders (dysarthrias) exhibit a from a single normal talker. The procedure wide range of articulatory difficulties: There are consists in fitting damped exponentials to tran- different types of dysarthria depending on the sitions traced from spectrograms and determin- specific nature of the neurological disorder. ing their time constants. Our first results indi- Many dysarthric speakers share the tendency to cate a limited range for F2 and F3 time con- produce distorted vowels and consonants, to stants. Numbers for F1 are more variable and nasalize excessively, to prolong segments and indicate rapid changes near the VC and CV thereby disrupt stress patterns and to speak in a boundaries. For the type of speech materials slow and labored way (Duffy 2005). For in- considered, time constants were found to be in- stance, in multiple sclerosis and ataxic dysarth- dependent of speaking rate. Two factors are ria, syllable durations tend to be longer and highlighted as possible determinants of the pat- equal in duration (‘scanning speech’). Further- terning of the data: the non-linear mapping more inter-stress intervals become longer and from articulation to acoustics and the biome- more variable (Hartelius et al 2000, Schalling chanical response characteristics of individual 2007). articulators. When applied to V-stop-V citation Deviant speech timing has been reported to forms the method gives an accurate description correlate strongly with the low intelligibility in of the acoustic facts and offers a feasible way dysarthric speakers. Trying to identify the of supplementing and refining measurements of acoustic bases of reduced intelligibility, inves- extent, duration and average rate of formant tigators have paid special attention to the beha- frequency change. vior of F2 examining its extent, duration and rate of change (Kent et al 1989, Weismer et al 1992, Hartelius et al 1995, Rosen et al 2008). Background issues Dysarthric speakers show reduced transition extents, prolonged transitions and hence lower Speaking rate average rates of formant frequency change (flat- One of the issues motivating the present study ter transition slopes). is the problem of how to define the notion of In theoretical and clinical phonetic work it ‘speaking rate’. Conventional measures of would be useful to be able to measure speaking speaking rate are based on counting the number rate defined both as movement speed and in of segments, syllables or words per unit time. terms of number of units per second. The However, attempts to characterize speech rate present project attempts to address this objec- in terms of ‘articulatory movement speed’ ap- tive building on previous acoustic analyses of pear to be few, if any. The question arises: Are dysarthric speech and using formant pattern rate variations in the number of phonemes per of change as an indirect window on articulatory movement.

18 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

complement (1- e-αt). We then obtain the fol- Method lowing expression: The method is developed from observing that -αt formant frequency transitions tend to follow F3(t) = (F3L-F3T)*(1-e ) + F3T (2) smooth curves roughly exponential in shape (Figure 1). Other approaches have been used in Speech materials the past (Broad & Fertig 1970). Stevens et al At the time of submitting this report recordings (1966) fitted parabolic curves to vowel formant and analyses are ongoing. Our intention is to tracks. Ours is similar to the exponential curve apply the proposed measure to both normal and fitting procedure of Talley (1992) and Park dysarthric speakers. Here we present some pre- (2007). liminary normal data on consonant and vowel sequences occurring in V:CV and VC:V frames with V=[i ɪ e ɛ a ɑ ɔ o u] and C=[b d g]. As an initial goal we set ourselves the task of describ- ing how the time constants for F1, F2 and F3 vary as a function of vowel features, consonant place (articulator) and formant number. The first results come from a normal male speaker of Swedish reading lists with rando- mized VC:V and VC:V words each repeated five times. No carrier phrase was used. Since one of the issues in the project con- cerns the relationship between ‘movement speed’ (as derived from formant frequency rate of change) and ‘speech rate’ (number of pho- nemes per second) we also had subjects produ- ce repetitions of a second set of test words: dag, dagen, Dagobert [ˈdɑ:gɔbæʈ], dagobertmacka. This approach was considered preferable to Figure 1. Spectrogram of syllable [ga]. White cir- asking subjects to “vary their speaking rate”. cles represent measurements of the F2 and F3 tran- Although this instruction has been used fre- sitions. The two contours can be described numeri- quently in experimental phonetic work it has cally by means of exponential curves (Eqs (1 and the disadvantage of leaving the speaker’s use of 2). ‘over-‘ and ‘underarticulation’ - the ‘hyper- Mathematically the F2 pattern of Figure 1 can hypo’ dimension –uncontrolled (Lindblom be approximated by: 1990). By contrast the present alternative is at- tractive in that the selected words all have the -αt F2(t) = (F2L-F2T)*e + F2T (1) same degree of main stress (‘huvudtryck’) on the first syllable [dɑ:(g)-]. Secondly speaking where F2(t) is the observed course of the transi- rate is implicitly varied by means of the ‘word tion, F2L and F2T represent the starting point length effect’ which has been observed in many (‘F2 locus’) and the endpoint (‘F2 target’) re- languages (Lindblom et al 1981). In the present spectively. The term e-αt starts out from a value test words it is manifested as a progressive of unity at t=0 and approaches zero as t gets shortening of the segments of [dɑ:(g)-] when larger. The α term is the ‘time constant’ in that more and more syllables are appended. it controls the speed with which e-αt approaches zero. Determining time constants At t=0 the value of Eq (1) is (F2L-F2T) + To measure transition time constants the fol- F2T = F2L. When e-αt is near zero, F2(t) is tak- lowing protocole was followed. en to be equal to F2T. The speech samples were digitized and ex- To capture patterns like the one for F3 in amined with the aid of wide-band spectro- Figure 1 a minor modification of Eq (1) is re- graphic displays in Swell. [FFT points 55/1024, quired because F3 frequency increases rather Bandwidth 400 Hz, Hanning window 4 ms]. than decays. This is done by replacing e-αt by its

19 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

For each sample the time courses of F1, F2 and F3 were traced by clicking the mouse along the formant tracks. Swell automatically produced a two-column table with the sample’s time and frequency values. The value of α was determined after rear- ranging and generalizing Eq (1) as follows:

-αt (Fn(t) - FnT)/(FnL - FnT) = e (3) and taking the natural logarithm of both sides which produces:

ln[(Fn(t) - FnT)/(FnL - FnT)] = -αt (4)

Eq (4) suggests that, by plotting the logarithm Figure 3. Measured data for 5 repetitions of [da] of the Fn(t) data – normalized to vary between (black dots) produced by male speaker. In red: Ex- 1 and zero – against time, a linear cluster of da- ponential curves derived .from the average formant- ta points would be obtained (provided that the specific values of locus and target frequencies and transition is exponential). time constants. A straight line fitted to the points so that it runs through the origin would have a slope of α. This procedure is illustrated in Figure 2. Results High r squared scores were observed (r2>0.90) indicating that exponential curves were good approximations to the formant transitions.

Figure 2. Normalized formant transition: Top: li- Figure 4. Formant time constants in V:CV and near scale running between 1.0 and zero; (Bottom): VC:V words plotted as a function of formant fre- Same data on logarithmic scale. The straight-line quency (kHz). F1 (open triangles), F2 (squares) and pattern of the data points allows us to compute the F3 (circles). Each data point is the value derived slope of the line. This slope determines the value of from five repetitions. the time constant. The overall patterning of the time constants is Figure 3 gives a representative example of how illustrated in Figure 4. The diagram plots time well the exponential model fits the data. It constant values against frequency in all V:CV shows the formant transitions in [da]. Meas- and VC:V words. Each data point is the value urements from 5 repetitions of this syllable derived from five repetitions by a single make were pooled for F1, F2 and F3. Time constants talker. Note that, since decaying exponentials were determined and plugged into the formant are used, time constants come out as negative equations to generate the predicted formant numbers and all data points end up below the tracks (shown in red). zero line.

20 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

F1 shows the highest negative values and the er than those tuning F2 (the tongue front-back largest range of variation. F2 and F3 are seen to motions)? The answer is no. occupy a limited range forming a horizontal Studies of the relation between articulation pattern independent of frequency. and acoustics (Fant 1960) tell us that rapid F1 A detailed analysis of the F1 transition sug- changes are to be expected when the vocal tract gests preliminarily that VC transitions tend to geometry changes from a complete stop closure be somewhat faster than CV transitions; VC: to a more open vowel-like configuration. Such data show larger values than VC measurements. abrupt frequency shifts exemplify the non- linear nature of the relation between articulation and acoustics. Quantal jumps of this kind lie at the heart of the Quantal Theory of Speech (Ste- vens 1989). Drastic non-linear increases can also occur in other formants but do not neces- sarily indicate faster movements. Such observations may at first appear to make the present method less attractive. On the other hand, we should bear in mind that the transformation from articulation to acoustics is a physical process that constrains both normal and disordered speech production. Accordingly, if identical speech samples are compared it should nonetheless be possible to draw valid conclusions about differences in articulation. Figure 5.Vowel duration (left y-axis) and .F2 time constants (right y-axis) plotted as a function of number of syllables per word.

Figure 5 shows how the duration of the vowel [ɑ:] in [dɑ:(g)-] varies with word length. Using the y-axis on the left we see that the duration of the stressed vowel decreases as a function of the number of syllables that follow. This com- pression effect implies that the ‘speaking rate increases with word length. The time constant for F2 is plotted along the right ordinate. The quasi-horizontal pattern of the open square symbols indicates that time constant values are not influenced by the rate Figure 6: Same data as in Figure 4. Abscissa: Ex- increase. tent of F1, F2 or F3 transition (‘locus’–‘target’ dis- tance). Ordinate: Average formant frequency rate of change during the first 15 msec of the transition. Discussion Formant frequency rates of change are Non-linear acoustic mapping predictable from transition extents. It is important to point out that the proposed As evident from the equations the determina- measure can only give us an indirect estimate of tion of time constants involves a normalization articulatory activity. One reason is the non- that makes them independent of the extent of linear relationship between articulation and the transition. The time constant does not say acoustics which for identical articulatory anything about the raw formant frequency rate movement speeds could give rise to different of change in kHz/seconds. However, the data time constant values. on formant onsets and targets and time con- The non-linear mapping is evident in the stants allow us to derive estimates of that di- high negative numbers observed for F1. Do we mension by inserting the measured values into conclude that the articulators controlling F1 Eqs (1) and (2) and calculating ∆Fn/∆t at transi- (primarily jaw opening and closing) move fast- tion onsets for a time window of ∆t=15 millise- conds.

21 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The result is presented in Figure 6 with ∆Fn/∆t Clues from biomechanics plotted against the extent of the transition (lo- To illustrate the meaning of the numbers in cus-target distance). All the data from three Figure 3 we make the following simplified formants have been included. It is clear that comparison. Assume that, on the average, syl- formant frequency rates of change form a fairly lables last for about a quarter of a second. Fur- tight linear cluster of data points indicating that ther assume that a CV transition, or VC transi- rates for F2 and F3 can be predicted with good tion, each occupies half of that time. So for- accuracy from transition extents. Some of data mant trajectories would take about 0.125 points for F1 show deviations from this trend. seconds to complete. Mathematically a decay- Those observations help us put the pattern ing exponential that covers 95% of its ampli- of Figure 3 in perspective. It shows that, when tude in 0.125 seconds has a time constant of interpreted in terms of formant frequency rate about -25. This figure falls right in the middle of change (in kHz/seconds), the observed time of the range of values observed for F2 and F3 in constant patterning does not disrupt a basically Figure 3. lawful relationship between locus-target dis- The magnitude of that range of numbers tances and rates of frequency change. A major should be linked to the biomechanics of the factor behind this result is the stability of F2 speech production system. Different articulators and F3 time constants. have different response times and the speech Figure 6 is interesting in the context of the wave reflects the interaction of many articulato- ‘gestural’ hypothesis which has recently been ry components. So far we know little about the given a great deal of prominence in phonetics. response times of individual articulators. It suggests that information on phonetic catego- In normal subjects both speech and non- ries may be coded in terms of formant transition speech movements exhibit certain constant cha- dynamics (e.g., Strange 1989). From the van- racteristics. tage point of a gestural perspective one might expect the data of the present project to show distinct groupings of formant transition time constants in clear correspondence with phonetic categories (e.g., consonant place, vowel fea- tures). As the findings now stand, that expecta- tion is not borne out. Formant time constants appear to provide few if any cues beyond those presented by the formant patterns sampled at transition onsets and endpoints.

Articulatory processes in dysarthria What would the corresponding measurements look like for disordered speech? Previous acoustic phonetic work has highlighted a slower average rate of F2 change in dysarthric speak- ers. For instance, Weismer et al (1992) investi- Figure 7: Diagram illustrating the normalized ‘ve- gated groups of subjects with amyotrophic lat- locity profile’ associated with three point-to-point eral sclerosis and found that they showed lower movements of different extents. average F2 slopes than normal: the more severe the disorder the lower the rate. In the large experimental literature on voluntary The present type of analyses could supple- movement there is an extensively investigated ment such reports by determining either how phenomenon known as “velocity profiles” (Fig- time constants co-vary with changes in transi- ure 7). For point-to-point movements (includ- tion extent and duration, or by establishing that ing hand motions (Flash & Hogan 1985) and normal time constants are maintained in dy- articulatory gestures (Munhall et al 1985)) these sarthric speech. Whatever the answers provided profiles tend to be smooth and bell-shaped. Ap- by such research we would expect them to parently velocity profiles retain their geometric present significant new insights into both nor- shape under a number of conditions: “…the mal and disordered speech motor processes. form of the velocity curve is invariant under transformations of movement amplitude, path, rate, and inertial load” (Ostry et al 1987:37).

22 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 7 illustrates an archetypical velocity pro- processes?, Indiana University Linguistics file for three hypothetical but realistic move- Club, Bloomington, Indiana. ments. The displacement curves have the same Lindblom B (1990): "Explaining phonetic vari- shape but differ in amplitude. Hence, when ation: A sketch of the H&H theory", in normalized with respect to displacement, their Hardcastle W & Marchal A (eds): Speech velocity variations form a single “velocity pro- Production and Speech Modeling, 403-439, file” which serves as a biomechanical “signa- Dordrecht:Kluwer. ture” of a given moving limb or articulator. Munhall K G, Ostry D J & Parush A (1985): What the notion of velocity profiles tells us “Characteristics of velocity profiles of that speech and non-speech systems are strong- speech movements”, J Exp Psychology: ly damped and therefore tend to produce Human Perception and Performance Vol movements that are s-shaped. Also significant 11(4):457-474 is the fact that the characteristics of velocity Ostry D J, Cooke J D & Munhall K G (1987): profiles stay invariant despite changes in expe- ”Velocity curves of human arm and speech rimental conditions. Such observations indicate movements”, Exp Brain Res 68:37-46 that biomechanical constancies are likely to Park S-H (2007): Quantifying perceptual con- play a major role in constraining the variation trast: The dimension of place of articula- of formant transition time constants both in tion, Ph D dissertation, University of Texas normal and disordered speech. at Austin However, our understanding of the biome- Rosen K M, Kent R D, Delaney A L & Duffy J chanical constraints on speech is still incom- R (2006): “Parametric quantitative acoustic plete. We do not yet fully know the extent to analysis of conversation produced by speak- which they remain fixed, or can be tuned and ers with dysarthria and healthy speakers”, adapted to different speaking conditions, or are JSLHR 49:395–411. modified in speech disorders (cf Forrest et al Schalling E (2007): Speech, voice, language 1989). It is likely that further work on compar- and cognition in individuals with spinoce- ing formant dynamics in normal and dysarthric rebellar ataxia (SCA), Studies in Logoped- speech will throw more light on these issues. ics and Phoniatrics No 12, Karolinska Insti- tutet, Stockholm, Sweden Stevens K N, House A S & Paul A P (1966): References “Acoustical description of syllabic nuclei: Broad D J & Fertig R (1970): "Formant- an interpretation in terms of a dynamic frequency trajectories in selected CVC syl- model of articulation”, J Acoust Soc Am lable nuclei", J Acoust Soc Am 47, 1572- 40(1), 123-132. 1582. Stevens K N (1989): “On the quantal nature of Duffy J R (1995): Motor speech disorders: speech,” J Phonetics 17:3-46. Substrates, differential diagnosis, and man- Strange W (1989): “Dynamic specification of agement, Mosby: St. Louis, USA. coarticulated vowels spoken in sentence Fant G (1960): Acoustic theory of speech pro- context”, J Acoust Soc Am 85(5):2135-2153. duction, Mouton:The Hague. Talley J (1992): "Quantitative characterization Forrest K, Weismer G & Turner G S (1989): of vowel formant transitions", J Acoust Soc "Kinematic, acoustic, and perceptual ana- Am 92(4), 2413-2413. lyses of connected speech produced by Par- Weismer G, Martin R, Kent R D & Kent J F kinsonian and normal geriatric adults", J (1992): “Formant trajectory characteristics Acoust Soc Am 85(6), 2608-2622. of males with amyotrophic lateral sclero- Hartelius L, Nord L & Buder E H (1995): sis”, J Acoust Sec Am 91(2):1085-1098. “Acoustic analysis of dysarthria associated with multiple sclerosis”, Clinical Linguis- tics & Phonetics, Vol 9(2):95-120 Flash T & Hogan N (1985): “The coordination of arm movements: An experimentally con- firmed mathematical model’”, J Neuros- cience Vol 5(7). 1688-1703. Lindblom B, Lyberg B & Holmgren K (1981): Durational patterns of : Do they reflect short-term memory

23 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Effects of vocal loading on the phonation and collision threshold pressures Laura Enflo1, Johan Sundberg1 and Friedemann Pabst2 1Department of Speech, Music & Hearing, Royal Institute of Technology, KTH, Stockholm, Sweden 2Hospital Dresden Friedrichstadt, Dresden, Germany

Abstract where PTP is measured in cm H2O and MF0 is the mean F0 for conversational speech (190 Hz Phonation threshold pressures (PTP) have been commonly used for obtaining a quantita- for females and 120 Hz for males). The con- tive measure of vocal fold motility. However, as stant a = 0.14 and the factor b = 0.06. these measures are quite low, it is typically dif- Titze’s equation has been used in several ficult to obtain reliable data. As the amplitude studies. These studies have confirmed that vo- of an electroglottograph (EGG) signal de- cal fold stiffness is a factor of relevance to PTP. creases substantially at the loss of vocal fold Hence, it is not surprising that PTP tends to rise contact, it is mostly easy to determine the colli- during vocal fatigue (Solomon & DiMattia, sion threshold pressure (CTP) from an EGG 2000 & Milbrath & Solomon, 2003 & Chang & signal. In an earlier investigation (Enflo & Karnell, 2004). A lowered PTP should reflect Sundberg, forthcoming) we measured CTP and greater vocal fold stiffness, which is a clinically compared it with PTP in singer subjects. Re- relevant property; high motility must be associ- sults showed that in these subjects CTP was on ated with a need for less phonatory effort for a average about 4 cm H2O higher than PTP. The given degree of vocal loudness. PTP has been found to increase during vocal Determining PTP is often complicated. One fatigue. In the present study we compare PTP reason is the difficulty of accurately measuring and CTP before and after vocal loading in low values. Another complication is that sev- singer and non-singer voices, applying a load- eral individuals find it difficult to produce their ing procedure previously used by co-author very softest possible sound. As a consequence, FP. Seven subjects repeated the vowel se- the analysis is mostly time-consuming and the quence /a,e,i,o,u/ at an SPL of at least 80 dB @ data are often quite scattered (Verdolini- 0.3 m for 20 min. Before and after the loading Marston et al., 1990). the subjects’ voices were recorded while they At very low subglottal pressures, i.e. in very produced a diminuendo repeating the syllable soft phonation, the vocal folds vibrate, but with /pa/. Oral pressure during the /p/ occlusion an amplitude so small that the folds never col- was used as a measure of subglottal pressure. lide. If subglottal pressure is increased, how- Both CTP and PTP increased significantly af- ever, vocal fold collision normally occurs. Like ter the vocal loading. PTP, the minimal pressure required to initiate vocal fold collision, henceforth the collision Introduction threshold pressure (CTP), can be assumed to

Subglottal pressure, henceforth Psub, is one of reflect vocal fold motility. the basic parameters for control of phonation. It CTP should be easy to identify by means of typically varies with fundamental frequency of an electroglottograph (EGG). During vocal fold phonation F0 (Ladefoged & McKinney, 1963 contact, the EGG signal can pass across the & Cleveland & Sundberg, 1985). Titze (1992) glottis, resulting in a high EGG amplitude. derived an equation describing how the mini- Conversely, the amplitude is low when the vo- mal Psub required for producing vocal fold - cal folds fail to make contact. In a previous cillation, the phonation threshold pressure study we measured PTP and CTP in a group of (PTP) varied with F0. He approximated this singers before and after vocal warm-up. The variation as: results showed that both PTP and CTP tended to drop after the warm-up, particularly for the PTP = a + b*(F0 / MF0 )2 (1) male voices (Enflo & Sundberg, forthcoming). The purpose of the present study was to explore

24 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the potential of the CTP measure in female and male subjects before and after vocal loading.


Experiment Seven subjects, two female (F) and five male (M), were recruited as subjects. One female and one male were amateur singers, one of the males had some vocal training while the re- maining subjects all lacked vocal training. Their task was to repeat the syllable [pa:] with Figure 1: Experimental setup used in the re- gradually decreasing vocal loudness and con- cordings. tinuing until voicing had ceased, avoiding em- phasis of the consonant /p/. The oral pressure The audio signal was calibrated by recording a during the occlusion for the consonant /p/ was synthesized vowel sound, the sound pressure accepted as an approximation of Psub. The sub- level (SPL) of which was determined by means jects repeated this task three to six times on all of a sound level recorder (OnoSokki) held next pitches of an F major triad that fitted into their to the recording microphone. The pressure sig- pitch range. The subjects were recorded in sit- nal was calibrated by recording it while the ting position in a sound treated booth. transducer was (1) held in free air and (2) im- Two recording sessions were made, one be- mersed at a carefully measured depth in a glass fore and one after vocal loading. This loading cylinder filled with water. consisted of phonating the vowel sequence /a,e,i,o,u/ at an SPL of at least 80 dB @ 0.3 m Analysis during 20 min. All subjects except the two The analysis was performed using the Sound- singers reported clear symptoms of vocal fa- swell Signal Workstation. As the oral pressure tigue after the vocal loading. transducer picked up some of the oral sound, Audio, oral pressure and EGG signals were this signal was LP filtered at 50 Hz. recorded, see Figure 1. The audio was picked After a 90 Hz HP filtering the EGG signal up at 30 cm distance by a condenser micro- was full-wave rectified, thus facilitating ampli- phone (B&K 4003), with a power supply (B&K tude comparisons. Figure 2 shows an example 2812), set to 0 dB and amplified by a mixer, of the signals obtained. DSP Audio Interface Box from (Nyvalla DSP). Oral pressure was recorded by means of a pres- sure transducer (Gaeltec Ltd, 7b) which the subject held in the corner of the mouth. The EGG was recorded with a two-channel elec- troglottograph (Glottal Enterprises EG 2), using the vocal fold contact area output and a low frequency limit of 40 Hz. This signal was monitored on an oscilloscope. Contact gel was applied to improve the skin contact. Each of these three signals was recorded on a separate track of a computer by means of the Sound- TM swell Signal Workstation software (Core 4.0, Figure 2: Example of the recordings analyzed show- Hitech Development AB, Sweden). ing the audio, the HP filtered and rectified EGG and the oral pressure signals (top, middle and bottom curves). The loss of vocal fold contact, reflected as a sudden drop in the EGG signal amplitude, is marked by the frame in the EGG and pressure sig- nals.

25 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

As absence of vocal fold contact produces a MSG great reduction of the EGG signal amplitude, such amplitude reductions were easy to identify 12 in the recording. The subglottal pressures ap- 10 Series1 Series2 Series3 pearing immediately before and after a sudden Series4 amplitude drop were assumed to lie just above 8 and just below the CTP, respectively, so the 6 average of these two pressures was accepted as the CTP. For each subject, CTP was deter- 4 mined in at least three sequences for each pitch 2 and the average of these estimates was calcu- 0 lated. The same method was applied for deter- 510152025 mining the PTP. Figure 3: PTP (circles) and CTP (triangles) values in cm H20 for one of the untrained male subjects. Results The graph shows threshold pressures before (dashed Both thresholds tended to increase with F0, as lines) and after (filled lines) vocal loading. Semi- expected, and both were mostly higher after the tones relative C2 are used as seen on the x-axis. loading. Figure 3 shows PTP and CTP before and after vocal loading for one of the untrained Table 1. F0 range in semitones (Range), and male subjects. The variation with F0 was less mean after-to-before ratio and SD across F0 for evident and less systematic for some subjects. the CTP and PTP for the male and female sub- Table 1 lists the mean and SD across F0 of the jects. Letters U and T refer to untrained and after-to-before ratio for the subjects. The F0 trained voices, respectively. range produced by the subjects was slightly narrower than one and a half octave for the male subjects and two octaves for the trained Range CTP PTP female but only 8 semitones for the untrained Males st Mean SD Mean SD MAG U 17 1.32 0.10 1.13 0.32 female. The after-to-before ratio for CTP varied MES U 17 1.06 0.04 1.02 0.11 between 1.32 and 1.06 for the male subjects. MDE U 12 1.20 0.17 1.74 0.09 The corresponding variation for PTP was 1.74 MSG U 16 1.24 0.15 1.13 0.05 and 0.98. The means across subjects were simi- MJS T 15 1.07 0.13 0.98 0.03 lar for CTP and PTP. Vocal loading caused a Mean 1.18 0.12 1.20 0.12 statistically significant increase of both CTP and PTP (paired samples t-test, p<0.001). Inter- Females estingly, the two trained subjects, who reported FAH U 8 1.49 0.14 1.62 0.07 minor effects of the loading, showed small ra- FLE T 24 1.08 0.13 1.06 0.15 tios for both CTP and PTP. Mean 1.29 0.13 1.34 0.11

Discussion Second, the CTP seems easier to measure than To our knowledge this is the first attempt to the PTP. Most of our subjects found it difficult analyze the CTP in untrained voices. Hence, it to continue reducing vocal loudness until after is relevant to compare this threshold with the as phonation had ceased. For determining CTP, it yet commonly used PTP. is enough that the vocal loudness is reduced to First, CTP appears to be a more reliable extremely soft phonation. This may be particu- measure than PTP. In our previous investiga- larly advantageous when dealing with untrained tion of the thresholds in singer subjects, re- and pathological voices. peated measurements showed that the ratio be- A relevant aspect is to what extent the CTP tween the SD and the average tended to be provides the same information as PTP. Both smaller for the CTP than for the PTP (Enflo & should reflect the motility of the vocal folds, Sundberg, forthcoming). Thus, in this respect, i.e., an important mechanical characteristic, as the CTP is a more reliable measure. mentioned before. In our previous study with singer subjects we found that manipulating a and b in Titze’s PTP equation (Eq. 1) yielded

26 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University rather good approximations of the average CTP son, J Sundberg, editors. SMAC 83. Pro- before and after warm-up. However, the un- ceedings of the Stockholm Internat Music trained subjects in the present experiment Acoustics Conf, Vol. 1 Stockholm: Roy Sw showed an irregular variation with F0, so ap- Acad Music, Publ. No. 46:1, 143-56. proximating their CTP curves with modified Enflo L. and Sundberg J. (forthcoming) Vocal versions of Titze’s equation seemed pointless. Fold Collision Threshold Pressure: An Al- A limitation of the CTP is that, obviously, it ternative to Phonation Threshold Pressure? cannot be measured when the vocal folds fail to Ladefoged P. and McKinney NP. (1963) Loud- collide. This often happens in some dysphonic ness, sound pressure, and subglottal pressure voices in the upper part of the female voice in speech. J Acoust Soc Am 35, 454-60. range, and in male falsetto phonation. Milbrath R.L. and Solomon N.P. (2003) Do The main finding of the present investigation Vocal Warm-Up Exercises Alleviate Vocal was that CTP increased significantly after vocal Fatigue?, J Speech Hear Res 46, 422-36. loading. For the two trained subjects, the effect Solomon N.P. and DiMattia M.S. (2000) Ef- was minimal, and these subjects did not experi- ence any vocal fatigue after the vocal loading. fects of a Vocally Fatiguing Task and Sys- On average, the increase was similar for CTP tematic Hydration on Phonation Threshold and PTP. This supports the assumption that Pressure. J Voice 14, 341-62. CTP reflects similar vocal fold characteristics Titze I. (1992) Phonation threshold pressure: A as the PTP. missing link in glottal aerodynamics. J Our results suggest that the CTP may be used Acoust Soc Am 91, 2926-35. as a valuable alternative or complementation to Verdolini-Marston K., Titze I. and Druker D.G. the PTP, particularly in cases where it is diffi- (1990) Changes in phonation threshold pres- cult to determine the PTP accurately. sure with induced conditions of hydration. J Voice 4, 142-51.

Conclusions The CTP seems a promising alternative or com- plement to the PTP. The task of phonating at phonation threshold pressure seems more diffi- cult for subjects than the task of phonating at the collision threshold. The information repre- sented by the CTP would correspond to that represented by the PTP. In the future, it would be worthwhile to test CTP in other applications, e.g., in a clinical setting with patients before and after therapy.

Acknowledgements The kind cooperation of the subjects is grate- fully acknowledged. This is an abbreviated version of a paper which has been submitted to the Interspeech conference in Brighton, Sep- tember 2009.

References Chang A. and Karnell M.P. (2004) Perceived Phonatory Effort and Phonation Threshold Pressure Across a Prolonged Voice Loading Task: A Study of Vocal Fatigue. J Voice 18, 454-66. Cleveland T. and Sundberg J. (1985) Acoustic analyses of three male voices of different quality. In A Askenfelt, S Felicetti, E Jans-

27 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Experiments with Synthesis of Swedish Dialects

Beskow, J. and Gustafson, J. Department of Speech, Music & Hearing, School of Computer Science & Communication, KTH

Abstract possible to synthesize speech with high quality, at both segmental and prosodic levels. Another We describe ongoing work on synthesizing important feature of HMM synthesis, that Swedish dialects with an HMM synthesizer. A makes it an interesting choice in studying di- prototype synthesizer has been trained on a alectal variation, is that it is possible to adapt a large database for read by a voice trained on a large data set (2-10 hours of professional male voice talent. We have se- speech) to a new speaker with only 15-30 mi- lected a few untrained speakers from each of nutes of transcribed speech (Watts et al., 2008). the following dialectal region: Norrland, Dala, In this study we will use 20-30 minutes of di- Göta, Gotland and South of Sweden. The plan alectal speech for experiments on speaker adap- is to train a multi-dialect average voice, and tion of the initially trained HMM synthesis then use 20-30 minutes of dialectal speech from voice. one speaker to adapt either the standard Swe- dish voice or the average voice to the dialect of that speaker. Data description The data we use in this study are from the Nor- Introduction wegian Språkbanken. The large speech synthe- sis database from a professional speaker of In the last decade, most speech synthesizers standard Swedish was recorded as part of the en based on prerecorded pieces of have be NST (Nordisk Språkteknologi) synthesis devel- speech resulting in improved quality, but with opment. It was recorded in stereo, with the lack of control in modifying prosodic patterns voice signal in one channel, and signal from a (Taylor, 2009). The research focus has been di- laryngograph in the second channel. rected towards how to optimally search and The corpus contains about 5000 read sen- combine speech units of different lengths. tences, which add up to about 11 hours of In recent years HMM based synthesis has speech. The recordings manuscript was based gained interest (Tokuda et al., 2000). In this so- on NST’s corpus, and the selection was done to lution the generation of the speech is based on a make them phonetically balanced and to ensure parametric representation, while the grapheme- diphone coverage. The manuscripts are not pro- to-phoneme conversion still relies on a large sodically balanced, but there are different types pronunciation . HMM synthesis has of sentences that ensure prosodic variation, e.g. been successfully applied to a large number of statements, wh-questions, yes/no questions and languages, including Swedish (Lundgren, enumerations. 2005). The 11 hour speech database has been aligned on the phonetic and word levels using Dialect Synthesis our Nalign software (Sjölander & Heldner, In the SIMULEKT project (Bruce et al., 2007) 2004) with the NST dictionary as pronunciation one goal is to use speech synthesis to gain in- dictionary. This has more than 900.000 items sight into prosodic variation in major regional that are phonetically transcribed with syllable varieties of Swedish. The aim of the present boundaries marked. The text has been part-of- study is to attempt to model these Swedish va- speech tagged using a TNT tagger trained on rieties using HMM synthesis. the SUC corpus (Megyesi, 2002). HMM synthesis is an entirely data-driven From the NST database for training of approach to speech synthesis and as such it speech recognition we selected a small number gains all its knowledge about segmental, into- of unprofessional speakers from the following national and durational variation in speech from dialectal areas: Norrland, Dala, Göta, Gotland training on an annotated speech corpus. Given and South of Sweden. The data samples are that the appropriate features are annotated and considerably smaller than the speech synthesis made available to the training process, it is database: they range from 22 to 60 minutes,

28 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University compared to the 11 hours from the professional speaker. Acknowledgements The work within the SIMULEKT project is HMM Contextual Features funded by the Swedish Research Council 2007- The typical HMM synthesis model (Tokuda et 2009. The data used in this study comes from al., 2000) can be decomposed into a number of Norsk Språkbank (http://sprakbanken.uib.no) distinct layers: • At the acoustic level, a parametric source-filter model (MLSA-vocoder) is References responsible for signal generation. Bruce, G., Schötz, S., & Granström, B. (2007). • Context dependent HMMs, containing SIMULEKT – modelling Swedish regional probability distributions for the parame- intonation. Proceedings of Fonetik, TMH- ters and their 1st and 2nd order deriva- QPSR, 50(1), 121-124. tives, are used for generation of control Lundgren, A. (2005). HMM-baserad talsyntes. parameter trajectories. Master's thesis, KTH, TMH, CTT. • In order to select context dependent Megyesi, B. (2002). Data-Driven Syntactic HMMs, a decision tree is used, that uses Analysis - Methods and Applications for input from a large feature set to cluster Swedish. Doctoral dissertation, KTH, De- the HMM models. partment of Speech, Music and Hearing, In this work, we are using the standard model KTH, Stockholm. for acoustic and HMM level processing, and Sjölander, K., & Heldner, M. (2004). Word focus on adapting the feature set for the deci- level precision of the NALIGN automatic sion tree for the task of modeling dialectal vari- segmentation algorithm. In Proc of The ation. XVIIth Swedish Phonetics Conference, Fo- The feature set typically used in HMM syn- netik 2004 (pp. 116-119). Stockholm Uni- thesis includes features on segment, syllable, versity. word, phrase and utterance level. Segment level Taylor, P. (2009). Text-To-Speech Synthesis. features include immediate context and position Cambridge University Press. in syllable; syllable features include stress and Tokuda, K., Yoshimura, T., Masuko, T., Ko- position in word and phrase; word features in- bayashi, T., & Kitamura, T. (2000). Speech clude part-of-speech tag (content or function parameter generation algorithms for hmm- word), number of syllables, position in phrase based speech synthesis. In Proceedings of etc., phrase features include phrase length in CASSP 2000 (pp. 1315-1318). terms of syllables and words; utterance level Watts, O., Yamagishi, J., Berkling, K., & King, includes length in syllables, words and phrases. S. (2008). HMM-Based Synthesis of Child For our present experiments, we have also Speech. Proceedings of The 1st Workshop added a speaker level to the feature set, since on Child, Computer and Interaction. we train a voice on multiple speakers. The only feature in this category at present is dialect group, which is one of Norrland, Dala, Svea, Göta, Gotland and South of Sweden. In addition to this, we have chosen to add to the word level a morphological feature stating whether or not the word is a compound, since compound stress pattern often is a significant dialectal feature in Swedish (Bruce et al., 2007). At the syllable level we have added ex- plicit information about lexical accent type (ac- cent I, accent II or compound accent). Training of HMM voices with these feature sets is currently in progress and results will be presented at the conference.

29 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Real vs. rule-generated tongue movements as an audio- visual speech perception support Olov Engwall and Preben Wik Centre for Speech Technology, CSC, KTH [email protected], [email protected]

Abstract Since the AR view of the tongue is unfamil- We have conducted two studies in which an- iar, it is far from certain that listeners are able imations created from real tongue movements to make use of the additional information in a and rule-based synthesis are compared. We similar manner as for animations of the lips. first studied if the two types of animations were Badin et al. (2008) indeed concluded that the different in terms of how much support they tongue reading abilities are weak and that sub- give in a perception task. Subjects achieved a jects get more support from a normal view of significantly higher word recognition rate in the face, where the skin of the cheek is shown sentences when animations were shown com- instead of the tongue, even though less infor- pared to the audio only condition, and a sig- mation is given. Wik & Engwall (2008) simi- nificantly higher score with real movements larly found that subjects in general found little than with synthesized. We then performed a additional support when an AR side-view as the classification test, in which subjects should in- one in Fig. 1 was added to a normal front view. dicate if the animations were created from There is nevertheless evidence that tongue measurements or from rules. The results show reading is possible and can be learned explicitly that the subjects as a group are unable to tell if or implicitly. When the signal-to-noise ratio the tongue movements are real or not. The was very low or the audio muted in the study stronger support from real movements hence by Badin et al. (2008), subjects did start to appears to be due to subconscious factors. make use of information given by the tongue movements – if they had previously learned Introduction how to do it. The subjects were presented VCV words in noise, with either decreasing or in- Speech reading, i.e. the use of visual cues in the creasing signal-to-noise ratio (SNR). The group speaker’s face, in particular regarding the shape with decreasing SNR was better in low SNR of the lips (and hence the often used alternative conditions when tongue movements were dis- term lip reading), can be a very important played, since they had been implicitly trained source of information if the acoustic signal is on the audiovisual relationship for stimuli with insufficient, due to noise (Sumby & Pollack, higher SNR. The subjects in Wik & Engwall 1954; Benoît & LeGoff, 1998) or a hearing- (2008) started the word recognition test in sen- impairment (e.g., Agelfors et al., 1998; Sicili- tences with acoustically degraded audio by a ano, 2003). This is true even if the face is com- familiarization phase, where they could listen puter animated. Speech reading is much more to, and look at, training stimuli with both nor- than lip reading, since information is also given by e.g., the position of the jaw, the cheeks and the eye-brows. For some phonemes, the tip of the tongue is visible through the mouth opening and this may also give some support. However, for most phonemes, the relevant parts of the tongue are hidden, and “tongue reading” is therefore impossible in human-human commu- nication. On the other hand, with a computer- animated talking face it is possible to make tongue movements visible, by removing parts in the model that hide the tongue in a normal view, thus creating an augmented reality (AR) display, as exemplified in Fig. 1. Figure 1. Augmented reality view of the face.

30 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University mal and degraded audio. Even though the total fication test. The perception test analyzes the results were no better with the AR view than support given by audiovisual displays of the with a normal face, the score for some sen- tongue, when they are generated based on real tences was higher when the tongue was visible. measurements (AVR) or synthesized by rules Grauwinkel et al. (2007) also showed that sub- (AVS). The classification test investigates if jects who had received explicit training, in the subjects are aware of the differences between form of a video that explained the intra-oral ar- the two types of animations and if there is any ticulator movement for different consonants, relation between scores in the perception test performed better in the VCV recognition task and the classification test. in noise than the group who had not received the training and the one who saw a normal face. Experiments An additional factor that may add to the un- familiarity of the tongue movements is that Both the perception test (PT) and the classifica- they were generated with a rule-based visual tion test (CT) were carried out on a computer speech synthesizer in Wik & Engwall (2008) with a graphical user interface consisting of one and Graunwinkel et al. (2007). Badin et frame showing the animations of the speech al. (2008) on the other hand created the anima- gestures and one response frame in which the tions based on real movements, measured with subjects gave their answers. The acoustic signal Electromagnetic Articulography (EMA). In this was presented over headphones. study, we investigate if the use of real move- ments instead of rule-generated ones has any The Augmented Reality display effect on speech perception results. Both tests used the augmented reality side-view It could be the case that rule-generated of a talking head shown in Fig. 1. Movements movements give a better support for speech of the three-dimensional tongue and jaw have perception, since they are more exaggerated been made visible by making the skin at the and display less variability. It could however cheek transparent and representing the palate also be the case that real movements give a bet- by the midsagittal outline and the upper incisor. ter support, because they may be closer to the Speech movements are created in the talking listeners’ conscious or subconscious notion of head model using articulatory parameters, such what the tongue looks like for different pho- as jaw opening, shift and thrust; lip rounding; nemes. Such an effect could e.g., be explained upper lip raise and retraction; lower lip depres- by the direct realist theory of speech perception sion and retraction; tongue dorsum raise, body (Fowler, 2008) that states that articulatory ges- raise, tip raise, tip advance and width. The tures are the units of speech perception, which tongue model is based on a component analysis means that perception may benefit from seeing of data from Magnetic Resonance Imaging the gestures. The theory is different from, but (MRI) of a Swedish subject producing static closely related to, and often confused with, the vowels and consonants (Engwall, 2003). speech motor theory (Liberman et al, 1967; Liberman & Mattingly, 1985), which stipulates Creating tongue movements that speech is perceived in terms of gestures The animations based on real tongue move- that translate to phomenes by a decoder linked ments (AVR) were created directly from simul- to the listener's own speech production. It has taneous and spatially aligned measurements of often been criticized (e.g., Traunmüller, 2007) the face and the tongue for a female speaker of because of its inability to fully explain acoustic Swedish (Beskow et al., 2003). The Movetrack speech perception. For visual speech percep- EMA system (Branderud, 1985) was employed tion, there is on the other hand evidence (Skip- to measure the intraoral movements, using per et al., 2007) that motor planning is indeed three coils placed on the tongue, one on the jaw activated when seeing visual speech gestures. and one on the upper incisor. The movements Speech motor areas in the listener’s brain are of the face were measured with the Qualisys activated when seeing visemes, and the activity motion capture system, using 28 reflectors at- corresponds to the areas activated in the tached to the lower part of the speaker’s face. speaker when producing the same phonemes. The animations were created by adjusting the We here investigate audiovisual processing of parameter values of the talking head to opti- the more unfamiliar visual gestures of the mally fit the Qualisys-Movetrack data (Beskow tongue, using a speech perception and a classi- et al., 2003).

31 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The animations with synthetic tongue Group III was a control group that was pre- movements (AVS) were created using a rule- sented all sets in AO. based visual speech synthesizer developed for The classification test was run with 22 sub- the face (Beskow, 1995). For each viseme, tar- jects, 11 of whom had previously participated get values may be given for each parameter in the perception test. The subjects were di- (i.e., articulatory feature). If a certain feature is vided into two groups I and II, again with the unimportant for a certain phoneme, the target is only difference being that they saw each sen- left undecided, to allow for coarticulation. tence in opposite condition (AVR or AVS). Movements are then created based on the speci- All subjects were normal-hearing, native fied targets, using linear interpolation and Swedes, aged 17 to 67 years old (PT) and 12 to smoothing. This signifies that a parameter that 81 years (CT). 18 male and 12 female subjects has not been given a target for a phoneme will participated in the perception test and 11 of move from and towards the targets in the adja- each sex in the classification test. cent phonemes. This simple coarticulation model has been shown to be adequate for facial Experimental set-up movements, since the synthesized face gestures Before the perception test, the subjects were support speech perception (e.g., Agelfors et al. presented a short familiarization session, con- 1998; Siciliano et al., 2003). However, it is not sisting of five VCV words and five sentences certain that the coarticulation model is suffi- presented four times, once each with AVRv, cient to create realistic movements for the AVRn, AVSv and AVSn. The subjects in the tongue, since they are more rapid and more di- perception test were unaware of the fact that rectly affected by coarticulation processes. there were two different types of animations. The stimuli order was semi-random (PT) or Stimuli random (CT), but the same for all groups, The stimuli consisted of short (3-6 words long) which means that the relative AVR-AVS condi- simple Swedish sentences, with an “everyday tion order was reversed between groups I and content”, e.g., “Flickan hjälpte till i köket” (The II. The order was semi-random (i.e., the three girl helped in the kitchen). The acoustic signal different conditions were evenly distributed) in had been recorded together with the Qualisys- the perception test to avoid that learning effects Movetrack measurements, and was presented affected the results. time-synchronized with the animations. Each stimulus was presented three times in In the perception test, 50 sentences were the perception test and once in the classification presented to the subjects: 10 in acoustic only test. For the latter, the subjects could repeat the (AO) condition (Set S1), and 20 each in AVR animation once. The subjects then gave their and AVS condition (Sets S2 and S3). All sen- answer by typing in the perceived words in the tences were acoustically degraded using a perception test, and by pressing either of two noise-excited three-channel vocoder (Siciliano, buttons (“Real” or “Synthetic”) in the classifi- 2003) that reduces the spectral details and cre- cation test. After the classification test, but be- ates a speech signal that is amplitude modu- fore they were given their classification score, lated and bandpass filtered. The signal consists the subjects typed in a short explanation on of multiple contiguous channels of white noise how they had decided if an animation was from over a specified frequency range. real movements or not. In the classification test, 72 sentences were The perception test lasted 30 minutes and used, distributed evenly over the four condi- the classification test 10 minutes. tions AVR or AVS with normal audio (AVRn, AVSn) and AVR or AVS with vocoded audio Data analysis (AVRv, AVSv), i.e. 18 sentences per condition. The word accuracy rate was counted manually in the perception test, disregarding spelling and Subjects word alignment errors. To assure that the dif- The perception test was run with 30 subjects, ferent groups were matched and remove differ- divided into three groups I, II and III. The only ences that were due to subjects rather than con- difference between groups I and II was that ditions, the results of the control group on sets they saw the audiovisual stimuli in opposite S2 and S3 were weighted, using a scale factor conditions (i.e., group I saw S2 in AVS and S3 determined on set S1 by adjusting the average in AVR; group II S2 in AVR and S3 in AVS).

32 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University of group III so that the recognition score on this 35% set was the same as for the group I+II. 30% 25% For the classification test, two measures μ rect and ∆ were calculated for all subjects. The clas- 20% sification score μ is the average proportion of 15% correctly classified animations c out of N pres- 10% entations. The discrimination score ∆ instead 5% measures the proportion of correctly separated 0% 1 3 5 7 9 11 13 15 17 19 animations by disregarding if the label was cor- -5% Difference in words cor words Difference in rect or not. The measures are in the range -10% 0≤μ≤1 and 0.5≤∆≤1, with {μ, ∆}=0.5 signifying -15% answers at chance level. The discrimination score was calculated since we want to investi- Figure 3. Difference of words correctly recognized gate not only if the subjects can tell which when presented with real movements compared to movements are real but also if they can see dif- synthetic movements for each of the 20 subjects. ferences between the two animation types. For example, if subjects A and B had 60 and 12 ognition score was higher in AVR and for six correct answers, μ=(60+12)/72=50% but sentences the difference was over 20%. ∆=(36+24)/72=67%, indicating that considered Fig. 3 however shows that there were large as a group, subject A and B could see the dif- differences between subjects in how important ference between the two types of animations, the AVR-AVS difference was. Since individual but not tell which were which. subjects did not see the same sentences in the The μ, ∆ scores were also analyzed to find two conditions, the per-subject results may be potential differences due to the accompanying unbalanced by sentence content. A weighted acoustic signal, and correlations between clas- difference was therefore calculated that re- sification and word recognition score for the moves the influence of sentence content by subjects who participated in both tests. scaling the results so that the average for the two sets of sentences was equal when calcu- lated over all three conditions (AO, AVR, and Results AVS) and all three subject groups. The calcu- The main results of the perception test are lated weighted difference, displayed in Fig. 5 summarized in Fig. 2. Both types of tongue an- for the 11 subjects participating in both tests, imations resulted in significantly better word indicates that while some subjects were rela- recognition rates compared to the audio only tively better with AVR animations, others were condition (at p<0.005 using a two-tailed paired in fact better with AVS. t-test). Moreover, recognition was significantly The main results of the classification test better (p<0.005) with real movements than with are shown in Fig. 4, where it can be seen that synthetic. For 28 of the 40 sentences, the rec- the subjects as a group were unable to classify

70 100% μ AVRn AVRv t 90% 60 ↕ **↕ * 80% AVSn AVSv ∆ ↕* 50 70% 60% AO 40 50% AVR 30 40% AVS 30% 20 20% Classifications correc 10% 10 0% Condition1 Word recognition score (%) score recognition Word 0 Figure 2. Percentage of words correctly recognized Figure 4. Mean classification score (μ) for all when presented in the different conditions Audio subjects, for all stimuli, and animations with Only (AO), Audiovisual with Real (AVR) or Syn- real (AVRn, AVRv) or synthetic movements thetic movements (AVS). The level of significance (AVSn, AVSv), accompanied by normal (n) or for differences is indicated by * (p<0.005), ** vocoded (v) audio. ∆ is the discrimination, i.e. (p<0.00005). the mean absolute deviation from chance.

33 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

40% Discussion 30% The perception test results showed that anima- 20% tions of the intraoral articulation may be valid 10% as a speech perception support, since the word 0% recognition score was significantly higher with -10% 1 3 5 7 9 111315171921 animations than without. We have in this test -20% not investigated if it is specifically the display -30% of tongue movements that is beneficial. The δμ AVR-AVS Recognition/classification Recognition/classification -40% results from Wik & Engwall (2008) and Badin et al. (2008) suggest that a normal view of the Figure 5. Classification score δμ relative face without any tongue movements visible chance-level (δμ=μ-0.5). The x-axis crosses at would be as good or better as a speech percep- chance level and the bars indicate scores above tion support. The results of the current study or below chance. For subjects 1–11, who par- however indicate that animations based on real ticipated in the perception test, the weighted dif- movements were significantly higher, and we ference in word recognition rate between the AVR and AVS conditions is also given. are therefore currently working on a new coar- ticulation model for the tongue, based on EMA the two types of animations correctly, with data, in order to be able to create sufficiently μ=48% at chance level. The picture is to some realistic synthetic movements, with the aim of extent altered when considering the discrimina- providing the same level of support as anima- tion score, ∆=0.66 (standard deviation 0.12) for tions from real measurements. the group. Fig. 5 shows that the variation be- The classification test results suggest that tween subjects is large. Whereas about half of subjects are mostly unaware of what real ton- them were close to chance level, the other half gue movements look like, with a classification were more proficient in the discrimination task. score at chance level. They could to a larger The classification score was slightly influ- extent discriminate between the two types of enced by the audio signal, since the subjects animations, but still at a modest level (2/3 of classified synthetic movements accompanied the animations correctly separated). by vocoded audio 6.5% more correctly than if In the explanations of what they had looked they were accompanied by normal audio. There at to judge the realism of the tongue move- was no difference for the real movements. It ments, two of the most successful subjects should be noted that it was not the case that the stated that they had used the tongue tip contact subjects consciously linked normal audio to the with the palate to determine if the animation movements they believed were real, i.e., sub- was real or not. However, subjects who had low jects with low μ (and high ∆) did not differ μ, but high ∆ or were close to chance level also from subjects with high or chance-level μ. stated that they had used this criterion, and it Fig. 5 also illustrates that the relation be- was hence not a truly successful method. tween the classification score for individual An observation that did seem to be useful to subjects and their difference in AVR-AVS discern the two types of movements (correctly word recognition in the perception test is weak. or incorrectly labeled) was the range of articu- Subject 1 was indeed more aware than the av- lation, since the synthetic movements were lar- erage subject of what real tongue movements ger, and, as one subject stated, “reached the look like and subjects 8, 10 and 11 (subjects 15, places of articulation better”. The subject with 16 and 18 in Fig. 3), who had a negative the highest classification rate and the two with weighted AVR-AVS difference in word recog- the lowest all used this criterion. nition, were the least aware of the differences A criterion that was not useful, ventured by between the two conditions. On the other hand, several subjects who were close to chance, was for the remaining subjects there is very little the smoothness of the movement and the as- correlation between recognition and classifica- sumption that rapid jerks occurred only in the tion scores. For example, subjects 4, 5 and 9 synthetic animations. This misconception is were much more proficient than subject 1 at rather common, due to the rapidity and unfa- discriminating between the two animation miliarity of tongue movements: viewers are types, and subject 2 was no better than subject very often surprised by how fast and rapidly 7, even though there were large differences in changing tongue movements are. perception results between them.

34 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Conclusions Benoît, C. and LeGoff, B. (1998). Audio-visual speech synthesis from French text: Eight The word recognition test of sentences with de- years of models, design and evaluation at graded audio showed that animations based on the ICP. Speech Communication 26, 117– real movements resulted in significantly better 129. speech perception than rule-based. The classifi- Beskow, J. (1995). Rule-based visual speech cation test then showed that subjects were un- synthesis. Proceedings of Eurospeech, 299– able to tell if the displayed animated move- 302. ments were real or synthetic, and could to a Beskow, J., Engwall, O. and Granström, B. modest extent discriminate between the two. (2003). Resynthesis of facial and intraoral- This study is small and has several factors motion from simultaneous measurements. of uncertainty (e.g., variation between subjects Proceedings of ICPhS, 431–434. in both tests, the influence of the face move- Branderud, P. (1985). Movetrack – a movement ments, differences in articulatory range of the tracking system, Proceedings of the French- real and rule-based movements) and it is hence Swedish Symposium on Speech, 113–122, not possible to draw any general conclusions on Engwall, O. (2003). Combining MRI, EMA & audiovisual speech perception with augmented EPG in a three-dimensional tongue model. reality. It nevertheless points out a very inter- Speech Communication 41/2-3, 303–329. esting path of future research: the fact that sub- Fowler, C. (2008). The FLMP STMPed, Psy- jects were unable to tell if animations were cre- chonomic Bulletin & Review 15, 458–462. ated from real speech movements or not, but Grauwinkel, K., Dewitt, B. and Fagel, S. received more support from this type of anima- (2007). Visual information and redundancy tions than from realistic synthetic movements, conveyed by internal articulator dynamics gives an indication of a subconscious influence in synthetic audiovisual speech. Proceed- of visual gestures on speech perception. This ings of Interspeech, 706–709. study cannot prove that there is a direct map- Liberman A, Cooper F, Shankweiler D and ping between audiovisual speech perception Studdert-Kennedy M (1967). Perception and speech motor planning, but it does hint at of the speech code. Psychological Review, the possibility that audiovisual speech is per- 74, 431–461. ceived in the listener’s brain terms of vocal Liberman A & Mattingly I (1985). The mo- tract configurations (Fowler, 2008). Additional tor theory of speech perception revised. investigations with this type of studies could Cognition, 21, 1–36. help determine the plausibility of different Siciliano, C., Williams, G., Beskow, J. and speech perception theories linked to the lis- Faulkner, A. (2003). Evaluation of a multi- tener’s articulations. lingual synthetic talking face as a commu- nication aid for the hearing impaired, Pro- Acknowledgements ceedings of ICPhS, 131–134. This work is supported by the Swedish Re- Skipper J., Wassenhove V. van, Nusbaum search Council project 80449001 Computer- H. and Small, S. (2007). Hearing lips Animated LAnguage TEAchers (CALATEA). and seeing voices: how cortical areas The estimation of parameter values from mo- supporting speech production mediate tion capture and articulography data was per- audiovisual speech perception. Cerebral formed by Jonas Beskow. Cortex 17, 387 – 2399. Sumby, W. and Pollack, I. (1954). Visual con- tribution to speech intelligibility in noise, References Journal of the Acoustical Society of Amer- Agelfors, E., Beskow, J., Dahlquist, M., Gran- ica 26, 212–215. ström, B., Lundeberg, M., Spens, K.-E. and Traunmüller, H. (2007). Demodulation, mirror Öhman, T. (1998). Synthetic faces as a neurons and audiovisual perception nullify lipreading support. Proceedings of ICSLP, the motor theory. Proceedings of Fonetik 3047–3050. 2007, KTH-TMH-QPSR 50: 17–20. Badin, P., Tarabalka, Y., Elisei, F. and Bailly, Wik, P. and Engwall, O. (2008). Can visualiza- G. (2008). Can you ”read tongue move- tion of internal articulators support speech ments”?, Proceedings of Interspeech, 2635– perception?, Proceedings of Interspeech 2638. 2008, 2627–2630.

35 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Adapting the Filibuster text-to-speech system for Norwegian bokmål Kåre Sjölander and Christina Tånnander The Swedish Library of Talking Books and Braille (TPB)

Abstract DAISY Pipeline production system (DAISY Pipeline, 2009). The Filibuster text-to-speech system is specifi- The Filibuster system is a unit selection cally designed and developed for the produc- TTS, where the utterances are automatically tion of digital talking textbooks at university generated through selection and concatenation level for students with print impairments. Cur- of segments from a large corpus of recorded rently, the system has one Swedish voice, sentences (Black and Taylor, 1997). 'Folke', which has been used in production at An important feature of the Filibuster TTS the Swedish Library of Talking Books and system is that the production team has total Braille (TPB) since 2007. In August 2008 the control of the system components. An unlimited development of a Norwegian voice (bokmål) number of new pronunciations can be added, as started, financed by the Norwegian Library of well as modifications and extensions of the text Talking Books and Braille (NLB). This paper processing system and rebuilding of the speech describes the requirements of a text-to-speech database. To achieve this, the system must be system used for the production of talking text- open and transparent and free from black boxes. books, as well as the developing process of the The language specific components such as Norwegian voice, 'Brage'. the pronunciation , the speech data- base and the text processing system are NLB’s Introduction property, while the language independent com- The Swedish Library of Talking Books and ponents are licensed as open-source. Braille (TPB) is a governmental body that pro- vides people with print impairments with Braille Requirements for a narrative textbook and talking books. Since 2007, the in-house text-to-speech system text-to-speech (TTS) system Filibuster with its The development of a TTS system for the pro- Swedish voice 'Folke' has been used in the pro- duction of university level textbooks calls for duction of digital talking books at TPB considerations that are not always required for a (Sjölander et al., 2008). About 50% of the conventional TTS system. Swedish university level textbooks is currently The text corpus should preferably consist of produced with synthetic speech, which is a text from the same area as the intended produc- faster and cheaper production method compared tion purpose. Consequently, the corpus should to the production of books with recorded hu- contain a lot of non-fiction literature to cover man speech. An additional advantage is that the various topics such as religion, medicine, biol- student gets access to the electronic text, which ogy, and law. From this corpus, high frequency is synchronized with the audio. All tools and terms and names are collected and added to the nearly all components in the TTS system com- pronunciation dictionary. ponents are developed at TPB. The text corpus doubles as a base for con- In August 2008, the development of a Nor- struction of recording manuscripts, which in ad- wegian voice (bokmål) started, financed by the dition to general text should contain representa- Norwegian Library of Talking Books and tive non-fiction text passages such as biblio- Braille (NLB). The Norwegian voice 'Brage' graphic and biblical references, formulas and will primarily be used for the production of uni- URL’s. A larger recording manuscript than versity level textbooks, but also for news text what is conventionally used is required in order and the universities' own production of shorter to cover phone sequences in foreign names, study materials. The books will be produced as terms, passages in English and so on. In addi- DAISY-books, the international standard for tion, the above-mentioned type of textbook spe- digital talking books, via the open-source cific passages necessitates complex and well- developed text processing.

36 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The number of out-of-vocabulary (OOV) TTS system. The remaining xenophones will be words is likely to be high, as new terms and mapped into phonemes that are more frequent names frequently appear in the textbooks, re- in the speech database. quiring sophisticated tools for automatic gen- In addition, some proper names from the eration of pronunciations. The Filibuster system Swedish pronunciation dictionary were adapted distinguishes between four word types; proper to Norwegian pronunciations, resulting in a names, compounds and simplex words in the proper name dictionary of about 50,000 entries. target language, and English words. In order to reach the goal of making the Text corpus textbooks available for studies, all text - plain The text corpus used for manuscript construc- Norwegian text and English text passages, OOV tion and word frequency statistics consists of words and proper names - need to be intelligi- about 10.8 million words from news and maga- ble, raising the demands for a distinct and prag- zine text, university level textbooks of different matic voice. topics, and Official Norwegian Reports (http://www.regjeringen.no/nb/dok/NOUer.html The development of the Norwegian ?id=1767). The text corpus has been cleaned voice and sentence chunked. The development of the Norwegian voice can Recording manuscripts be divided into four stages: (1) adjustments and completion of the pronunciation dictionary and The construction of the Norwegian recording the text corpus, and the development of the re- manuscript was achieved by searching phoneti- cording manuscripts, (2) recordings of the cally rich utterances iteratively. While diphones Norwegian speaker, (3) segmentation and build- was used as the main search unit, searches also ing the speech database, and (4) quality assur- included high-frequency triphones and syllables. ance. As mentioned above, university level text- books include a vast range of different domains and text types, and demands larger recording Pronunciation dictionaries manuscripts than most TTS systems in order to The Norwegian HLT Resource Collection has cover the search units for different text types been made available for research and commer- and languages. Biographical references, for ex- cial use by the Language Council for Norwegian ample, can have a very complex construction, (http://www.sprakbanken.uib.no/). The re- with authors of different nationalities, name ini- sources include a pronunciation dictionary for tials of different formats, titles in other lan- Norwegian bokmål with about 780,000 entries, guages, page intervals and so on. To maintain a which were used in the Filibuster Norwegian high performance of the TTS system for more TTS. The pronunciations are transcribed in a complex text structures, the recording manu- somewhat revised SAMPA, and follow mainly script must contain a lot of these kinds of utter- the transcription conventions in Øverland ances. (2000). Some changes to the pronunciations To cover the need of English phone se- were done, mainly consistent adaptations to the quences, a separate English manuscript was re- Norwegian speaker's pronunciation and removal corded. The CMU ARCTIC database for of inconsistencies, but a number of true errors speech synthesis with nearly 1,150 English ut- were also corrected, and a few changes were terances (Kominek and Black, 2003) was used made due to revisions of the transcription con- for this purpose. In addition, the Norwegian ventions. manuscript contained many utterances with To cover the need for English pronuncia- mixed Norwegian and English, as well as email tions, the English dictionary used by the Swed- addresses, acronyms, spelling, numerals, lists, ish voice, consisting of about 16,000 entries, announcements of DAISY specific structures was used. The pronunciations in this dictionary such as page numbers, tables, parallel text and are ‘Swedish-style’ English. Accordingly, they so on. were adapted into ‘Norwegian-style’ English pronunciations. 24 xenophones were imple- Recordings mented in the phoneme set, of which about 15 have a sufficiently number of representations in The speech was recorded in NLB’s recording the speech database, and will be used by the studio. An experienced male textbook speaker was recorded by a native supervisor. The re-

37 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University cordings were carried out in 44.1 KHz with 24- looks at the phone's joining capability, as well as bit resolution. Totally, 15,604 utterances were its spectral distance from the mean of all candi- recorded. dates. The best concatenation point between Table 1. A comparison of the length of the recorded two sound clips is found by correlating their speech databases for different categories, Norwe- waveforms. gian and Swedish Text processing Norwegian Swedish The Swedish text processing system was used Total time 26:03:15 28:27:24 as a base for the Norwegian system. Although Total time (speech) 18:24:39 16:15:09 the two languages are similar in many ways, Segments 568 606 781 769 many modifications were needed. Phones 519 065 660 349 The tokenisation (at sentence, multi-word Words 118 104 132 806 and word level) is largely the same for Swedish Sentences 15 604 14 788 and Norwegian. One of the exceptions is the A comparison of the figures above shows that sentence division at ordinals, where the standard the Swedish speaker is about 45% faster than Norwegian annotation is to mark the digit with the Norwegian speaker (11.37 vs. 7.83 phones a period as in '17. mai', which is an annotation per second). This will result in very large file- that is not used in Swedish. sets for the Norwegian textbooks, which often The Swedish part-of-speech tagging is done consists of more than 400 pages, and a very by a hybrid tagger, a statistical tagger that uses slow speech rate of the synthetic speech. How- the POS trigrams of the Swedish SUC2.0 cor- ever, the speech rate can be adjusted in the stu- pus (Källgren et al., 2006), and a rule-based dent's DAISY-player or by the TTS-system it- complement which handles critical part-of- self. On the other hand, a slow speech rate speech disambiguation. It should be mentioned comes with the benefit that it well articulated that the aim of the inclusion of a part-of-speech and clear speech can be attained in a more natu- tagger is not to achieve perfectly tagged sen- ral way compared to slowing down a voice with tences; its main purpose is to disambiguate an inherently fast speech rate. homographs. Although the Swedish and Nor- wegian morphology and syntax differ, the Segmentation Swedish tagger and the SUC trigram statistics Unlike the Swedish voice, for which all re- should be used also for the Norwegian system, cordings were automatically and manually seg- even though it seems like homographs in Nor- mented (Ericsson et al., 2007), all the Norwe- wegian bokmål need more attention than in gian utterances were control listened, and the Swedish. As an example, the relatively frequent phonetic transcriptions were corrected before Norwegian homographs where one form is a the automatic segmentation was done. In that noun and the other a verb or a past participle, way, only the pronunciation variants that actu- for example 'laget', in which the supine verb ally occurred in the audio had to be taken into form (or past participle) is pronounced with the [" ] account by the speech recognition tool 't' and with accent II lɑ:.gət , while the noun (Sjölander, 2003). Another difference from the form is pronounced without the 't' and with ac- [' ] Swedish voice is that plosives are treated as one cent I lɑ:.gə . As it stands, it seems as the sys- continuous segment, instead of being split into tem can handle these cases to satisfaction. OOV obstruction and release. words are assigned their part-of-speech accord- Misplaced phone boundaries and incorrect ing to language specific statistics of suffixes of phone assignments will possibly be corrected in different lengths, and from contextual rules. No the quality assurance project. phrase parsing is done for Norwegian, but there is a future option to predict phrase boundaries from the part-of-speech tagged sentence. Unit selection and concatenation Regarding text identification, that is classify- The unit selection method used in the Filibuster ing text chunks as numerals or ordinals, years, system is based mainly on phone decision trees, intervals, acronyms, abbreviations, email ad- which find candidates with desired properties, dresses or URLs, formulas, biographical, bibli- and strives to find as long phone sequences as cal or law references, name initials and so on, possible to minimise the number of concatena- the modifications mainly involved translation of tion points. The optimal phone sequence is cho- for instance units and numeral lists, new lists of sen using an optimisation technique, which

38 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University abbreviations and formats for ordinals, date ex- to-speech system. Proceedings of Fonetik, pressions and suchlike. Similar modifications TMH-QPSR 50(1), 33-36, Stockholm. were carried out for text expansions of the Kominek J. and Black A. (2003). CMU above-mentioned classifications. ARCTIC database for speech synthesis. A language detector that distinguishes the Language Technologies Institute. Carnegie target language from English was also included Mellon University, Pittsburgh PA . Techni- in the Norwegian system. This module looks up cal Report CMU-LTI-03-177. the words in all dictionaries and suggests lan- http://festvox.org/cmu_arctic/cmu_arctic_re guage tag (Norwegian or English) for each port.pdf word depending on unambiguous language Källgren G., Gustafson-Capkova S. and Hart- types of surrounding words. man B. (2006). Stockholm Umeå Corpus OOV words are automatically predicted to 2.0 (SUC2.0). Department of Linguistics, be proper names, simplex or compound Norwe- Stockholm University, Stockholm. gian words or English words. Some of the pro- Sjölander, K. (2003). An HMM-based system nunciations of these words are generated by for automatic segmentation and alignment of rules, but the main part of the pronunciations is speech. Proceedings of Fonetik 2003, 93-96, generated with CART trees, one for each word Stockholm. type. Sjölander K., Sönnebo L. and Tånnander C. The output from the text processor is sent to (2008). Recent advancements in the Filibus- the TTS engine in SSML format. ter text-to-speech system. SLTC 2008. Øverland H. (2000). Transcription Conventions Quality assurance for Norwegian. Technical Report. Nordisk The quality assurance phase consists of two Språkteknologi AS parts, the developers’ own testing to catch gen- eral errors, and a listening test period where na- tive speakers report errors in segmentation, pronunciation and text analysis to the develop- ing team. They are also able to correct minor errors by adding or changing transcriptions or editing simpler text processing rules. Some ten textbooks will be produced for this purpose, as well as test documents with utterances of high complexity.

Current status The second phase of the quality assurance phase with native speakers will start in May 2009. The system is scheduled to be put in production of Norwegian textbooks by the autumn term of 2009. Currently, the results are promising. The voice appears clear and highly intelligible, also in the generation of more complex utterances such as code switching between Norwegian and English.

References Black A. and Taylor P. (1997). Automatically clustering similar units for unit selection in speech synthesis. Proceedings of Eu- rospeech 97, Rhodes, Greece. DAISY Pipeline (2009). http://www.daisy.org/projekcts/pipeline. Ericsson C., Klein J., Sjölander K. and Sönnebo L. (2007). Filibuster – a new Swedish text-

39 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Acoustic characteristics of onomatopoetic expressions in child-directed speech U. Sundberg and E. Klintfors Department of linguistics, Stockholm University, Stockholm

Abstract The purpose of this study was to identify pre Background liminary acoustic and phonological characte Onomatopoetic expressions (OE) may be de- ristics of onomatopoetic expressions (OE) in fined as a word or a combination of words that Swedish childdirected speech. The materials imitate or suggest the source of the sound they on one mother interacting with her 4yearold are describing. Common occurrences include child were transcribed and used for pitch con expressions for actions, such as the sound of tour measurements on OE. Measurements were something falling into water: splash or a cha- also made on some nononomatopoetic expres racteristic sound of an object/animal, such as sions to be used as controls. The results showed oink. In general, the relation between a word- that OE were often composed of CV or CVC form and the meaning of the word is arbitrary. syllables, as well as that the syllables or words OEs are different: a connection between how of the expressions were usually reduplicated. the word sounds and the object/action exists. Also, the mother’s voice was often modified OE are rather few in human languages. In when OE were used. It was common that the some languages, such as Japanese, OE are quality of the voice was creaky or that the though frequent and of great significance (Yule, mother whispered these expressions. There 2006). In Japanese there are more than 2000 were also changes in intonation and some of onomatopoetic words (Noma, 1998). These the expressions had higher fundamental fre words may be divided in categories, such as quency (f0) as compared to nononomatopoetic “Giseigo” – expressions sounds made by expressions. In several ways then, OE can be people/animals (e.g. kyaakyaa ‘a laughing or seen as highly modified childdirected speech. screaming female voice with high f0’), “Gion- go” – expressions imitating inanimate objects Introduction in the nature (e.g. zaazaa ‘a shower/lots of rain- ing water pouring down’), and “Gitagio” – ex- The video materials analyzed in the current pressions for tactile/visual impressions we can- study were collected within a Swedish-Japanese not perceive auditorily (e.g. niyaniaya ‘an iron- research project: A cross language study of ic smile’). Furthermore, plenty of these words onomatopoetics in infant- and child-directed are lexicalized in Japanese (Hamano, 1998). speech (initiated 2002 by Dr Y. Shimura, Sai- Therefore, non-native speakers of Japanese of- tama Univ., Japan & Dr U. Sundberg, Stock- ten find it difficult to learn onomatopoetic holm Univ., Sweden). The aim of the project is words. The unique characteristics of onomato- to analyze characteristics of onomatopoetic ex- poetic words are also dependent on specific use pressions (OE) in speech production of Swe- within diverse subcultures (Ivanova, 2006). dish and Japanese 2-year-old and 4-to 5-year- old children, as well as in the child-directed speech (CDS) of their mothers. Method The aim of the current study is to explore Video/audio recordings of mother-child dyads preliminary acoustic and phonological characte- were collected. The mothers were interacting ristics of OE in Swedish CDS. Therefore an with their children by using Swedish/Japanese analysis of pitch contours, and syllable structure fairy tale-books as inspiration of topics possibly in OE of a mother talking to her 4-year-old generating use of OE. The mothers were thus child was carried out. Japanese is a language instructed not to read aloud the text, but to dis- with a rich repertoire of OE in CDS, as well as cuss objects/actions illustrated in the books. in adult-directed speech (ADS). The characte- ristics and functions of OE in Japanese are ra- Materials ther well investigated. The current study aims The materials were two video/sound recordings to somewhat fill in that gap in Swedish IDS. of one mother interacting with her 4-year-old

40 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University child. The first recording (app. 4 min.) was when the lack of OE was apparent, such as in based on engagement in two Japanese books the case of a sun, the mother rather described and the other recording (app. 4 min.) was based how it feels when the sun warms, haaaa. Alter- on engagement in Pippi Longstocking. The natively, the mother used several expressions to mother uttered 65 OE within these recordings. refer to the same animal, such as kvack and rib bit ‘ribbit’ for the frog, or nöff ‘oink’ and a Analysis voiceless imitation of the sound of the pig. The speech materials were transcribed with no- Among some of the voiced OE a very high tifications of stress and f0 along a time scale of f0 was found, e.g. over 200 Hz for pippi. But 10 sec. intervals. Pitch contour analysis of OE since plenty of the expressions were voiceless, (such as voff ‘bark’), and non-onomatopoetic general conclusions on pitch contour characte- words corresponding to the object/action (such ristics are hard to make. as hund ‘dog’) was performed in Wavesurfer. The OE uttered with creaky voice had a low f0 between 112-195Hz. Substantially more of the OE were uttered with creaky voice as com- Results pared to non-onomatopoetic words in CDS. The results showed that the mother’s voice The OE used by the mother were more or quality was more modified, using creaky/ less word-like: nöff ‘oink’ is a word-like ex- pressed voice, whispering and highly varying pression, while the voiceless imitation of the pitch as compared to non-onomatopoetic words sound of a pig is not. in CDS. The creaky voice was manifested by All the OE had a tendency to be redupli- low frequencies with identifiable pulses. cated. Some were more reduplicated than oth- OE such as nöff ‘oink’, voff ‘bark’, kvack ers, such as pipipi, reflecting how one wants to ‘ribbit’ and mjau ‘meow’ were often redupli- describe the animal, as well as that pi is a short cated. Further, the OE were frequently com- syllable easy and quick to reduplicate. Similar- posed of CVC/CVCV syllables. If the syllable ly, reduplicated voff ‘bark’ was used to refer to structure was CCV or CCVC, the second C was a small/intense dog, rather than to a big one. realized as an approximant/part of a diphthong, In summary, OE contain all of the characte- as in /kwak:/ or /mjau/. Every other consonant, ristics of CDS – but more of everything. Varia- every other vowel (e.g. vovve) was typical. tions in intonation are large; the voice quality is The non-onomatopoetic words chosen for highly modulated. Words are reduplicated, analysis had f0 range 80-274Hz, while the OE stressed and stretched or produced very quickly. had f0 range 0-355Hz. OE thus showed a wider OE are often iconic and therefore easy to under- f0 range than non-onomatopoetic words. stand – they explain the objects via sound illu- The interaction of the mother and the child strations by adding contents into the concepts. was characterized by the mother making OE It can be speculated that OE contribute to main- and asking the child: How does a…sound like? tain interaction by – likely for the child appeal- The child answered if she knew the expression. ing – clear articulatory and acoustic contrasts. If the child did not know any expression – such as when the mother asked: How does it sound Acknowledgements when the sun warms nicely? – she simply made a gesture (in this case a circle in the air) and We thank Åsa Schrewelius and Idah L-Mubiru, looked at the mother. In the second recording students in Logopedics, for data analysis. the mother also asked questions on what some- body was doing. For example, several questions References on Pippi did not directly concern OE, but were Hamano, S. (1998) The sound-symbolic system of the kind: What do you think ... is doing? of Japanese. CSLI Publications. Ivanova, G. (2006) Sound symbolic approach to Concluding remarks Japanese mimetic words. Toronto Working Presumably due to the fact that Swedish does Papers in Linguistics26, 103. not have any particular OE for e.g how a turtle Noma, H. (1998) Languages richest in onoma- sounds, the mother made up her own expres- topoetic words. Language Monthly 27, 30- sions. She asked her child how the animal 34. might sound, made a sound, and added maybe Yule, G. (2006) The study of language, Cam- to express her own uncertainty. Sometimes bridge University Press, New York.

41 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Phrase initial accent I in South Swedish Susanne Schötz and Gösta Bruce Department of Linguistics & Phonetics, Centre for Languages and Literature, Lund University

Abstract in this dialect type for any accent I word occur- ring in a prominent utterance position. How- The topic of this paper is the variability of pitch ever, taking also post-lexical prosody into ac- realisation of phrase-initial accent I. In our count, there is some interesting variability to be study we have observed a difference in vari- found specifically for accent I occurring in ut- ability for the varieties investigated. Central terance-/phrase-initial position. Swedish pitch patterns for phrase-initial ac- Variability in the pitch realisation of phrase- cent I both to the East (Stockholm) and to the initial accent I in South Swedish is the specific West (Gothenburg) display an apparent con- topic of this paper. The purpose of our contri- stancy, albeit with distinct patterns: East Cen- bution is to try to account for the observed tral Swedish rising and West Central Swedish variation of different pitch patterns accompany- fall-rise. In South Swedish, the corresponding ing an accent I word in this particular phrase pitch patterns can be described as more vari- position. In addition, we will discuss some in- able. The falling default accentual pitch pattern ternal variation in pitch accent realisation in the South is dominating in the majority of the within the South Swedish region. These pitch sub-varieties examined, even if a rising pattern patterns of accent type 1A will also be com- and a fall-rise are not uncommon here. There pared with the corresponding reference patterns seems to be a difference in geographical distri- of types 2A and 2B characteristic of Stockholm bution, so that towards northeast within the (Svea) and Gothenburg (Göta) respectively. See South Swedish region the percentage of a rising Figure 1 for their citation forms. pattern has increased, while there is a corre- sponding tendency for the fall-rise to be a more a frequent pattern towards northwest. The oc- currence of the rising pattern of initial accent I in South Swedish could be seen as an influence from and adaptation to East Central Swedish, and the fall-rise as an adaptation to West Swedish intonation.

Introduction A distinctive feature of Swedish lexical pros- ody is the tonal word accent contrast between accent I (acute) and accent II (grave). It is well established that there is some critical variation in the phonetic realisation of the two word ac- cents among regional varieties of Swedish. Ac- cording to Eva Gårding’s accent typology Figure 1. The five accent types in Eva Gårding’s (Gårding & Lindblad 1973, Gårding 1977) accent typology based on Meyer’s original data. based on Ernst A. Meyer’s data (1937, 1954) on the citation forms of the word accents – di- Accentuation and phrasing syllabic words with initial stress – there are five In our analysis (Bruce, 2007), the exploitation distinct accent types to be recognised (see Fig- of accentuation for successive words of a ure 1). These accent types by and large also co- phrase, i.e. in terms of post-lexical prosody, di- incide with distinct geographical regions of the vides the regional varieties of Swedish into two Swedish-speaking area. For accent I, type 1A distinct groups. In Central Swedish, both in the shows a falling pitch pattern, i.e. an early pitch West (Göta, prototype: Gothenburg) and in the peak location in the stressed syllable and then a East (Svea, prototype: Stockholm), two distinct fall down to a low pitch level in the next sylla- levels of intonational prominence – focal and ble (Figure 1). This is the default pitch pattern non-focal accentuation – are regularly ex-

42 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ploited. Thus, the expectation for an intona- knowledge about the pitch pattern variation of tional phrase containing for example three ac- phrase initial accent I in South Swedish. cented words, will be an alternation: focal ac- centuation + non-focal accentuation + focal ac- centuation. The other regional varieties of Swedish – South, Gotland, Dala, North, and – turn out to be different in this respect and make up another group. South Swedish, as a case in point, for a corresponding phrase with three accented words, is expected to have equal prominence on these constituents. This means that for a speaker of South Swed- ish, focal accentuation is not regularly ex- ploited as an option distinct from regular accen- tuation. Figure 2 shows typical examples of a phrase containing three accented words: accent I + accent I + accent II (compound), for three female speakers representing East Central (Stockholm), West Central (Gothenburg) and South Swedish (Malmö) respectively. Note the distinct pitch patterns of the first and second accent I words of the phrase in the Central Swedish varieties – as a reflex of the distinction between focal and non-focal accentuation – in contrast with the situation in South Swedish, where these two words have got the same basic pitch pattern. Intonational phrasing is expressed in differ- ent ways and more or less explicitly in the dif- Figure 2. Accentuation and phrasing in varieties of ferent varieties, partly constrained by dialect- Swedish. Pitch contours of typical examples of a specific features of accentuation (Figure 2). phrase containing three accented words for East The rising pitch gesture in the beginning and Central Swedish (top), West Central Swedish (mid- the falling gesture at the end of the phrase in dle) and South Swedish (bottom). East Central Swedish is an explicit way of sig- nalling phrase edges, to be found also in vari- ous other languages. In West Central Swedish, Speech material the rise after the accent I fall in the first word of The speech material was taken from the Swed- the phrase could be seen as part of an initial ish SpeechDat (Elenius, 1999) – a speech data- pitch gesture, expressing that there is a con- base of read telephone speech of 5000 speakers, tinuation to follow. However, there is no falling registered by age, gender, current location and counterpart at the end of the phrase, but instead self-labelled dialect type according to Elert’s a pitch rise. This rise at the end of a prominent suggested 18 Swedish dialectal regions (Elert, word is analysed as part of a focal accent ges- 1994). For our study, we selected a fair number ture and considered to be a characteristic fea- of productions of the initial intonational phrase ture of West Swedish intonation. In South (underlined below) of the sentence Flyget, tåget Swedish, the falling gesture at the end of the och bilbranschen tävlar om lönsamhet och phrase (like in East Central Swedish) has no folkets gunst ‘Airlines, train companies and the regular rising counterpart in the beginning, but automobile industry are competing for profit- there is instead a fall, which is the dialect- ability and people’s appreciation’. The target specific pitch realisation of accent I. All three item was the initial disyllabic accent I word fly- varieties also display a downdrift in pitch to be get. In order to cover a sufficient number (11) seen across the phrase, which is to be inter- of varieties of South Swedish spoken in and preted as a signal of coherence within an into- around defined localities (often corresponding national phrase. to towns), our aim was to analyse 12 speakers The following sections describe a small from each variety, preferably balanced for age study we carried out in order to gain more

43 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University and gender. In four of the varieties, the Method SpeechDat database did not include as many as 12 speakers. In these cases we settled for a Our methodological approach combined audi- smaller and less gender-balanced speaker set. In tory judgment, visual inspection and acoustic addition to the South Swedish varieties, we se- analysis of pitch contours using the speech lected 12 speakers each from the Gothenburg analysis software Praat (Boersma and Weenick, (Göta) and Stockholm (Svea) area to be used as 2009). Praat was used to extract the pitch con- reference varieties. Table 1 shows the number tours of all phrases, to smooth them (using a 10 and gender distribution of the speakers. The Hz bandwidth), and to draw pitch contours us- geographical location of the sub-varieties are ing a semitone scale. Auditory analysis included displayed on a map of South Sweden in Figure listening to the original sound as well as the 3. pitch (using a Praat function which plays back the pitch with a humming sound). Table 1. Number and gender distribution of speak- ers from each sub-variety of South Swedish and the Identification of distinct types of pitch reference varieties used in the study. patterns The first step was to identify the different types Sub-variety (≈ town) Female Male Total of pitch gestures occurring for phrase-initial Southern (Laholm) 3 7 10 accent I and to classify them into distinct cate- Ängelholm 7 4 11 gories. It should be pointed out that our classi- Helsingborg 5 7 12 fication here was made from a melodic rather Landskrona 5 7 12 than from a functional perspective. We identi- Malmö 6 6 12 fied the following four distinct pitch patterns: Trelleborg 3 6 9 Ystad 7 5 12 1) Fall: a falling pitch contour often used in Simrishamn 5 2 7 South Swedish and corresponding to the ci- Kristianstad 7 5 12 tation form of type 1A Northeastern Skåne 2) Fall-rise: a falling-rising pattern typically 4 5 9 & Western Blekinge occurring in the Gothenburg variety Southern Småland 5 7 12 3) Rise: a rising pattern corresponding to the Gothenburg (reference variety) 6 6 12 pattern characteristic of the Stockholm vari- Stockholm (reference variety) 6 6 12 ety Total 69 73 142 4) Level: a high level pitch contour with some representation in most varieties

We would like to emphasise that our division of the pitch contours under study into these four categories may be considered as an arbitrary choice to a certain extent. Even if the classifica- tion of a particular pitch contour as falling or rising may be straightforward in most cases, we do not mean to imply that the four categories chosen by us should be conceived of as self- evident or predetermined. It should be admitted that there were a small number of unclear cases, particularly for the classification as high level or rising and as high level or fall-rise. These cases were further examined by the two authors together and finally decided on. Figure 4 shows typical pitch contours of one female and one male speaker for each of the four pat- terns. Figure 3. Geographic location of the South Swedish and the two reference varieties used in the study. The location of the reference varieties on the map is only indicated and not according to scale.

44 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University


Categorisation of the speakers into dis- tinct pitch pattern types The results of our categorisation can be seen in Table 2. In the South Swedish varieties the fall pattern dominated (54 speakers), but the other three patterns were not uncommon with 25, 23 and 17 speakers respectively. In the Gothen- burg variety, eleven speakers used the fall-rise intonation, and only one was classified as be- longing to the level category. Ten Stockholm speakers had produced the rise pattern, while two speakers used the level pattern. Table 2. Results of the categorisation of the 142 speakers into the four distinct categories fall, fall- rise, rise and level along with their distribution across the South Swedish, Gothenburg and Stock- holm varieties.

Pattern South Sweden Gothenburg Stockholm Fall 54 0 0 Fall-rise 17 11 0 Rise 23 0 10 Level 24 1 2 Total 118 12 12

Geographical distribution Figure 5 shows the geographical distribution of the four patterns across varieties. Each pie chart in the figure represents the distribution of pat- terns within one sub-variety. The southern and western varieties of South Swedish display a majority of the fall pattern, although other pat- terns are represented as well. In the northwest- ern varieties, the fall pattern is much less com- mon (only one speaker each in Ängelholm and southern Halland), while the fall-rise pattern – the most common one in the Gothenburg vari- ety – is more frequent. Moreover, the level pat- tern is also more common here than in the va- rieties further to the south. The fall pattern is also common in Kristianstad and in northeast- ern Skåne and western Blekinge. In these two varieties the rise pattern – the category used by most speakers in Stockholm – is also rather common. As already mentioned, no fall pattern was observed in the two reference varieties. In Figure 4. Two typical pitch contours of the initial Gothenburg the fall-rise pattern is used by all phrase Flyget, tåget och bilbranschen ‘Airlines, speakers except one, while Stockholm displays train companies and the automobile industry’ for a vast majority of rise patterns and two level each of the four distinct pitch pattern types found in ones. the South Swedish varieties of this study: fall, fall- rise, rise and level (solid line: female speaker; dashed line: male speaker).

45 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 6. Pitch contours of the initial phrase Flyget, tåget och bilbranchen ‘Airlines, train companies and the automobile industry’ produced by 118 South Swedish speakers.

Discussion In our study of the phonetic realisation of an accent I word in phrase-initial position, we have observed a difference in variability for the varieties investigated. Even if it should be admitted that there may be some difficulties of classification of the pitch patterns involved, and that the data points in our study may be rela- tively few, the variability among pitch patterns for phrase-initial accent I is still clear. So while the Central Swedish pitch patterns for phrase- Figure 5. Geographical distribution of the four pitch initial accent I both to the East (Stockholm) and patterns of phrase-initial accent I observed in South to the West (Gothenburg) display an apparent Swedish varieties and two reference varieties (Stockholm and Gothenburg Swedish). constancy, albeit with distinct patterns – East Central Swedish rising and West Central Swed- ish fall-rise – the corresponding pitch patterns Additional observations in South Swedish can be described as more While we have observed a variability for pitch variable. This would appear to be true of each patterns accompanying the accent I word in ini- of the sub-varieties in this group, but there is tial position of the phrase under study in South also an interesting difference between some of Swedish, the other two accented words of the them. As we have seen, the falling default pitch phrase show an apparent constancy for their pattern in the South is dominating in the major- pitch patterns. The second accent I word tåget ity of the sub-varieties examined, even if both a has the regular falling pattern, while the third rising pattern and a fall-rise are not uncommon final word of the phrase bilbranschen (accent II here. But there seems to be a difference in geo- compound) displays an expected rise-fall on the graphical distribution, so that towards northeast sequence of syllables consisting of the primary within the South Swedish region the percentage stress (the first syllable of the word) and the of a rising pattern has increased, while there is next syllable. This is true of basically all pro- a corresponding tendency for the fall-rise to be ductions of each variety of South Swedish, as a more a frequent pattern towards northwest. A can be seen in Figure 6, showing the pitch con- high level pitch, which can be seen as function- tours of all South Swedish speakers. ally equivalent to the rising pattern (and maybe even to the fall-rise), is a relatively frequent pattern only in some northern sub-varieties of the South Swedish region. It is tempting to interpret the occurrence of the rising pattern of initial accent I in South

46 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Swedish as an influence from and adaptation to [computer program]. http://www.praat.org/, East Central Swedish. Also the occurrence of visited 30-Mar-09. the fall-rise can be seen as an adaptation to Bruce G. (2007) Components of a prosodic ty- West Swedish intonation. The addition of a rise pology of Swedish intonation. In Riad T. after the falling gesture resulting in a fall-rise and Gussenhoven C. (eds) Tones and (typical of West Central Swedish) would – for a Tunes, Volume 1, 113-146, Berlin: Mouton South Swedish speaker – appear to be less of a de Gruyter. concession than the substitution of a falling pitch gesture with a rising one. The added rise Elenius K. (1999) Two Swedish SpeechDat da- after the fall for phrase-initial accent I does not tabases - some experiences and results. seem to alter the intonational phrasing in a fun- Proc. of Eurospeech 99, 2243-2246. damental way. But even if a rising pitch gesture Elert C.-C. (1994) Indelning och gränser inom for initial accent I may appear to be a more området för den nu talade svenskan - en ak- fundamental change of the phrase intonation, it tuell dialektografi. In Edlund L. E. (ed) Kul- still seems to be a feasible modification of turgränser - myt eller verklighet, Umeå, South Swedish phrase intonation. The integra- Sweden: Diabas, 215-228. tion of a rising pattern in this phrase position Gårding E. (1977) The Scandinavian word ac- does not appear to disturb the general structure cents. Lund, Gleerup. of South Swedish phrase intonation. Our im- Gårding E. and Lindblad P. (1973) Constancy pression is further that having a rising pitch and variation in Swedish word accent pat- gesture on the first accent I word followed by a terns. Working Papers 7. Lund: Lund Uni- regular falling gesture on the second word (cre- ating a kind of hat pattern as it were) does not versity, Phonetics Laboratory, 36-110. change the equal prominence to be expected for Meyer E. A. (1937) Die Intonation im the successive words of the phrase under study Schwedischen, I: Die Sveamundarten. Stud- either. Moreover, a pitch rise in the beginning ies Scand. Philol. Nr. 10. Stockholm Uni- of an intonation unit (as well as a fall at the end versity. of a unit) could be seen as a default choice for Meyer E. A. (1954) Die Intonation im intonational phrasing, if the language or dialect Schwedischen, II: Die Sveamundarten. in question does not impose specific constraints Studies Scand. Philol. Nr. 11. Stockholm dictated for example by features of accentua- University. tion.

Acknowledgements This paper was initially produced as an invited contribution to a workshop on phrase-initial pitch contours organised by Tomas Riad and Sara Myrberg at the Scandinavian Languages Department, Stockholm University in March 2009. It is related to the SIMULEKT project (cf. Beskow et al., 2008), a co-operation be- tween Phonetics at Lund University and Speech Communication at KTH Stockholm, funded by the Swedish Research Council 2007-2009.

References Beskow J., Bruce G., Enflo L., Granström B., and Schötz S. (alphabetical order) (2008) Recognizing and Modelling Regional Va- rieties of Swedish. Proceedings of Inter- speech 2008, Brisbane, Australia. Boersma P. and Weenink D. (2009) Praat: do- ing phonetics by computer (version 5.1)

47 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Modelling compound intonation in Dala and Gotland Swedish Susanne Schötz1, Gösta Bruce1, Björn Granström2 1Department of Linguistics & Phonetics, Centre for Languages & Literature, Lund University 2Department of Speech, Music & Hearing, School of Computer Science & Communication, KTH

Abstract modelling of pitch patterns of compounds in Dala and Gotland Swedish. As part of our work within the SIMULEKT pro- ject, we are modelling compound word intona- tion in regional varieties of Swedish. The focus of this paper is on the Gotland and Dala varie- ties of Swedish and the current version of SWING (SWedish INtonation Generator). We examined productions by 75 speakers of the compound mobiltelefonen ‘the mobile phone’. Based on our findings for pitch patterns in compounds we argue for a possible division into three dialect regions: 1) Gotland: a high upstepped pitch plateau, 2) Dala-Bergslagen: a high regular pitch plateau, and 3) Upper Da- larna: a single pitch peak in connection with the primary stress of the compound. The SWING tool was used to model and simulate com- pounds in the three intonational varieties. Fu- ture work includes perceptual testing to see if listeners are able to identify a speaker as be- longing to the Gotland, Dala-Bergslagen or Upper Dalarna regions, depending on the pitch shape of the compound. Figure 1. Approximate geographical distribution of the seven main regional varieties of Swedish. Introduction Within the SIMULEKT project (Simulating In- Compound word intonation tonational Varieties of Swedish) (Bruce et al., The classical accent typology by Gårding 2007), we are studying the prosodic variation (1977) is based on Meyer’s (1937, 1954) pitch characteristic of different regions of the Swed- curves in disyllabic simplex words with initial ish-speaking area. Figure 1 shows a map of stress having either accent I or accent II. It these regions, corresponding to our present dia- makes a first major division of Swedish intona- lect classification scheme. tion into single-peaked (1) and double-peaked In our work, various forms of speech syn- (2) types, based on the number of pitch peaks thesis and the Swedish prosody model (Bruce for a word with accent II. According to this ty- & Gårding, 1978; Bruce & Granström, 1993; pology, the double-peaked type is found in Bruce, 2007) play prominent roles. To facilitate Central Swedish both to the West (Göta) and to our work with testing and further developing the East (Svea) as well as in North Swedish. the model, we have designed a tool for analysis The single-peaked accent type is characteristic and modelling of Swedish intonation by of South Swedish, Dala and Gotland regional resynthesis: SWING. The aim of the present varieties. Within this accent type there is a fur- paper is two-fold: to explore variation in ther division into the two subtypes 1A and 1B compound word intonation specifically in two with some difference in pitch peak timing – regions, namely Dala and Gotland Swedish, earlier-later – relative to the stressed syllable. It and to describe the current version of SWING has been shown that the pitch patterns of com- and how it is used to model intonation in pound words can be used as an even better di- regional varieties of Swedish. We will agnostic than simplex words for distinguishing exemplify this with the modelling of pitch

48 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University between intonational varieties of Swedish (Riad Speech material and method 1998, Bruce 2001, 2007). A compound word in The speech material was taken from the Swed- Swedish contains two stresses, primary stress ish SpeechDat (Elenius, 1999), a database con- (ˈ) on the first element and secondary stress (ˌ) taining read telephone speech. It contains on the final element. In most varieties of Swed- speech of 5000 speakers registered by age, ish a compound takes accent II. The exception gender, current location and self-labelled dia- is South Swedish where both accents can occur lect type according to Elert’s suggested 18 (Bruce, 2007). A critical issue is whether the Swedish dialectal regions (Elert, 1994). As tar- secondary stress of a compound is a relevant get word of our examination, we selected the synchronisation point for a pitch gesture or not. initial “long” compound /moˈbilteleˌfonen/ from Figure 2 shows stylised pitch patterns of accent the sentence Mobiltelefonen är nittiotalets stora II compounds identifying four different shapes fluga, både bland företagare och privatper- characteristic of distinct regional varieties of soner. ‘The mobile phone is the big hit of the Swedish (Bruce 2001). The target patterns for nineties, both among business people and pri- our discussion in this paper are the two types vate persons’. Following Elert’s classification, having either a single peak in connection with we selected 75 productions of mobiltelefonen the primary stress of the compound or a high from the three dialect regions which can be la- plateau between the primary and the secondary belled roughly as Gotland, Dala-Bergslagen stresses of the word. These two accentual types and Upper Dalarna Swedish (25 speakers of are found mainly in South Swedish, and in the each dialect). Dala region and on the isle of Gotland respec- F0 contours of all productions were tively. It has been suggested that the pitch pat- extracted, normalised for time (expressed as the tern of an accent II compound in the Gotland percentage of the word duration) and plotted on and Dala dialect types has basically the same a semitone scale in three separate graphs: one shape with the high pitch plateau extending for each dialectal region. roughly from the primary to the secondary stress of the word. The specific point of interest Tentative findings of our contribution is to examine the idea about the similarity of pitch patterns of compound Figure 3 shows the F0 contours of the speakers words particularly in Gotland and Dala Swed- from the three dialectal regions examined. Even ish. if there is variation to be seen among F0 con- tours of each of these graphs, there is also some constancy to be detected. For both Gotland and Dala-Bergslagen a high pitch plateau, i.e. early rise + high level pitch + late fall, for the com- pound can be traced. A possible difference be- tween the two dialect types may be that, while speakers representing Dala-Bergslagen Swed- ish have a regular high pitch plateau, Gotland speakers tend to have more of an upstepped pitch pattern for the plateau, i.e. early rise + high level pitch + late rise & fall. Among the speakers classified as represent- ing Upper Dalarna Swedish there is more inter- nal variation of the F0 contours to be seen. However, this variation can be dissolved as consisting of two basic distinct pitch patterns, Figure 2. Schematic pitch patterns of accent II com- either a single pitch peak in connection with the pound words in four main intonational varieties of primary stress of the compound or a high pitch Swedish (after Bruce, 2001). The first arrow marks the CV-boundary of the primary stress, and the sec- plateau. These two patterns would appear to ond/third arrow marks the CV-boundary of the sec- have a geographical distribution within the ondary stress. In a “short” compound the two area, so that the high pitch plateau is more stresses are adjacent, while in a “long” compound likely to occur towards South-East, i.e. in the stresses are not directly adjacent. places closer to the Dala-Bergslagen dialect re- gion.

49 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

tion with the primary stress of the compound followed by a fall and a low pitch level in con- nection with secondary stress of the word is characteristic of the speaker representing Upper Dalarna Swedish.

Figure 4. Compound word intonation. Typical ex- ample F0 contours of mobiltelefonen produced by a speaker from each of the three dialect regions Got- land, Dala-Bergslagen and Upper Dalarna Swedish.

SWING SWING (SWedish INtonation Generator) is a tool for analysis and modelling of Swedish in- tonation by resynthesis. It comprises several Figure 3. Variation in compound word intonation. parts joined by the speech analysis software F0 contours of the compound word mobiltelefonen Praat (Boersma & Weenink, 2009), which also (accent II) produced by 25 speakers each of Got- land, Dala-Bergslagen and Upper Dalarna Swedish. serves as graphical interface. Using an input annotated speech sample and an input rule file, Figure 4 shows examples of typical F0 contours SWING generates and plays PSOLA resynthesis of the compound word mobiltelefonen pro- – with rule-based and speaker-normalised into- duced by one speaker from each of the three nation – of the input speech sample. Additional dialect regions discussed. For the speaker from features include visual display of the output on Gotland the high plateau is further boosted by a the screen, and options for printing various second rise before the final fall creating an up- kinds of information to the Praat console (Info stepped pitch pattern through the compound window), e.g. rule names and values, the time word. The example compound by the speaker and F0 of generated pitch points etc. Figure 5 from Dala-Bergslagen is characterised by a shows a schematic overview of the tool. high pitch plateau instead, i.e. a high pitch level The input speech sample to be used with the extending between the rise synchronised with tool is manually annotated. Stressed syllables the primary stress and the fall aligned with the are labelled prosodically and the corresponding secondary stress. A single pitch peak in connec- vowels are transcribed orthographically. Figure

50 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 5. Schematic overview of the SWING tool.

6 displays an example compound word annota- stressed vowel. Three values are used for F0: L tion, while Table 1 shows the prosodic labels (low), H (high) and H+ (extra high, used in fo- that are handled by the current version of the cal accents). The pitch points are optional; they tool. can be left out if they are not needed by a rule. New rules can easily be added and existing ones adjusted by editing the rule file. Table 2 shows an example of the rules for compound words in Gotland, Dala-Bergslagen and Upper Dalarna Swedish. Several rules contain an extra pitch gesture in the following (unstressed) seg- ment of the annotated input speech sample. This extra part has the word ‘next’ attached to its rule name; see e.g. cpa2 in Table 2. Figure 6. Example of an annotated input speech Table 2. Rules for compound words in Gotland, sample. Dala-Bergslagen and Upper Dalarna Swedish with timing (T) and F (F0) values for initial (ini), mid Table 1. Prosodic labels used for annotation of 0 (mid) and final (fin) points (‘_next’: extra gesture; speech samples to be analysed by SWING. see Table 1 for additional rule name descriptions). Label Description iniT iniF0 midT midF0 finT finF0 pa1 primary stressed (non-focal) accent 1 Gotland pa2 primary stressed (non-focal) accent 2 cpa2 50 L pa1f focal focal accent 1 cpa2_next 30 H pa2f focal focal accent 2 csa2 30 H 70 H+ cpa1 primary stressed accent 1 in compounds csa2_next 30 L cpa2 primary stressed accent 2 in compounds Dala-Bergslagen csa1 secondary stressed accent 1 in compounds csa2 secondary stressed accent 2 in compounds cpa2 50 L cpa2_next 30 H Rules csa2 30 H csa2_next 30 L In SWING, the Swedish prosody model is im- Upper Dalarna plemented as a set of rule files – one for each cpa2 0 L 60 H regional variety of the model – with timing and cpa2_next 30 L F0 values for critical pitch points. These files csa2 are text files with a number of columns; the first contains the rule names, and the following Procedure comprise three pairs of values, corresponding Analysis with the SWING tool is fairly straight- to the timing and F0 of the critical pitch points forward. The user selects one input speech sam- of the rules. The three points are called ini (ini- ple and one rule file to use with the tool, and tial), mid (medial), and fin (final). Each point which (if any) text (rules, pitch points, de- contains values for timing (T) and F0 (F0). bugging information) to print to the Praat con- Timing is expressed as a percentage into the sole. A Praat script generates resynthesis of the stressed syllable, starting from the onset of the

51 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University input speech sample with a rule-based output intonation patterns shown in Figure 4. There is pitch contour based on 1) the pitch range of the also a close match between the original pitch of input speech sample, used for speaker normali- the input speech samples and the simulated sation, 2) the annotation, used to identify the pitch contour in all three dialectal regions. time and pitch gestures to be generated, and 3) the rule file, containing the values of the critical Discussion and additional re- pitch points. The Praat graphical user interface provides immediate audio-visual feedback of marks how well the rules work, and also allows for The present paper partly confirms earlier ob- easy additional manipulation of pitch points servations about pitch patterns of word accen- with the Praat built-in Manipulation feature. tuation in the regional varieties of Dala and Gotland Swedish, and partly adds new specific Modelling compounds with SWING pieces of information, potentially extending our knowledge about compound word intonation of SWING is now being used in our work with test- these varieties. ing and developing the Swedish prosody model One point of discussion is the internal varia- for compound words. Testing is done by select- tion within Dala Swedish with a differentiation ing an input sound sample and a rule file of the of the pitch patterns of word accents into Upper same intonational variety. If the model works Dalarna and Dala-Bergslagen intonational sub- adequately, there should be a close match be- varieties. This division has earlier been sug- tween the F0 contour of the original version and gested by Engstrand and Nyström (2002) revis- the rule-based one generated by the tool. Figure iting Meyer’s pitch curves of the two word ac- 7 shows Praat Manipulation objects for the cents in simplex words, with speakers repre- three dialect regions Gotland, Dala-Bergslagen senting Upper Dalarna having a slightly earlier and Upper Dalarna Swedish modelled with the timing of the relevant pitch peak than speakers corresponding rules for each dialect region. from Dala-Bergslagen. See also Olander’s study (2001) of Orsa Swedish as a case in point concerning word intonation in a variety of Up- per Dalarna. Our study of compound word intonation clearly demonstrates the characteristic rising- falling pitch pattern in connection with the pri- mary stress of a compound word in Upper Da- larna as opposed to the high pitch plateau be- tween the primary and secondary stresses of the compound in Dala-Bergslagen, even if there is also some variability among the different speakers investigated here. We would even like to suggest that compound word intonation in Dala-Bergslagen and Upper Dalarna Swedish is potentially distinct. It would also appear to be

true that a compound in Upper Dalarna has got Figure 7. Simulation of compound words with the same basic pitch shape as that of South SWING. Praat Manipulation displays of mobiltele- Swedish. Generally, word intonation in Upper fonen of the three dialect regions Gotland, Dala- Dalarna Swedish and South Swedish would Bergslagen and Upper Dalarna Swedish (simulation: even seem to be basically the same. circles connected by solid line; original pitch: light- Another point of discussion is the suggested grey line). similarity of word intonation between Dala and The light grey lines show the original pitch of Gotland Swedish. Even if the same basic pitch each dialect region, while the circles connected pattern of an accent II compound – the pitch with the solid lines represent the rule-generated plateau – can be found for speakers represent- output pitch contours. ing varieties of both Dala-Bergslagen and Got- As can bee seen in Figure 7, the simulated land, there is also an interesting difference to be (rule-based) pitch patterns clearly resemble the discerned and further examined. As has been corresponding three typical compound word shown above, Gotland Swedish speakers tend to display more of an upstepped pitch shape for

52 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the compound word, while Dala-Berslagen Engstrand O. and Nyström G. (2002) Meyer´s speakers have a more regular high pitch pla- accent contours revisited. Proceedings from teau. Our preliminary simulation of compound Fonetik 2002, the XVth Swedish Phonetics word intonation for Dala and Gotland with the Conference, Speech, Music and Hearing, SWING tool is also encouraging. We are plan- Quarterly Progress and Status Report 44, ning to run some perceptual testing to see 17-20. KTH, Stockholm. whether listeners will be able to reliably iden- Gårding E. (1977) The Scandinavian word ac- tify a speaker as belonging to the Gotland, Dala-Bergslagen or Upper Dalarna regions de- cents. Lund, Gleerup. pending on the specific pitch shape of the com- Gårding E. and Lindblad P. (1973) Constancy pound word. and variation in Swedish word accent pat- terns. Working Papers 7. Lund: Lund Uni- Acknowledgements versity, Phonetics Laboratory, 36-110. Meyer E. A. (1937) Die Intonation im The work within the SIMULEKT project is Schwedischen, I: Die Sveamundarten. Stud- funded by the Swedish Research Council 2007- ies Scand. Philol. Nr. 10. Stockholm Uni- 2009. versity. References Meyer E. A. (1954) Die Intonation im Schwedischen, II: Die Sveamundarten. Boersma P. and Weenink D. (2009) Praat: do- Studies Scand. Philol. Nr. 11. Stockholm ing phonetics by computer (version 5.1) University. [computer program]. http://www.praat.org/, Olander E. (2001) Word accents in the Orsa visited 30-Mar-09. dialect and in Orsa Swedish. In Fonetik Bruce G. (2001) Secondary stress and pitch ac- 2001, 132-135. Working Papers 49, Lin- cent synchronization in Swedish. In van guistics, Lund University. Dommelen W. and Fretheim T. (eds) Nor- Riad T. (1998) Towards a Scandinavian accent dic Prosody VIII, 33-44, Frankfurt am typology. In Kehrein W. and Wiese R. (eds) Main: Peter Lang. Phonology and morphology of the Ger- Bruce G. (2007) Components of a prosodic ty- manic languages, 77-109. Tübingen: Max pology of Swedish intonation. In Riad T. Niemeyer. and Gussenhoven C. (eds) Tones and Tunes, Volume 1, 113-146, Berlin: Mouton de Gruyter. Bruce G. and Gårding E. (1978) A prosodic ty- pology for Swedish dialects. In Gårding E., Bruce G. and Bannert R. (eds) Nordic Pros- ody, 219-228. Lund: Department of Lin- guistics. Bruce G. and Granström B. (1993) Prosodic modelling in Swedish speech synthesis. Speech Communication 13, 63-73. Bruce G., Granström B., and Schötz S. (2007) Simulating Intonational Varieties of Swed- ish. Proc. of ICPhS XVI, Saarbrıücken, Ger- many. Elenius K. (1999) Two Swedish SpeechDat da- tabases - some experiences and results. Proc. of Eurospeech 99, 2243-2246. Elert C.-C. (1994) Indelning och gränser inom området för den nu talade svenskan - en ak- tuell dialektografi. In Edlund L.E. (ed) Kul- turgränser - myt eller verklighet, Umeå, Sweden: Diabas, 215-228.

53 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The acoustics of Estonian Swedish long close vowels as compared to Central Swedish and Finland Swedish Eva Liina Asu1, Susanne Schötz2 and Frank Kügler3 1Institute of Estonian and General Linguistics, University of Tartu 2Department of Linguistics and Phonetics, Centre for Languages and Literature, Lund University 3Department of Linguistics, Potsdam University

Abstract Ladefoged and Maddieson, 1996). It has been shown, however, that there is considerable This pilot study investigates the phonetic reali- variation in the realisation of these contrasts sation of Estonian Swedish long close vowels depending on the variety of Swedish (Elert, comparing them with Central Swedish and Fin- 2000, Kuronen, 2001). Thus, the study of close land Swedish counterparts. It appears that in vowels seems like a good place where to start the Rickul variety of Estonian Swedish there is the acoustic analysis of sound system. a distinction between only three long close vowels. The analysed vowels of Estonian Swed- General Characteristics of Estonian ish are more similar to those of Central Swed- Swedish ish than Finland Swedish, as measured by the Euclidean distance. Further research with Swedish settlers started arriving in in more data is needed to establish the exact the Middle Ages. During several centuries, they vowel space and phonetic characteristics of Es- continued coming from various parts in Sweden tonian Swedish dialects. and Finland bringing different dialects which influenced the development of separate ES va- rieties. ES dialects are usually divided into four Introduction dialect areas on the basis of their sound system This study is a first step in documenting the and vocabulary (see Figure 1). phonetic characteristics of Estonian Swedish (ES), a highly endangered variety of Swedish spoken historically on the islands and western coast of Estonia. Despite its once flourishing status, ES is at present on the verge of extinc- tion. Most of the Estonian Swedish community fled to Sweden during WWII. Today only a handful of elderly native ES speakers remain in Estonia, and about a hundred in Sweden. ES has received surprisingly little attention and was not, for instance, included in the SweDia 2000 project (Bruce et al., 1999) be- cause there were no speakers from younger generations. To our knowledge, ES has not been analysed acoustically before; all the exist- ing work on its sound system has been con- ducted in the descriptive framework of dialect research (e.g. E. Lagman, 1979). Therefore, the aim of this study is to carry out the first acous- tic analysis of ES by examining the quality of close vowels. In this pilot study we will focus on ES long close vowels comparing them to those of Finland Swedish (FS) and Central Swedish (CS), the two varieties of Swedish that have had most influence on ES in recent times. Figure 1. The main dialect areas of Estonian Swed- Swedish is unique among world’s languages ish in the 1930s (from E. Lagman 1979: 2). because of a number of phonologically distinct contrasts in the inventory of close vowels (cf.

54 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The largest area is the Nuckö-Rickul-Ormsö CS exhibits a phonological four way con- area (including Dagö) followed by the Rågö- trast in close vowels /iː – yː – ʉː – uː/ where / Korkis-Vippal area. Separate dialect areas are iː/, / yː/ and /ʉː/ are front vowels with many formed by the Island of Runö and the Island of similar articulatory and acoustic features, and Nargö. E. Lagman (1979: 5) claims that con- /uː/ is a back vowel (Riad, 1997). While /iː/ is nections between the different dialect areas considered an unrounded vowel and /yː/ its were not particularly lively which made it pos- rounded counterpart, /ʉː/ has been referred to sible for the separate dialects to retain their as: (1) a labialised palatal vowel with a tongue characteristic traits up to modern times. position similar (but slightly higher) to [ø:], but Another factor which has shaped ES is the with a difference in lip closure (Malmberg, fact that until the 20th century, ES dialects 1966: 99-100), (2) a close rounded front vowel, were almost completely isolated from varieties more open than [iː] and [yː] (Elert, 2000: 28; of Swedish in Sweden, and therefore did not 31; 49), (3) a further back vowel pronounced participate in several linguistic changes that oc- with pursed lips (hopsnörpning) rather than lip curred for instance in Standard Swedish (Ti- protrusion as is the case with /yː/ (Kuronen, berg, 1962: 13, Haugen, 1976), e.g. the Great 2000: 32), and (4) a protruded (framskjuten) Quantity shift that took place in most Scandi- central rounded vowel (Engstrand, 2004: 113). navian varieties between 1250 and 1550. ES, as The two vowels /iː/ and /yː/ display similar opposed to Standard Swedish, has retained the F1 and F2 values (Fant, 1969), and can be sepa- archaic ‘falling’ diphthongs, e.g. stain ‘sten’ rated only by F3, which is lower for /y:/. (stone), haim ‘hem’ (home) (E. Lagman, 1979: Malmberg (1966: 101) argues that the only 47). Starting from the end of the 19th century, relevant phonetic difference between /ʉ:/ and however, ES came gradually in closer contact /y:/ can be seen in the F2 and F3 values. with above all Stockholm Swedish and Finland An additional characteristic of long close Swedish. It was also around that time that the vowels in CS is that they tend to be diphthong- so called ‘high’ variety of ES (den est- ised. Lowering of the first three formants at the landssvenska högspråksvarianten) appeared in end of the diphthongised vowels /ʉː/ and /uː/ connection with the development of the educa- has been reported by e.g. Kuronen (2000: 81- tion system. This was the regional standard that 82), while the diphthongisation of /iː/ and /yː/ was used as a common language within the ES results in a higher F1 and lower F2 at the end of community. the vowel (Kuronen, 2000: 88). According to E. Lagman (1979: 5) the main In FS, the close vowels differ somewhat features of ES dialects resemble most those of from the CS ones, except /u:/ that is rather simi- the variety of Swedish spoken in Nyland in lar in both varieties. FS /iː/ and /yː/ are pro- South Finland. Lexically and semantically, the nounced more open and further front than their ES dialects have been found to agree with Fin- CS counterparts. Acoustically, these vowels are land Swedish and North Swedish (Norrbotten) realised with lower F1 and higher F2 values dialects on the one hand, and with dialects in than in CS (Kuronen, 2000: 59). In FS, the East Central (Uppland) and West (Götaland) close central /ʉː/ is pronounced further back Sweden and the Island of Gotland on the other than in CS (Kuronen, 2000: 60; 177). There is hand (Bleckert, 1986: 91). It has been claimed some debate over as to whether the characteris- that the influence of Estonian on the sound sys- tics of FS are a result of language contact with tem of ES is quite extensive (Danell, 1905-34, Finnish (Kuronen, 2000: 60) or an independent ref. in H. Lagman, 1971: 13) although it has to dialectal development (Niemi, 1981). be noted that the degree of language contact The quality of the rounded front vowel /yː/ with Estonian varied considerably depending in the ‘high’ variety of ES is more open than in on the dialect area (E. Lagman, 1979: 4). Standard Swedish (Lagman, 1979: 9). The rounded front vowel /yː/ is said to be missing in Swedish long close vowels ES dialects (Tiberg, 1962: 45, E. Lagman, Of the three varieties of Swedish included in 1979: 53) and the word by (village) is pro- the present study, it is the CS close vowels that nounced with an /iː/. It seems, though, that the have been subject to most extensive acoustic exact realisation of the vowel is heavily de- and articulatory analyses. Considerably less is pendent on its segmental context and the dia- known about FS vowels, and no acoustic data is lect, and most probably also historical sound so far available for ES vowels. changes. Thus, in addition to [iː] /yː/ can be re- alised as [eː], [ɛː], or [ʉː] or as a diphthong

55 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

[iœː] or [iʉː] (for examples see E. Lagman, script. The segmentation was manually checked 1979: 53). Considering this variation, it is and corrected. nearly impossible to predict how /yː/ might be A Praat script was used for obtaining the realised in our ES data. Based on E. Lagman’s values for the first three formant frequencies comment (1979: 5) about ES being most simi- (F1, F2, F3) of each vowel with the Burg lar to the Nyland variety of FS, we can hy- method. The measurements were taken at the pothesise that ES vowels would be realised mid-point of each vowel. All formant values closer to those of FS than CS. Yet, we would were subsequently checked and implausible or not expect exactly the same distribution of deviant frequencies re-measured and corrected close vowels in ES as in FS or CS. by hand. Mean values were calculated for the female and male speakers for each variety. Materials and method One-Bark vowel circles were plotted for the female and male target vowels [iː, yː, ʉː, uː] of each variety on separate F1/F2 and F2/F3 plots Speech data using another Praat script. As materials the word list from the SweDia In order to test for statistically significant 2000 database was used. The data comprised differences between the dialects a two-way three repetitions of four words containing long ANOVA was carried out with the between- close vowels: dis (mist), typ (type), lus (louse), subjects factors dialect (3) and gender (2), and sot (soot). When recording the ES speakers, the a dependent variable formant (3). word list had to be adapted slightly because not Finally, a comparison of the inventory of all the words in the list appear in ES vocabu- long close vowels in the three varieties was lary. Therefore, dis was replaced by ris (rice), conducted using the Euclidean distance, which typ by nyp (pinch) and sot by mot (against). was calculated for the first three formants based Four elderly ES speakers (2 women and 2 on values in Bark. men) were recorded in a quiet setting in Stock- holm in March 2009. All the speakers had ar- Results rived in Sweden in the mid 1940s as youngsters and were between 80 and 86 years old (mean Figure 2 plots the F1 and F2 values separately age 83) at the time of recording. They represent for female and male speakers for each of the the largest dialect area of ES, the Rickul vari- three dialects. It can be seen that the distribu- ety, having all been born there (also all their tion is roughly similar for both female and male parents came from Rickul). The ES speakers speakers in all varieties. were recorded using the same equipment as for There is a significant effect of dialect on F2 collecting the SweDia 2000 database: a Sony for the vowel /iː/ (F(2, 10) = 8.317, p<0.01). In portable DAT recorder TCD-D8 and Sony tie- ES, the F2 is significantly higher than in CS pin type condenser microphones ECM-T140. and FS. For the vowel /yː/ there is a significant For the comparison with CS and FS the effect of dialect on F1 (F(2, 10) = 7.022, word list data from the SweDia 2000 database p<0.05). The ES target /yː/ has a higher F1 than from two locations was used: Borgå in Nyland the other two varieties. was selected to represent FS, while CS was rep- The F2 of the vowel [ʉː] is significantly resented by Kårsta near Stockholm. From each lower in FS than in ES and CS (F(2, 10) = of these locations the recordings from 3 older 61.596, p<0.001); the vowel is realised furthest women and 3 older men were analysed. The back in FS. Borgå speakers were between 53 and 82 years For the vowel /uː/ there is a significant ef- old (mean age 73), and the Kårsta speakers be- fect of dialect on both F1 (F(2, 10) = 4.176, tween 64 and 74 years old (mean age 67). p<0.05) and F2 (F(2, 10) = 22.287, p<0.001). F2 is lowest in FS. Analysis It can be seen in Figure 2 that in CS, the three vowels /iː/, /yː/ and /ʉː/ cluster close to- The ES data was manually labelled and seg- gether on the F1/F2 plot. The vowel qualities mented, and the individual repetitions of the are, however, separated on the F3 dimension, as words containing long close vowels were ex- shown in Figure 3 where the F2 and F3 values tracted and saved as separate sound and annota- are plotted against each other. FS /yː/ has a sig- tion files. Equivalent CS and FS data was ex- nificantly lower F3 than that of ES and CS tracted from the SweDia database using a Praat (F(2, 10) = 10.752, p<0.01).

56 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 2. F1/F2 plots of long close vowels for female and male speakers of Estonian Swedish, Fin- land Swedish and Central Swedish.

Figure 3. F2/F3 plots of long close vowels for female and male speakers of Estonian Swedish, Fin- land Swedish and Central Swedish.

57 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Similarily, the present analysis does not capture the diphthongisation that is common in CS long close vowels. As shown by earlier studies (e.g. Fant, et al. 1969) the close front vowel space in CS is crowded on the F1/F2 dimension, and there is no clear separation of /iː/ and /yː/. In our data, there also occurs an overlap of [iː] and [yː] with [ʉː] for female CS speakers. All three vowels are, however, separated nicely by the F3 di- mension. It is perhaps worth noting that the mean F2 for /iː/ is somewhat lower for CS female speak- ers than male speakers. This difference is probably due to one of the female speakers who realised her /iː/ as the so called Viby /iː/ which is pronounced as [ɨː]. Our results confirm that the FS /ʉː/ is a close central vowel that is acoustically closer to [uː] than to [yː] (cf. Kuronen, 2000: 136), and Figure 4. The Euclidean distance for the first three significantly different from the realisations of formants (in Bark) for female and male speakers. the target vowel /ʉː/ in the other two varieties Figure 4 shows the Euclidean distance between under question. dialects of long close vowels for female and The comparison of ES with CS and FS by male speakers. The black bars display the dis- means of the Euclidean distance allowed us to tance between ES and CS, grey bars between assess the proximity of ES vowels with the ES and FS, and white bars between CS and FS. other two varieties. Interestingly, it seems that Except for /iː/ in female speakers, the long the results of the comparison point to less dis- close vowels of ES are closer to CS than to FS tance between ES and CS than between ES and (a two-tailed t-test reveals a trend towards sig- FS. This is contrary to our initial hypothesis nificance; t=-1.72, p=0.062). based on E. Lagman’s (1979: 5) observation that the main dialectal features of ES resemble Discussion most FS. However, this does not necessarily mean that the language contact between CS and Our results show that at least this variety of ES ES must account for these similarities. Given (Rickul) has only three distinct close vowels: that the ES vowels also resemble Estonian /iː/, /yː/ and /uː/. There is an almost complete vowels a detailed acoustic comparison with Es- overlap of the target vowels [yː] and [ʉː] in ES. tonian vowels would yield a more coherent pic- The plotted F1/F2 vowel space of close ES ture on this issue. vowels bears a striking resemblance to that of Estonian which also distinguishes between the same three close vowels (cf. Eek and Meister, Conclusions 1998). This paper has studied the acoustic characteris- As pointed out above, earlier descriptions of tics of long close vowels in Estonian Swedish ES refer to the varying quality of /yː/ in differ- (ES) as compared to Finland Swedish (Borgå) ent dialects (cf. E. Lagman 1979: 53). Auditory and Central Swedish (Kårsta). The data for the analysis of the vowel sound in the word nyp analysis was extracted from the elicited word reveals that the vowel is actually realised as a list used for the SweDia 2000 database. The diphthong [iʉː] by all our ES speakers, but as same materials were used for recording the we only measured the quality of the second part Rickul variety of ES. of the diphthong (at only one point in the The analysis showed that the inventory of vowel), our measurements do not reflect diph- long close vowels in ES includes three vowels. thongisation. It is also possible that if a differ- Comparison of the vowels in the three varieties ent test word had been chosen the quality of the in terms of Euclidean distance revealed that the /yː/ would have been different. long close vowels in ES are more similar to those of CS than FS.

58 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Much work remains to be done in order to Haugen E. (1976) The Scandinavian languages: reach a comprehensive phonetic analysis of ES an introduction to their history. London: vowels. More speakers need to be recorded Faber & Faber. from different varieties of ES to examine in Kuronen M. (2000) Vokaluttalets akustik i closer detail the dialectal variation within ES. sverigesvenska, finlandssvenska och finska. In the following work on ES vowels, we are Studia Philologica Jyväskyläensia 49. planning to carry out dynamic formant analysis Jyväskylä: University of Jyväskylä. in order to capture possible diphthongisation as Kuronen M. (2001) Acoustic character of well as speaker variation. vowel pronunciation in Sweden-Swedish and Finland-Swedish. Working papers, Acknowledgements Dept. of Linguistics, Lund University 49, 94–97. We would like to thank Joost van de Weijer for Ladefoged P., Maddieson I. (1996) The Sounds help with statistical analysis, Gösta Bruce for of the World’s Languages. Oxford: Black- advice and discussing various aspects of Swed- well. ish vowels, and Francis Nolan for his com- Lagman E. (1979) En bok om Estlands sven- ments on a draft of this paper. We are also very skar. Estlandssvenskarnas språkförhållan- grateful to our Estonian Swedish subjects who den. 3A. Stockholm: Kulturföreningen willingly put up with our recording sessions. Svenska Odlingens Vänner. We owe a debt to Ingegerd Lindström and Göte Lagman H. (1971) Svensk-estnisk språkkon- Brunberg at Svenska Odlingens Vänner – the takt. Studier över estniskans inflytande på Estonian Swedes’ cultural organisation in de estlandssvenska dialekterna. Stockholm. Stockholm – for hosting the Estonian Swedish Malmberg B. (1966) Nyare fonetiska rön och recording sessions. The work on this paper was andra uppsatser i allmän och svensk fonetik. supported by the Estonian Science Foundation Lund: Gleerups. grant no. 7904, and a scholarship from the Niemi S. (1981) Sverigesvenskan, fin- Royal Swedish of Letters, History landssvenskan och finskan som kvalitets- and Antiquities. och kvantitetsspråk. Akustiska iakttagelser. Folkmålsstudier 27, Meddelanden från References Föreningen för nordisk filologi. Åbo: Åbo Akademi, 61–72. Bleckert L. (1986) Ett komplement till den eu- Riad T. (1997) Svensk fonologikompendium. ropeiska språkatlasen (ALE): Det est- University of Stockholm. landssvenska materialet till första volym- Tiberg N. (1962) Estlandssvenska språkdrag. serien (ALE I 1). Swedish Dialects and Folk Lund: Carl Bloms Boktryckeri A.-B. Traditions 1985, Vol. 108. Uppsala: The In- stitute of Dialect and Folklore Research. Bruce G., Elert C.-C., Engstrand O., and Eriks- son A. (1999) Phonetics and phonology of the Swedish dialects – a project presentation and a database demonstrator. Proceedings of ICPhS 99 (San Francisco), 321–324. Danell G. (1905–34) Nuckömålet I–III. Stock- holm. Eek A., Meister E. (1998) Quality of Standard Estonian vowels in stressed and unstressed syllables of the feet in three distinctive quantity degrees. Linguistica Uralica 34, 3, 226–233. Elert C.-C. (2000) Allmän och svensk fonetik. Stockholm: Norstedts. Engstrand O. (2004) Fonetikens grunder. Lund: Studentlitteratur. Fant G., Henningson G., Stålhammar U. (1969) Formant Frequencies of Swedish Vowels. STL-QPSR 4, 26–31.

59 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Fenno-Swedish VOT: Influence from Finnish? Catherine Ringen1 and Kari Suomi2 1Department of Linguistics, University of Iowa, Iowa City, Iowa, USA 2Phonetics, Faculty of Humanities, University of , Oulu, Finland

Abstract tions were recorded directly to a hard disk (22.5 kHz, 16 bit) using high quality equipment. This paper presents results of an investigation Measurements were made using broad-band of VOT in the speech of twelve speakers of spectrograms and oscillograms. Fenno-Swedish. The data show that in utter- ance-initial position, the two-way contrast is usually realised as a contrast between pre- Results voiced and unaspirated stops. Medially and The full results, including the statistical tests, finally, the contrast is that of a fully voiced stop are reported in Ringen and Suomi (submitted). and a voiceless unaspirated stop. However, a Here we concentrate on those aspects of the re- large amount of variation was observed for sults that suggest an influence on Fenno- some speakers in the production of /b d g/, with Swedish from Finnish. many tokens being completely voiceless and overlapping phonetically with tokens of /p t k/. The set /p t k/ Such tokens, and the lack of aspiration in /p t The stops /p t k/ were always voiceless unas- k/, set Fenno-Swedish apart from the varieties pirated. For the utterance-initial /p t k/, the spoken in Sweden. In Finnish, /b d g/ are mar- mean VOTs were 20 ms, 24 ms and 41 ms, re- ginal and do not occur in many varieties, and spectively. These means are considerably less /p t k/ are voiceless unaspirated. We suggest than those reported by Helgason and Ringen that Fenno-Swedish VOT has been influenced (2008) (49 ms, 65 ms and 78 ms, respectively) by Finnish. for Central Standard Swedish (CS Swedish), for a set of target words that were identical to Method those in our, with only a few exceptions. On the Twelve native speakers of Fenno-Swedish (6 other hand, the mean VOTs of Finnish word- females, 6 males) were recorded in Turku, Fin- initial /p/, /t/ and /k/ reported by Suomi (1980) land. The ages of the male speakers varied be- were 9 ms, 11 ms and 20 ms, respectively (10 tween 22 and 32 years, those of the female male speakers). These means are smaller than speakers between 24 years and 48 years. those of our Fenno-Swedish speakers, but the Fenno-Swedish was the first language and the difference may be due to the fact that while the language of education of the subjects, as well initial stops in our study were utterance-initial, as of both of their parents. The speakers came the Finnish target words were embedded in a 1 from all three main areas in which Swedish is constant frame sentence. spoken in Finland: Uusimaa/Nyland (the south- In medial intervocalic position, the mean of ern coast of Finland), Turunmaa / Åboland (the VOT was 10 ms for /p/, 18 ms for /t/ and 25 ms south-western coast) and Pohjanmaa / Österbot- for /k/. In Helgason and Ringen (2008), the cor- ten (the western coast, ). The responding means for CS Swedish were 14 ms, speakers are all fluent in Finnish. There were 23 ms and 31 ms, and in Suomi (1980) 11 ms, 68 target words containing one or more stops. 16 ms and 25 ms for Finnish. The differences A list was prepared in which the target words are small, and there is the difference in the el- occurred twice, with six filler words added to icitation methods, but it can nevertheless be the beginning of the list. The recordings took noted that the Fenno-Swedish and Finnish fig- place in an anechoic chamber at the Centre for ures are very close to each other, and that the Cognitive Neuroscience of Turku University. CS Swedish VOTs are at least numerically The words were presented to the subjects on a longer than the Fenno-Swedish ones. At any computer screen. The subjects received each rate, these comparisons do not suggest that new word by clicking the mouse and were in- Fenno-Swedish and Finnish are different with structed to click only when they had finished respect to the VOT of medial /p t k/. (Our uttering a target word. The subjects were in- Fenno-Swedish speakers produced medial structed to speak naturally, and their produc- intervocalic /p t k/ in quantitatively two ways:

60 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University either as short or long, e.g. baka was pro- of the formally /b d g/ stops were incorrectly nounced as either [baaka] (eight speakers) or classified as /p t k/. For these four speakers, [baakka] (four speakers). However, this alterna- then, /b d g/ often had positive VOT values in tion had no effect on VOT].) Word-final /ptk/ the small positive lag region also frequently ob- were fully voiceless. To the ears of the second served in /p t k/. author, the Fenno-Swedish /p t k/ sound very Medial /b d g/ were extensively or fully much like Finnish /p t k/. voiced. In 9.4% of the tokens, the voiced pro- portion of the occlusion was less than 50%, in The set /b d g/ 10.4% of the tokens the voiced proportion was 50% or more but less than 75%, in 10.4% of the In the realization of the utterance-initial /b d g/, tokens the voiced proportion was 75% or more there was much variation among the speakers. but less than 100%, and in 70.0% of the tokens For eight of the twelve speakers, 95.2% of the the voiced proportion was 100%. Thus, while utterance initial lenis tokens were prevoiced, the medial /b d g/ were on the whole very whereas for the remaining four speakers, only homogeneous and mostly fully voiced, 3.6% of 70.6% of the tokens were prevoiced. Half of the them were fully voiceless. These were nearly speakers in both subgroups were female and all produced by three of the four speakers who half were male. For the group of eight speakers, also produced many utterance-initial /b d g/ to- the mean VOT was -85 ms, and the proportion kens with non-negative VOT values. The fully of tokens with non-negative VOT was 4.8%. voiceless tokens were short and long /d/’s and The results for this group are similar to those of short /g/’s; there was no instance of voiceless Helgason and Ringen (2008) for CS Swedish /b/. A discriminant analysis was run on all me- who observed the grand mean VOT of -88 ms dial stops, with closure duration, speaker sex, and report that, for all six speakers pooled, 93% quantity, place of articulation, duration of voic- of the initial /b d g/ had more than 10 ms pre- ing during occlusion and positive VOT as the voicing. For the CS Swedish speaker with the independent variables. 96.4% of the tokens shortest mean prevoicing, 31 of her 38 initial /b were correctly classified (98.7% of /p t k/ and d g/ tokens had more than 10 ms of prevoicing. 94.5% of /b d g/). The order of magnitude of For our group of eight Fenno-Swedish speakers the independent variables as separators of the and for all of the six CS Swedish speakers in two categories was duration of voicing during Helgason and Ringen, then, extensive prevoic- occlusion > positive VOT > closure duration > ing was the norm, with only occasional non- place > quantity: sex had no separating power. prevoiced renditions. Great variation was observed in both short For the group of four Fenno-Swedish and long final /b d g/, and therefore the speak- speakers the mean VOT was only -40 ms, and ers were divided into two subgroups both con- the proportion of tokens with non-negative sisting of six speakers. This was still somewhat VOT was 29.4%. For an extreme speaker in Procrustean, but less so than making no divi- this respect, the mean VOT was -28 ms and sion would have been. For group A the mean 38% of the /b d g/ tokens had non-negative voiced proportion of the occlusion was 89% VOT. In fact, at least as far as VOT is con- (s.d. = 18%), for group B it was 54% (s.d. = cerned, many /b d g/ tokens overlapped pho- 31%). As the standard deviations suggest, there netically with tokens of /p t k/. A linear dis- was less inter-speaker and intra-speaker vari- criminant analysis was run on all initial stops ation in the A group than in the B group. produced by the group of eight speakers and on Among the group A speakers, mean voiced all initial stops produced by the group of four proportion of occlusion across the places of ar- speakers to determine how well the analysis ticulation ranged from 73% to 98% in the short can classify the stop tokens as instances of /p t /b d g/ and from 63% to 99% in the long /b d g/, k/ or /b d g/ on the basis of VOT. For the group among the group B speakers the corresponding of eight speakers, 97.3% of the tokens were ranges were 36% to 62% and 46% to 64%. An correctly classified: the formally /p t k/ stops extreme example of intra-speaker variation is a were all correctly classified as /p t k/, 4.8% of male speaker in group B for whom four of the the formally /b d g/ stops were incorrectly 24 final /b d g/ tokens were completely voice- classified as /p t k/. For the group of four less and nine were completely voiced. Dis- speakers, 82.9% of the tokens were correctly criminant analyses were again run on all final classified: 1.0% of the formally /p t k/ stops stops, separately for the two groups. For group were incorrectly classified as /b d g/, and 29.4% B, with voicing duration during the occlusion,

61 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University occlusion duration and quantity as independent throughout. But in the realisation of the /gt/ variables, 96.5% of the tokens were correctly cluster there was again much variation among classified (99.4% of /p t k/ and 93.1% of /b d the speakers. The /t/ was always voiceless, but g/). For group A, with the same independent the /g/ ranged from fully voiceless (in 43% of variables, all stops were correctly classified the tokens) to fully voiced (33%), and only the (100.0%). For both groups A and B, the order beginning of /g/ was voiced in the remaining of strength of the independent variables as 24% of the tokens. For two speakers all six to- separators of the two categories was: voicing kens of /g/ were fully voiceless, for three duration > closure duration > quantity. speakers four tokens were fully voiceless, and for five speakers, on the other hand, four or Phonological conclusions more tokens were fully voiced. As an example of intra-speaker variation, one speaker pro- Fenno-Swedish contrasts voiced /b d g/ with duced two fully voiced, two partially voiced voiceless unaspirated /p t k/. On the basis of our and two fully voiceless /g/ tokens. On the acoustic measurements and some knowledge of whole, the speakers used the whole continuum, how these are related to glottal and supraglottal but favoured the extreme ends: /g/ was usually events, we conclude that the contrast in Fenno- either fully voiced or fully voiceless, the inter- Swedish is one of [voice] vs no laryngeal speci- mediate degrees of voicing were less common. fication. Note that our definition of voiced This dominant bipartite distribution of tokens stops refers, concretely, to the presence of con- along a phonetic continuum is very different siderable prevoicing in utterance-initial stops, from the more or less Gaussian distribution one and to extensive voicing during the occlusion in usually finds in corresponding studies of a sin- other positions with, at most, a short positive gle phonological category. VOT after occlusion offset. What we refer to as voiceless stops, in turn refers, again concretely, to short positive VOT in utterance-initial stops, Contact influence? to voiceless occlusion in medial stops and final Historically, Finnish lacks a laryngeal contrast stops (allowing for a very short period of voic- in the stop system, the basic stops being /p t k/, ing at the beginning of the occlusion, if the pre- which are voiceless unaspirated. In the past, all ceding segment is voiced), with at most a very borrowed words were adapted to this pattern, short positive VOT. e.g. parkki ‘bark’ (< Swedish bark), tilli ‘dill’ Suomi (1980: 165) concluded for the Fin- (< Sw. dill), katu ‘street’ (< Sw. gata). Standard nish voiceless unaspirated /p t k/ that their “de- Spoken Finnish (SSF) also has a type of /d/ gree of voicing [is] completely determined by which is usually fully voiced. However, this /d/ the supraglottal constrictory articulation”. is not a plosive proper, but something between These stops have no glottal abduction or pha- a plosive and a flap, and is called a semiplosive ryngeal expansion gesture, a circumstance that by Suomi, Toivanen and Ylitalo (2008). Its leads to voicelessness of the occlusion (p. place is apical alveolar, and the duration of its 155ff). Despite the different terminology, this occlusion is very short, about half of that of /t/, amounts to concluding that the Finnish /p t k/ ceteris paribus (Lehtonen 1970: 71; Suomi have no laryngeal specification. Thus in the two 1980: 103). During the occlusion, the location studies, thirty years apart, essentially the same of the apical contact with the alveoli also conclusions were reached concerning Fenno- moves forward when the preceding vowel is a Swedish and the Finnish /p t k/. front vowel and the following vowel is a back vowel (Suomi 1998). Stop clusters What is now /d/ in the native vocabulary, Four cluster types investigated: (1) /kt/, /pt/ (as was a few centuries ago /ð/ for all speakers. in läkt, köpt), (2) /kd/, /pd/ (as in väckte, köpte When Finnish was first written down, the which, on a generative analysis, are derived mostly Swedish-speaking clerks symbolised /ð/ from vä/k+d/e and kö/p+d/e), (3) /gt/ (as in variably, e.g. with the grapheme sequence vägt, byggt) and (4) /gd/ (as in vägde). Clusters . When the texts were read aloud, again (1) – (2) were always almost completely voice- usually by educated people whose native less, and consequently there is no phonetic evi- tongue was Swedish, was pronounced as dence that the two types are distinct. The clus- it would be pronounced in Swedish. At the ter /gd/ was usually nearly fully voiced same time, /ð/ was vanishing from the vernacu- lar, and it was either replaced by other conso-

62 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University nants, or it simply disappeared. Today, /ð/ has speakers of Fenno-Swedish often pronounce vanished and the former /ð/ is represented by a the voiceless stops in the same way as do number of other consonants or by complete speakers of Finnish. Leinonen’s own measure- loss, and /d/ does not occur. But /d/ does occur ments suggest that this is the case. in modern SSF as a result of conscious norma- Many of our Fenno-Swedish speakers ex- tive attempts to promote “good speaking”. The hibited instability in the degree of voicing in /b second author, for example, did not have /d/ in d g/. We suggest that this, too, is due to influ- his speech in the early childhood but learnt it at ence from Finnish. school. In fully native words, /d/ occurs only The Fenno-Swedish speakers’ medial short word medially, e.g. sydän ‘heart’; in recent and long /d/ had considerably shorter closure loanwords it is also found word-initially, e.g. durations than did their medial /b/ and /g/. In demokraatti, desimaali, devalvaatio, diktaat- word-final position, this was not the case. The tori. Finnish semiplosive /d/ occurs word-medially, Under the influence of foreign languages, as does geminate /dd/ in a few recent loanwords nowadays most notably from English, /b/ and (e.g. addikti ‘an addict’). But the Finnish /g/ are entering Standard Spoken Finnish as semiplosive does not occur word-finally. Thus, separate phonemes in recent loanwords, e.g. both short and long Fenno-Swedish /d/ have a baari, bakteeri, baletti, banaani; gaala, gal- relatively short duration in medial position, ex- leria, , gaselli. But such words are not actly where Finnish /d/ and /dd/ occur, but do yet pronounced with [b] and [g] by all speakers, not exhibit this typologically rare feature in fi- nor in all speaking situations. On the whole, it nal position where Finnish could not exert an can be concluded that /d/ and especially /b/ and influence. With respect to voicing, the Fenno- /g/ must be infrequent utterance initially in Fin- Swedish short medial /d/ behaved very much nish discourse, especially in informal registers, like Finnish /d/. The mean voiced proportion of and consequently prevoicing is seldom heard in the occlusion was 90%, and in Suomi (1980: Finnish. Instead, utterance-initial stops pre- 103), all tokens of the medial Finnish /d/ were dominantly have short-lag VOT. Even word- fully voiced. According to Kuronen and Lei- medially voiced stops, with the exception of the nonen (2000), /d/ is dentialveolar in CS Swed- semiplosive /d/, are rather infrequent, because ish, but alveolar in Fenno-Swedish. Finnish /d/ they only occur in recent loanwords and not for is clearly alveolar and apical (Suomi 1998). all speakers and not in all registers. Word- Kuronen & Leinonen have confirmed (p. c.) finally — and thus also utterance-finally — that they mean that Fenno-Swedish /d/ is more voiced plosives do not occur at all because exactly apico-alveolar. loanwords with a voiced final stop in the lend- Against a wider perspective, the suggestion ing language are borrowed with an epenthetic that the Fenno-Swedish /p t k/ have been influ- /i/ in Finnish, e.g. blogi (< Engl. blog). enced by the corresponding Finnish stops is not Our Fenno-Swedish speakers’ /p t k/ had implausible. First, it should be impressionisti- short positive VOTs very similar to those ob- cally apparent to anyone familiar with both served for Finnish, assuming that the differ- Fenno-Swedish and CS Swedish that, on the ences between our utterance-initial /p t k/ and whole, they sound different, segmentally and the word-initial Finnish /p t k/ reported in prosodically; for empirical support for such an Suomi (1980) are due to the difference in posi- impression, see Kuronen and Leinonen (2000; tion in the utterance. In utterance-initial posi- 2008). Second, it should also be apparent to tion, the Fenno-Swedish /p t k/ are unaspirated anyone familiar with both Finnish and Swedish while the CS Swedish /p t k/ are aspirated. We that CS Swedish sounds more different from suggest that the Fenno-Swedish /p t k/ have Finnish than does Fenno-Swedish; in fact, apart been influenced by the corresponding Finnish from the Fenno-Swedish segments not found in stops. Reuter (1977: 27) states that “the [Fenno- Finnish, Fenno-Swedish sounds very much like Swedish] voiceless stops p, t and k are wholly Finnish. Third, Leinonen (2004a) argues con- or partially unaspirated […]. Despite this, they vincingly that CS Swedish has no influence on should preferably be pronounced with a Fenno-Swedish pronunciation today. Leinonen stronger explosion than in Finnish, so that one compared what are three sibilants in CS Swed- clearly hears a difference between the voiceless ish with what are two sibilants and an affricate stops and the voiced b, d and g” (translation by in Fenno-Swedish. He observed clear differ- KS). As pointed out by Leinonen (2004b), an ences among the varieties in each of these con- implication of this normative exhortation is that sonants, and found little support for an influ-

63 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ence of the CS Swedish consonants on these lects of Dutch with no prevoicing (and aspir- consonants in Fenno-Swedish. Thus, to the ex- ated stops) or by contact with speakers of Ger- tent that Fenno-Swedish differs from CS Swed- man. ish (or, more generally, any varieties of Swed- There is evidence that speakers are very ish spoken in Sweden), a very likely cause of sensitive to the VOTs they are exposed to. the difference is influence from Finnish. In ad- Nielsen (2006, 2007) has shown that speakers dition to the influence of the Finnish /p t k/ on of American English produced significantly the Fenno-Swedish /p t k/, we suggest that any longer VOTs in /p/ after they were asked to variation towards voiceless productions of /b d imitate speech with artificially lengthened g/ is also due to Finnish influence. VOTs in /p/, and the increased aspiration was Our results for utterance-initial stops in a generalised to new instances of /p/ (in new language in which /b d g/ stops are predomi- words) and to the new segment /k/. There is nantly prevoiced are not altogether unprece- also evidence that native speakers of a language dented. They resemble the results of Caramazza shift VOTs in their native language as a result and Yeni-Komshian (1974) on Canadian of VOTs in the language spoken around them French and those of van Alphen and Smits (Caramazza and Yeni-Komshian, 1974; van (2004) for Dutch. Caramazza and Yeni- Alphen and Smits, 2004). Sancier and Fowler Komshian observed substantial overlap be- (1997) show that the positive VOTs in the tween the VOT distributions of /b d g/ and /p t speech of a native Brazilian Portuguese speaker k/: a large proportion (58%) of the /b d g/ to- were longer after an extended stay in the United kens were produced without prevoicing, while States and shorter again after an extended stay /p t k/ were all produced without aspiration. in Brazil. The authors conclude that the English The authors argued that the Canadian French long-lag /p t k/ influenced the amount of posi- VOT values are shifting as a result of the influ- tive VOT in the speaker’s native Brazilian ence of Canadian English. van Alphen and Portuguese. Smits observed that, overall, 25% of the Dutch All of our Fenno-Swedish speakers, like the /b d g/ were produced without prevoicing by majority of Fenno-Swedish speakers, are fluent their 10 speakers, and, as in the present study, in Finnish (as was observed before and after the there was variation among the speakers: five of recordings). Fenno-Swedish is a minority lan- the speakers prevoiced very consistently, with guage in Finland, and hence for most speakers more than 90% of their /b d g/ tokens being it is very difficult not to hear and speak Finnish. prevoiced, while for the other five speakers Consequently, most speakers of Fenno-Swedish there was less prevoicing and considerable in- are in contact, on a daily basis, with a language ter-speaker variation; one speaker produced in which there is no aspiration and in which only 38% of /b d g/ with prevoicing. van Al- prevoicing is not often heard. On the basis of phen & Smits’ list of target words contained information available to us on our speakers’ words with initial lenis stops before conso- place of birth, age and sex, it is not possible to nants, which ours did not. They found that the detect any systematic pattern in the variation in amount of prevoicing was lower when the stops the degree of voicing in /b d g/ as a function of were followed by a consonant. If we compare these variables. the results for the prevocalic lenis stops in the In brief, the situation in Fenno-Swedish two studies, the results are almost identical may be parallel, mutatis mutandis, to that ob- (86% prevoicing for van Alphen & Smits, 87% served in Canadian French and Dutch. Assum- for our speakers). The authors are puzzled by ing that prevoicing in Fenno-Swedish /b d g/ the question: given the importance of prevoic- has been more systematic in the past than it is ing as the most reliable cue to the voicing dis- among our speakers (which can hardly be veri- tinction in Dutch initial plosives, why do fied experimentally), influence from Finnish is speakers not produce prevoicing more reliably? an explanation for the variability of prevoicing As a possible explanation to this seemingly in Fenno-Swedish that cannot be ruled out paradoxical situation, van Alphen and Smits easily. Without such influence it is difficult to suggest that Dutch is undergoing a sound see why speakers would choose to collapse change that may be caused or strengthened by phonemic categories in their native language.2 the large influence from English through the educational system and the media. It may be, Acknowledgements however, that van Alphen and Smits’ speakers have also been in contact with speakers of dia- We are grateful to Mikko Kuronen and Kari

64 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Leinonen for their very useful and informative man M., Björklund S., Laurėn Ch,, Mård- comments on earlier versions of this paper. We Miettinen K. and Pilke N. (eds) Svenskans also want to thank Viveca Rabb and Urpo Ni- beskrivning 29. Skrifter utgivna av Svensk- kanne for assistance in recruiting subjects, Österbottniska samfundet 70, 161-169. Maria Ek for help with many aspects of this Vasa. project, to Teemu Laine, Riikka Ylitalo, and Ju- Lehtonen, J. (1970) Aspects of Quantity in hani Järvikivi for technical assistance, Heikki Standard Finnish. Studia Philologica Jyväs- Hämäläinen for use of the lab, Pétur Helgason kyläensia VI, Jyväskylä. for valuable discussions and assistance with Leinonen, K. (2004a) Finlandssvenskt sje-, tje- our word list and, finally, our subjects. The re- och s-ljud i kontrastiv belysning. Jyväskylä search of C. Ringen was supported, in part, by Studies in Humanities 17, Jyväskylä Uni- a Global Scholar Award and a Stanley Interna- versity. Ph.D. Dissertation. tional Programs/Obermann Center Research Leinonen, K. (2004b) Om klusilerna i - Fellowship (2007) from the University of Iowa svenskan. In Melander, B., Melander Mart- and NSF. Grant BCS-0742338. tala, U., Nyström, C., Thelander M. and Östman C. (eds) Svenskans beskrivning 26, Notes 179-188). Uppsala: Hallgren & Fallgren. Nielsen, K. (2006) Specificity and gener- 1. We use the broad notations /b d g/ and /p t k/ alizability of spontaneous phonetic imitia- to refer to the phonetically more voiced and to tion. In Proceedings of the ninth interna- the phonetically less voiced stops of a lan- tional conference on spoken language proc- guage, respectively, without committing our- essing (Interspeech) (paper 1326), Pitts- selves to any claims about cross-linguistic simi- burgh, USA. larities. E.g., when we talk about /p t k/ in Nielsen, K. (2007) Implicit phonetic imitation Fenno-Swedish and about /p t k/ in English, we is constrained by phonemic contrast. In do not claim that the consonants are alike in Trouvain, J. and Barry, W. (eds) Proceed- the two languages. Similarly the notation /t/ as ings of the 16th International Congress of applied to Finnish overlooks the fact that the Phonetic Sciences. Universität des Saarlan- Finnish stop is laminodentialveolar. des, Saarbrücken, Germany, 1961–1964. 2. The fact that there was phonetic overlapping Reuter, M. (1977) Finlandssvenskt uttal. In Pet- between tokens of /b d g/ and /p t k/ with re- terson B. and Reuter M. (eds) Språkbruk spect to parameters related to voicing does not och språkvård, 19-45. : Schildts. exclude the possibility that the two sets were Ringen, C. and Suomi, K. (submitted) Voicing still distinguished by some properties not in Fenno-Swedish stops. measured in this study. Nevertheless, overlap- Sancier, M. & Fowler, C. (1997) Gestural drift ping in parameters directly related to the lar- in a bilingual speaker of Brazilian Portu- yngeal contrast is very likely to reduce the sali- guese and English. Journal of Phonetics 25, ence of the phonetic difference between the two 421-436. sets of stops. Suomi, K. (1980) Voicing in English and Fin- nish stops. Publications of the Department References of Finnish and General Linguistics of the University of Turku 10. Ph.D. Dissertation. Caramazza, A. and Yeni-Komshian. G. (1974) Suomi, K. (1998) Electropalatographic investi- Voice onset time in two French dialects. gations of three Finnish coronal consonants. Journal of Phonetics 2, 239-245. Linguistica Uralica XXXIV, 252-257. Helgason, P. and Ringen, C. (2008) Voicing Suomi, K., Toivanen J. and Ylitalo, R. (2008) and aspiration in Swedish stops. Journal of Finnish sound structure. Studia Humaniora Phonetics 36, 607-628. Ouluensia 9. URL: Kuronen, M. and Leinonen, K. (2000) Fonetis- http://herkules.oulu.fi/isbn9789514289842. ka skillnader mellan finlandssvenska och van Alphen, A. and Smits, R. (2004) Acoustical rikssvenska. Svenskans beskrivning 24. and perceptual analysis of the voicing dis- Linköping Electronic Conference Procee- tinction in Dutch initial plosives: The role dings. URL: of prevoicing. Journal of Phonetics 32, 455- http://www.ep.liu.se/ecp/006/011/. 491. Kuronen, M. and Leinonen, K. (2008) Proso- diska särdrag i finlandssvenska. In Nord-

65 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Grammaticalization of prosody in the brain Mikael Roll and Merle Horne Department of Linguistics and Phonetics, Lund University, Lund

Abstract essary. According to Gussenhoven, the Produc- tion Code has been grammaticalized in many Based on the results from three Event-Related languages in the use of a right edge H% to Potential (ERP) studies, we show how the de- show non-finality in an utterance, as well as a gree of grammaticalization of prosodic features left-edge %H, to indicate topic-refreshment. influences their impact on syntactic and mor- In the present study, we will first examine phological processing. Thus, results indicate the processing of a Swedish left-edge H tone that only lexicalized word accents influence that would appear to be on its way to becoming morphological processing. Furthermore, it is incorporated into the grammar. The H will be shown how an assumed semi-grammaticalized shown to activate main clause structure in the left-edge boundary tone activates main clause online processing of speech, but without inhib- structure without, however, inhibiting subordi- iting subordinate clause structure when co- nate clause structure in the presence of compet- occurring with the subordinating conjunction ing syntactic cues. att ‘that’ and subordinate clause word order. The processing dissociation will be related to Introduction the low impact the tone has on normative In the rapid online processing of speech, pro- judgments in competition with the conjunction sodic cues can in many cases be decisive for att and subordinate clause word order con- the syntactic interpretation of utterances. Ac- straints. This shows that the tone has a rela- cording to constraint-based processing models, tively low degree of grammaticalization, the brain activates different possible syntactic probably related to the fact that it is confined to structures in parallel, and relevant syntactic, the spoken modality, lacking any counterpart in semantic, and prosodic features work as con- written language (such as commas, which cor- straints that increase or decrease the activation relate with right-edge boundaries). We will fur- of particular structures (Gennari and Mac- ther illustrate the influence of lexicalized and Donald, 2008). How important a prosodic cue non-lexicalized tones associated with Swedish is for the activation of a particular syntactic word accents on morphological processing. structure depends to a large extent on the fre- The effects of prosody on syntactic and quency of their co-occurrence. Another factor morphological processing were monitored on- we assume to play an important role is how line in three experiments using electroencepha- ‘grammaticalized’ the association is between lography (EEG) and the Event-Related Poten- the prosodic feature and the syntactic structure, tials (ERP) method. EEG measures changes in i.e. to what degree it has been incorporated into the electric potential at a number of electrodes the language norm. (here 64) over the scalp. The potential changes Sounds that arise as side effects of the ar- are due to electrochemical processes involved ticulatory constraints on speech production may in the transmission of information between neu- gradually become part of the language norm rons. The ERP method time locks this brain ac- (Ohala, 1993). In the same vein, speakers seem tivity to the presentation of stimuli, e.g. words to universally exploit the tacit knowledge of the or morphemes. In order to obtain regular pat- biological conditions on speech production in terns corresponding to the processing of spe- order to express different pragmatic meanings cific stimuli rather than to random brain activ- (Gussenhoven, 2002). For instance, due to con- ity, ERPs from at least forty trials per condition ditions on the exhalation phase of the breathing are averaged and statistically analyzed for process, the beginning of utterances is normally twenty or more participants. In the averaged associated with more energy and higher funda- ERP-waveform, recurrent responses to stimuli mental frequency than the end. Mimicking this in the form of positive (plotted downwards) or tendency, the ‘Production Code’ might show negative potential peaks, referred to as ‘com- the boundaries of utterances by associating the ponents’, emerge. beginning with high pitch and the end with low In this contribution, we will review results pitch, although it might not be physically nec- related to the ‘P600’ component, a positive

66 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 1. Waveform and F0 contour of an embedded main clause sentence with (H, solid line) or without (∅, dotted line) a left-edge boundary tone. p eak around 600 ms following stimuli that trig- Afghans conquered not Persia’ would be inter- ger reprocessing due to garden path effects or preted as an assertion that what is expressed by sy ntactic errors (Osterhout and Holcomb, the embedded main clause within brackets is 1992). The P600 often gives rise to a longer true. sustained positivity from around 500 to 1000 Roll et al. (2009a) took advantage of the ms or more. word order difference between main and subor- dinate clauses in order to study the effects of A semi-grammaticalized tone the left-edge boundary tone on the processing of clause structure. Participants listened to sen- In Central Swedish, a H tone is phonetically tences similar to the one in Figure 1, but with associated with the last syllable of the first pro- the sentence adverb ju ‘of course’ instead of sodic word of utterances (Horne, 1994; Horne inte ‘not’, and judged whether the word order et al., 2001). Roll (2006) found that the H ap- was correct. The difference between the test pears in embedded main clauses but not in sub- conditions was the presence or absence of a H ordinate clauses. It thus seems that this ‘left- left-edge boundary tone in the first prosodic edge boundary tone’ functions as a signal that a word of the embedded clause, as seen in the last main clause is about to begin. syllable of the subject afghanerna ‘the Af- Swedish subordinate clauses are distin- ghans’ in Figure 1. Roll et al. hypothesized that guished from main clauses by their word order. a H left-edge boundary tone would increase the Whereas main clauses have the word order S– activation of main clause structure, and thus V–SAdv (Subject–Verb–Sentence Adverb), as make the S–V–SAdv word order relatively in Afghanerna intog inte Persien ‘(literally) more expected than in the corresponding clause The Afghans conquered not Persia’, where the without a H associated with the first word. sentence adverb inte ‘not’ follows the verb in- When there was no H tone in the embedded tog ‘conquered’, in subordinate clauses, the clause, the sentence adverb yielded a biphasic sentence adverb instead precedes the verb (S– positivity in the ERPs, interpreted as a P345- SAdv–V), as in …att afghanerna inte intog P600 sequence (Figure 2). In easily resolved Persien ‘(lit.) …that the Afghans not conquered garden path sentences, the P600 has been regu- Persia’. In spoken Swedish, main clauses with larly observed to be preceded by a positive postverbal sentence adverbs are often embed- peak between 300 and 400 ms (P345). The ded instead of subordinate clauses in order to biphasic sequence has been interpreted as the express embedded assertions, although many discovery and reprocessing of unexpected speakers consider it normatively unacceptable. structures that are relatively easy to reprocess For instance, the sentence Jag sa att [afghan- (Friederici et al. 2001). erna intog inte Persien] ‘(lit.) I said that the

67 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

afghanerna inte intog Persien… ‘(lit.) The old fogies think thus that the Afghans not con- quered Persia…’ Conditions with embedded main clauses lacking left-edge boundary tones and subordinate clauses with an initial H tone were obtained by cross-splicing the conditions in the occlusion phase of [t] in att ‘that’ and intog ‘conquered’ or inte ‘not.’ For embedded main clauses, the ERP- results were similar to those of Roll et al. (2009a), but the effect was even clearer: A rather strong P600 effect was found between 400 and 700 ms following the onset of the sen- Figure 2. ERPs at the sentence adverb ju ‘of course’ tence adverb inte ‘not’ for embedded main in sentences with embedded main clauses of the clauses lacking a left-edge boundary tone (Fig- type Besökaren menar alltså att familjen känner ure 3). ju… ‘(lit.) The visitor thinks thus that the family feels of course…’ In the absence of a preceding left-edge boundary tone (∅, grey line), there was increased positivity at 300–400 (P345) and 600–800 ms (P600). The effect found in Roll et al. (2009a) hence indicates that in the absence of a left-edge boundary tone, a sentence adverb showing main clause word order (S–V–SAdv) is syntac- tically relatively unexpected, and thus triggers reprocessing of the syntactic structure. The word order judgments confirmed the effect. Embedded main clauses associated with a left- edge boundary tone were rated as acceptable in Figure 3. ERPs at the disambiguation point for the 68% of the cases, whereas those lacking a tone verb in–tog ‘conquered’ (embedded main clauses, were accepted only 52 % of the time. Generally EMC, solid lines) or the sentence adverb in–te ‘not’ speaking, the acceptability rate was surprisingly (subordinate clauses, SC, dotted line) in sentences high, considering that embedded main clauses like Stofilerna anser alltså att afghanerna in– are limited mostly to spoken language and are tog/te… ‘(lit.) The old fogies thus think that the Af- considered inappropriate by many speakers. ghans con/no–quered/t…’ with (H, black line) or In other words, the tone was used in proc- without (∅, grey line) a left-edge boundary tone. Embedded main clauses showed a P600 effect at essing to activate main clause word order and, 400–700 ms following the sentence adverb, which all other things being equal, it influenced the was reduced in the presence of a left-edge tone. normative judgment related to correct word or- der. However, if the tone were fully grammati- Thus, it was confirmed that the left-edge calized as a main clause marker, it would be boundary tone increases the activation of main expected not only to activate main clause struc- clause structure, and therefore reduces the syn- ture, but also to inhibit subordinate clause tactic processing load if a following sentence structure. That is to say, one would expect lis- adverb indicates main clause word order. As teners to reanalyze a subordinate clause as an mentioned above, however, if the tone were embedded main clause after hearing an initial H fully grammaticalized, the reverse effect would tone. also be expected for subordinate clauses: A In a subsequent study (Roll et al., 2009b), left-edge boundary tone should inhibit the ex- the effects of the left-edge boundary tone were pectation of subordinate clause structure. How- tested on both embedded main clauses and sub- ever, the tone did not have any effect at all on ordinate clauses. The embedded main clauses the processing of the sentence adverb in were of the kind presented in Figure 1. Corre- subordinate clauses. The left-edge boundary sponding sentences with subordinate clauses tone thus activates main clause structure, albeit were recorded, e.g. Stofilerna anser alltså att without inhibiting subordinate clause structure.

68 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Interestingly, in this experiment, involving cally specify whether the stressed stem vowel both main and subordinated embedded clauses, should be associated with Accent 2. In the ab- the presence of a left-edge boundary tone did sence of an Accent 2-specification, Accent 1 is not influence acceptability judgments. Rather, assigned postlexically by default. A stem such the speakers made their grammaticality deci- as lek– ‘game’ is thus unspecified for word ac- sions based only on word order, where subordi- cent. If it is combined with the Accent 2- nate clause word order was accepted in 89% specified indefinite plural suffix –ar, the and embedded main clause word order in stressed stem syllable is associated with a H*, around 40% of the cases. Hence, the left-edge resulting in the Accent 2-word lekar ‘games’ boundary tone would appear to be a less gram- shown in Figure 4. If it is instead combined maticalized marker of clause type than word with the definite singular suffix –en, which is order is. In the next section, we will review the assumed to be unspecified for word accent, the processing effects of a prosodic feature that is, stem is associated with a L* by a default post- in contrast to the initial H, strongly grammati- lexical rule, producing the Accent 1 word leken calized, namely Swedish word accent 2. ‘the game’, with the intonation contour shown by the dotted line in Figure 4. A lexicalized tone Neurocognitively, a lexical specification for Accent 2 would imply a neural association be- In Swedish, every word has a lexically speci- tween the representations of the Accent 2 tone fied word accent. Accent 2 words have a H* (H*) and the grammatical suffix. ERP-studies tone associated with the stressed syllable, dis- on morphology have shown that a lexical speci- tinguishing them from Accent 1 words, in fication that is not satisfied by the combination which a L* is instead associated with the of a stem and an affix results in an ungram- stressed syllable (Figure 4). Accent 2 is histori- matical word that needs to be reprocessed be- cally a lexicalization of the postlexical word fore interpreting in the syntactic context (Lück accent assigned to bi-stressed words (Riad, et al., 2006). Therefore, affixes with lexical 1998). Following Rischel (1963), Riad (in specifications left unsatisfied give rise to P600 press) assumes that it is the suffixes that lexi- effects.

Figure 4. Waveform and F0-contour of a sentence containing the Accent 2 word lekar ‘games’ associated with a H* (solid line). The F0 contour for the Accent 1 word leken ‘the game’ is shown by the dotted line (L*).

69 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

A H*-specified suffix such as –ar in lekar The study showed that a lexicalized pro- ‘games’ would hence be expected to produce sodic feature has similar effects on morpho- an ungrammatical word if combined with a logical processing as other lexicalized morpho- stem associated with a clashing L*. The word logical features such as those related to declen- would have to be reprocessed, which would be sion marking. reflected in a P600 effect. No such effect would be expected for suffixes that usually co-occur Summary and conclusions with Accent 1, such as –en in leken ‘the game’, since they are assumed to be unspecified for In the present contribution, two prosodic fea- word accent. tures with different degrees of grammaticaliza- Roll et al. (2009c) found the expected dis- tion and their influence on language processing sociation when they compared 160 sentences have been discussed. It was suggested that a containing words with either the H*-specified Swedish left-edge boundary tone has arisen suffix –ar or the unspecified suffix –en, and from the grammaticalization of the physical stems phonetically associated with a H* or a conditions on speech production represented by L*, obtained by cross-splicing. The effects Gussenhoven’s (2002) Production Code. were compared with another 160 sentences Probably stemming from a rise naturally asso- containing words involving declension errors, ciated with the beginning of phrases, the tone such as lekor or leket, which have 1st and 5th has become associated with the syntactic struc- declension instead of 2nd declension suffixes, ture that is most common in spoken language and therefore yield a clash between the lexical and most expected at the beginning of an utter- specification of the suffix and the stem. The ance, namely the main clause. The tone has also results were similar for declension mismatching been assigned a specific location, the last sylla- words and words with a H*-specifying suffix ble of the first prosodic word. When hearing inaccurately assigned an Accent 1 L* (Figure the tone, speakers thus increase the activation 5). In both cases, the mismatching suffix gave of main clause structure. rise to a P600 effect at 450 to 900 ms that was However, the tone does not seem to be fully stronger in the case of declension-mismatching grammaticalized, i.e. it does not seem to be suffixes. The combination of the lexically un- able to override syntactic cues to subordination specified singular suffix –en and a H* did not in situations where both main and subordinate yield any P600 effect, since there was no speci- embedded clauses occur. Even when hearing fication-mismatch, although –en usually co- the tone, speakers seem to be nevertheless bi- occurs with Accent 1. ased towards subordinate clause structure after hearing the subordinate conjunction att ‘that’ and a sentence adverb in preverbal position in the embedded clause. However, embedded main clauses are easier to process in the context of a left-edge boundary tone. Thus we can as- sume that the H tone activates main clause structure. Further, the boundary tone influenced acceptability judgments, but only in the ab- sence of word order variation in the test sen- tences. The combination of syntactic cues such as the conjunction att and subordinate word or- der (S–SAdv–V) thus appears to constitute de- cisive cues to clause structure and cancel out the potential influence the initial H could have Figure 5. ERPs for the Accent 2-specifying plural had in reprocessing an embedded clause as a suffix –ar (L*PL), the word accent-unspecified main clause. definite singular suffix –en (L*SG), as well the in- appropriately declining –or (L*DECL) and –et A fully grammaticalized prosodic feature (L*NEU), combined with a stem associated with was also discussed, Swedish Accent 2, whose Accent 1 L*. The clash of the H*-specification of – association with the stem is accounted for by a ar with the L* of the stem produced a P600 similar H* lexically specified for certain suffixes, e.g. to that of the declension errors at 450–900 ms. plural –ar (Riad, in press). When the H*- specification of the suffix clashed with a L*

70 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University inappropriately associated with the stem in the ternational Journal of Speech Technology 4, test words, the words were reprocessed, as re- 93–102. flected in a P600 effect in the ERPs. Signifi- Lück, M., Hahne, A., and Clahsen, H. (2006) cantly lower acceptability judgments confirmed Brain potentials to morphologically comp- that the effect was due to the ungrammatical lex words during listening. Brain Research form of these test words. Similar effects were 1077(1), 144–152. obtained for declension errors. Ohala, J. J. (1993) The phonetics of sound The results reviewed in this paper indicate change. In Jones C. (ed.) Historical linguis- that prosodic features with a low degree of tics: Problems and perspectives, 237–278. grammaticalization can nevertheless influence New York: Longman. processing of speech by e.g. increasing the ac- Osterhout, L. and Holcomb, P. J. (1992) Event- tivation of a particular syntactic structure with- related brain potentials elicited by syntactic out, however, inhibiting the activation of paral- anomaly. Journal of Memory and Language lel competing structures. The studies involving 31, 785–806. the semi-grammaticalized left-edge boundary Riad, T. (1998) The origin of Scandinavian tone show clearly how prosodic cues interact tone accents. Diachronica 15(1), 63–98. with syntactic cues in the processing of differ- Riad, T. (in press) The morphological status of ent kinds of clauses. In the processing of sub- accent 2 in North Germanic simplex forms. ordinate embedded clauses, syntactic cues were In Proceedings from Nordic Prosody X, seen to override and cancel out the potential Helsinki. influence of the prosodic cue (H boundary Rischel, J. (1963) Morphemic tone and word tone). In embedded main clauses, however, the tone i Eastern Norwegian. Phonetica 10, prosodic cue facilitated the processing of word 154–164. order. In contrast to these results related to the Roll, M. (2006) Prosodic cues to the syntactic left-edge boundary tone, the findings from the structure of subordinate clauses in Swedish. study on word accent processing show how this In Bruce, G. and Horne, M. (eds.) Nordic kind of prosodic parameter has a much differ- prosody: Proceedings of the IXth conferen- ent status as regards its degree of grammaticali- ce, Lund 2004, 195–204. Frankfurt am zation. The Swedish word accent 2 was seen to Main: Peter Lang. affect morphological processing in a way simi- Roll, M., Horne, M., and Lindgren, M. (2009a) lar to other morphological features, such as de- Left-edge boundary tone and main clause clension class, and can therefore be regarded as verb effects on embedded clauses—An ERP fully lexicalized. study. Journal of Neurolinguistics 22(1), 55- 73. Acknowledgements Roll, M., Horne, M., and Lindgren, M. (2009b) Activating without inhibiting: Effects of a This work was supported by grant 421-2007- non-grammaticalized prosodic feature on 1759 from the Swedish Research Council. syntactic processing. Submitted. Roll, M., Horne, M., and Lindgren, M. (2009c) References Effects of prosody on morphological pro- cessing. Submitted. Gennari, M. and MacDonald, M. C. (2008) Semantic indeterminacy of object relative clauses. Journal of Memory and Language 58, 161–187. Gussenhoven, C. (2002) Intonation and inter- pretation: Phonetics and phonology. In Speech Prosody 2002, 47–57. Horne, M. (1994) Generating prosodic structure for synthesis of Swedish intonation. Wor- king Papers (Dept. of Linguistics, Lund University) 43, 72–75. Horne, M., Hansson, P., Bruce, G. Frid, J. (2001) Accent patterning on domain-related information in Swedish travel dialogues. In-

71 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Focal lengthening in assertions and confirmations Gilbert Ambrazaitis Linguistics and Phonetics, Centre for Languages and Literature, Lund University

Abstract phrases), and signalled by different, e.g. mor- phosyntactic, means, but only narrow focus (i.e. This paper reports on duration measurements in focus on individual words) as signalled by a corpus of 270 utterances by 9 Standard prosodic means is of interest for this paper. Swedish speakers, where focus position is varied For Swedish, Bruce (1977) demonstrated systematically in two different speech acts: as- that focus is signalled by a focal accent – a tonal sertions and confirmations. The goal is to pro- rise that follows the word accent gesture. In the vide information needed for the construction of Lund model of Swedish intonation (e.g. Bruce et a perception experiment, which will test the al., 2000) it is assumed that focal accent may be hypothesis that Swedish has a paradigmatic present or absent in a word, but there is no contrast between a rising and a falling utter- paradigmatic contrast of different focal accents. ance-level accent, which are both capable of However, the Lund model is primarily based on signalling focus, the falling one being expected the investigation of a certain type of speech act, in confirmations. The results of the present study namely assertions (Bruce, 1977). This paper is are in line with this hypothesis, since they show part of an attempt to systematically include that focal lengthening occurs in both assertions further speech acts in the investigation of and confirmations, even if the target word is Swedish intonation. produced with a falling pattern. In Ambrazaitis (2007), it was shown that confirmations may be produced without a rising Introduction focal accent (H-). It was argued, however, that This paper is concerned with temporal aspects of the fall found in confirmations not merely re- focus signalling in different types of speech acts flects a ‘non-focal’ accent, but rather an utter- – assertions and confirmations – in Standard ance-level prominence, which paradigmatically Swedish. According to Büring (2007), most contrasts with a H-. Therefore, in Ambrazaitis definitions of focus have been based on either of (in press), it is explored if and how focus can be two ‘intuitions’: first, ‘new material is focussed, signalled prosodically in confirmations. To this given material is not’, second, ‘the material in end, the test sentence “Wallander förlänger till the answer that corresponds to the wh-con- november.” (‘Wallander is continuing until stituent in the (constituent) question is focussed’ November.’) was elicited both as an assertion (henceforth, ‘Question-Answer’ definition). In and as a confirmation, with focus either on the many cases, first of all in studies treating focus initial, medial, or final content word. An exam- in assertions, there is no contradiction between ple for a context question eliciting final focus in the two definitions; examples for usages of focus a confirmation is ‘Until when is Wallander that are compatible with both definitions are continuing, actually? Until November, right?’. Bruce (1977), Heldner and Strangert (2001), or As a major result, one strategy of signalling a Ladd (2008), where focus is defined, more or confirmation was by means of a lowered H- rise less explicitly, with reference to ‘new informa- on the target word. However, another strategy tion’, while a question-answer paradigm is used was, like in Ambrazaitis (2007), to realise the to elicit or diagnose focus. In this study, focus is target word with a lack of a H- rise, i.e. with basically understood in the same sense as in, e.g. falling F0 pattern (cf. Figure 1, upper panel). Ladd (2008). However, reference to the notion The initial word was always produced with a of ‘newness’ in defining focus is avoided, since rise, irrespective of whether the initial word it might seem inappropriate to speak of ‘new itself was in focus or not. Initial, pre-focal rises information’ in confirmations. Instead, the have been widely observed in Swedish and re- ‘Question-Answer’ definition is adopted, how- ceived different interpretations (e.g. Horne, ever, in a generalised form not restricted to 1991; Myrberg, in press; Roll et al., 2009). For wh-questions. Focus signalling or focussing is the present paper, it is sufficient to note that an then understood as a ‘highlighting’ of the con- initial rise is not necessarily associated with stituent in focus. Focus can refer to constituents focus. of different size (e.g. individual words or entire

72 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

medial rise (n=20) medial fall (n=17) or the final word. Thus, future perception ex- 10 periments should test whether focus can be 8 signalled by the falling pattern found in con- firmations, and furthermore, which acoustic 6 correlates of this fall serve as perceptual cues of 4 focus in confirmations. Prior to that, the acoustic st 2 characteristics of the falling pattern need to be 0 established in more detail. It is known for a variety of languages that -2 prosodically focussed words in assertions are -4 not only marked tonally, i.e. by a pitch accent, normalised time but also temporally, i.e. by lengthening (e.g. Bruce, 1981, Heldner and Strangert, 2001, for initial rise (n=38) medial fall (n=17) final fall (n=16) Swedish; Cambier-Langeveld and Turk, 1999, 10 for English and Dutch; Kügler, 2008, for Ger- 8 man). Moreover, Bruce (1981) suggests that

6 increased duration is not merely an adaptation to the more complex tonal pattern, but rather a 4

st focus cue on its own, besides the tonal rise. 2 The goal of this study is to examine the data 0 from Ambrazaitis (in press) on focus realisation

-2 in assertions and confirmations in more detail as regards durational patterns. The results are ex- -4 pected to provide information as to whether normalised time duration should be considered as a possible cue Figure 1. Mean F0 contours of the three content to focus and to speech act in future perception words in the test sentence “Wallander förlänger till experiments. The hypothesis is that, if focus is november”; breaks in the curves symbolise word signalled in confirmations, and if lengthening is boundaries; time is normalised (10 measurements a focus cue independent of the tonal pattern, per word); semitones refer to an approximation of then focal lengthening should be found, not only individual speakers’ base F0; adapted from Am- in assertions, but also in confirmations. Fur- brazaitis (in press). Upper panel: two strategies of thermore, it could still be the case that durational focus signalling on the medial word in a confirma- patterns differ in confirmations and assertions. tion. Lower panel: Focus on the initial, medial, and final word in a confirmation; for medial and final Method focus, only the falling strategy is shown. The following two sections on the material and That is, in confirmations with intended focus on the recording procedure are, slightly modified, the medial or the final word, one strategy was to reproduced from Ambrazaitis (in press). produce a (non-focal) rise on the initial, and two falling movements, one each on the medial and Material the final word. As the lower panel in Figure 1 The test sentence used in this study was “Wal- shows, the mean curves of these two cases look lander förlänger till november” (‘Wallander is very similar; moreover, they look similar to the continuing until November’). In the case of a pattern for initial focus, which was always confirmation, the test sentence was preceded by produced with a rising focal accent. One possi- “ja” (‘yes’). Dialogue contexts were constructed ble reason for this similarity could be that me- in order to elicit the test sentence with focus on dial or final focus, in fact, were not marked at all the first, second, or third content word, in each in these confirmations, i.e. that the entire utter- case both as an assertion and as a confirmation. ance would be perceived as lacking any narrow These dialogue contexts consisted of a situ- focus. Another possibility is that all patterns ational frame context, which was the same for displayed in Figure 1 (lower panel) would be all conditions (‘You are a police officer meeting perceived with a focal accent on the initial word. a former colleague. You are talking about re- Informal listening, however, indicates that in tirement and the possibility to continue work- many cases, an utterance-level prominence, ing.’), plus six different context questions, one indicating focus, can be perceived on the medial

73 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University for each condition (cf. the example in the In- trograms and wave form diagrams. The troduction). While the frame context was pre- boundaries between the segments were set as sented to the subjects exclusively in written illustrated by the following broad phonetic form, the context question was only presented transcriptions: [a], [land], [], [fœ], [lŋ], [], auditively. For that, the context questions were [tl], [n], [vmb], []. In the case of [land] and pre-recorded by a 30-year-old male native [vmb], the final boundary was set at the time of speaker of Swedish. the plosive burst, if present, or at the onset of the post-stress vowel. Recording procedure and subjects It has been shown for Swedish that focal The data collection was performed using a lengthening in assertions is non-linear, in that computer program, which both presented the the stressed syllable is lengthened more than the contexts and test sentences to the subjects and unstressed syllables (Heldner and Strangert, organised the recording. First, for each trial, 2001). Therefore, durational patterns were ana- only the frame context was displayed on the lysed on two levels, first, taking into account screen in written form. The subjects had to read entire word durations, second, concentrating on the context silently and to try to imagine the stressed syllables only. In both cases, the situation described in the context. When ready, analyses focussed on the three content words they clicked on a button to continue with the and hence disregarded the word “till”. trial. Then, the pre-recorded context question For each word, two repeated-measures was played to them via headphones, and simul- ANOVAs were calculated, one with word dura- taneously, the test sentence appeared on the tion as the dependant variable, the other for screen. The subject’s task was to answer the stressed syllable duration. In each of the six question using the test sentence in a normal ANOVAs, there were three factors: SPEECH ACT conversational style. The subjects were allowed (with two levels: assertion, confirmation), FO- to repeat each trial until they were satisfied. CUS (POSITION) (three levels: focus on initial, Besides the material for this study, the re- medial, final word), and finally REPETITION cording session included a number of further test (five repetitions, i.e. five levels). cases not reported on in this paper. Five repeti- All data were included in these six tions of each condition were recorded, and the ANOVAs, irrespective of possible mispronun- whole list of items was randomised. One re- ciations, or the intonation patterns produced (cf. cording session took about 15 minutes per the two strategies for confirmations, Figure 1), speaker. Nine speakers of Standard Swedish in order to obtain a general picture of the effects were recorded (5 female) in an experimental of focus and speech act on duration. However, studio at the Humanities Laboratory at Lund the major issue is whether focus in confirma- University. Thus, a corpus of 270 utterances tions may be signalled by a falling F0 pattern. relevant to this study (6 conditions, 5 repetitions Therefore, in a second step, durational patterns per speaker, 9 speakers) was collected. were looked at with respect to the classification of F0 patterns made in Ambrazaitis (in press). Data analysis A first step in data analysis is reported in Am- Results brazaitis (in press). There, the goal was to pro- Figure 2 displays mean durations of the three vide an overview of the most salient character- test words for the six conditions (three focus istics of the F0 patterns produced in the different positions in two speech acts). The figure only conditions. To this end, F0 contours were time shows word durations, since, on an approximate and register normalised, and mean contours descriptive level, the tendencies for stressed were calculated in order to illustrate the general syllable durations are similar; the differences characteristics of the dominant patterns found in between durational patterns based on entire the different conditions (cf. examples in Figure words and stressed syllables only will, however, 1). The F0 patterns were classified according to be accounted for in the inferential statistics. the F0 movement found in connection with the The figure shows that the final word (“no- stressed syllable of the target word, as either vember”) is generally produced relatively long ‘falling’ or ‘rising’. even when unfocussed, i.e. longer than medial In order to obtain duration measurements, in or initial unfocussed words, reflecting the the present study, the recorded utterances were well-known phenomenon of final lengthening. segmented into 10 quasi-syllables using spec- Moreover, the medial word (“förlänger”) is

74 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Wallander förlänger november [fœlŋ] and [lŋ] were realised with a longer duration in assertions than in confirmations 550 (p<.001 in both cases) only when the medial 500 word itself was in focus. In assertions, both the word and the stressed syllable were longer when 450 the word was in focus than when another word 400 ms was in focus (p<.001 for all comparisons). In 350 confirmations, the general result was the same (entire word: focal > post-focal (p=.016); focal > 300 pre-focal (p=.003); stressed syllable: focal > 250 post-focal (p=.023); focal > pre-focal (p=.001)). ass 1 ass 2 ass 3 con 1 con 2 con 3 The situation is similar for the final word, the major difference in the test results being that the Figure 2. Mean durations of the three test words for interaction of FOCUS and REPETITION was sig- the six conditions (ass = assertion; con = confirma- nificant for word durations (cf. Table 1). Re- tion; numbers 1, 2, 3 = initial, medial, final focus), solving this interaction shows that significant pooled across 45 repetitions by 9 speakers. differences between repetitions only occur for generally produced relatively short. However, final focus, and furthermore, that they seem to influences of position or individual word char- be restricted to confirmations. A possible ex- acteristics are not treated in this study. The fig- planation is that the two different strategies of ure also shows that each word is produced focussing the final word in confirmations (rise longer when it is in focus than when it is vs. fall) are reflected in this interaction (cf. pre-focal or post-focal (i.e. when another word Figure 3 below). As in the case of the medial is focussed). This focal lengthening effect can, word, post-hoc comparisons reveal that both moreover, be observed in both speech acts, al- [nvmb] and [vmb] were realised with a though the effect appears to be smaller in con- longer duration in assertions than in confirma- firmations than in assertions. For unfocussed tions (p<.001 in both cases) only when the final words, there seem to be no duration differences word itself was in focus. Also, the entire word is between the two speech acts. longer when in focus than in the two post-focal These observations are generally supported conditions, i.e. the initial or medial word being by the inferential statistics (cf. Table 1), al- focussed (assertions: p<.001 in both cases; con- though most clearly for the medial word: A firmations: p=.018 for final vs. initial focus, significant effect was found for the factors p=.007 for final vs. medial focus). The picture SPEECH ACT and FOCUS, as well as for the in- is, however, different when only the stressed teraction of the two factors, both for word dura- syllable is measured. In confirmations, no sig- tion ([fœlŋ]) and stressed syllable duration nificant differences are found for [vmb] in the ([lŋ]); no other significant effects were found different focus conditions, while in assertions, for the medial word. According to post-hoc the duration of [vmb] differs in all three focus comparisons (with Bonferroni correction), conditions (final focus > medial focus (p=.001);

Table 1. Results of the six repeated-measures ANOVAs: degrees of freedom (Greenhouse-Geisser corrected where sphericity cannot be assumed), F-values, and p-values. Factor REPETITION was never significant; no interactions besides the one shown were significant, an exception being FOCUS*REPETITION for [nvmb] (F(8,64)=2.21; p=.038).

[valand] [fœlŋ] [nvmb] [land] [lŋ] [vmb] F(1,8)=9.15 F(1,8)=17.36 F(1,8)=36.33 F(1,8)=26.59 F(1,8)=73.23 F(1,8)=25.99 SPEECH p=.016 p=.003 p<.001 p=.001 p<.001 p=.001 ACT F(1.13,9.07) F(2,16) F(2,16) F(1.03,8.26) F(2,16) F(2,16) FOCUS =19.25 =39.13 =34.57 =16.57 =36.92 =25.94 p=.001 p<.001 p<.001 p=.003 p<.001 p<.001 F(2,16) F(2,16) F(2,16) F(2,16) F(2,16) SP. ACT * n.s. =20.82 =28.03 =3.75 =22.59 =31.06 FOCUS p<.001 p<.001 p=.046 p<.001 p<.001

75 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University final > initial (p<.001); medial > initial Wallander förlänger november (p=0.19)). Finally, for the initial word, the interaction 550 of FOCUS and SPEECH ACT was not significant 500 for word duration (cf. Table 1). That is, 450 [valand] was produced longer in assertions 400 than in confirmations, both when in focus and in ms pre-focal position (cf. also Figure 2). Post-hoc 350 tests for FOCUS show that [valand] is realised 300 with a longer duration when the word is in focus 250 than when focus is on the medial (p=.011) or initial rise medial fall medial final fall final rise final word (p=.003). However, when only the (38) (17) rise (20) (16) (27) stressed syllable is taken into account, the in- Figure 3. Mean durations of the three test words in teraction of SPEECH ACT and FOCUS is signifi- cant (cf. Table 1). As shown by post-hoc com- confirmations, divided into classes according to the parisons, the situation is, however, more com- intended focus position (initial, medial, final word) plex than for the interactions found for the other and F0 pattern produced on the target word (rise, fall); n in parentheses. words: First, [land] is realised longer in asser- tions than in confirmations not only when the lengthened not only when it is in focus itself, but initial word is in focus (p=.002), but also when also when medial or final focus is produced with the final word is in focus (p=.029). Second, in a fall, as compared to medial or final focus assertions, the duration of [land] differs in all produced with a rise. three focus conditions (initial focus > medial focus (p=.015); initial > final (p=.036); final > Discussion medial (p=.039)), while in confirmations, [land] is significantly longer in focus than in the two The goal of this study was to examine the dur- pre-focal conditions only (initial > medial ational patterns in a data corpus where focus was (p=.005); initial > final (p=.016)), i.e. no sig- elicited in two different speech acts, assertions nificant difference is found between the two and confirmations. It is unclear from the previ- pre-focal conditions. ous F0 analysis (cf. Figure 1 and Ambrazaitis, in In the analysis so far, all recordings have press) whether focus was successfully signalled been included irrespective of the variation of F0 in confirmations, when these were produced patterns produced within an experimental con- without a ‘rising focal accent’ (H-). The general dition. As mentioned in the Introduction, con- hypothesis to be tested in future perception ex- firmations were produced with either of two periments is that focus in confirmations may strategies, as classified in Ambrazaitis (in press) even be signalled by a falling pattern, which as either ‘rising’ (presence of a (lowered) H- would support the proposal by Ambrazaitis accent on the target word), or ‘falling’ (absence (2007) that there is a paradigmatic utterance- of a H- accent on the target word), cf. Figure 1. level accent contrast in Standard Swedish be- This raises the question as to whether the focal tween a rising (H-) and a falling accent. lengthening found in confirmations (cf. Figure The present results are in line with this gen- 2) is present in both the rising and the falling eral hypothesis, since they have shown that focal variants. Figure 3 displays the results for con- lengthening can be found not only in assertions, firmation in a rearranged form, where the F0 but also in confirmations, although the degree of pattern is taken into account. focal lengthening seems to be smaller in con- For the medial word, Figure 3 indicates that, firmations than in assertions. In fact, the speech first, the word seems to be lengthened in focus act hardly affects the duration of unfocussed even when it is produced with a falling pattern words, meaning that speech act signalling in- (cf. “förlänger” in conditions ‘medial fall’ vs. teracts with focus signalling. Most importantly, ‘final fall’, ‘final rise’, and ‘initial rise’), and the results also indicate that focal lengthening second, the focal lengthening effect still tends to may even be found when the target word is be stronger when the word is produced with a produced with a falling F0 pattern, although no rise (‘medial fall’ vs. ‘medial rise’). However, inferential statistics have been reported for this for the final word, focal lengthening seems to be case. In fact, in these cases, duration differences present only when the word is produced with a seem to the more salient than F0 differences (cf. rise. Finally, the initial word seems to be ‘medial fall’ and ‘final fall’ in Figures 1 and 3).

76 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

This summary of the results, however, best tion’ in Swedish: the interplay of word and matches the durational patterns found for the utterance prosody. Proceedings of the 16th medial word. Heldner and Strangert (2001) ICPhS (Saarbrücken, Germany), 1093–96. conclude that the medial position is least af- Ambrazaitis G. (in press) Swedish and German fected by factors other than the focal accent intonation in confirmations and assertions. itself, e.g. final lengthening. Based on the pre- Proceedings of Nordic Prosody X (Helsinki, sent results, it seems obvious that even the Finland). duration of the initial word is influenced by Bruce G. (1977) Swedish word accents in sen- more factors than focus, since even if the initial tence perspective. Lund: Gleerup. word is pre-focal, its duration seems to vary Bruce G. (1981) Tonal and temporal interplay. depending on whether the medial or the final In Fretheim T. (ed) Nordic Prosody II – word is focussed (when only stressed syllable is Papers from a symposium, 63–74. Trond- measured), or, in confirmations, whether medial heim: Tapir. or final focus is produced with a fall or a rise. Bruce G., Filipsson M., Frid J., Granström B., More research is needed in order to reach a Gustafson K., Horne M., and House D. better understanding of these patterns. In part, (2000) Modelling of Swedish text and dis- durational patterns of initial words could possi- course intonation in a speech synthesis bly be related to the role the initial position plays framework. In Botinis A. (ed) Intonation. in signalling phrase- or sentence prosody (Myr- Analysis, modelling and technology, berg, in press; Roll et al., 2009). 291–320. Dordrecht: Kluwer. Finally, for the final word, the evidence for Büring D. (2007) Intonation, semantics and focal lengthening in confirmations is weaker, a information structure. In Ramchand G. and tendency opposite to the one found by Heldner Reiss C. (eds) The Oxford Handbook of and Strangert (2001) for assertions, where final Linguistic Interfaces, 445–73. Oxford: Ox- words in focus tended to be lengthened more ford University Press. than words in other positions. In the present Cambier-Langeveld T. and Turk A.E. (1999) A study, no focal lengthening was found for the cross-linguistic study of accentual leng- final word in confirmations when the word was thening: Dutch vs. English. Journal of Pho- produced with a falling pattern. However, the netics 27, 255–80. relative difference in duration between the final Heldner M. and Strangert E. (2001) Temporal and the medial word was still larger as compared effects of focus in Swedish. Journal of to the case of intended medial focus produced Phonetics 29, 329–361. with a fall (cf. the duration relations of ‘medial Horne M. (1991) Why do speakers accent fall’ and ‘final fall’ in Figure 3). “given” information? Proceedings of Eu- Some of the duration differences found in rospeech 91: 2nd European conference on this study are small and probably irrelevant from speech communication and technology a perceptual point of view. However, the general (Genoa, Italy), 1279–82. tendencies indicate that duration is a possible Kügler F. (2008) The role of duration as a cue to perceived focus position in confirmations phonetic correlate of focus. Proceedings of and thus should be taken into account in the the 4th Conference on Speech Prosody planned perception experiment. (Campinas, Brazil), 591–94. Ladd D.R. (2008) Intonational phonology 2nd Acknowledgements ed. Cambridge: Cambridge University Press. Myrberg S. (in press) Initiality accents in Cen- Thanks to Gösta Bruce and Merle Horne for tral Swedish. Proceedings of Nordic Prosody their valuable advice during the planning of the X (Helsinki, Finland). study and the preparation of the paper, to Mikael Roll M., Horne M., and Lindgren, M. (2009) Roll for kindly recording the context questions, Left-edge boundary tone and main clause and, of course, to all my subjects! verb effects on syntactic processing in em- bedded clauses – An ERP study. Journal of References Neurolinguistics 22, 55–73. Ambrazaitis G. (2007) Expressing ‘confirma-

77 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

On utterance-final intonation in tonal and non-tonal dia- lects of Kammu David House1, Anastasia Karlsson2, Jan-Olof Svantesson2, Damrong Tayanin2 1 Dept of Speech, Music and Hearing, CSC, KTH, Stockholm, Sweden 2Dept of Linguistics and Phonetics, Centre for Languages and Literature, Lund University, Sweden

Abstract In this study we investigate utterance-final in- Data collection and method tonation in two dialects of Kammu, one tonal Recordings of both scripted and spontaneous and one non-tonal. While the general patterns speech spoken by tonal and non-tonal speakers of utterance-final intonation are similar be- of Kammu were carried out in November, 2007 tween the dialects, we do find clear evidence in northern Laos and in February, 2008 in that the lexical tones of the tonal dialect re- northern . 24 speakers were recorded strict the pitch range and the realization of fo- ranging in age from 14 to 72 years. cus. Speaker engagement can have a strong ef- The scripted speech material was comprised fect on the utterance-final accent in both dia- of 47 read sentences. The sentences were com- lects. posed in order to control for lexical tone, to elicit focus in different positions and to elicit Introduction phrasing and phrase boundaries. Kammu speakers are bilingual with Lao or Thai being Kammu, a Mon-Khmer language spoken pri- their second language. Since Kammu lacks a marily in northern Laos by approximately written script, informants were asked to trans- 600,000 speakers, but also in Thailand, Viet- late the material from Lao or Thai to Kammu. nam and , is a language that has devel- This resulted in some instances of slightly dif- oped lexical tones rather recently, from the ferent but still compatible versions of the target point of view of language history. Tones arose sentences. The resulting utterances were in connection with loss of the contrast between checked and transcribed by one of the authors, voiced and voiceless initial consonants in a Damrong Tayanin, who is a native speaker of number of dialects (Svantesson and House, Kammu. The speakers were requested to read 2006). One of the main dialects of this lan- each target sentence three times. guage is a tone language with high or low tone For the present investigation six of the 47 on each syllable, while the other main dialect read sentences were chosen for analysis. The lacks lexical tones. The dialects differ only sentences are transcribed below using the tran- marginally in other respects. This makes the scription convention for the tonal dialect. different Kammu dialects well-suited for study- ing the influence of lexical tones on the intona- tion system. 1) nàa wɛ̀ɛt hmràŋ In previous work using material gathered (she bought a horse) from spontaneous storytelling in Kammu, the utterance-final accent stands out as being espe- 2) nàa wɛ̀ɛt hmràŋ yɨ̀m cially rich in information showing two types of (she bought a red horse) focal realizations depending on the expressive load of the accent and the speaker’s own en- 3) tɛ́ɛk pháan tráak gagement (Karlsson et al., 2007). In another (Tɛɛk killed a buffalo) study of the non-tonal Kammu dialect, it was generally found that in scripted speech, the 4) tɛ́ɛk pháan tráak yíaŋ highest F0 values were located on the utter- (Tɛɛk killed a black buffalo) ance-final word (Karlsson et al., 2008). In this paper, we examine the influence of tone, focus 5) Ò àh tráak, àh sɨ́aŋ, àh hyíar and to a certain extent speaker engagement on (I have a buffalo, a pig and a chicken) the utterance-final accent by using the same scripted speech material recorded by speakers 6) Ò àh hmràŋ, àh mɛ̀ɛw, àh prùul (I have a horse, a cat and a badger) of a non-tonal dialect and by speakers of a tonal dialect of Kammu. 78 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Sentences 1 and 2 contain only words with a St = 12[ln(Hz/100)/ln2] (1) low lexical tone while sentences 3 and 4 con- tain only words with a high lexical tone. Sen- which results in a reference value of St=0 semi- tences 2 and 4 differ from 1 and 3 only in that tones at 100 Hz, St=12 at 200 Hz and St=-12 at they end with an additional color adjective fol- 50 Hz. Normalization is performed by subtract- lowing the noun (red and black respectively). ing each subject’s average F0 in St (measured Sentences 2 and 4 were designed to elicit focal across the three utterances of each target sen- accent on the final syllable. Sentences 5 and 6 tence) from the individual St values. convey a listing of three nouns (animals). The nouns all have high lexical tone in sentence 5 Results and low lexical tone in sentence 6. Of all the speakers recorded, one non-tonal Plots for sentences 1-4 showing the F0 meas- speaker and four tonal speakers were excluded urement points in normalized semitones are from this study as they had problems reading presented in Figure 1 for the non-tonal dialect and translating from the Lao/Thai script. All and in Figure 2 for the tonal dialect. Alignment the other speakers were able to fluently trans- is from the end of the sentences. Both dialects late the Lao/Thai script into Kammu and were exhibit a pronounced rise-fall excursion on the included in the analysis. Thus there were 9 non- utterance-final syllable. There is, however, a tonal speakers (2 women and 7 men) and 10 clear difference in F0 range between the two tonal speakers (6 women and 4 men) included dialects. The non-tonal dialect exhibits a much in this study. The speakers ranged in ages from wider range of the final excursion (6-7 St) than 14 to 72. does the tonal dialect (3-4 St). The subjects were recorded with a portable If we are to find evidence of the influence Edirol R-09 digital recorder and a lapel micro- of focus on the F0 excursions, we would expect phone. The utterances were digitized at 48 KHz differences between sentence pairs 1 and 2 on sampling rate and 16-bit amplitude resolution the one hand and 3 and 4 on the other hand and stored in .wav file format. Most of the where the final words in 2 and 4 receive focal speakers were recorded in quiet hotel rooms. accent. In the non-tonal dialect (Figure 1) we One speaker was recorded in his home and one find clear evidence of this difference. In the to- in his native village. nal dialect there is some evidence of the effect Using the WaveSurfer speech analysis pro- of focal accent but the difference is considera- gram (Sjölander and Beskow, 2000) the wave- bly reduced and not statistically significant (see form, spectrogram and fundamental frequency Table 1). contour of each utterance was displayed. Table 1. Normalized maximum F0 means for the Maximum and minimum F0 values were then sentence pairs. 1st and 2nd refers to the F0 max for annotated manually for successive syllables for the 1st and 2nd sentence in the pair. Diff is the differ- each utterance. For sentences 1-4, there was ence of the normalized F0 val ues fo r the pair, an d generally very little F0 movement in the sylla- the ANOVA column shows the results of a single bles leading up to the penult (pre-final sylla- factor ANOVA. Measurement position in the utter- ble). Therefore, measurements were restricted ance is shown in parenthesis. to maximum F0 values on the pre-penultimate st nd syllables, while both maximum and minimum Sentence pair 1 2 diff ANOVA values were measured on the penultimate and Non-tonal ultimate syllables. For sentences 5 and 6, the 1-2 (final) 1.73 3.13 -1.40 p<0.05 initial syllable and the three nouns often 3-4 (final) 1.49 2.84 -1.35 p<0.05 showed considerable F0 movement and were 5-6 (1st noun) 1.60 2.63 -1.03 p<0.05 therefore provided with both maximum and 5-6 (2nd noun) 2.47 1.86 0.61 n.s. minimum F0 values. For the other syllables 5-6 (final) 2.02 4.00 -1.98 p<0.05 which showed little F0 movement, measure- Tonal ments were again restricted to F0 maximum 1-2 (final) 0.56 0.76 -0.20 n.s. values. 3-4 (final) 0.76 1.54 -0.78 n.s. To be able to compare the different speak- 5-6 (1st noun) 2.01 1.22 0.79 p<0.05 ers, the measured F0 values in Hz were con- 5-6 (2nd noun) 2.33 1.42 0.91 p<0.05 verted to semitones (St) and then normalized in 5-6 (final) 1.59 1.98 -0.39 n.s. accordance with Fant and Kruckenb er g (200 4). A fixed semitone scale is used where the unit St is defined by 79 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 1: Normalized F0 measurement points for Figure 3: Normalized F0 measurement points for sentences 1-4 from nine speakers of the non-tonal sentences 5-6 from nine speakers of the non-tonal dialect. Lexical tone in parenthesis refers to the to- dialect. Lexical tone in parenthesis refers to the to- nal dialect. nal dialect.

Figure 2: Normalized F0 measurement points for Figure 4: Normalized F0 measurement points for sentences 1-4 from ten speakers of the tonal dialect. sentences 5-6 from ten speakers of the tonal dialect. Lexical tone is indicated in parenthesis. Lexical tone is indicated in parenthesis. Plots for sentences 5 and 6 are presented in low tone. A comparison of the F0 maximum of Figure 3 for the non-tonal dialect and in Figure the nouns in the three positions for the non- 4 for the tonal dialect. Alignment is from the tonal dialect (Figure 3) shows that the noun in end of the sentences. Both dialects show a simi- the first and third position has a higher F0 in lar intonation pattern exhibiting rise-fall excur- sentence 6 than in sentence 5 but in the second sions on each of the three nouns comprising the position the noun has a higher F0 in sentence 5 listing of the three animals in each sentence. As than in sentence 6. A comparison of the F0 is the case for sentences 1-4, there is also here a maximum of the nouns in the three positions clear difference in F0 range between the two for the tonal dialect (Figure 4) shows that the dialects. The non-tonal dialect exhibits a much F0 maximum for the high tone (sentence 5) is wider overall range (6-9 St) than does the tonal higher than the low tone (sentence 6) in the first dialect (3-4 St). and second position but not in the third posi- The nouns in sentence 5 have high tone in tion. the tonal dialect, while those in sentence 6 have

80 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Discussion nal restriction, although the overall pitch range is still restricted compared to the non-tonal dia- In general terms, the results presented here lect. Thus we see an interaction between show considerable similarities in the basic in- speaker engagement, tone and intonation. tonation patterns of the utterance-final accent in the two dialects. There is a pronounced final rise-fall excursion which marks the end of the Conclusions utterance. The presence of lexical tone in the In this study we see that the general patterns of tonal dialect does, however, restrict the intona- intonation for these sentences are similar in the tion in certain specific ways. The most apparent two dialects. However, there is clear evidence overall difference is found in the restricted of the lexical tones of the tonal dialect restrict- range of the tonal dialect. This is an interesting, ing the pitch range and the realization of focus, but perhaps not such an unexpected difference. especially when the lexical tone is low. Speaker As the tonal differences have lexical meaning engagement can have a strong effect on the ut- for the speakers of the tonal dialect, it may be terance-final accent, and can even neutralize important for speakers to maintain control over pitch differences of high and low lexical tone in the absolute pitch of the syllable which can re- certain cases. sult in the general reduction of pitch range. The lexical tones and the reduction of pitch range seem to have implications for the realiza- Acknowledgements tion of focal accent. In the non-tonal dialect, The work reported in this paper has been car- the final color adjectives of sentences 2 and 4 ried out within the research project, “Separating showed a much higher F0 maximum than did intonation from tone” (SIFT), supported by the the final nouns of sentences 1 and 3. Here the Bank of Sweden Tercentenary Foundation with speakers are free to use rather dramatic F0 ex- additional funding from Crafoordska stiftelsen. cursions to mark focus. The tonal speakers, on the other hand, seem to be restricted from doing References this. It is only the final color adjective of sen- tence 4 which is given a markedly higher F0 Fant, G. and Kruckenberg, A. (2004) Analysis maximum than the counterpart noun in sen- and synthesis of Swedish prosody with out- tence 3. Since the adjective of sentence 4 has looks on production and perception. In Fant high lexical tone, this fact seems to allow the G., Fujisaki H., Cao J. and Xu Y (eds.) speakers to additionally raise the maximum F0. From traditional phonology to modern As the adjective of sentence 2 has low lexical speech processing, 73-95. Foreign Lan- tone, the speakers are not free to raise this F0 guage Teaching and Research Press, Bei- maximum. Here we see evidence of interplay jing. between lexical tone and intonation. Karlsson A., House D., Svantesson J-O. and In the listing of animals in sentences 5 and Tayanin D. (2007) Prosodic phrasing in to- 6, there is a large difference in the F0 maxi- nal and ton-tonal dialects of Kammu. Pro- mum of the final word in the non-tonal dialect. ceedings of the 16th International Congress The word “badger” in sentence 6 is spoken with of Phonetic Sciences, Saarbrücken, Ger- a much higher F0 maximum than the word many, 1309-1312. “chicken” in sentence 5. This can be explained Karlsson A., House D. and Tayanin, D. (2008) by the fact that the word “badger” is semanti- Recognizing phrase and utterance as pro- cally marked compared to the other common sodic units in non-tonal dialects of Kammu. farm animals in the list. It is quite natural in In Proceedings, FONETIK 2008, 89-92, Kammu farming culture to have a buffalo, a Department of Linguistics, University of pig, a chicken, a horse and a cat, but not a Gothenburg. badger! Some of the speakers even asked to Sjölander K. and Beskow J. (2000) WaveSurfer confirm what the word was, and therefore it is - an open source speech tool. In Proceed- not surprising if the word often elicited addi- ings of ICSLP 2000, 6th Intl Conf on Spo- tional speaker engagement. This extra engage- ken Language Processing, 464-467, Beijing. ment also shows up in the tonal speakers’ ver- Svantesson J-O. and House D. (2006) Tone sions of “badger” raising the low lexical tone to production, tone perception and Kammu a higher F0 maximum than the word “chicken” tonogenesis. Phonology 23, 309-333. in sentence 5 which has high lexical tone. Here, speaker engagement is seen to override the to- 81 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Reduplication with fixed tone pattern in Kammu Jan-Olof Svantesson1, David House2, Anastasia Mukhanova Karlsson1 and Damrong Tayanin1 1Department of Linguistics and Phonetics, Lund University 2Department of Speech, Music and Hearing, KTH, Stockholm

Abstract (see e.g. Duanmu 2000: 228). The second copy of the adjective always has high tone (denoted In this paper we show that speakers of both to- ¯), irrespective of the tone of the base: nal and non-tonal dialects of Kammu use a fixed tone pattern high–low for intensifying re- jiān ‘pointed’ > jiān-jiān-r-de duplication of adjectives, and also that speak- hóng ‘red’ > hóng-hōng-r-de ers of the tonal dialect retain the lexical tones hǎo ‘good’ > hǎo-hāo-r-de (high or low) while applying this fixed tone pat- màn ‘slow’ > màn-mān-r-de tern.

Thus, the identity of the word, including the Background tone, is preserved in Standard Chinese redupli- Kammu (also known as Khmu, Kmhmu’, etc.) cation, the tone being preserved in the first is an Austroasiatic language spoken in the copy of it. northern parts of Laos and in adjacent areas of In Kammu there is a similar reduplication Vietnam, China and Thailand. The number of pattern, intensifying the adjective meaning. For speakers is at least 500,000. Some dialects of example, blia ‘pretty’ (non-tonal dialect) is re- this language have a system of two lexical duplicated as blia-blia ‘very pretty’. This redu- tones (high and low), while other dialects have plication has a fixed tone pattern, the first copy preserved the original toneless state. The tone- being higher than the second one (although, as less dialects have voicless and voiced syllable- will be seen below, a few speakers apply an- initial stops and sonorants, which have merged other pattern). in the tonal dialects, so that the voiceless ~ voiced contrast has been replaced with a high ~ Material and method low tone contrast. For example, the minimal We investigate two questions: pair klaaŋ ‘eagle’ vs. glaaŋ ‘stone’ in non-tonal (1) Is the high–low pattern in intensifying dialects corresponds to kláaŋ vs. klàaŋ with high reduplication used by speakers of both tonal and low tone, respectively, in tonal dialects. and non-tonal dialects? Other phonological differences between the (2) For speakers of tonal dialects: is the dialects are marginal, and all dialects are mutu- lexical tone of the adjective preserved in the ally comprehensible. See Svantesson (1983) for reduplicated form? general information on the Kammu language For this purpose we used recordings of tonal and Svantesson and House (2006) for Kammu and non-tonal dialect speakers that we made in tonogenesis. northern Laos in November 2007, and in north- This state with two dialects that more or ern Thailand in February 2008. A total of 24 less constitute a minimal pair for the distinction speakers were recorded, their ages ranging be- between tonal and non-tonal languages makes tween 14 and 72 years. The recordings included Kammu an ideal language for investigating the two sentences with reduplicated adjectives: influence of lexical tone on different prosodic properties of a language. In this paper we will naa blia-blia ‘she is very pretty’ deal with intensifying full reduplication of ad- naa thaw-thaw ‘she is very old’ jectives from this point of view. Intensifying or attenuating reduplication of This is the form in the non-tonal dialect; in adjectives occurs in many languages in the the tonal dialect, the reduplicated words are Southeast Asian area, including Standard Chi- plìa-plìa with low lexical tone, and tháw-tháw nese, several Chinese dialects and Vietnamese. with high; the word nàa ‘she’ has low tone in As is well known, Standard Chinese uses full the tonal dialect. Each speaker was asked to re- reduplication combined with the suffixes -r-de cord the sentences three times, but for some, to form adjectives with an attenuated meaning only one or two recordings were obtained or

82 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University possible to analyse (see table 1 for the number blia-blia was recorded than when thaw-thaw was of recordings for each speaker). Two of the recorded just after that; see House et al. speakers (Sp2 and Sp4) were recorded twice. (forthc.) for the role of the speakers’ engage- For four of the 24 speakers no useable record- ment for Kammu intonation. ings were made. Of the remaining 20 speakers, 8 speak the non-tonal dialect and 12 the tonal Conclusion dialect. The maximal fundamental frequency was The results show that the great majority of the measured in each copy of the reduplicated speakers we recorded used the expected fixed words using the Praat analysis program. pattern, high–low, for intensifying reduplica- tion, independent of their dialect type, tonal or non-tonal. Furthermore, the speakers of tonal Results and discussion dialects retain the contrast between high and The results are shown in table 1. low lexical tone when they apply this fixed tone Concerning question (1) above, the results pattern for adjective reduplication. show that most speakers follow the pattern that the first copy of the reduplicated adjective has Acknowledgements higher F0 than the second one. 14 of the 20 speakers use this high–low pattern in all their The work reported in this paper has been car- productions. These are the 5 non-tonal speakers ried out within the research project Separating Sp1, Sp3, Sp5, Sp6, Sp10 and the 9 tonal intonation from tone (SIFT), supported by the speakers Sp2, Sp4, Sp13, Sp16, Sp20, Sp21, bank of Sweden Tercentenary Foundation (RJ), Sp22, Sp24, Sp25. Two speakers, Sp9 (non- with additional funding from Crafoordska tonal male) and Sp17 (tonal female) use a com- stiftelsen. pletely different tone pattern, low–high. The remaining speakers mix the patterns, Sp8 and References Sp18 use high–low for blia-blia but low–high Duanmu, San (2000) The phonology of Stan- for thaw-thaw, and the two speakers Sp11 and Sp23 seem to mix them more or less randomly. dard Chinese. Oxford: Oxford University As seen in table 1, the difference in F0 between Press. the first and second copy is statistically signifi- House, David; Karlsson, Anastasia; Svantes- cant in the majority of cases, especially for son, Jan-Olof and Tayanin, Damrong those speakers who always follow the expected (forthc.) The phrase-final accent in Kammu: high–low pattern. Some of the non-significant effects of tone, focus and engagement. Pa- results can probably be explained by the large per submitted to InterSpeech 2009. variation and the small number of measure- Svantesson, Jan-Olof (1983) Kammu phonol- ments for each speakers. ogy and morphology. Lund: Gleerup The second question is whether or not the Svantesson, Jan-Olof and House, David (2006) tonal speakers retain the tone difference in the Tone production, tone perception and reduplicated form. In the last column in table 1, Kammu tonogenesis. Phonology 23, 309– we show the difference between the mean F0 333. values (on both copies) of the productions of thaw-thaw/tháw-tháw and blia-blia/plìa-plìa. For 11 of the 12 speakers of tonal dialects, F0 was, on the average, higher on tháw-tháw than on plìa-plìa, but only 2 of the 8 speakers of non- tonal dialects had higher F0 on thaw-thaw than on blia-blia. An exact binomial test shows that the F0 difference is significant (p = 0.0032) for the tonal speakers but not for the non-tonal ones (p = 0.144). One might ask why the majority of non- tonal speakers have higher F0 on blia-blia than on thaw-thaw. One possible reason is that blia- blia was always recorded before thaw-thaw, and this may have led to higher engagement when

83 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Table 1. F0 means for each reduplicated word and each speaker. The columns n1 and n2 show the number of repetitions of the reduplicated words, 1st and 2nd refers to the F0 means in the first and sec- ond copy of the reduplicated adjective, and diff is the difference between them. The test column shows the results of t-tests for this difference (df = n1 + n2 – 2). The column difference shows the F0 difference between each speaker’s productions of thaw-thaw and blia-blia (means of first and second copy).

blia/plìa thaw/tháw differ- n1 1st 2nd diff test n2 1st 2nd diff test ence non-tonal male Sp1 3 178 120 58 p < 0.001 2 172 124 48 p < 0.05 –1 Sp3 3 149 115 34 p < 0.001 3 147 109 38 p < 0.001 –4 Sp5 3 216 151 65 p < 0.01 3 205 158 47 p < 0.01 –1 Sp6 3 186 161 25 p < 0.01 3 181 142 39 p < 0.05 –9 Sp8 3 176 146 30 n.s. 1 155 172 –17 — 2 Sp9 3 126 147 –21 n.s 3 105 127 –22 p < 0.05 –20 non-tonal female Sp10 2 291 232 59 n.s 2 287 234 53 n.s. –2 Sp11 3 235 224 11 n.s 3 232 234 –2 n.s. 3 tonal male Sp13 2 173 140 33 n.s. 3 213 152 61 p < 0.05 25 Sp20 4 119 106 13 n.s 3 136 119 17 n.s. 14 Sp22 3 192 136 57 p < 0.05 3 206 134 72 n.s 6 Sp23 2 190 192 –2 n.s. 2 207 210 –3 n.s. 17 Sp24 3 159 132 27 p < 0.01 3 159 129 30 p < 0.01 –2 tonal female Sp2 6 442 246 196 p < 0.001 6 518 291 227 p < 0.001 61 Sp4 5 253 202 51 n.s. 6 257 232 25 p < 0.05 17 Sp16 3 326 211 115 p < 0.05 3 351 250 101 p < 0.05 31 Sp17 3 236 246 –10 n.s. 3 251 269 –18 n.s. 19 Sp18 3 249 208 41 p < 0.05 3 225 236 –11 n.s. 2 Sp21 5 339 210 129 p < 0.001 6 316 245 71 p < 0.001 6 Sp25 3 240 231 9 n.s. 3 269 263 6 p < 0.01 31

84 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

85 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Exploring data driven parametric synthesis Rolf Carlson1, Kjell Gustafson1,2 1 KTH, CSC, Department of Speech, Music and Hearing, Stockholm, Sweden 2 Acapela Group Sweden AB, Solna, Sweden

Abstract by a formant model, since the articulatory con- straints are not directly included in a formant- This paper describes our work on building a based model. Traditionally, parametric speech formant synthesis system based on both rule synthesis has been based on very labour-inten- generated and database driven methods. Three sive optimization work. The notion analysis by parametric synthesis systems are discussed: synthesis has not been explored except by man- our traditional rule based system, a speaker ual comparisons between hand-tuned spectral adapted system, and finally a gesture system. slices and a reference spectrum. When increas- The gesture system is a further development of ing our ambitions to multi-lingual, multi- the adapted system in that it includes concate- speaker and multi-style synthesis, it is obvious nated formant gestures from a data-driven unit that we want to find at least semi-automatic library. The systems are evaluated technically, methods to collect the necessary information, comparing the formant tracks with an analysed using speech and language databases. The work test corpus. The gesture system results in a by Holmes and Pearce (1990) is a good exam- 25% error reduction in the formant frequencies ple of how to speed up this process. With the due to the inclusion of the stored gestures. Fi- help of a synthesis model, the spectra are auto- nally, a perceptual evaluation shows a clear matically matched against analysed speech. advantage in naturalness for the gesture system Automatic techniques such as this will proba- compared to both the traditional system and the bly also play an important role in making speaker adapted system. speaker-dependent adjustments. One advantage with these methods is that the optimization is Introduction done in the same framework as that to be used Current speech synthesis efforts, both in re- in the production. The synthesizer constraints search and in applications, are dominated by are thus already imposed in the initial state. methods based on concatenation of spoken If we want to keep the flexibility of the for- units. Research on speech synthesis is to a large mant model but reduce the need for detailed extent focused on how to model efficient unit formant synthesis rules, we need to extract for- selection and unit concatenation and how opti- mant synthesis parameters directly from a la- mal databases should be created. The tradi- belled corpus. Already more than ten years ago tional research efforts on formant synthesis and at Interspeech in Australia, Mannell (1998) re- articulatory synthesis have been significantly ported a promising effort to create a diphone reduced to a very small discipline due to the library for formant synthesis. The procedure success of waveform based methods. Despite included a speaker-specific extraction of for- the well motivated current research path result- mant frequencies from a labelled database. In a ing in high quality output, some efforts on pa- sequence of papers from Utsunomiya Univer- rametric modelling are carried out at our de- sity, Japan, automatic formant tracking has partment. The main reasons are flexibility in been used to generate speech synthesis of high speech generation and a genuine interest in the quality using formant synthesis and an elabo- speech code. We try to combine corpus based rate voice source (e.g. Mori et al., 2002). Hertz methods with knowledge based models and to (2002) and Carlson and Granström (2005) re- explore the best features of each of the two ap- port recent research efforts to combine data- proaches. This report describes our progress in driven and rule-based methods. The approaches this synthesis work. take advantage of the fact that a unit library can better model detailed gestures than the general Parametric synthesis rules. In a few cases we have seen a commercial Underlying articulatory gestures are not easily interest in speech synthesis using the formant transformed to the acoustic domain described model. One motivation is the need to generate

86 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University speech using a very small footprint. Perhaps the rule-generated parameter values are replaced by formant synthesis will again be an important values from the unit library. The process is con- research subject because of its flexibility and trolled by the unit selection module that takes also because of how the formant synthesis ap- into account not only parameter information but proach can be compressed into a limited appli- also linguistic features supplied by the text-to- cation environment. parameter module. The parameters are normal- ized and concatenated before being sent to the GLOVE synthesizer (Carlson et al., 1991). A combined approach for acoustic speech synthesis Input text Data-base The efforts to combine data-driven and rule- Text- to - parameter based methods in the KTH text-to-speech sys- Analysis tem have been pursued in several projects. In a Extracted Features study by Högberg (1997), formant parameters parameters Rule- gen era te d parameters were extracted from a database and structured Unit library Unit selection with the help of classification and regression trees. The synthesis rules were adjusted accord- Concatenation ing to predictions from the trees. In an evalua- Unit-controlled parameters + tion experiment the synthesis was tested and judged to be more natural than the original Synthesizer rule-based synthesis. Sjölander (2001) expanded the method into Speech output replacing complete formant trajectories with manually extracted values, and also included Figure 1. Rule-based synthesis system using a data- consonants. According to a feasibility study, driven unit library. this synthesis was perceived as more natural sounding than the rule-only synthesis (Carlson Creation of a unit library et al., 2002). Sigvardson (2002) developed a generic and complete system for unit selection In the current experiments a male speaker re- using regression trees, and applied it to the corded a set of 2055 diphones in a nonsense data-driven formant synthesis. In Öhlin & Carl- word context. A unit library was then created son (2004) the rule system and the unit library based on these recordings. are more clearly separated, compared to our When creating a unit library of formant fre- earlier attempts. However, by keeping the rule- quencies, automatic methods of formant extrac- based model we also keep the flexibility to tion are of course preferred, due to the amount make modifications and the possibility to in- of data that has to be processed. However, clude both linguistic and extra-linguistic available methods do not always perform ade- knowledge sources. quately. With this in mind, an improved for- Figure 1 illustrates the approach in the KTH mant extraction algorithm, using segmentation text-to-speech system. A database is used to information to lower the error rate, was devel- create a unit library and the library information oped (Öhlin & Carlson, 2004). It is akin to the is mixed with the rule-driven parameters. Each algorithms described in Lee et al. (1999), unit is described by a selection of extracted Talkin (1989) and Acero (1999). synthesis parameters together with linguistic Segmentation and alignment of the wave- information about the unit’s original context form were first performed automatically with and linguistic features such as stress level. The nAlign (Sjölander, 2003). Manual correction parameters can be extracted automatically was required, especially on vowel–vowel tran- and/or edited manually. sitions. The waveform is divided into (overlap- In our traditional text-to-speech system the ping) time frames of 10 ms. At each frame, an synthesizer is controlled by rule-generated pa- LPC model of order 30 is created; the poles are rameters from the text-to-parameter module then searched through with the Viterbi algo- (Carlson et al., 1982). The parameters are rep- rithm in order to find the path (i.e. the formant resented by time and values pairs including la- trajectory) with the lowest cost. The cost is de- bels and prosodic features such as duration and fined as the weighted sum of a number of par- intonation. In the current approach some of the tial costs: the bandwidth cost, the frequency deviation cost, and the frequency change cost.

87 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The bandwidth cost is equal to the bandwidth in Hertz. The frequency deviation cost is de- 100 % rule fined as the square of the distance to a given reference frequency, which is formant, speaker, and phoneme dependent. This requires the la- belling of the input before the formant tracking 0 % unit is carried out. Finally, the frequency change x % x % cost penalizes rapid changes in formant fre- phoneme phoneme quencies to make sure that the extracted trajec- tories are smooth. Figure 2. Mixing proportions between a unit and a Although only the first four formants are rule generated parameter track. X=100% equals the phoneme duration. used in the unit library, five formants are ex- tracted. The fifth formant is then discarded. The justification for this is to ensure reasonable val- Parameter concatenation ues for the fourth formant. The algorithm also The concatenation process in the gesture sys- introduces eight times over-sampling before tem is a simple linear interpolation between the averaging, giving a reduction of the variance of rule generated formant data and the possible the estimated formant frequencies. After the joining units from the library. At the phoneme extraction, the data is down-sampled to 100 Hz. border the data is taken directly from the unit. The impact of the unit data is gradually reduced Synthesis Systems inside the phoneme. At a position X the influ- Three parametric synthesis systems were ex- ence of the unit has been reduced to zero (Fig- plored in the experiments described below. The ure 2). The X value is calculated relative to the first was our rule-based traditional system, segment duration and measured in % of the which has been used for many years in our segment duration. The parameters in the middle group as a default parametric synthesis system. of a segment are thus dependent on both rules It includes rules for both prosodic and context and two units. dependent segment realizations. Several meth- ods to create formant trajectories have been ex- plored during the development of this system. Technical evaluation Currently simple linear trajectories in a loga- A test corpus of 313 utterances was selected to rithmic domain are used to describe the for- compare predicted and estimated formant data mants. Slopes and target positions are con- and analyse how the X position influences the trolled by the transformation rules. difference. The utterances were collected in the The second rule system, the adapted system, IST project SpeeCon (Großkopf et al., 2002) was based on the traditional system and and the speaker was the same as the reference adapted to a reference speaker. This speaker speaker behind the unit library. As a result, the was also used to develop the data-driven unit adapted system also has the same reference library. Default formant values for each vowel speaker. In total 4853 phonemes (60743 10 ms were estimated based on the unit library, and frames) including 1602 vowels (17508 frames) the default rules in the traditional system were were used in the comparison. changed accordingly. It is important to empha- A number of versions of each utterance size that it is the vowel space that was data were synthesized, using the traditional system, driven and adapted to the reference speaker and the adapted system and the unit system with not the rules for contextual variation. varying values of X percent. The label files Finally, the third synthesis system, the ges- from the SpeeCon project were used to make ture system, was based on the adapted system, the duration of each segment equal to the re- but includes concatenated formant gestures cordings. An X value of zero in the unit system from the data-driven unit library. Thus, both the will have the same formant tracks as the adapted system and the gesture system are data- adapted system. Figure 3 shows the results of driven systems with varying degree of mix be- calculating the city-block distance between the tween rules and data. The next section will dis- synthesized and measured first three formants cuss in more detail the concatenation process in the vowel frames. that we employed in our experiments.

88 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 4 presents a detailed analysis of the data 200 for the unit system with X=70%. The first for- 170 150 mant has an average distance of 68 Hz with a 124 115 standard deviation of 43 Hz. Corresponding 103 93 90 89 87 data for F2 is (107 Hz, 81 Hz), F3 (111 Hz, 68 Hz 100 Hz) and F4 (136 Hz, 67 Hz). 50 Clearly the adapted speaker has a quite dif- ferent vowel space compared to the traditional 0 l system. Figure 5 presents the distance calcu- 20% 40% 60% 70% 80% 100% lated on a phoneme by phoneme base. The cor- Adapted Gesture Traditiona responding standard deviations are 66 HZ, 58 Frame by frame city block distance (three formants mean) Hz or 46 Hz for the three systems. As expected, the difference between the tra- Figure 3. Comparison between synthesized and ditional system and the adapted system is quite measured data (frame by frame). large. The gesture system results in about a 25% error reduction in the formant frequencies due to the inclusion of the stored gestures. However, whether this reduction corresponds 250 to a difference in perceived quality cannot be predicted on the basis of these data. The differ- 200 ence between the adapted and the gesture sys- tem is quite interesting and of the same magni- 150 f1 f2 tude as the adaptation data. The results clearly HZ f3 indicate how the gesture system is able to 100 f4 mimic the reference speaker in more detail than the rule-based system. The high standard devia- 50 tion indicates that a more detailed analysis should be performed to find the problematic 0 cases. Since the test data as usual is hampered Traditional Adapted Gesture by errors in the formant tracking procedures we 70% will inherently introduce an error in the com- Phoneme by phoneme parison. In a few cases, despite our efforts, we have a problem with pole and formant number Figure 4. Comparisons between synthesized and assignments. measured data for each formant (phoneme by phoneme).

Perceptual evaluation A pilot test was carried out to evaluate the natu- 200 ralness in the three synthesis systems: tradi- 175 tional, adapted and gesture. 9 subjects working 150 in the department were asked to rank the three 131 systems according to perceived naturalness us- 95 Hz 100 ing a graphic interface. The subjects have been exposed to parametric speech synthesis before. 50 Three versions of twelve utterances including single words, numbers and sentences were 0 ranked. The traditional rule-based prosodic Traditional Adapted Gesture 70% model was used for all stimuli. In total Phoneme by phoneme city block distance (three formants mean) 324=3*12*9 judgements were collected. The Figure 5. Comparison between synthesized and result of the ranking is presented in Figure 6. measured data (phoneme by phoneme).

89 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The perceptual results showed an advantage in 100 naturalness for the gesture system which in- cludes both speaker adaptation and a diphone database of formant gestures, compared to both 75 the traditional reference system and the speaker adapted system. However, it is also apparent 50 from the synthesis quality that a lot of work still needs to be put into the automatic building Responses (%) Responses 25 of a formant unit library.

0 Acknowledgements Bottom Middle Top Rank The diphone database was recorded using the WaveSurfer software. David Öhlin contributed Traditional Adapted Gesture 70% in building the diphone database. We thank Figure 6. Rank distributions for the traditional, John Lindberg and Roberto Bresin for making adapted and gesture 70% systems. the evaluation software available for the per- ceptual ranking. The SpeeCon database was The outcome of the experiment should be con- made available by Kjell Elenius. We thank all sidered with some caution due to the selection subjects for their participation in the perceptual of the subject group. However, the results indi- evaluation. cate that the gesture system has an advantage over the other two systems and that the adapted system is ranked higher than the traditional sys- References tem. The maximum rankings are 64%, 72% and Acero, A. (1999) “Formant analysis and syn- 71% for the traditional, adapted and gesture thesis using hidden Markov models”, In: systems, respectively. Our initial hypothesis Proc. of Eurospeech'99, pp. 1047-1050. was that these systems would be ranked with Carlson, R., and Granström, B. (2005) “Data- the traditional system at the bottom and the driven multimodal synthesis”, Speech gesture system at the top. This is in fact true in Communication, Volume 47, Issues 1-2, 58% of the cases with a standard deviation of September-October 2005, Pages 182-193. 21%. One subject contradicted this hypothesis Carlson, R., Granström, B., and Karlsson, I. in only one out off 12 cases while another sub- (1991) “Experiments with voice modelling ject did the same in as many as 9 cases. The in speech synthesis”, Speech Communica- hypothesis was confirmed by all subjects for tion, 10, 481-489. one utterance and by only one subject for an- Carlson, R., Granström, B., Hunnicutt, S. (1982) “A multi-language text-to-speech other one. module”, In: Proc. of the 7th International The adapted system is based on data from Conference on Acoustics, Speech, and Sig- the diphone unit library and was created to nal Processing (ICASSP’82), Paris, France, form a homogeneous base for combining rule- vol. 3, pp. 1604-1607. based and unit-based synthesis as smoothly as Carlson, R., Sigvardson, T., Sjölander, A. possible. It is interesting that even these first (2002) “Data-driven formant synthesis”, In: steps, creating the adapted system, are regarded Proc. of Fonetik 2002, Stockholm, Sweden, to be an improvement. The diphone library has STL-QPSR 44, pp. 69-72 not yet been matched to the dialect of the refer- Großkopf, B., Marasek, K., v. d. Heuvel, H., ence speaker, and a number of diphones are Diehl, F., Kiessling, A. (2002) “SpeeCon - missing. speech data for consumer devices: Database specification and validation”, Proc. LREC. Final remarks Hertz, S. (2002) “Integration of Rule-Based This paper describes our work on building for- Formant Synthesis and Waveform Concate- mant synthesis systems based on both rule- nation: A Hybrid Approach to Text-to- generated and database driven methods. The Speech Synthesis”, In: Proc. IEEE 2002 technical and perceptual evaluations show that Workshop on Speech Synthesis, 11-13, Sep- this approach is a very interesting path to ex- tember 2002 Santa Monica, USA. plore further at least in a research environment.

90 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Högberg, J. (1997) “Data driven formant syn- Öhlin, D., Carlson, R. (2004) “Data-driven thesis”, In: Proc. of Eurospeech 97. formant synthesis”, In: Proc. Fonetik 2004 Holmes, W. J. and Pearce, D. J. B. (1990) pp. 160-163. “Automatic derivation of segment models Sigvardson, T. (2002) “Data-driven Methods for synthesis-by-rule”, Proc ESCA Work- for Paameter Synthesis – Description of a shop on Speech Synthesis, Autrans, France. System and Experiments with CART- Lee, M., van Santen, J., Möbius, B., Olive, J. Analysis”, (in Swedish). Master Thesis, (1999) “Formant Tracking Using Segmental TMH, KTH, Stockholm, Sweden. Phonemic Information”, In: Proc.of Eu- Sjölander, A. (2001) “Data-driven Formant rospeech ‘99, Vol. 6, pp. 2789–2792. Synthesis” (in Swedish). Master Thesis, Mannell, R. H. (1998) “Formant diphone pa- TMH, KTH, Stockholm. rameter extraction utilising a labeled single Sjölander, K. (2003) “An HMM-based System speaker database”, In: Proc. of ICSLP 98. for Automatic Segmentation and Alignment Mori, H., Ohtsuka, T., Kasuya, H. (2002) “A of Speech”, In: Proc. of Fonetik 2003, data-driven approach to source-formant Umeå Universitet, Umeå, Sweden, pp. 93– type text-to-speech system”, In ICSLP- 96. 2002, pp. 2365-2368. Talkin, D. (1989) “Looking at Speech”, In: Öhlin, D. (2004) “Formant Extraction for Data- Speech Technology, No 4, April/May 1989, driven Formant Synthesis”, (in Swedish). pp. 74–77. Master Thesis, TMH, KTH, Stockholm.

91 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Uhm… What’s going on? An EEG study on perception of filled pauses in spontaneous Swedish speech Sebastian Mårback1, Gustav Sjöberg1, Iris-Corinna Schwarz1 and Robert Eklund2, 3 1Dept of Linguistics, Stockholm University, Stockholm, Sweden 2Dept of Clinical Neuroscience, Karolinska Institute/Stockholm Brain Institute, Stockholm, Sweden 3Voice Provider Sweden, Stockholm, Sweden

Abstract Corley, MacGregor & Donaldson (2007) showed that the presence of filled pauses in Filled pauses have been shown to play a utterances correlated with memory and significant role in comprehension and long- perception improvement. In an event-related term storage of speech. Behavioral and potential (ERP) study on memory, recordings neurophysiological studies suggest that filled of utterances with filled pauses before target pauses can help mitigate semantic and/or words were played back to the subjects. syntactic incongruity in spoken language. The Recordings of utterances with silent pauses purpose of the present study was to explore were used as comparison. In a subsequent how filled pauses affect the processing of memory test subjects had to report whether spontaneous speech in the listener. Brain target words, presented to them one at a time, activation of eight subjects was measured by had occurred during the previous session or not. electroencephalography (EEG), while they The subjects were more successful in listened to recordings of Wizard-of-Oz travel recognizing words preceded by filled pauses. booking dialogues. EEG scans were performed starting at the onset The results show a P300 component in the of the target words. A clearly discernable N400 Primary Motor Cortex, but not in the Broca or component was observed for semantically Wernicke areas. A possible interpretation could unpredictable words as opposed to predictable be that the listener is preparing to engage in ones. This effect was significantly reduced speech. However, a larger sample is currently when the words were preceded by filled pauses. being collected. These results suggest that filled pauses can affect how the listener processes spoken Introduction language and have long-term consequences for Spontaneous speech contains not only words the representation of the message. with lexical meaning and/or grammatical Osterhout & Holcomb (1992) reported from function but also a considerable number of an EEG experiment where subjects were elements, commonly thought of as not part of presented with written sentences containing the linguistic message. These elements include either transitive or intransitive verbs. In some so-called disfluencies, some of which are filled of the sentences manipulation produced a pauses, repairs, repetitions, prolongations, garden path sentence which elicited a P600 truncations and unfilled pauses (Eklund, 2004). wave in the subjects, indicating that P600 is The term ‘filled pause’ is used to describe non- related to syntactic processing in the brain. words like “uh” and “uhm”, which are common Kutas & Hillyard (1980) presented subjects in spontaneous speech. In fact they make up with sentences manipulated according to degree around 6% of words in spontaneous speech of semantic congruity. Congruent sentences in (Fox Tree, 1995; Eklund, 2004). which the final word was contextually Corley & Hartsuiker (2003) also showed predictable elicited different ERPs than that filled pauses can increase listeners’ incongruent sentences containing unpredictable attention and help them interpret the following final words. Sentences that were semantically utterance segment. Subjects were asked to press incongruent elicited a clear N400 whereas buttons according to instructions read out to congruent sentences did not. them. When the name of the button was We predicted that filled pauses evoke either preceded by a filled pause, their response time an N400 or a P600 potential as shown in the was shorter than when it was not preceded by a studies above. This hypothesis has explanatory filled pause. value for the mechanisms of the previously

92 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University mentioned attention-enhancing function of The dialogs were edited so that only the party filled pauses (Corley & Hartsuiker, 2003). booking the trip (customer/client) was heard Moreover, the present study is explorative in and the responding party’s (agent) speech was nature in that it uses spontaneous speech, in replaced with silence. The exact times for a contrast to most previous EEG studies of total of 54 filled pauses of varying duration speech perception. (200 to 1100 ms) were noted. Out of these, 37 Given the present knowledge of the effect were utterance-initial and 17 were utterance- medial. The times were used to manually of filled pauses on listeners’ processing of identify corresponding sequences from the subsequent utterance segments, it is clear that EEG scans which was necessary due to the direct study of the immediate neurological nature of the stimuli. ERP data from a period of reactions to filled pauses proper is of interest. 1000 ms starting at stimulus onset were The aim of this study was to examine selected for analysis. listeners’ neural responses to filled pauses in Swedish speech. Cortical activity was recorded Procedure using EEG while the subjects listened to The experiment was conducted in a sound spontaneous speech in travel booking dialogs. attenuated, radio wave insulated and softly lit room with subjects seated in front of a monitor Method and a centrally positioned loud speaker. Subjects were asked to remain as still as Subjects possible, to blink as little as possible, and to keep their eyes fixed on the screen. The The study involved eight subjects (six men and subjects were instructed to imagine that they two women) with a mean age of 39 years and were taking part in the conversation − assuming an age range of 21 to 73 years. All subjects the role of the agent in the travel booking were native speakers of Swedish and reported setting − but to remain silent. The total duration typical hearing capacity. Six of the subjects of the sound files was 11 min and 20 sec. The were right-handed, while two considered experimental session contained three short themselves to be left-handed. Subjects were breaks, offering the subjects the opportunity to paid a small reward for their participation. correct for any seating discomfort. Apparatus Processing of data The cortical activation of the subjects was In order to analyze the EEG data for ERPs, recorded using instruments from Electrical several stages of data processing were required. Geodesics Inc. (EGI), consisting of a Hydrocel A band pass filter was set to 0.3−30 Hz to GSN Sensor Net with 128 electrodes. These remove body movement artefacts and eye high impedance net types permit EEG blinks. A period of 100 ms immediately prior to measurement without requiring gel application stimulus onset was used as baseline. The data which permits fast and convenient testing. The segments were then divided into three groups, amplifier Net Amps 300 increased the signal of each 1100 ms long, representing utterance- the high-impedance nets. To record and analyze initial filled pauses, utterance-medial filled the EEG data the EGI software Net Station 4.2 pauses and all filled pauses, respectively. Data was used. The experiment was programmed in with artefacts caused by bad electrode channels the Psychology Software Tools’ software E- and muscle movements such as blinking were Prime 1.2. removed and omitted from analysis. Bad channels were then replaced with interpolated Stimuli values from other electrodes in their vicinity. The stimuli consisted of high-fidelity audio The cortex areas of interest roughly recordings from arranged phone calls to a travel corresponded to Broca’s area (electrodes 28, booking service. The recordings were made in a 34, 35, 36, 39, 40, 41, 42), Wernicke’s area “Wizard-of-Oz” setup using speakers (two (electrodes 52, 53, 54, 59, 60, 61, 66, 67), and males/two females) who were asked to make Primary Motor Cortex (electrodes 6, 7, 13, 29, travel bookings according to instructions (see 30, 31, 37, 55, 80, 87, 105, 106, 111, 112).The Eklund, 2004, section 3.4 for a detailed average voltage of each session was recorded description of the data collection). and used as a subjective zero, as shown in Figure 1.

93 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University



Figure 1. Sensor net viewed from above (nose up). Groups of electrodes roughly corresponding to Broca area marked in black, Wernicke area in bright grey and Primary Motor Cortex in dark grey. B Finally, nine average curves were calculated on the basis of selected electrode groups. The groups were selected according to their scalp localization and to visual data inspection. Results After visual inspection of the single electrode curves, only curves generated by the initial filled pauses (Figure 2) were selected for further analysis. No consistent pattern could be 3 detected in the other groups. In addition to the electrode sites that can be expected to reflect language-related brain activation in Broca’s 1,5 and Wernicke’s areas, a group of electrodes V located at and around Primary Motor Cortex μ was also selected for analysis as a P300 ms component was observed in that area. The P300 300 700 peaked at 284 ms with 2.8 µV. The effect differed significantly (p<.001) from the -1,5 baseline average. In the Wernicke area a later and weaker − however statistically significant Figure 2. The average curves of initial filled (p<.05) − potential was observed than in pauses. Positive voltages in µV shown up. The Primary Motor Cortex. intersection of the y-axis and the x-axis marks the baseline. A P300 component, peaking at 284 ms At around 300 ms post-onset a negative with 2.8 µV, is observed in Primary Motor Cortex tendency was visible, which turned to a very (PMC). In Wernicke’s area (W) the activation weak positive trend after approximately occurs somewhat later and the reaction is 800 ms. No clear trend for positive or negative considerably weaker. In Broca’s area (B) no voltage could be observed in the Broca area. consistent activation is observed.

94 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Discussion Acknowledgements Contrary to our initial hypotheses, no palpable The data and basic experimental design used in activation was observed in either Broca or this paper were provided by Robert Eklund as Wernicke related areas. part of his post-doctoral fMRI project of However, the observed − and somewhat disfluency perception at Karolinska Institute/ unexpected − effect in Primary Motor Cortex is Stockholm Brain Institute, Dept of Clinical no less interesting. The presence of a P300 Neuroscience, MR-Center, with Professor component in or around the Primary Motor Martin Ingvar as supervisor. This study was Cortex could suggest that the listener is further funded by the Swedish Research preparing to engage in speech, and that filled Council (VR 421-2007-6400) and the Knut and pauses could act as a cue to the listener to Alice Wallenberg Foundation (KAW initiate speech. 2005.0115). The fact that it can often be difficult to determine where the boundary between medial References filled pauses and the rest of the utterance is could provide an explanation as to why it is Corley, M. & Hartsuiker, R.J. (2003). difficult to discern distinct ERPs connected to Hesitation in speech can… um… help a medial filled pauses. listener understand. In Proceedings of the In contrast to the aforementioned study by 25th Meeting of the Cognitive Science Corley, MacGregor & Donaldson (2007) which Society, 276−281. used the word following the filled pause as Corley, M., MacGregor, L.J., & Donaldson, ERP onset and examined whether the presence D.I. (2007). It’s the way that you, er, say it: of a filled pause helped the listener or not, this Hesitations in speech affect language study instead examined the immediate comprehension. Cognition, 105, 658−668. neurological response to the filled pause per se. Eklund, R. (2004). Disfluency in Swedish Previous research material has mainly human-human and human-machine travel consisted of laboratory speech. However, it is booking dialogues. PhD thesis, Linköping potentially problematic to use results from such Studies in Science and Technology, studies to make far-reaching conclusions about Dissertation No. 882, Department of processing of natural speech. In this respect the Computer and Information Science, present study − using spontaneous speech − Linköping University, Sweden. differs from past studies, although ecologically Fox Tree, J.E. (1995). The effect of false starts valid speech material makes it harder to control and repetitions on the processing of for confounding effects. subsequent words in spontaneous speech. A larger participant number could result in Journal of Memory and Language, 34, more consistent results and a stronger effect. 709−738. Future studies could compare ERPs Kutas, M. & Hillyard, S.A. (1980). Reading generated by utterance-initial filled pauses on senseless sentences: Brain potentials reflect the one hand and initial function words and/or semantic incongruity. Science, 207, 203−205. initial content words on the other hand, as Luck S.J. (2005). An introduction to the event- syntactically related function words and related potential technique. Cambridge, semantically related content words have been MA: MIT Press. shown to generate different ERPs (Kutas & Osterhout, L. & Holcomb, P.J. (1992). Event- Hillyard, 1980; Luck, 2005; Osterhout & related potentials elicited by syntactic Holcomb, 1992). Results from such studies anomaly. Journal of Memory and could provide information about how the brain Language, 31, 785−806. deals with filled pauses in terms of semantics and syntax.

95 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

HöraTal – a test and training program for children who have difficulties in perceiving and producing speech

Anne-Marie Öster Speech, Music and Hearing, CSC, KTH, Stockholm

Abstract that supplements the audiogram and the articu- lation index, because it indicates a person’s A computer-aided analytical speech perception ability to perceive and to discriminate between test and a helpful training program have been speech sounds. However, no computerized tests developed. The test is analytical and seeks to have so far been developed in Swedish for use evaluate the ability to perceive a range of with young children who have difficulties in sound contrasts used in the . perceiving and producing spoken language. The test is tailored for measurements with chil- The development of the computer-aided dren, who have not yet learnt to read, by using analytical speech perception test, described in easy speech stimuli, words selected on the basis this paper, is an effort to address the big need of familiarity, and pictures that represent the for early diagnoses. It should supply informa- test items unambiguously. The test is intended tion about speech perception skills and auditory to be used with small children, from 4 years of awareness. age, with difficulties to perceive and produce More important, a goal of this test is to spoken language. measure the potential for children to produce Especially prelingually hearing-impaired intelligible speech given their difficulties to children show very different abilities to learn perceive and produce spoken language. The spoken language. The potential to develop in- expectation is that the result of this test will telligible speech is unrelated to their pure tone give important recommendations for individual audiograms. The development of this test is an treatment with speech-training programs effort to find a screening tool that can predict (Öster, 2006). the ability to develop intelligible speech. The information gained from this test will provide Decisive factors for speech tests with supplementary information about speech per- small children ception skills, auditory awareness, and the po- tential for intelligible speech and specify im- The aim of an analytical speech perception test portant recommendations for individualized is to investigate how sensitive a child is to the speech-training programs. differences in speech patterns that are used to The intention is that this test should be define word meanings and sentence structures normalized to various groups of children so the (Boothroyd, 1995). Consequently, it is impor- result of one child could be compared to group tant to use stimuli that represent those speech data. Some preliminary results and reference features that are phonologically important. data from normal-hearing children, aged 4 to 7 Since the speech reception skills in profoundly years, with normal speech development are re- hearing-impaired children are quite limited, and ported. since small children in general have a restricted vocabulary and reading proficiency, the selec- tion of the speech material was crucial. The Introduction words selected had to be familiar and meaning- There exist several speech perception tests that ful to the child, be represented in pictorial form are used with prelingually and profoundly hear- and contain the phonological contrasts of inter- ing impaired children and children with spe- est. Thus, presenting sound contrasts as non- cific language impairment to assess their sense syllables, so that the perception is not de- speech processing capabilities; the GASP test pendent on the child’s vocabulary, was not a (Erber, 1982), the Merklein Test (Merklein, solution. It has been shown that nonsense syl- 1981), Nelli (Holmberg and Sahlén, 2000) and lables tend to be difficult for children to re- the Maltby Speech Perception Test (Maltby, spond to and that they often substitute the near- 2000). Results from these tests provide infor- est word they know (Maltby, 2000). Other im- mation concerning education and habilitation portant factors to pay attention to were:

96 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

• what order of difficulty of stimulus pres- entation is appropriate The child answers by pointing with the mouse • what are familiar words for children at or with his/her finger to one of the boxes on the different ages and with different hearing screen. The results are presented in percent cor- losses rect responses on each subtest showing a pro- • what is the most unambiguous way to file of a child’s functional hearing, see figure 2 illustrate the chosen test words (Öster, 2008)

Moreover, the task had to be meaningful, natu- ral and well understood by the child; otherwise Table 1. The eighteen subtests included in the test. he/she will not cooperate. Finally, the test must rapidly give a reliable result, as small children do not have particularly good attention and mo- tivation.

Description of HöraTal-Test In the development of this analytical speech perception test the above mentioned factors were taken into consideration (Öster et al, 2002). The words contain important phonological Swedish contrasts and each contrast is tested in one of eighteen different subtests by 6 word pairs presented twice. In Table 1 a summary of the test shows the phonological contrasts evalu- ated in each subtest, an explanation of the dis- crimination task and one example from each subtest. The words used were recorded by one female speaker. An illustrated word (the target) is presented to the child on a computer screen together with the female voice reading the word. The task of the child is to discriminate between two following sounds without illustra- tions and to decide which one is the same as the target word, see figure 1.

Figure 1. An example of the presentation of test stimuli on the computer screen. In this case the phonological contrast of vowels differing at low frequencies tested through the words båt-bot Figure 2. Example of a result profile for a child. -t] (boat-remedy). Percent correct responses are shown for each sub׃bo:t­bu] test.

97 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Confusion matrixes are also available where all and 7 years of age and twelve were between 9 errors are displayed. Such results are useful for and 19 years of age. Nine children had a mod- the speech therapist for screening purposes and erate hearing impairment and were between 4 give good indications of the child’s difficulties and 6 years old and fifteen children had a pro- in perceiving and subsequently producing the found hearing impairment and were between 6 sounds of the Swedish language. and 12 years of age. Four of these had a co- chlear implant and were between 6 and 12 years of age. Table 2 shows a summary of the Description of HöraTal-Training children who tried out some of the subtests. A training program consisting of computerized game-like exercises has been developed. Two Table 2. Description of the children who par- different levels are available. A figure is guided ticipated in the development of the test. Aver- around in a maze searching for fruit. Every age of pure-tone hearing threshold levels at time the figure reaches a fruit two words are 500, 1000 and 2000 Hz), age and number of read by the female voice. The child should de- children are shown. cide whether the two words are the same or not. Correct answers are shown through the num- Normal-hearing Hearing-impaired bers of diamonds in the bottom of the screen. children with specific children There is a time limit for the exercise and language impairment < 60 dBHL > 60 dBHL obstacles are placed in the figure’s way to jump 4-7 years 9-19 years 4-6 years of 6-12 years over. If the figure smashes an obstacle or if the of age of age age of age time runs out one of three “lives” will be lost. If No. = 18 No. = 12 No. = 9 No. = 15 the child succeeds to collect enough points (diamonds) he/she is passed to the second and Figure 4. shows profiles for all 24 hearing- more difficult level. impaired children on some subtests. Black bars show mean results of the whole group and striped bars show the profile of one child with 60 dBHL pure tone average (500, 1000 and 2000 Hz.)

Number of syllables Gross discrimination of long vowels Vowels differing at low frequencies Vowels differing at high frequencies Vowel quantity Discrimination of voiced consonants Discrimination of voiceless consonants Figure 3. An example of HöraTal-Training showing Manner of articulation one of the mazes with obstacles and fruits to collect. Place of articulation Voicing Preliminary results Nasality

0 20406080100 Studies of children with special lan- guage impairment and prelingually Figure 4. Results for the hearing-impaired children hearing impaired children (N=24). Black bars show average results of the During the development of HöraTal-Test, 54 whole group and striped bars show the result of one children of different ages and with different child with 60 dBHL. types of difficulty in understanding and produc- ing speech took part in the development and The result indicates that the child has greater evaluation of the different versions (Öster et al., difficulties on the whole to perceive important 1999). Eighteen normally hearing children with acoustical differences between speech sounds special language impairment were between 4 than the mean result of the 24 children. Many

98 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University of his results of the subtests were on the level that a five-year-old child with normal hearing of guessing (50%). The result might be a good and normal speech development should receive type of screening for what the child needs to high scores on HöraTal-Test. train in the speech clinic. In the other study (Möllerström & Åker- The results for the children with specific feldt, 2008) 36 children aged between 6:0 and language impairment showed very large differ- 7:11 years were assessed with all subtests of ences. All children had difficulties with several HöraTal-Test to provide normative data. In contrasts. Especially consonantal features and general, the participants obtained high scores duration seemed to be difficult for them to per- on all subtests. Seven-year-old participants per- ceive. formed better (98,89%) than six-year-olds on average (94,28%) Studies of normal-hearing children with normal speech development 100%

Two studies have been done to receive refer- 95% ence data for HöraTal-Test from children with 90% normal hearing and normal speech develop- ment. One study reported results from children 85% aged between 4;0 and 5;11 years. (Gadeborg & 80% Lundgren, 2008) and the other study (Möller- 75% ström & Åkerfeldt, 2008) tested children be- 70% tween 6;0 and 7;11 years. 65% In the first study (Gadeborg & Lundgren, 2008) 16 four-year-old children and 19 five- 60% year-old children participated. One of the con- 55% clusions of this study was that the four-year-old 50% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 children were not able to pass the test. Only 3 of the four-year-old children manage to do the Figure 6. Mean result on all 18 subtests for 18 six- whole test. The rest of the children didn’t want year-old normal-hearing children with normal to finish the test, were too unconcentrated or speech development (grey bars) and 18 seven-year- didn’t understand the concept of same/different. old normal-hearing children with normal speech development (black bars). 13 of the five-year-old children did the whole test, which are 16 subtests as the first two subtests were used in this study to intro- Conclusions duce the test to the children. The preliminary results reported here indicate 100 that this type of a computerized speech test 90 gives valuable information about which speech 80 sound contrasts a hearing or speech disorded 70 child has difficulties with. The child’s results

60 on the different subtests, consisting of both acoustic and articulatory differences between 50 contrasting sounds, form a useful basis as an 40 individual diagnosis of the child’s difficulties. 30 This can be of great relevance for the work of

20 the speech therapists. The intention is that this test should be nor- 10 malized to various groups of children so the 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 result of one child could be compared to group Figure 5. Mean result on 16 subtests (the first two data. The test is useful supplementary informa- subtests were used as introduction) for 13 normal- tion to the pure tone audiogram. Hopefully it hearing five-year-old children with normal speech will meet the long-felt need for such a test for development. early diagnostic purposes in recommending and designing pedagogical habilitation programs The children had high mean scores on the sub- for small children with difficulties in perceiv- tests (94,25 % correct answers) which indicates ing and producing speech.

99 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University


Boothroyd A. (1995) Speech perception tests and hearing-impaired children, Profound deafness and speech communication. G. Plant and K-E Spens. London, Whurr Pub- lishers Ltd, 345-371. Erber NP. (1982) Auditory training. Alexander Bell Association for the Deaf. Gadeborg J., Lundgren M. (2008) Hur barn i åldern 4;0-5;11 år presterar på taluppfatt- ningstestet HöraTal. En analys av resultaten från en talperceptions och en talproduk- tionsuppgift, Uppsala universitet, Enheten för logopedi. Holmberg E., Sahlén B. (2000) Nya Nelli, Pe- dagogisk Design, Malmö. Maltby M (2000). A new speech perception test for profoundly deaf children, Deafness & Education international, 2, 2. 86–101. Merklein RA. (1981). A short speech percep- tion test. The Volta Review, January 36-46. Möllerström, H. & Åkerfeldt, M. (2008) Höra- Tal-Test Hur presterar barn i åldrarna 6;0- 7;11 år och korrelerar resultaten med fone- misk medvetenhet. Uppsala universitet, En- heten för logopedi. Öster, A-M., Risberg, A. & Dahlquist, M. (1999). Diagnostiska metoder för tidig be- dömning/behandling av barns förmåga att lära sig förstå tal och lära sig tala. Slutrap- port Dnr 279/99. KTH, Institutionen för tal, musik och hörsel. Öster, A-M., Dahlquist, M. & Risberg, A. (2002). Slutförandet av projektet ”Diagnos- tiska metoder för tidig bedöm- ning/behandling av barns förmåga att lära sig förstå tal och lära sig tala”. Slutrapport Dnr 2002/0324. KTH, Institutionen för tal, musik och hörsel. Öster A-M. (2006) Computer-based speech therapy using visual feedback with focus on children with profound hearing impair- ments, Doctoral thesis in speech and music communication, TMH, CSC, KTH. Öster, A-M. (2008) Manual till HöraTal Test 1.1. Frölunda Data AB, Göteborg.

100 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

101 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Transient visual feedback on pitch variation for Chinese speakers of English Rebecca Hincks1 and Jens Edlund2 1Unit for Language and Communication, KTH, Stockholm 2Centre for Speech Technology, KTH, Stockholm

Abstract monotone in English. Speakers of tone lan- guages have particular difficulties using pitch This paper reports on an experimental study to structure discourse in English. Because in comparing two groups of seven Chinese stu- tonal languages pitch functions to distinguish dents of English who practiced oral presenta- lexical rather than discourse meaning, they tend tions with computer feedback. Both groups imi- to strip pitch movement for discourse purposes tated teacher models and could listen to re- from their production of English. Pennington cordings of their own production. The test and Ellis (2000) tested how speakers of Can- group was also shown flashing lights that re- tonese were able to remember English sen- sponded to the standard deviation of the fun- tences based on prosodic information, and damental frequency over the previous two sec- found that even though the subjects were com- onds. The speech of the test group increased petent in English, the prosodic patterns that significantly more in pitch variation than the disambiguate sentences such as Is HE driving control group. These positive results suggest the bus? from Is he DRIVing the bus? were not that this novel type of feedback could be used in easily stored in the subjects’ memories. Their training systems for speakers who have a ten- conclusion was that speakers of tone languages dency to speak in a monotone when making simply do not make use of prosodic informa- oral presentations. tion in English, possibly because for them pitch patterns are something that must be learned ar- Introduction bitrarily as part of a word’s lexical representa- First-language speech that is directed to a large tion. audience is normally characterized by more Many non-native speakers have difficulty pitch variation than conversational speech using intonation to signal meaning and struc- (Johns-Lewis, 1986). In studies of English and ture in their discourse. Wennerstrom (1994) Swedish, high levels of variation correlate with studied how non-native speakers used pitch and perceptions of speaker liveliness (Hincks, intensity contrastively to show relationships in 2005; Traunmüller & Eriksson, 1995) and cha- discourse. She found that “neither in … oral- risma (Rosenberg & Hirschberg, 2005; reading or in … free-speech tasks did the L2 Strangert & Gustafson, 2008). groups approach the degree of pitch increase on Speech that is delivered without pitch varia- new or contrastive information produced by tion affects a listener’s ability to recall informa- native speakers.” (p. 416). This more monotone tion, and is not favored by listeners. This was speech was particularly pronounced for the established by Hahn (2004) who studied lis- subjects whose native language was Thai, like tener response to three versions of the same Chinese a tone language. Chinese-native teach- short lecture: delivered with correct placement ing assistants use significantly fewer rising of primary stress or focus, with incorrect or un- tones than native speakers in their instructional natural focus, and with no focus at all (mono- discourse (Pickering, 2001) and thereby miss tone). She demonstrated that monotonous de- opportunities to ensure mutual understanding livery, as well as delivery with misplaced fo- and establish common ground with their stu- cus, significantly reduced a listener’s ability to dents. In a specific study of Chinese speakers recall the content of instructional speech, as of English, Wennerstrom (1998) found a sig- compared to speech delivered with natural fo- nificant relationship between the speakers’ abil- cus placement. Furthermore, listeners preferred ity to use intonation to distinguish rhetorical incorrect or unnatural focus to speech with no units in oral presentations and their scores on a focus at all. test of English proficiency. Pickering (2004) A number of researchers have pointed to the applied Brazil’s (1986) model of intonational tendency for Asian L1 individuals to speak in a paragraphing to the instructional speech of Chi-

102 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University nese-native teaching assistants at an American These findings generated the following pri- university. By comparing intonational patterns mary research question: Will on-line visual in lab instructions given by native and non- feedback on the presence and quantity of pitch native TAs, she showed that the non-natives variation in learner-generated utterances stimu- lacked the ability to create intonational para- late the development of a speaking style that graphs and thereby to facilitate the students’ incorporates greater pitch variation? understanding of the instructions. The analysis Following previous research on technology of intonationnal units in Pickering’s work was in pronunciation training, comparisons were “hampered at the outset by a compression of made between a test group that received visual overall pitch range in the [international teach- feedback and a control group that was able to ing assistant] teaching presentations as com- access auditory feedback only. Two hypotheses pared to the pitch ranges found in the [native were tested: speaker teaching assistant] data set” (2004,). 1. Visual feedback will stimulate a greater The Chinese natives were speaking more mo- increase in pitch variation in training utterances notonously than their native-speaking col- as compared to auditory-only feedback. leagues. 2. Participants with visual feedback will One pedagogic solution to the tendency for be able to generalize what they have learned Chinese native speakers of English to speak about pitch movement and variation to the pro- monotonously as they hold oral presentations duction of a new oral presentation. would be simply to give them feedback when they have used significant pitch movement in Method any direction. The feedback would be divorced from any connection to the semantic content of The system we used consists of a base system the utterance, and would basically be a measure allowing students to listen to teacher recordings of how non-monotonously they are speaking. (targets), read transcripts of these recordings, While a system of this nature would not be able and make their own recordings of their attempts to tell a learner whether he or she has made to mimic the targets. Students may also make pitch movement that is specifically appropriate recordings of free readings. The interface keeps or native-like, it should stimulate the use of track of the students’ actions, and some of this more pitch variation in speakers who underuse information, such as the number of times a stu- the potential of their voices to create focus and dent has attempted a target, is continuously pre- contrast in their instructional discourse. It could sented to the student. be seen as a first step toward more native-like The pitch meter is fed data from an online intonation, and furthermore to becoming a bet- analysis of the recorded speech signal. The ter public speaker. In analogy with other learn- analysis used in these experiments is based on ing activities, we could say that such a system the /nailon/ online prosodic analysis software aims to teach students to swing the club with- (Edlund & Heldner, 2006) and the Snack sound out necessarily hitting the golf ball perfectly the toolkit. As the student speaks, a fundamental first time. Because the system would give feed- frequency estimation is continuously extracted back on the production of free speech, it would using an incremental version of getF0/RAPT stimulate and provide an environment for the (Talkin, 1995). The estimation frequency is autonomous practice of authentic communica- transformed from Hz to logarithmic semitones. tion such as the oral presentation. This gives us a kind of perceptual speaker nor- Our study was inspired by three points con- malization, which affords us easy comparison cluded from previous research: between pitch variation in different speakers. 1. Public speakers need to use varied pitch After the semitone transformation, the next movement to structure discourse and engage step is a continuous and incremental calculation with their listeners. of the standard deviation of the student’s pitch 2. Second language speakers, especially over the last 10 seconds. The result is a meas- those of tone languages, are particularly chal- ure of the student’s recent pitch variation. lenged when it comes to the dynamics of Eng- For the test students, the base system was lish pitch. extended with a component providing online, 3. Learning activities are ideally based on instantaneous and transient feedback visualiz- the student’s own language, generated with an ing the degree of pitch variation the student is authentic communicative intent. currently producing. The feedback is presented in a meter that is reminiscent of the amplitude

103 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University bars used in the equalizers of sound systems: randomly assigned pair-wise from the list to the the current amount of variation is indicated by control or test group, ensuring gender balance the number of bars that are lit up in a stack of as well as balance in initial pitch variation. bars, and the highest variation over the past two Four participants who joined the study at a later seconds is indicated by a lingering top bar. The date were distributed in the same manner. meter has a short, constant latency of 100ms. Participants completed approximately three The test group and the control group each hours of training in half-hour sessions; some consisted of 7 students of engineering, 4 participants chose to occasionally have back-to- women and 3 men each. The participants were back sessions of one hour. The training ses- recruited from English classes at KTH, and sions were spread out over a period of four were exchange students from China, in Sweden weeks. Training took place in a quiet and pri- for stays of six months to two years. Partici- vate room at the university language unit, with- pants’ proficiency in English was judged by out the presence of the researchers or other means of an internal placement test to be at the onlookers. For the first four or five sessions, upper intermediate to advanced level. The par- participants listened to and repeated the teacher ticipants spoke a variety of dialects of Chinese versions of their own utterances. They were in- but used Mandarin with each other and for their structed to listen and repeat each of their 10 ut- studies in China. They did not speak Swedish terances between 20 and 30 times. Test group and were using English with their teachers and participants received the visual feedback de- classmates. scribed above and were encouraged to speak so Each participant began the study by giving that the meter showed a maximum amount of an oral presentation of about five minutes in green bars. The control group was able to listen length, either for their English classes or for a to recordings of their production but received smaller group of students. Audio recordings no other feedback. were made of the presentations using a small Upon completion of the repetitions, both clip-on microphone that recorded directly into a groups were encouraged to use the system to computer. The presentations were also video- practice their second oral presentation, which recorded, and participants watched the presen- was to be on a different topic than the first tations together with one of the researchers, presentation. For this practice, the part of the who commented on presentation content, deliv- interface designated for ‘free speech’ was used. ery and language. The individualized training In these sessions, once again the test partici- material for each subject was prepared from the pants received visual feedback on their produc- audio recordings. A set of 10 utterances, each tion, while control participants were only able of about 5-10 seconds in length, was extracted to listen to recordings of their speech. Within from the participants’ speech. The utterances 48 hours of completing the training, the partici- were mostly non-consecutive and were chosen pants held another presentation, this time about on the basis of their potential to provide exam- ten minutes in length, for most of them as part ples of contrastive pitch movement within the of the examination of their English courses. individual utterance. The researcher recorded This presentation was audio recorded. her own (native-American speaking) versions of them, making an effort to use her voice as Results expressively as possible and making more pitch We measured development in two ways: over contrasts than in the original student version. the roughly three hours of training per student, For example, a modeled version of a student’s in which case we compared pitch variation in flat utterance could be represented as: “And the first and the second half of the training for THIRDly, it will take us a lot of TIME and EF- each of the 10 utterances used for practice, and fort to READ each piece of news.” in generalized form, by comparing pitch varia- The participants were assigned to the con- tion in two presentations, one before and one trol or test groups following the preparation of after training. Pitch estimations were extracted their individualized training material. Partici- using the same software used to feed the pitch pants were ranked in terms of the global pitch variation indicator used in training, an incre- variation in their first presentation, as follows: mental version of the getF0/RAPT (Talkin, they were first split into two lists according to 1995) algorithm. Variation was calculated in a gender, and each list was ordered according to manner consistent with Hincks (2005) by calcu- initial global pitch variation. Participants were

104 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University lating the standard deviation over a moving 10 The effect of the feedback method (test second window. group vs. control group) was analyzed using an In the case of the training data, recordings ANOVA with time of measurement (1st pres- containing noise only or those that were empty entation, 1st half of training, 2nd half of train- were detected automatically and removed. For ing, 2nd presentation) as a within-subjects fac- each of the 10 utterances included in the train- tor. The sphericity assumption was met, and the ing material, the data were split into a first and main effect of time of measurement was sig- a second half, and the recordings from the first nificant (F = 8.36, p < .0005, η² = 0.45) indi- half were spliced together to create one con- cating that the speech of the test group receiv- tinuous sound file, as were the recordings from ing visual feedback increased more in pitch the second half. The averages of the windowed variation than the control group. Between- standard deviation of the first and the second subject effect for feedback method was signifi- half of training were compared. cant (F = 6.74, p =.027, η² = 0.40). The two hy- potheses are confirmed by these findings. 12 Discussion

10 Our results are in line with other research that has shown that visual feedback on pronuncia- tion is beneficial to learners. The visual channel semitones


in provides information about linguistic features that can be difficult for second language learn-

6 ers to perceive audibly. The first language of deviation our Chinese participants uses pitch movement to distinguish lexical meaning; these learners 4 standard can therefore experience difficulty in interpret- ing and producing pitch movement at a dis-

Average 2 course level in English (Pennington & Ellis, test 2000; Pickering, 2004; Wennerstrom, 1994). control Our feedback gave each test participant visual 0 confirmation when they had stretched the re- 1st presentation 1st half training 2nd half training 2nd presentation sources of their voices beyond their own base- Figure 1. Average pitch variation over 10 seconds line values. It is possible that some participants of speech for the two experimental conditions dur- had been using other means, particularly inten- ing the 1st presentation, the 1st half of the training, sity, to give focus to their English utterances. the 2nd half of the training and the 2nd presenta- The visual feedback rewarded them for using tion. The test group shows a statistically significant pitch movement only, and could have been a effect of the feedback they were given. powerful factor in steering them in the direction The mean standard deviations for each data of an adapted speaking style. While our data set and each of the two groups are shown in were not recorded in a way that would allow Figure 1. The y-axis displays the mean standard for an analysis of the interplay between inten- deviation per moving 10-second frame of sity and pitch as Chinese speakers give focus to speech in semitones, and the x-axis the four English utterances, this would be an interesting points of measurement: the first presentation, area for further research. the first half of training, the second half of Given greater resources in terms of time and training, and the second oral presentation. The potential participants, it would have been inter- experimental group shows a greater increase in esting to compare the development of pitch pitch variation across all points of measurement variation with other kinds of feedback. For ex- following training. Improvement is most dra- ample, we could have displayed pitch tracings matic in the first half of training, where the dif- of the training utterances to a third group of ference between the two groups jumps signifi- participants. It has not been an objective of our cantly from nearly no difference to one of more study, however, to prove that our method is su- than 2.5 semitones. The gap between the two perior to showing pitch tracings. We simply groups narrows somewhat in the production of feel that circumventing the contour visualiza- the second presentation. tion process allows for the more autonomous use of speech technology. A natural develop-

105 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ment in future research will be to have learners second, for stretches of speech up to 45 minutes practice presentation skills without teacher in length, giving tens of thousands of data models. points per speaker for the training utterances. It is important to point out that we cannot By converting the Hertz values to the logarith- determine from these data that speakers became mic semitone scale, we are able to make valid better presenters as a result of their participa- comparisons between speakers with different tion in this study. A successful presentation en- vocal ranges. This normalization is an aspect tails, of course, very many features, and using that appears to be neglected in commercial pro- pitch well is only one of them. Other vocal fea- nunciation programs such as Auralog’s Tell Me tures that are important are the ability to clearly More series, where pitch curves of speakers of articulate the sounds of the language, the rate of different mean frequencies can be indiscrimi- speech, and the ability to speak with an inten- nately compared. There is a big difference in sity that is appropriate to the spatial setting. In the perceptual force of a rise in pitch of 30Hz addition, there are numerous other features re- for a speaker of low mean frequency and one garding the interaction of content, delivery and with high mean frequency, for example. These audience that play a critical role in how the differences are normalized by converting to presentation is received. Our presentation data, semitones. gathered as they were from real-life classroom Secondly, our feedback can be used for the settings, are in all likelihood too varied to allow production of long stretches of free speech for a study that attempted to find a correlation rather than short, system-generated utterances. between pitch variation and, for example, the It is known that intonation must be studied at a perceived clarity of a presentation. However, higher level than that of the word or phrase in we do wish to explore perceptions of the speak- order for speech to achieve proper cohesive ers. We also plan to develop feedback gauges force over longer stretches of discourse. By for other intonational features, beginning with presenting the learners with information about rate of speech. We see potential to develop lan- their pitch variation in the previous ten seconds guage-specific intonation pattern detectors that of speech, we are able to incorporate and reflect could respond to, for example, a speaker’s ten- the vital movement that should occur when a dency to use French intonation patterns when speaker changes topic, for example. In an ideal speaking English. Such gauges could form a world, most teachers would have the time to sit type of toolbox that students and teachers could with students, examine displays of pitch trac- use as a resource in the preparation and as- ings, and discuss how peaks of the tracings re- sessment of oral presentations. late to each other with respect to theoretical Our study contributes to the field in a num- models such as Brazil’s intonational paragraphs ber of ways. It is, to the best of our knowledge, (Levis & Pickering, 2004). Our system cannot the first to rely on a synthesis of online funda- approach that level of detail, and in fact cannot mental frequency data in relation to learner make the connection between intonation and its production. We have not shown the speakers lexical content. However, it can be used by the absolute fundamental frequency itself, but learners on their own, in the production of any rather how much it has varied over time as rep- content they choose. It also has the potential for resented by the standard deviation. This vari- future development in the direction of more able is known to characterize discourse in- fine-grained analyses. tended for a large audience (Johns-Lewis, A third novel aspect of our feedback is that 1986), and is also a variable that listeners can it is transient and immediate. Our lights flicker perceive if they are asked to distinguish lively and then disappear. This is akin to the way we speech from monotone (Hincks, 2005; Traun- naturally process speech; not as something that müller & Eriksson, 1995). In this paper, we can be captured and studied, but as sound have demonstrated that it is a variable that can waves that last no longer than the milliseconds effectively stimulate production as well. Fur- it takes to perceive them. It is also more similar thermore, the variable itself provides a means to the way we receive auditory and sensory of measuring, characterizing and comparing feedback when we produce speech – we only speaker intonation. It is important to point out hear and feel what we produce in the very in- that enormous quantities of data lie behind the stance we produce it; a moment later it is gone. values reported in our results. Measurements of Though at this point we can only speculate, it fundamental frequency were made 100 times a would be interesting to test whether transient

106 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University feedback might be more easily integrated and University of Birmingham, English Lan- automatized than higher-level feedback, which guage Research is more abstract and may require more cogni- Edlund, J., & Heldner, M. (2006). /nailon/ -- tive processing and interpretation. The potential Software for Online Analysis of Prosody. difference between transient and enduring feed- Proceedings of Interspeech 2006 back has interesting theoretical implications Hahn, L. D. (2004). Primary Stress and Intelli- gibility: Research to Motivate the Teaching that could be further explored. of Suprasegmentals. TESOL Quarterly, This study has focused on Chinese speakers 38(2), 201-223. because they are a group where many speakers Hincks, R. (2005). Measures and perceptions of can be expected to produce relatively monotone liveliness in student oral presentation speech: speech, and where the chances of achieving a proposal for an automatic feedback mecha- measurable development in a short period of nism. System, 33(4), 575-591. time were deemed to be greatest. However, Johns-Lewis, C. (1986). Prosodic differentia- there are all kinds of speaker groups who could tion of discourse modes. In C. Johns-Lewis benefit from presentation feedback. Like many (Ed.), Intonation in Discourse (pp. 199-220). communicative skills that are taught in ad- Breckenham, Kent: Croom Helm. vanced language classes, the lessons can apply Levis, J., & Pickering, L. (2004). Teaching in- tonation in discourse using speech visualiza- to native speakers as well. Teachers who pro- tion technology. System, 32, 505-524. duce monotone speech are a problem to stu- Pennington, M., & Ellis, N. (2000). Cantonese dents everywhere. Nervous speakers can also Speakers' Memory for English Sentences tend to use a compressed speaking range, and with Prosodic Cues The Modern Language could possibly benefit from having practiced Journal 84(iii), 372-389. delivery with an expanded range. Clinically, Pickering, L. (2001). The Role of Tone Choice monotone speech is associated with depression, in Improving ITA Communication in the and can also be a problem that speech thera- Classroom. TESOL Quarterly, 35(2), 233- pists need to address with their patients. How- 255. ever, the primary application we envisage here Pickering, L. (2004). The structure and function is an aid for practicing, or perhaps even deliver- of intonational paragraphs in native and non- ing, oral presentations. native speaker instructional discourse. Eng- lish for Specific Purposes, 23, 19-43. It is vital to use one’s voice well when Rosenberg, A., & Hirschberg, J. (2005). Acous- speaking in public. It is the channel of commu- tic/Prosodic and Lexical Correlates of Char- nication, and when used poorly, communica- ismatic Speech. Paper presented at the Inter- tion can be less than successful. If listeners ei- speech 2005, Lisbon. ther stop listening, or fail to perceive what is Strangert, E., & Gustafson, J. (2008). Subject most important in a speaker’s message, then all ratings, acoustic measurements and synthe- actors in the situation are in effect wasting sis of good-speaker characteristics. Paper time. We hope to have shown in this paper that presented at the Interspeech 2008, Brisbane, stimulating speakers to produce more pitch Australia. variation in a practice situation has an effect Talkin, D. (1995). A robust algorithm for pitch that can transfer to new situations. People can tracking (RAPT). In W. B. Klejin, & Paliwal, learn to be better public speakers, and technol- K. K (Ed.), Speech Coding and Synthesis (pp. 495-518): Elsevier. ogy should help in the process. Traunmüller, H., & Eriksson, A. (1995). The perceptual evaluation of F0 excursions in Acknowledgements speech as evidenced in liveliness estima- tions. Journal of the Acoustical Society of This paper is an abbreviated version of an arti- America, 97(3), 1905-1915. cle to be published in Language Learning and Wennerstrom, A. (1994). Intonational meaning Technology in October 2009. The technology in English discourse: A Study of Non-Native used in the research was developed in part Speakers Applied Linguistics, 15(4), 399- within the Swedish Research Council project 421. #2006-2172 (Vad gör tal till samtal). Wennerstrom, A. (1998). Intonation as Cohe- sion in Academic Discourse: A Study of Chinese Speakers of English Studies in Sec- References ond Language Acquisition, 20, 1-25. Brazil, D. (1986). The Communicative Value of Intonation in English. Birmingham UK:

107 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Phonetic correlates of unintelligibility in Vietnamese- accented English Una Cunningham School of Arts and Media, Dalarna University, Falun

Abstract There are a number of possible ways in which intelligibility can be measured. Listeners Vietnamese speakers of English are often able can be asked to transcribe what they hear or to to communicate much more efficiently in writ- choose from a number of alternatives. Stimuli ing than in speaking. Many have quite high vary from spontaneous speech through texts to proficiency levels, with full command of ad- sentences to wordlists. Sentences with varying vanced vocabulary and complex syntax, yet degrees of semantic meaning are often used they have great difficulty making themselves (Kirkpatrick, Deterding et al. 2008) to control understood when speaking English to both na- for the effect of contextual information on intel- tive and non-native speakers. This paper ex- ligibility. Few intelligibility studies appear to be plores the phonetic events associated with concerned with the stimulus material. The ques- breakdowns in intelligibility, and looks at com- tion of what makes an utterance unintelligible is pensatory mechanisms which are used. not addressed in these studies. The current pa- per is an effort to come some way to examining Intelligibility this issue. The scientific study of intelligibility has passed through a number of phases. Two strands that Learning English in Vietnam have shifted in their relative prominence are the The pronunciation of English presents severe matter of to whom non-native speakers are to challenges to Vietnamese-speaking learners. be intelligible. In one strand the emphasis is on Not only is the sound system of Vietnamese the intelligibility of non-native speakers to na- very different from that of English, but there are tive English-speaking listeners (Flege, Munro et also extremely limited opportunities for hearing al. 1995; Munro and Derwing 1995; Tajima, and speaking English in Vietnam. In addition, Port et al. 1997). This was the context in which there are limited resources available to teachers English was taught and learned – the majority of English in Vietnam so teachers are likely to of these studies have been carried out in what pass on their own English pronunciation to their are known as inner circle countries, which, in students. turn, reflects the anglocentricism which has University students of English are intro- characterised much of linguistics. The other duced to native-speaker models of English pro- strand focuses on the position of English as a nunciation, notably Southern educated British, language of international communication, but they do not often have the opportunity to where intelligibility is a two-way affair between speak with non-Vietnamese speakers of Eng- a native or non-native English-speaking speaker lish. Most studies of Vietnamese accents in and a native or non-native English-speaking English have been based in countries where listener (Irvine 1977; Flege, MacKay et al. English is a community language, such as the 1999; Kirkpatrick, Deterding et al. 2008; Rooy U.S. (Tang 2007) or Australia (Nguyen 1970; 2009). In-gram and Nguyen 2007). This study is thus The current study is a part of a larger study unusual in considering the English pronuncia- of how native speakers of American English, tion of learners who live in Vietnam. The Swedish, Vietnamese, Urdu and Ibo are per- speech material presented here was produced by ceived by listeners from these and other lan- members of a group of female students from guage backgrounds. Vietnamese-accented Hanoi. speech in English has been informally observed to be notably unintelligible for native English- speaking listeners and even for Vietnamese lis- Vietnamese accents of English teners there is great difficulty in choosing The most striking feature of Vietnamese- which of four words has been uttered (Cun- accented English is the elision of consonants, in ningham 2009). particular in the syllable coda. This can obvi-

108 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ously be related to the phonotactic constraints Vowel quality operational in Vietnamese, and it is clearly a The semantic load of the distinction between problem when speaking English which places a the KIT and FLEECE vowels is significant. heavy semantic load on the coda in verb forms This opposition seems to be observed in most and other suffixes. Consonant clusters are gen- varieties of English, and it is one that has been erally simplified in Vietnamese-accent English identified as essential for learners of English to to a degree that is not compatible with intelligi- master (Jenkins 2002). Nonetheless, this dis- bility. Even single final consonants are often tinction is not very frequent in the languages of absent or substituted for by another consonant the world. Consequently, like any kind of new which is permitted in the coda in Vietnamese. distinction, a degree of effort and practice is Other difficulties in the intelligibility of required before learners with many first lan- Vietnamese-accented English are centred in guages, including Vietnamese, can reliably per- vowel quality. English has a typologically rela- ceive and produce this distinction. tively rich vowel inventory, and this creates problems for learners with many L1s, including Vietnamese. The distinction between the vow- els of KIT and FLEECE to use the word class terminology developed by John Wells (Wells 1982) or ship and sheep to allude to the popular pronunciation teaching textbook (Baker 2006) is particularly problematic. Other problematic vowel contrasts are that between NURSE and THOUGHT (e.g. work vs walk) and between TRAP AND DRESS (e.g. bag vs beg). The failure to perceive or produce these vowel distinctions is a major hinder to the intelligibility of Vietnamese-accented English. Figure 1. F1 vs F2 in Bark for S15 for the words Vowel length is not linguistically significant bead, beat, bid, bit. in Vietnamese and the failure to notice or pro- Fig.1 shows the relationship between F1 and duce pre-fortis clipping is another source of un- F2 in Bark for the vowels in the words beat, intelligibility. Another interesting effect that is bead, bit and bid for S15, a speaker of Viet- attributable to transfer from Vietnamese is the namese (a 3rd year undergraduate English major use in English of the rising sac tone on sylla- student at a university in Hanoi). This speaker bles that have a voiceless stop in the coda. This does not make a clear spectral distinction be- can result in a pitch prominence that may be tween the vowels. As spectral quality is the interpreted as stress by listeners. most salient cue to this distinction for native Vietnamese words are said to be generally speakers of English (Ladefoged 2006; Crutten- monosyllabic, and are certainly written as den 2008), the failure to make a distinction is monosyllables with a space between each sylla- obviously a hinder to intelligibility. ble. This impression is augmented (or possibly explained) by the apparent paucity of connected Vowel duration speech phenomena in Vietnamese and conse- quently in Vietnamese-accented English. Enhanced pre-fortis clipping is used in many varieties of English as a primary cue to post- vocalic voicelessness (Ladefoged 2006; Crut- Analysis tenden 2008). It has been well documented that A number of features of Vietnamese-accented phonologically voiced (lenis) post-vocalic con- English will be analysed here. They are a) the sonants are often devoiced by native speakers vowel quality distinction between the words of many varieties of English (e.g. Cruttenden sheep and ship, b) the vowel duration distinc- 2008). This means that in word sets such as tion between seat and seed, and c) the causes of bead, beat, bid, bit native speakers will signal global unintelligibility in semantically mean- postvocalic voicing primarily by shortening the ingful sentences taken from an earlier study prevocalic vowel in beat and bit. In addition, (Munro and Derwing 1995). native speakers will have a secondary dur- ational cue to the bead, beat vs bid, bit vowel distinction where the former vowel is system-

109 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University atically longer than the latter (Cruttenden fusion (as illustrated in Figures 1 and 2 above. 2008). The word often is liable to be pronounced with a substitution of /p/ for the /f/ at the end of the first syllable, as voiceless stops are permissible in coda position in Vietnamese while fricatives are not. Let us then see what happens when speaker V1, a 23-year old male graduate student of Eng- lish from Hanoi, reads this sentence. In fact he manages the beginning of the utterance well, with an appropriate (native-like) elision of the /d/ in friends. Things start to go wrong after that with the word sheep. Figure 3 shows a spectro- gram of this word using Praat (Boersma and

Weenink 2009). As can be seen, the final con- Figure 2. Average vowel and stop duration in ms sonant comes out as an ungrooved [s]. for S15 for 5 instances of the words bead, beat, bid, bit.

So, as is apparent from Figure 2, speaker S15 manages to produce somewhat shorter vowels in bid and bit than in beat and bit. This is the primary cue that this speaker is using to dis- similate the vowels, although not, unfortunately the cue expected as most salient by native lis- teners. But there is no pre-fortis clipping appar- ent. This important cue to post-vocalic voicing is not produced by this speaker. In conjunction with the lack of spectral distinction between the vowels of bead, bead vs. bid, bit seen in Figure 1, the result is that these four words are per- Figure 3. The word sheep as pronounced by speaker cieved as indistinguishable by native and not- V1. native listeners (Cunningham 2009). Now there is a possible explanation for this. As Sentence intelligibility mentioned above, final /f/ as in the word if is often pronounced as [ip]. This pronunciation is A number of factors work together to confound viewed as characteristic of Vietnamese- the listener of Vietnamese-accented English in accented English in Vietnam – teacher and thus connected speech. Not only is it difficult to per- learner awareness of this is high, and the feature ceive vowel identity and post vocalic voicing as is stigmatised. Thus the avoidance of the final in the above examples, but there are a number /p/ of sheep may be an instance of hypercorrec- of other problems. Consider the sentence My tion. It is clearly detrimental to V1’s intelligi- friend’s sheep is often green. This is taken from bility. the stimuli set used for the larger study men- Another problematic part of this sentence by tioned above. The advantage of sentences of V1 is that he elides the /z/ in the word is. There this type is that the contextual aids to interpret- is silence on either side of this vowel. Again, ability are minimised while connected speech this leads to intelligibility problems. The final phenomena are likely to be elicited. There are difficulty for the listener in this utterance is a here a number of potential pitfalls for the Viet- matter of VOT in the word green. V1 has no namese speaker of English. The cluster at the voice before the vowel begins as is shown in end of friend’s, especially in connection with figure 4. The stop is apparently voiceless and the initial consonant in sheep can be expected the release is followed by a 112ms voiceless to prove difficult and to be simplified in some aspiration. This leads to the word being misin- way. The quality and duration of the vowels in terpreted by listeners as cream. sheep and green can be expected to cause con-

110 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Acoustical Society of America 97(5): 3125- 3134. Ingram, J. C. L. and T. T. A. Nguyen (2007). Vietnamese accented English: Foreign ac- cent and intelligibility judgement by listen- ers of different language backgrounds, Uni- versity of Queensland. Irvine, D. H. (1977). Intelligibility of English speech to non-native English speakers. Lan- guage and Speech 20: 308-316. Jenkins, J. (2002). A sociolinguistically based, empirically researched pronunciation sylla- bus for English as an international language. Figure 4 the word green as pronounced by speaker Applied Linguistics 23(1): 83-103. V1. The marking shows aspiration after the release Kirkpatrick, A., D. Deterding, et al. (2008). The of the initial stop. international intelligibility of Hong Kong English. World Englishes 27(3-4): 359-377. Ladefoged, P. (2006). A Course in Phonetics. Conclusion Boston, Mass, Thomson. So it can be seen that the intelligibility of these Munro, M. J. and T. M. Derwing (1995). Proc- Vietnamese speakers of English is a major essing time, accent, and comprehensibility problem for them and their interlocutors. Not in the perception of native and foreign- only do they have non-native pronunciation fea- accented speech. Language and Speech 38: tures that are clear instances of transfer from 289-306. their L1, Vietnamese, they also have other, Nguyen, D. L. (1970). A contrastive phonologi- spontaneous, modifications of the target cal analysis of English and Vietnamese. sounds. This is part of the general variability Canberra, Australian National University. that characterises non-native pronunciation, but Rooy, S. C. V. (2009). Intelligibility and per- when the sounds produced are as far from the ceptions of English proficiency. World Eng- target sounds as they are in the speech of V1, lishes 28(1): 15-34. communication is an extreme effort. Tajima, K., R. Port, et al. (1997). Effects of temporal correction on intelligibility of for- References eign-accented English, Academic Press Ltd. Tang, G. M. (2007). Cross-linguistic analysis of Baker, A. (2006). Ship or sheep: an intermedi- Vietnamese and English with implications ate pronunciation course. Cambridge, Cam- for Vietnamese language acquisition and bridge University Press. maintenance in the United States. Journal of Boersma, P. and D. Weenink (2009). Praat: do- Southeast Asian American Education and ing phonetics by computer. Advancement 2. Cruttenden, A. (2008). Gimson's Pronunciation Wells, J. C. (1982). Accents of English. Cam- of English. London, Hodder Arnold bridge, Cambridge University Press Cunningham, U. (2009). Quality, quantity and intelligibility of vowels in Vietnamese- accented English. Issues in Accents of Eng- lish II: Variability and Norm. E. Waniek- Klimczak. Newcastle, Cambridge Scholars Publishing Ltd. Flege, J. E., I. R. A. MacKay, et al. (1999). Na- tive Italian speakers' perception and produc- tion of English vowels. Journal of the Acoustical Society of America 106(5): 2973-2987. Flege, J. E., M. J. Munro, et al. (1995). Factors affecting strength of perceived foreign ac- cent in a 2nd language. Journal of the

111 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Perception of Japanese quantity by Swedish speaking learners: A preliminary analysis Miyoko Inoue Department of Japanese Language & Culture, Nagoya University, Japan

Abstract Fundamental frequency (F0) could be such a supplementary feature in Japanese. Kinoshita et Swedish learners’ perception of Japanese al. (2002) and Nagano-Madsen (1992) reported quantity was investigated by means of an iden- that the perception of quantity in L1 Japanese tification task. Swedish informants performed was affected by the F0 pattern. In their similarly to native Japanese listeners in experiments, when there was a F0 fall within a short/long identification of both vowel and vowel, Japanese speakers tended to perceive the consonant. The Swedish and Japanese listeners vowel as ‘long’. On the other hand, a vowel reacted similarly both to the durational varia- with long duration was heard as ‘short’ when the tion and to the F0 change despite the different onset of F0 fall was at the end of the vowel use of F0 fall in relation with quantity in their (Nagano-Madsen, 1992). L1. These results are in line with phonological and phonetic characteristics of word accent in Introduction Japanese. It is the first mora that can be accented A quantity language is a language that has a in a long vowel, and the F0 fall is timed with the phonological length contrast in vowels and/or boundary of the accented and the post-accent consonants. Japanese and Swedish are known as morae of the vowel. Since the second mora in a such languages, and they employ both vowels long vowel does not receive the word accent, a and consonants for the long/short contrast (Han, F0 fall should not occur at the end of a long 1965, for Japanese; Elert, 1964, for Swedish). vowel. Both of the languages use duration as a primary In Swedish, quantity and word accent seem acoustic cue to distinguish the long/short con- only to be indirectly related to the stress in such trast. a way that the stress is signaled by quantity and Quantity in Japanese is known to be difficult the F0 contour of word accent is timed with the to acquire for learners (e.g. Toda, 2003), how- stress. In the current research, it is also exam- ever, informants in previous research have ined if Swedish learners react differently to mainly been speakers of non-quantity languages. stimuli with and without F0 change. Response In their research on L2 quantity in Swedish, to unaccented and accented words will be McAllister et al. (2002) concluded that the de- compared. An unaccented word in Japanese gree of success in learning L2 Swedish quantity typically has only a gradual F0 declination, seemed to be related to the role of the duration while an accented word is characterized by a feature in learners’ L1. It can, then, be antici- clear F0 fall. It can be anticipated that Swedish pated that Swedish learners of Japanese may be learners would perform differently from native relatively successful in acquiring Japanese Japanese speakers. quantity. The present study, thus, aims to in- vestigate whether Swedish learners are able to Methodology perform similarly to the Japanese in the percep- An identification task was conducted in order to tion of Japanese quantity of vowels and conso- examine the categorical boundary between long nants. and short vowels and consonants and the con- In addition to duration, there can be other sistency of the categorization. The task was phonetic features that might supplement the carried out in a form of forced-choice test, and quantity distinction in quantity languages. In the results were compared between Swedish and Swedish, quality is such an example, but such a Japanese informants. feature may not be necessarily utilized in other languages. For example, quality does not seem Stimuli to be used in the quantity contrast in Japanese 1 (Arai et al., 1999). The measured data of the prepared stimuli are shown in Table 1 and

112 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Table 2. The original sound, accented and un- speakers of standard Japanese (NJ) also par- accented /mamama, mama:ma/ (for long/short ticipated in the task in order for comparison. vowel) and /papapa, papap:a/ (for long/short consonant), was recorded by a female native Procedure Japanese speaker. For the accented version, the An identification task was conducted using 2nd mora was accented for both ma- and ExperimentMFC of Praat. Four sessions pa-series. The stimuli were made by manipu- (ma-/pa-series x 2 accent) were held for each lating a part of recorded tokens with Praat informant. In each session, an informant lis- (Boersma and Weenink, 2004) so that the long tened to 70 stimuli (7 steps x 10 times) randomly sound shifts to short in 7 steps. Thus, a total of and answered whether the stimulus played was, 28 tokens (2 series x 7 steps x 2 accent type) for example, /mamama/ or /mama:ma/ by were prepared. The F0 peak, the location of F0 clicking on a designated key. peak in V2 and the final F0 were fixed at the average value of long and short sounds. Calculation of the categorical boundary Table 1. The measurements of the stimuli in and the ‘steepness’ of categorical func- ma-series (adopted from Kanamura, 2008: 30 (Ta- tion ble 2-2) and 41 (Table 2-5), with permission). The The location of the categorical boundary be- unaccented and the accented stimuli are differenti- tween long and short, and also the consistency ated by the utterance final F0 (rightmost column). (‘steepness’) of the categorization function was calculated following Ylinen et al. (2005). The C3 Word Peak F0 categorical boundary is indicated in millisec- Dura- Dura- Loca- Final No. Ratio Peak onds. The value of steepness is interpreted in tion tion tion in F0 (Hz) (Hz) such a way that the smaller the value, the (ms) (ms) V2 stronger the consistency of the categorization 1 0.25 78 582 function. 2 0.40 128 627 242 3 0.55 168 673 (unacc) 4 0.70 213 718 330 48% Results 136 5 0.85 259 764 (acc) It was reported in Kanamura (2008) that several 6 1.00 303 810 of the Chinese informants did not show corre- 7 1.15 349 855 spondence between the long/short responses and the duration of V2 in the mamama/mama:ma Table 2. The measurements of the stimuli in stimuli. She excluded the data of such infor- pa-series. The unaccented and the accented stimuli mants from the analysis. No such inconsistency are differentiated by the utterance final F0 (right- between the response and the duration was most column). found for the Swedes or for the Japanese in the current study, and thus none of the data were C3 Word Peak omitted in this regard. However, the data of one F0 Dura- Dura- Loca- Final Japanese informant was eliminated from the No. Ratio Peak tion tion tion in F0 (Hz) result of ma-series since the calculated boundary (Hz) (ms) (ms) V2 location of that person was determined as ex- 1 0.25 85 463 tremely high compared with the others. 2 0.40 136 514 231 3 0.55 188 566 Perception of long and short vowel (unacc) 4 0.70 239 617 295 96% (ma-series) 116 5 0.85 290 668 (acc) Figure 1 indicates the percentage of ‘short’ re- 6 1.00 341 719 sponse to the stimuli. The leftmost stimulus on 7 1.15 392 770 x-axis (labeled as ‘0.25’) is the shortest sound and the rightmost the longest (‘1.15’). The Informants plotted responses traced s-shaped curves, and The informants were 23 Swedish learners of the curves turned out to be fairly close to each Japanese (SJ) at different institutions in Japan other. Differences are found at 0.55 and 0.70, and Sweden. The length of studying Japanese and the ‘short’ responses by NJ differ visibly varied from 3 to 48 months.2 Thirteen native between unaccented and accented stimuli. The ‘short’ response to the accented stimuli at 0.55

113 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University dropped to a little below 80%, but the unac- As for the steepness, there was no significant cented stimuli remained as high as almost 100%. interaction for the factors Group and Accent SJ’s responses to unaccented/accented stimuli at Type. A significant main effect was found for 0.55 appeared closely to each other. Besides, the Group (F(1,34)=11.48, p<.01). The mean of s-curves of SJ looked somewhat more gradual steepness was greater for SJ than for NJ. This than those of NJ. means that the categorization function of NJ is more consistent than that of SJ as in the above-interpretation of Figure 1.

Perception of long and short consonant (pa-series) The ratio of ‘short’ response to the stimuli in pa-series is given in Figure 2. As in ma-series, the plotted responses make s-shaped curves. Noticeable difference in the ‘short’ response rate was found at 0.55 and 0.70 on x-axis. The dif- ference between SJ and NJ seems to be greater for the accented stimuli, and SJ’s ‘short’ re- Figure 1. The percentage of “short” responses for sponse drops below 60% at 0.55 while NJ marks stimuli with the shortest to the longest V2 (from left around 80%, which makes SJ’s curve more to right on x-axis). gradual than NJ’s. But the curves of SJ and NJ Table 3 shows the mean category boundary for Unaccented stimuli almost overlap with each and the steepness of the categorization function. other. Two-way ANOVAs were conducted for the factors Group (SJ, NJ) and Accent Type (Un- accented, Accented) separately for the category boundary and the steepness. Table 3. The category boundary location (ms) and the steepness of the categorization function in the unaccented (flat) and the accented stimuli of ma-series.

Unaccented Accented (flat) SJ NJ SJ NJ Boundary (ms) 199.6 200.0 191.3 182.6 Figure 2. The percentage of “short” responses for SD 13.6 15.1 13.4 13.2 stimuli with the shortest to the longest C3 (from left to right on x-axis). Steepness 27.8 16.3 27.6 18.8 SD 7.7 8.9 9.7 10.5 The categorical boundary and the steepness are shown in Table 4. Two-way mixed ANO- For the categorical boundary, the interaction VAs were carried out for the categorical boun- between Group and Accent Type tended to be dary and the steepness with the factors Group significant (F(1,33)=3.31, p<.10). There was a (SJ, NJ) and Accent Type (Unaccented, Ac- simple main effect of Group at Accented condi- cented). The categorical boundary exhibited a tion which was also tended to be significant tendency of interaction between the factors (F(1,33)=3.15, p<.10). Meanwhile, simple main (F(1,34)=3.50, p<.10). A simple main effect of effects of Accent Type at SJ (F(1,33)=5.57, Group at Accented condition was also found to p<.05) and at NJ (F(1,33)=24.34, p<.01) were tend to be significant (F(1,34)=3.67, p<.10). It significant. For both of the speaker groups, the was natural that difference caused by Accent location of the perceptual boundary was earlier Type did not reach statistical significance be- in accented than unaccented words. In sum, cause the consonant duration here corresponded there was no difference between the two groups, to the closure (silence) duration, and there was but there was between the two accent types. of course no F0 information relevant to the ‘Long’ responses were invoked by shorter dura- quantity. tion of V2 of the accented stimuli.

114 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The result for steepness was similar to that of 2. According to personal interviews, Japanese ma-series. There was no significant interaction classes at high schools were usually held only between the factors Group and Accent Type once a week as opposed to several hours per (F(1,34)=0.00, n.s.), but there was a significant week at universities. The length of studying main effect of Group (F(1,34)=11.47, p<.01). Japanese excluding high school is 3-24 months. The mean of NJ was smaller than that of SJ. Thus, the Japanese made a clearer distinction References between long and short than the Swedes. Arai, T., Behne, D. M., Czigler, P. and Sullivan, Table 4 The category boundary location (ms) and K. (1999) Perceptual cues to vowel quantity: the steepness of the categorization function in the Evidence from Swedish and Japanese. Pro- unaccented (flat) and the accented stimuli of pa-series. ceedings of the Swedish Phonetics Confer- ence. Fonetik 99 (Göteborg) 81, 8-11. Unaccented Boersma, P. and Weenink, D. (2004) Praat: Accented (flat) Doing phonetics by computer (Version 4.2) [Computer program]. Retrieved March 4, SJ NJ SJ NJ 2004, from http://www.praat.org. Boundary (ms) 210.2 210.0 203.7 219.5 Elert, C.-C. (1964) Phonologic Studies of SD 22.0 31.0 19.6 29.8 Quantity in Swedish. Stockholm: Almqvist Steepness 36.2 22.0 39.9 26.1 & Wiksell. SD 15.5 8.7 16.3 21.1 Han, M. S. (1965) The feature of duration in Japanese. Study of Sound (Journal of Pho- Conclusion netic Society of Japan) 10, 65-80. The Swedish informants generally performed Kanamura, K. (2008) Why the acquisition of similarly to the Japanese. They were quite long and short vowels in L2 Japanese is dif- successful as opposed to what has been reported ficult? : The perception and recognition of on other L1 speakers. This might be due to the Japanese long and short vowels by Chinese use of duration in L1 phonology in Swedish, but learners [In Japanese]. Doctoral dissertation. further inter-linguistic comparison is necessary Nagoya University. to obtain a clearer view. Kinoshita, K., Behne, D. M. and Arai, T. (2002) It was expected that the two speaker groups Duration and F0 as perceptual cues to Japa- would perform differently for the accented sti- nese vowel quantity. Proceedings of the In- muli, but this was not true. The F0 change ternational Conference on Spoken Language caused the earlier long/short boundary for both Processing. ICSLP-2002 (Denver, Colorado, of the groups. Further investigation is needed to USA), 757-760. find out if this has been acquired during the McAllister, R., Flege, J. E. and Piske, T. (2002) course of learning or if it originated from the The influence of L1 on the acquisition of characteristics in Swedish. Swedish quantity by native speakers of Spanish, English and Estonian. Journal of Phonetics 30, 229-258. Acknowledgements Nagano-Madsen, Y. (1992) Mora and prosodic I would like to thank Kumi Kanamura for the coordination: A phonetic study of Japanese, permission to use the sound stimuli in her re- Eskimo and Yoruba. Travaux de l'Institut de search in Kanamura (2008), and also the stu- Linguistique de Lund 27. Lund: Lund dents and teachers at , University Press. Lund University/Gifu University, Nagoya Uni- Toda, T. (2003) Second Language Speech Per- versity and Stockholm University for their kind ception and Production: Acquisition of cooperation in the experiments. The research Phonological Contrasts in Japanese. Lanham, presented here was supported by The Scandina- MD: University Press of America. via-Japan Sasakawa Foundation for the year Ylinen, S., Shestakova, S., Alku, P. and Huoti- 2008. lainen, M. (2005) The perception of phono- logical quantity based on durational cues by native speakers, second-language users and Notes nonspeakers of Finnish. Language and 1. The stimuli of ma-series were adopted from Speech 48 (3), 313-338. Kanamura (2008) with permission.

115 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Automatic classification of segmental second language speech quality using prosodic features Eero Väyrynen1, Heikki Keränen2, Juhani Toivanen3 and Tapio Seppänen4 1 2,4MediaTeam, University of Oulu 3MediaTeam, University of Oulu & Academy of Finland

Abstract types of problems usually encountered by Finns when learning and speaking English. The rating An experiment is reported exploring whether was not based on a scale rating of the overall the general auditorily assessed segmental qual- fluency or a part thereof, but instead on count- ity of second language speech can be evaluated ing the number of errors in individual segmen- with automatic methods, based on a number of tal or prosodic units. As a guideline for the prosodic features of the speech data. The re- analysis, the classification by Morris-Wilson sults suggest that prosodic features can predict (1992) was used to make sure that especially the occurrence of a number of segmental prob- the most common errors encountered by Finns lems in non-native speech. learning English were taken into account. The main problems for the speakers were, Introduction as was expected for native Finnish speakers, Our research question is: is it possible, by look- problems with voicing (often with the sibi- θ ing into the supra-segmentals of a second lan- lants), missing friction (mostly /v, , ð/), voice guage variety, to gain essential information onset time and aspiration (the plosives /p, t, k, about the segmental aspects, at least in a proba- b, d, g/), and affricates (post-alveolar instead of bilistic manner? That is, if we know what kinds palato-alveolar). There were also clear prob- of supra-segmental features occur in a second lems with coarticulation, assimilation, linking, language speech variety, can we predict what rhythm and the strong/weak form distinction, some of the segmental problems will be? all of which caused unnatural pauses within The aim of this research is to find if supra- word groups. segmental speech features can be used to con- The errors were divided into two rough cat- struct a segmental model of Finnish second egories, segmental and prosodic, the latter language speech quality. Multiple nonlinear po- comprising any unnatural pauses and word- lynomial regression methods (for general refer- level errors – problems with intonation were ence see e.g. Khuri (2003)) are used in an at- ignored. Subsequently, only the data on the tempt to construct a model capable of predict- segmental errors was used for the acoustic ing segmental speech errors based solely on analysis. global prosodic features that can be automati- cally derived from speech recordings. Acoustic analysis For the speech data, features were calculated Speech data using the f0Tool software (Seppänen et al. The speech data used in this study was pro- 2003). The f0Tool is a software package for au- duced by 10 native Finnish speakers (5 male tomatic prosodic analysis of large quanta of and 5 female), and 5 native English speakers (2 speech data. The analysis algorithm first distin- male and 3 female). Each of them read two guishes between the voiced and voiceless parts texts: first, a part of the Rainbow passage, and of the speech signal using a cepstrum based second, a conversation between two people. voicing detection logic (Ahmadi & Spanias Each rendition was then split roughly from the 1999) and then determines the f0 contour for middle into two smaller parts to form a total of the voiced parts of the signal with a high preci- 60 speech samples (4 for each person). The data sion time domain pitch detection algorithm was collected by Emma Österlund, M.A. (Titze & Haixiang 1993). From the speech sig- nal, over forty acoustic/prosodic parameters were computed automatically. The parameters Segmental analysis were: The human rating of the speech material was done by a linguist who was familiar with the

116 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

A) general f0 features: mean, 1%, 5%, dre polynomials and cross terms to introduce 50%, 95%, and 99% values of f0 nonlinearity: (Hz), 1%- 99% and 5%-95% f0 ranges (Hz) 1 = xP pp , B) features describing the dynamics of 1 2 f0 variation: average continuous f0 2 p ()xP p −= ,13 rise and fall (Hz), average f0 rise 2 and fall steepness (Hz/cycle), max 3 pq = qp < pqxxP ., continuous f0 rise and fall (Hz), max steepness of f0 rise and fall The resulting total of 1127 new features P was (Hz/cycle) then searched to find the best performing re-

C) additional f0 features: normalised gression coefficients to be described next. segment f0 distribution width varia- A sequential forwards-backwards floating tion, f0 variance, trend corrected search (SFFS) (Pudil et al. 1994) was used to mean proportional random f0 per- find a set of 15 best features k ∈ Pa by mini- turbation (jitter) mising the sum of squared errors (SSE) given

D) general intensity features: mean, by solving a standard multiple linear regression median, min, and max RMS intensi- procedure: ties, 5% and 95% values of RMS in- tensity, min-max and 5%-95% RMS y βββ aa L a +++++= εβ , intensity ranges i i i 22110 iikk

E) additional intensity features: normal- L ised segment intensity distribution where = ,,2,1 ni are independent samples, β the regression parameters, and ε is a ran- width variation, RMS intensity vari- k i ance, mean proportional random in- dom error term. An LMS solution tensity perturbation (shimmer) T ˆ T F) durational features: average lengths ( AA )β = yA of voiced segments, unvoiced seg- ments shorter than 300ms, silence was used to produce regression estimates segments shorter than 250ms, un- ˆ ˆ ˆ L ˆ voiced segments longer than 300ms, yˆ i i βββ aa i22110 ++++= β aikk . and silence segments longer than 250ms, max lengths of voiced, un- The best features were then transformed us- voiced, and silence segments ing a robust PCA method (Hubert et al. 2005) G) distribution and ratio features: per- to remove any linear correlations. The PCA centages of unvoiced segments transformed 15 features (scaled to − ≤ x ≤ 11 shorter than 50ms, between 50- interval) were then searched again with SFFS to 250ms, and between 250-700ms, ra- select a set of 8 final PCA transformed features. tio of speech to long unvoiced seg- The motivation for this process was to com- ments (speech = voiced + un- bat any over learning of data by limiting the voiced<300ms), ratio of voiced to number of resulting regression coefficients to as unvoiced segments, ratio of silence small a number as possible. A person indepen- to speech (speech = voiced + un- dent hold out cross-validation process was also voiced<300ms) used throughout the feature selection procedure H) spectral features: proportions of low to ensure generalization of the resulting models. frequency energy under 500 Hz and In the hold out procedure for each person his or under 1000 Hz her samples were rotated out from the database in turn and a regression model was trained us- Nonlinear regression ing the remaining people’s samples. The result- ing models were then used to predict the cor- A multiple polynomial nonlinear regression of responding set of samples held out. segmental speech errors was performed. First, The final regression model was then vali- K the x p , p = 46,,2,1 raw prosodic features dated by inspecting the total cross-validated re- were scaled to 0-mean and scaled to − ≤ x ≤ 11 gression residual and each of the individual p interval. The scaled features were then trans- values of the final cross-validation training formed using the first and second order legen- round. The p values represent the probability

117 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the data is drawn from a distribution consistent 50ms with the null hypothesis where the prosodic da- ta is assumed containing no explanatory linear max lengths of voiced segments X max lengths components at all. Finally, the consistency of of silence segments the speaker independent regression coefficients was inspected to ensure the validity and stabili- max lengths of voiced segments X normalised ty of the model. segment f0 distribution width variation

Results ratio of silence to speech (speech = voiced + unvoiced<300ms) X normalised segment f0 The first feature selection resulted in a feature distribution width variation vector that contains no second order polynomi- als. The 15 features are described in Table 1. A No second order legendre polynomials were selected cross term is indicated as “feature X included by the selection procedure while many feature”. cross terms were included in the regression.

The omission of second order nonlinearity and Table 1. Selected features in the first search heavy reliance of cross terms suggests that dif- ferentiating information is perhaps coded more trend corrected mean proportional random f0 as co-occurrences of features rather than strict perturbation (jitter) nonlinear combinations of raw prosodic fea- tures. average lengths of unvoiced segments longer After the robust PCA transformation a fea- than 300ms ture vector containing 8 best linearly indepen- dent features was selected as the final regres- normalised segment intensity distribution sion model. The resulting regression coeffi- width variation cients of each person independent training round were found to be consistent with little or 50% values of f0 X percentages of unvoiced no variation. The final models relevant person segments between 250-700ms independent cross-validation training regression statistics are shown in Table 2. In the table, a 1% values of f0 X max lengths of silence seg- range of R2 and p value for the training with ments cross-validated R2 and p value for the testing are shown. The p values indicate that the null average continuous f0 rise X average continu- hypothesis can be rejected i.e. prosodic features ous f0 fall do contain a model capable of predicting speech proficiency. The R2 values further average continuous f0 fall X max steepness of shows that a majority of the variance is ex- f0 fall plainable by the chosen features. Table 2. Final regression statistics average continuous f0 fall X max lengths of voiced segments 2 Person R p

training 0.752 – 0.858 <0.001 max continuous f0 fall X ratio of speech to long unvoiced segments (speech = voiced + un- cross-validated 0.659 <0.001 voiced<300ms) The cross-validated residual of the final 8 max steepness of f0 rise X ratio of speech to robust PCA feature regression is shown in Fig- long unvoiced segments (speech = voiced + un- ure 1 and the corresponding scatterplot of hu- voiced<300ms) man and regression estimate of errors in Figure 2. RMS intensity variance X average lengths of silence segments longer than 250ms unvoiced segments longer than 300ms X per- centages of unvoiced segments shorter than

118 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Cross−validated residuals evaluated segmental quality of the second lan- guage speech data. Generally, it may be the 10 case that segmental and supra-segmental (pro- sodic, intonational) problems in second lan- guage speech occur together: a command of one 5 pronunciation aspect may improve the other. Some investigators actually argue that good in- Error residual 0 tonation and rhythm in a second language will, almost automatically, lead to good segmental −5 features (Pennington, 1989). From a technolo- 5 10 15 20 25 30 35 40 45 50 55 60 gical viewpoint, it can be concluded that a Sample # model capable of estimating segmental errors Figure 1. Cross-validated regression residuals. The can be constructed using prosodic features. Fur- data is ordered in an ascending human error with ther research is required to evaluate if a robust the circles indicating the residual errors. test and index of speech proficiency can be constructed. Such an objective measure can be The resulting regression residuals indicate seen as a speech technology application of great that perhaps some nonrandom trend is still interest. present. Some relevant information not in- cluded by the regression model is therefore pos- sibly still present in the residual errors. It may References be that the used prosodic features do not in- Ahmadi, S. & Spanias, A.S. (1999) Cepstrum clude this information or that more coefficients based pitch detection using a new statistical in the model could be justified. V/UV classification algorithm. IEEE Trans- action on Speech and Audio Processing 7 Scatterplot 25 (3), 333–338. Hubert, M., Rousseeuw, P.J., Vanden Branden, 20 K. (2005) ROBPCA: a new approach to ro- 15 bust principal component analysis. Tech- nometrics 47, 64–79. 10 Khuri, A.I. (2003) Advanced Calculus with

5 Applications in Statistics, Second Edition.

Regression estimate Wiley, Inc., New York, NY. 0 Morris-Wilson, I. (1992) English segmental phonetics for Finns. Loimaa: Finn Lectura. −5 0 5 10 15 20 25 Pennington, M.C. (1989) Teaching pronuncia- Human estimate tion from the top down. RELC Journal, 20- 38. Figure 2. A scatterplot of human errors against the č corresponding cross-validated regression estimates. Pudil, P., Novovi ová, J. & Kittler J. (1994) The solid line shows for reference where a perfect Floating search methods in feature selec- linear correspondence is located and the dashed line tion. Pattern Recognition Letters 15 (11), is a least squares fit of the data. 1119–1125. Seppänen, T., Väyrynen, E. & Toivanen, J. The scatter plot shows a linear dependence (2003). Prosody-based classification of of 0.76 between human and regression esti- emotions in spoken Finnish. Proceedings of mates with 66% of variance explained. the 8th European Conference on Speech Communication and Technology Conclusion EUROSPEECH-2003 (Geneva, Switzer- The results suggest that segmental fluency or land), 717–720. “correctness” in second language speech can be Titze, I.R. & Haixiang, L. (1993) Comparison modelled using prosodic features only. It seems of f0 extraction methods for high-precision that segmental and supra-segmental second lan- voice perturbation measurements. Journal of guage speech skills are interrelated. Parameters Speech and Hearing Research 36, 1120– describing the dynamics of prosody (notably, 1133. the steepness and magnitude of f0 movements – see Table 1) are strongly correlated with the

119 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Children’s vocal behaviour in a pre-school environment and resulting vocal function Mechtild Tronnier1 and Anita McAllister2 1Department of Culture and Communication, University of Linköping 2Department of Clinical and Experimental Medicine, University of Linköping

Abstract ing in the same direction is their shorter body length compared to pre-school teachers. This study aims to shed some light onto the re- In an earlier study by McAllister et al., lationship between the degree of hoarseness in (2008, in press), the perceptual evaluation of children’s voices observed at different times pre-school children showed that the girls’ during a day in pre-school and different as- voices revealed higher values on breathiness, pects of their speech behaviour. Behavioural hyperfunction and roughness by the end of the aspects include speech activity, phonation time, day, which for the boys was only the case for F0 variation, speech intensity and the relation- hyperfunction. ship between speech intensity and background In the present study the interest is directed noise intensity. The results show that children to speech behaviour of children in relation to behave differently and that the same type of be- the background noise and the affect on the vo- haviour has a varied effect on the different cal function. Diverse acoustic measurements children. It can be seen from two children with were carried out for this purpose. otherwise very similar speech behaviour, that The investigation of speech activity is cho- the fact that one of them produces speech at a sen to show the individuals’ liveliness in the higher intensity level also brings about an in- pre-school context and includes voiced and crease of hoarseness by the end of the day in voiceless speech segments, even non speech pre-school. The speech behaviour of the child voiced segments such as laughter, throat clear- with highest degree of hoarseness on the other ing, crying etc. In addition, measurements on hand cannot be observed to be putting an ex- phonation time were chosen, reflecting the vo- treme load on the vocal system. cal load. It should be noted that in some parts of speech intended voicing could fail due to the Introduction irregularity of the vocal fold vibrations – the Speaking with a loud voice in noisy environ- hoarseness of the speaker’s voice. Therefore, ments in order to making oneself heard de- both measurements, speech activity and phona- mands some vocal effort and has been shown to tion time were considered important. In a study harm the voice in the long run. on phonation time for different professions, In several studies on vocal demands for dif- Masuda et al. (1993) showed that the propor- ferent professions it has been shown that pre- tion for pre-school teachers corresponded to school teachers are rather highly affected and 20% during working time, which is considered that voice problems are common (Fritzell, a high level compared to e.g. nurses with a cor- 1996; Sala, Airo, Olkinuora, Simberg, Ström, responding level of 5.3% (Ohlsson, 1988). Hav- Laine, Pentti, & Suonpää 2002; Södersten, ing these findings in mind, the degree of chil- Granqvist, Hammarberg & Szabo, 2002). This dren’s speech activity and phonation time and problem is to a large extent based on the need the consequences for perceived voice quality is of the members of this professional group to an interesting issue. make themselves heard over the surrounding Other factors for the analysis of vocal load noise, mainly produced by the children present. consist of F0 including F0-variation and speech It is reasonable to assume that children’s intensity including intensity variation. A vocal voices are equally affected by background noise trauma can be based on using high fundamental as adult voices. As children most of the time frequency and high vocal loudness (and hyper- contribute to the noise in a pre-school setting function, which this study is not focusing on in themselves – rather then other environmental particular), as Södersten et al. point out. factors as traffic etc. – they are exposed to the One further aspect that may increase the noise source even more potently as they are risk for voice problems is the need of a speaker closer to the noise source. Another factor point- to be heard over background noise. Therefore,

120 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the relationship speech intensity/background • child C showed no increase of hoarseness, noise intensity is investigated. According to the • child D showed a clear decrease of hoarse- results of Södersten et al. the subjects speech ness. was 9.1dB louder than the environmental noise, The development of the children’s voices in an already noisy environment. over the day was compared to the development of several acoustic measures of the recordings Material and Method of spontaneous speech, shedding light on the children’s speech behaviour and activity and The material investigated in the present study is the use of the voice part of the data gathered for the project Barn The speech activity of each child during och buller (Children and noise). The project is a each recording session was calculated by setting cooperation between the University of the number of obtained intensity counts in rela- Linköping and KTH, Stockholm, within the lar- tion to the potential counts of the whole re- ger BUG project (Barnröstens utveckling och cording according to an analysis in PRAAT with genusskillnader; Child Voice Development and a sampling rate of 100Hz (in %). Gender Differences; Furthermore phonation time was calculated http://www.speech.kth.se/music/projects/BUG/ by setting the number of obtained F0-measures abstract.html). It consists of the data of selected in relation to the potential counts of the whole recordings from four five-year-old children, at- recording according to an analysis in PRAAT tending different pre-schools in Linköping. with a sampling rate of 100Hz (in %). These children were recorded using a binaural An analysis of the fundamental frequency technique (Granqvist, 2001) three times during and the intensity was carried out in PRAAT with one day at the pre-school: at arriving in the a sampling rate of 100Hz for file (1), which morning (m) and gathering, during lunch (l) contains the child’s speech. Intensity measures and in the afternoon during play time (a). The were also normalised in comparison to a cali- binaural recording technique makes it possible bration tone and with regard to microphone dis- to extract one audio file containing the child’s tance from the mouth to 15 cm. speech activity (1) and one file containing the For both F0- and intensity measurements, surrounding sound (2). Each recording con- the mean value, standard deviation and median sisted of two parts. First a recording with a con- (in Hz and dB) was calculated. For the sake of trolled condition was made, where the children interpretation of the results regarding the meas- were asked to repeat the following phrases three urements of fundamental frequency, additional times: “En blå bil. En gul bil. En röd bil”. Fur- F0-measurements of controlled speech obtained thermore spontaneous speech produced during from the BUG-material for each child is given the following activities at the pre-school were in the results. recorded for approximately one hour. Concerning the background noise investiga- The recordings of the controlled condition, tion, the intensity was calculated in PRAAT with comprising the phrase repetitions, were used in a sampling rate of 100Hz for file (2). Intensity an earlier study (McAllister et al., in press) to measurements for this channel were normalised perceptually assess the degree of hoarseness, in comparison to a calibration tone. breathiness, hyperfunction and roughness by Descriptive statistics for the interpretation of three professional speech pathologists. Assess- the measurements was used. The degree of ment was carried out by marking the degree of hoarseness is not a direct consequence of the each of the four voice qualities plus an optional speech behaviour reflected by the acoustic parameter on a Visual Analog Scale (VAS). measurements presented in the same rows in The averaged VAS-ratings by the speech the tables below, because the recordings of the pathologists for the four children regarding the controlled condition were made before the re- comprehensive voice quality hoarseness were cordings of spontaneous speech. used as a selection criterion in the present in- vestigation. The selected children showed dif- ferent tendencies regarding the hoarseness Results variation over the day at pre-school (see e.g. In this section the results of the diverse meas- Table 1). urements are presented. In the tables, the per- • child A showed a marked increase of ceptual ratings of degree of hoarseness obtained hoarseness, from an earlier study are shown, too. They are, • child B showed some increase of hoarseness, however, not considered any further here but

121 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University are relevant for the next section, for the discus- point in the same direction giving an outline to sion. whether a child is an active speaker or not.

Speech activity and phonation time Fundamental frequency (F0) Speech activity increases for the children A, B Table 2 shows not only the results from sponta- and D between morning and afternoon and is neous speech but even measures of the mean highest for the children B and C (Table 1). fundamental frequency obtained from the BUG- Child C has a higher activity in the morning and recording with controlled speech. during lunch, but decreases activity in the after- Child B produces speech on a relatively noon. high mean F0 with a large F0-range in the Phonation time is in general highest for morning and decreases mean F0 and range over child B, who also shows the strongest increase the rest of the day. Furthermore, child B is pro- over the day. The children A and D show low- ducing speech on a clearly higher mean F0 in est phonation time, however child A shows a spontaneous speech compared to controlled clearly higher level in the afternoon. Child C speech. has a degree of phonation time in between with Child C presents a relatively strong increase the highest measure at lunchtime. of mean F0 over the day, however the range is Table 1. Degree of hoarseness, speech activity and broad in the morning and at lunch but less phonation time. broad in the afternoon. Mean F0 is relatively high for spontaneous speech compared to the CHILD Re- Hoarse- Speech phonation controlled condition in the afternoon but rather cording ness in mm activity time moderately higher for the other times of the VAS in (%) in (%) day. Child D shows a moderate increase of F0 A m 53 20.6 9.5 and maintains a fairly stable F0-range over the A l 81 19.4 9.9 day. F0-mean is higher in the morning and at A a 72.5 30.5 13 lunch for spontaneous speech compared to con- B m 16.5 26.5 14.5 trolled speech, however for the afternoon re- B l 21 33.3 17.7 cording F0 is much higher for the controlled B a 25.5 39.7 25.2 condition. C m 29.5 34.1 11.9 C l 26 34.2 14.2 Table 2. Degree of hoarseness, mean fundamental C a 28.5 24.9 12.3 frequency, F0-standard deviation, F0-median and D m 20.5 16.9 8.1 mean F0 for controlled speech. D l 18.5 21.2 9.9 D a 10 28.8 9.7 F0 CHILD Hoarse- F0 sd in F0 me- F0 mean and re- ness in mean [Hz] dian controlled cording mm in [Hz] in [Hz] in [Hz] VAS A, m 53 322 77 308 354 voicing in relation to speech A, l 81 331 79 325 307 45 40 A, a 72.5 328 74 315 275 35 30 B, m 16.5 369 100 358 266 25 20 B, l 21 308 88 295 236 15 R2 = 0,6381 10 B, a 25.5 305 85 296 236 speech activity in % in activity speech 5 0 C, m 29.5 290 108 292 284 0 10 20 30 phonation time in % C, l 26 302 110 306 285

C, a 28.5 335 92 328 279 D, m 20.5 312 91 298 270 Figure 1. Correlation between speech activity and phonation time. D, l 18.5 321 88 311 279 D, a 10 332 90 318 354 There is generally a good correlation between speech activity and phonation time as can be No clear tendency over the day can be observed seen in Figure 1. This means that both measures for Child A; possibly a moderate F0-increase for lunch and a slight F0-decrease in the after-

122 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University noon. Range varies little between morning and measured level of background noise (70dB and lunch and decreases somewhat in the afternoon. 71dB) the children exposed to that level pro- Even the relationship between the F0- duce speech either slightly stronger (child D) or measurements of the different conditions shows at the same intensity level (child A). quite a discrepancy: higher F0 occurs for the Table 4. Degree of hoarseness, mean background controlled recording in the morning, but at the noise intensity, mean speech intensity and the dif- other two instances F0 is higher for the sponta- ference between the intensity levels. neous recordings, where the difference is larg- est in the afternoon. CHILD Hoarse Background Child’s difference Child A uses in general a narrow F0-range, and re- -ness in intensity speech in- whereas child C shows the broadest F0-range. cording mm mean tensity, VAS in [dB] mean in [dB] Speech intensity Child B produces speech with highest intensity A, m 53 75 71 -4 in general and child C with lowest intensity A, l 81 79 72 -7 (Table 3). Child B presents little variation in A, a 72.5 71 71 0 intensity at lunch and in the afternoon, which is B, m 16.5 82 76 -6 equally high for both times of the day. Also the B, l 21 76 73 -3 median of intensity is clearly higher for lunch B, a 25.5 73 78 5 and in the afternoon compared to the other chil- C, m 29.5 81 64 -17 dren and for all recordings higher then the C, l 26 81 66 -15 mean. This means that child B is producing C, a 28.5 78 70 -8 speech at high vocal loudness most of the time. D, m 20.5 70 72 2 Table 3. Degree of hoarseness, mean intensity, D, l 18.5 75 71 -4 standard deviation and median of intensity. D, a 10 73 71 -2

CHILD Hoarse Intensity Intensity, Intensity, Discussion and re- -ness in mean sd in median In this section the relationship between the cording mm VAS in [dB] [dB] in [dB] childrens speech behaviour presented in the re- A, m 53 71 18 72 sults and the degree of hoarseness obtained A, l 81 72 18 75 from an earlier study (McAllister et al.) are dis- A, a 72.5 71 19 72 cussed. As there is a good correlation between B, m 16.5 76 23 80 speech activity and phonation time (see Figure B, l 21 73 17 76 1), these parameters will be discussed together. B, a 25.5 78 17 83 C, m 29.5 64 16 65 Hoarseness vs. speech activity and C, l 26 65 17 67 phonation time C, a 28.5 70 17 72 D, m 20.5 72 17 77 The child with highest increase of speech activ- D, l 18.5 71 17 75 ity and phonation time over the day – child B – D, a 10 70 17 72 also shows a clear increase of hoarseness. Child B, however is not the child with the highest de- Speech intensity and background noise gree of hoarseness. The child with the highest degree and increase of hoarseness – child A – It can be seen in Table 4 that the speech inten- does not show the highest degree of speech ac- sity of the children’s speech is lower than the tivity and phonation time. Child A reveals most intensity of the background noise in most cases. speech activity and highest phonation time in However, child C, who is exposed to the high- the afternoon, but highest degree of hoarseness est level of background noise, produces speech around lunchtime. Child C is an active child, with the lowest intensity level. Child B, who but does not present us with a change for the also is exposed to a fairly high level of back- worse in terms of hoarseness. However, this ground noise on the other hand, also produces child exhibits a slightly higher degree of speech at a relatively high intensity level, which hoarseness then child B. Child D shows a fairly in the afternoon is even higher than the level of low level and an increase of speech activity background noise. In the case of the lowest over the day, but a decrease in hoarseness.

123 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The parameters speech activity and phona- has been carried out for the recordings of the tion time solely do therefore not give a clear ex- background noise. planation to vocal fatigue. However, child B, who suffers from vocal fatigue by the end of the General discussion day presents us with an amount of phonation It can be found that child B refers to a typical time comparable to what has been found for child in a risk zone who suffers from voice pre-school teachers (Masuda et al. 1993). problems by the end of the day due to hazard- ous voice use: being a lively child in a very Hoarseness vs. fundamental frequency noisy environment leads to making use of a Child B presents us with particularly high mean loud, strong voice with a relatively high funda- F0 - much higher then F0 under controlled con- mental frequency and a high fundamental fre- dition - and a broad F0-range in the morning, quency range, resulting in vocal fatigue at the followed by an increase of hoarseness later in end of the day, reflected by an increase of the day. However, Child C with fairly high F0- hoarseness. The results for this child agree with increase over the day and a high F0-range is not Södersten et al. that producing high fundamen- affected by a change of hoarseness for the tal frequency at high vocal loudness can lead to worse. Child A with fairly stable mean F0 and vocal trauma. F0-range over the day presents us with higher Child A on the other hand does not show degree of hoarseness after the morning re- any particularly unusual voice use. However the cordings. Child D with comparably stable mean degree of hoarseness was already very high in F0 and F0-range on the other hand improved the first recording made in the morning and in- the voice condition over the day. creases further in during the day. The high de- The use of high F0, a high F0-range solely gree of hoarseness could have had influence on does not seem to account for voice deteriora- the calculation of different acoustic measure- tion. ments, e.g. phonation time is fairly low, be- cause the algorithm is not able to pick periodi- Hoarseness vs. speech intensity cal parts of speech. Even the low measure of The child (B) producing speech at highest intensity might have been affected, since voiced loudness level is the one that suffers most from sounds show stronger intensity, which might be increased voice problems later in the day. lacking due to the child’s high degree of Strenuous speech production with high inten- hoarseness. It should however be noted that sity therefore seems to be an important parame- child A does not present us with high number ter to take into consideration when accounting of measurements on speech activity, so that a for voice problems in children. low degree of phonation time is unlikely to be based on period picking problems. This child Hoarseness vs. speech intensity and might have a predisposition for a hoarse voice, background noise or an already obtained voice problem. Child C seems to be typical for a lively child The children react in a different way to the level with high speech activity (Table 1) and a broad of background noise. Being exposed to a high F0-range (Table 2). On the other hand, this level of background noise, one of the active child shows lowest speech intensity (Table 3) children (B) seems to be triggered for a loud which seems to be a good prerequisite to pre- voice use, whereas one other speech active vent voice problems. This child presents us child (C) does not behave in the same way, but with a somewhat higher degree of hoarseness produces a much softer voice. The child react- then child B, but there is no change for worse ing with a stronger voice (B) also responds with over the day. When taking a look at how child increased hoarseness later in the day. C uses her voice, one can find out that she is As has been presented in the results, the humming a lot by herself. children never produce speech at a loudness of Child D is most lively in the afternoon, 9.1dB above background noise, the level that where the degree of hoarseness is lowest (Table had been found by Södersten et al. to occur for 1). Obviously, this child is seems to need to pre-school teachers. However, normalization warm up the voice during the day, which results with regard to microphone distance for the in highest activity later in the day combined children’s speech might be a question to be with best voice condition. considered, since no comparable normalisation

124 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

In summary, the children behave in a differ- Sala, E., Airo, E., Olkinuora, P., Simberg, S., ent way. Being an active child in a noisy envi- Ström, U., Laine, A., Pentti, J. & Suonpää, ronment can lead to a high level of speech ac- J. (2002) Vocal loading among day care tivity and phonation time, and high F0 and F0 center teachers. Logopedics Phoniatrics Vo- range. However, increase of hoarseness rather cology.;27: 21-28. seems to occur if speech is produced with high Södersten, M., Granqvist, S., Hammarberg, B., intensity on top of high level measures of the and Szabo, A. (2002). Vocal behavior and other parameters. The child reacting with a vocal loading factors for preschool teachers louder voice also responds with increased at work studied with binaural DAT re- hoarseness later in the day. As has been shown cordings. Journal of Voice, 16(3), 356–371. above, another speech active child’s voice has not been affected in the same direction as it produces speech at a much weaker intensity.

Conclusions A lively child with high speech activity in a noisy environment feeling the need to compete with the background noise by producing loud speech is at risk to suffer from vocal fatigue. An equally lively child without the need to make her/himself heard in en equally noisy environ- ment prevents a child from a similar outcome. Putting a high vocal load on the voice by pro- ducing speech at a high intensity level is there- fore likely to be the key-parameter leading to a raised level of hoarseness. A child with a pre- disposition for hoarseness seems to be at risk to suffer from even stronger hoarseness later in the day even if there are no signs for extreme use of the voice.

References Fritzell, B. (1996). Voice disorders and occupa- tions. Logopedics Phoniatrics Vocology, 21, 7-21. Granqvist, S. (2001). The self-to-other ratio ap- plied as a phonation detector. Paper pre- sented at: The IV Pan European Voice Con- ference, Stockholm August 2001. Masuda, T., Ikeda, Y., Manako, H., and Komi- yama, S. (1993). Analysis of vocal abuse: fluctuations in phonation time and intensity in 4 groups of speakers, Acta Oto-Laryngol. 113(4), 547–552. McAllister, A., Granqvist, S., Sjölander, P., Sundberg, J. 2008 (in press) Child voice and noise, A pilot study of noise in day ca- res and the effects on 10 childrens’ voice quality according to perceptual evaluation. Journal of Voice doi:10.1016/j.jvoice.2007.10.017 Ohlsson, A.-C. (1988). Voice and Working En- vironments. Gothenburg: Gothenburg Uni- versity (doctoral dissertation).

125 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Major parts-of-speech in child language – division in open and close class words

E. Klintfors, F. Lacerda and U. Sundberg Department of Linguistics, Stockholm University, Stockholm

Abstract mathematical models simulating infants’ and animals’ performances are implemented. In The purpose of this study was to assess rela these models the balance between variance in between major partsofspeech in 14to tions the input and the formation of phonological- 43monthsold infants. Therefore a division in like categories under the pressure of different open class and close class words was made. amounts of available memory representation Open class words consist of nouns, verbs and space are of interest. adjectives, while the group of close class words The aim of the current study is to explore is mainly constituted of grammatical words the major parts-of-speech in child language. such as conjunctions, prepositions and adverbs. Therefore an analysis of questionnaire data The data was collected using the Swedish Early based on parental reports of their infants’ com- Communicative Development Inventory, a ver municative skills regarding open and close class sion of the MacArthur Communicative Devel words was carried out. opment Inventory. The number of open and close class words was estimated by summariz ing items from diverse semantic categories. The Background study was performed as a mixture of longitu The partition in words that belong to the so dinal and crosssectional data based on 28 called open class and those that belong to close completed forms. The results showed that while class is a basic division in major parts-of- the total number of items in the children’s vo speech. The open class is “open” in the sense cabularies grew as the child got older; the pro that there is no upper limit for how many units portional division in open vs. close class words the class may contain, while the close class has – proximally 9010% – was unchanged. relatively few members. The open and close class words also tend to have different func- Introduction tions in the language: the open class words of- ten carry contents, while the close class words This study is performed within the multidiscip- modify the relations of the semantically loaded linary research project: Modeling Interactive content words. Language Learning1 (MILLE, supported by the Why would children pay attention to open Bank of Sweden Tercentenary Foundation). The class words? Children, as well as adults, look goal of the project is to study how general pur- for meaning in what they see and hear. There- pose mechanisms may lead to emergence of fore, the areas of interest and the cognitive de- linguistic structure (e.g. words) under the pres- velopment of the child are naturally factors that sure of exposure to the ambient language. The constrain what is learned first. Close class human subject part of the project use data from words seldom refer to something concrete that infant speech perception and production expe- can be pointed out in the physical world in the riments and from adult-infant interaction. The way open class words do (Strömqvist, 2003). non-human animal part of the project use data Also, close class words are not expected to be from gerbil discrimination and generalization learned until the child has reached certain experiments on natural speech stimuli. And fi- grammatical maturity (Håkansson, 1998). Per- nally, within the modeling part of the project ceptual prominence and frequency are other factors that influence what is learned first 1 A collaboration between Department of Linguistics, (Strömqvist, 1997). Prosodic features such as Stockholm University (SU, Sweden), Department of Psy- length and stress make some content words chology, Carnegie Mellon University (CMU, USA), and more salient than others. Also, if a word occurs Department of Speech, Music and Hearing, Royal Insti- tute of Technology (KTH, Sweden).

126 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University more often in the language input of the child, it atypical language development. The current is easier to recognize. study is a step in the direction for finding refer- Estimations of children’s use of open vs. ence data for typical development of open vs. close class words may be based on apprecia- close class words. Atypical development of tions of types and occurrences. For example, in close class words might thus give information a longitudinal study on four Swedish children on potentially deviant grammatical develop- and their parents it was shown that the 20 most ment. frequent types of words stand for approximately 35-45% of all the word occurrences in child Method language, as well as in adult’s speech directed towards children (Strömqvist, 1997). And even The Swedish Early Communicative Develop- more notably, there were almost none open ment Inventory (SECDI) based on parental re- class words among these 20 most frequent ports exists in two versions, one version on words in child language or in child-directed words & gestures for 8-to 16-months-old child- speech (CDS) in the Swedish material. On the ren and the other version on words & sentences contrary, close class words such as de, du, va, e, for 16-to-28-months-old children. In this study ja, den, å, så constituted the most common the latter version, divided in checklists of 711 word forms. These word forms were most often words belonging to 21 semantic categories, was unstressed and phonologically/phonetically re- used. The inventory may be used to estimate duced (e.g. the words were monosyllabic, and receptive and productive vocabulary, use of the vowels were centralized). Nevertheless, it gestures and grammar, maximal length of utter- should be mentioned that the transcriptions ance, as well as pragmatic abilities (Eriksson & used were not disambiguated in the sense that Berglund, 1995). one sound might stand for much more than the child is able to articulate. For example, the fre- Subjects quent e might be generalized to signify är (eng. The subjects were 24 Swedish children (13 is), det (eng. it/that), ner (eng. down) etc. girls, and 11 boys, age range 6.1- to 20.6- In the current study, the questionnaires months by the start point of the project) ran- based on parental reports prompted for words domly selected from the National Swedish ad- types produced by the child. The use of words dress register (SPAR). Swedish was the primary was differentiated by whether the word in ques- language spoken in all the families with the ex- tion was used “occasionally” or “often” by the ception of two mothers who primarily spoke child, but no estimations of number of word French and Russian respectively. The parents of occurrences were made. Therefore the mate- the subjects were not paid to participate in the rials used in the current study allow only for study. Children who only participated during comparison of types of words used. the first part of the collection of longitudinal Based on the earlier study by Strömqvist we data (they had only filled in the version of should thus expect our data to show large and SECDI for 8-to 16-months-old children) were maybe growing proportion of close class words. excluded from the current study resulting in 28 For example, the proportion of open vs. close completed forms filled by 17 children (10 girls, class words measured at three different time 7 boys, age range 14- to 43-months at the time points, corresponding to growing vocabulary point of data collection). The data collected was sizes, could progress as follows: 90-10%, 80- a mixture of longitudinal and cross-sectional 20%, 70-30% etc. But on the other hand, the data as follows: 1 child completed 4 forms, 1 typically limited amount of close class words in child completed 3 forms, 6 children completed languages should be reflected in the sample and 2 forms, and 9 children completed 1 form. therefore our data should – irrespective of the child’s vocabulary size – reveal large and stable Materials proportion of open class words as compared to To estimate the number of open class words the close class words, measured at different time sections through A2 to A12, as well as A14 and points corresponding to growing vocabulary A15 were included. The semantic categories of sizes (e.g. 90-10%, 90-10%, 90-10% etc.). these sections are listed in Table 1. Section A1- Eriksson and Berglund (1995) indicate that Sound effects/animal sounds (e.g. mjau) and SECDI can to certain to extent to be used for A13-Games/routines (e.g. god natt, eng. good screening purposes to detect and follow up night) were not considered as representative children who show tendencies of delayed or

127 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

open class words and were therefore excluded from the analysis. The sections A16-A21 con- Results stituted the group of close class words belong- ing to the semantic categories listed in Table 2. The results based on 28 completed forms showed that the child with the smallest vocabu- Table 1. The semantic categories included for esti lary (4 open class words) had yet not started to mation of number of open class words. use words from close class. The child who pro- duced the most of the open class words (564 Section Semantic category Examples of words open class words) had developed her/his use of A2 Animals (real/toys) anka (eng. duck) close class words into 109 close class words. A3 Vehicles (real/toys) bil (eng. car) A4 Toys boll (eng. ball) A5 Food and beverage apelsin (eng. orange) A6 Clothes jacka (eng. jacket) A7 Body parts mun (eng. mouth) A8 Small objects/things blomma (eng. flower) A9 Furniture and rooms badkar (eng. bathtub) A10 Objects outdoors gata (eng. street) A11 Places to go affär (eng. store)

A12 People flicka (eng. girl) Figure 1. Total number of open class and close A14 Actions arbeta (eng. work) class words produced per each completed form. A15 Adjectives arg (eng. angry) Number of open class words (the light line) and close class words (the dark line) – shown on the y-axis are plotted for each completed form – listed on the x-axis. Table 2. The semantic categories included for esti mation of number of close class words. When a child knows approximately 100 open class words, she/he knows about 10 close class Section Semantic category Examples of words words – in other words the close class words A16 Pronouns de (eng. they) constitute 10% of the total vocabulary (Figure A 17 Time expressions dag (eng. day) 1). And further, when a child knows about 300 A18 Prepositions/location bakom (eng. behind) open class words, she/he knows about 35 close A19 Amount and articles alla (eng. everybody) words – that is the close class words constitute A20 Auxiliary verbs ha (eng. have) 12% of the total vocabulary. And finally, when A21 Connectors/questions och (eng. and) a child knows approximately 600 open class words, she/he knows about 100 close class Procedure words corresponding to 17% of the total voca- bulary. The materials were collected 2004-2007 by members of the Development group, Phonetic laboratory, Stockholm University. The subjects Discussion and their parents visited the lab approximately The results showed that children’s vocabularies once/month. Each visit started off with an eye- initially contain proportionally more open class tracking session to explore specific speech per- words as compared to close class words. The- ception research questions, and then a video reafter, the larger the vocabulary size, the big- recording (app. 15-20 minutes) of adult-infant ger proportion of it is devoted for close class interaction was made. Towards the end of the words. The proportion of open vs. close class visit, one of the experimenters entered the stu- words corresponding to total vocabulary size of dio and filled the questionnaire based on paren- 100, 300, and 600 words, was as follows: 90- tal information while the parent was playing 10%, 88-12%, 83-17%. with the child. Occasionally, if the parent had to Children might pay more attention to open leave the lab immediately after the recording class words since content words are typically session, she/he returned the questionnaire to the stressed and more prominent (e.g. the vowel lab within about one week (Klintfors, Lacerda, space of content words is expanded) in CDS Sundberg, 2007). (Kuhl et al., 1997; van de Weijer, 1998). Fur-

128 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University ther, the open class words often refer to con- were Pronouns, Time expressions, preposi- crete objects in the physical world and might tions/words for spatial locations, word for therefore be learned earlier (Gentner & Boro- Amount and articles, Auxiliary verbs, Connec- ditsky, 2001). Nor are children expected to use tors and question words. It may thus be specu- close class words until they have reached cer- lated that the children in the current study have tain grammatical maturity (Håkansson, 1998). started to perceive and explore the grammatical The youngest subjects in the current study status of the close class words. were 1.2-years old and some of the forms com- pleted early on – with vocabularies < 50 words Acknowledgements – did not contain any close class words. Shortly thereafter – for vocabularies > 50 words, all the Research supported by The Bank of Sweden children can be assumed to have reached Tercentenary Foundation, and European Com- grammatical maturity. The current study does mission. We thank Ingrid Broomé, Andrea thus not reveal the exact time point for starting Dahlman, Liz Hultby, Ingrid Rådholm, and to use close class words. Nevertheless, the age Amanda Thorell for data analysis within their group of the current study ranged between 1.2- B-level term paper in Logopedics. years to 3.6-years and likely captured the time point for onset of word spurt. The onset of word spurt has been documented to take place References sometime between the end of the first and the Bates, e., Marchman, V., Thal, D., Fenson, L., end of the third year of life (Bates et al., 1994). Dale, P., Reilly, J., Hartung, J. (1994) Deve- Therefore, the proportional increase of close lopmental and stylistic variation in the com- class words being almost twice as large (17%) position of early vocabulary. Journal of Child Language 21, 85-123. for vocabulary size of 300 to 600 words, as Eriksson, M. and Berglund, E. (1995) Instru- compared to vocabulary size from 100 to 300 ments, scoring manual and percentile levels words (10%) is not surprising. of the Swedish Early Communicative De- One reason for expecting close class words velopment Inventory, SECDI, FoU- to later enter the children’s vocabularies is that nämnden. Högskolan i Gävle. children might have more difficult to under- Gentner, D. and Boroditsky, L. (2001) Individ- stand the abstract meaning of close class words. uation, relativity, and early word learning. But closer inspection of the results shows that In Bowerman, M. and Levinson, S. (eds) children start to use close class words although Language acquisition and conceptual devel- the size of their vocabularies is still relatively opment, 215-256. Cambridge University small. For example, one of the subjects showed Press, UK. at one occasion to have one close class word Håkansson, G. (1998). Språkinlärning hos barn. and five open class words. But a question to be Studentlitteratur. Klintfors, E., Lacerda, F., and Sundberg, U. asked next is how close class words are used in (2007) Estimates of Infants’ vocabulary child language. That is, has the child unders- composition and the role of adult- tood the abstract functions of the words used? It instructions for early word-learning. is reasonable that children use close class words Proceedings of Fonetik 2007, TMH-QPSR to express other functions than the original (Stockholm, Sweden) 50, 69-72. function of the word in question. For example, Kuhl, K. P., Andruski, J. E., Chistovich, I. A, the word upp (eng. up) might not be the unders- Chistovich, L. A., Kozhevnikova, E. V., tood as the abstract content of the preposition Ryskina V. L., Stolyarova, E. I., Sundberg, up, but instead used to refer to the action lyft U. and Lacerda, F. (1997) Cross-language mig upp (eng. lift med up). Using the particle of analysis of phonetic units in language ad- a verb and omitting the verb referring to the ac- dressed to infants. Science, 277, 684-686. tion is typical in child language (Håkansson, Strömqvist, S. (1997) Om tidig morfologisk utveckling. In Söderberg, R. (ed) Från joller 1998). Thus, the close class words are often till läsning och skrivning, 61-80. Gleerups. phonotactically less complex (compare upp to Strömqvist, S. (2003) Barns tidiga lyfta) and therefore likely more available for the språkutveckling. In Bjar, L. and Liberg, S. child. But the use of the word per se does not (eds) Barn utvecklar sitt språk, 57-77. indicate that the child has understood the Studentlitteratur. grammatical role of the close class words in the Weijer, J. van de (1998) Language input for language. The close class words used by the 14- word discovery. Ph.D thesis. Max Planck to 43-month-old children in the current study Series in Psycholinguistics 9.

129 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Language-specific speech perception as mismatch negativity in 10-month-olds’ ERP data IrisCorinna Schwarz1, Malin Forsén2, Linnea Johansson2, Catarina Lång2, Anna Narel2, Tanya Valdés2, and Francisco Lacerda1 1Department of Linguistics, Stockholm University 2Department of Clinical Science, Intervention, and Technology, Speech Pathology Division, Karolinska Institutet, Stockholm

Abstract language skills (Kuhl, Williams, Lacerda, Ste Discrimination of native and nonnative speech vens, & Lindblom, 1992). For example, 10 to contrasts, the heart of the concept of language 12monthold Canadian English environment specific speech perception, is sensitive to de infants could neither discriminate the nonnative velopmental change in speech perception dur Hindi contrast [ʈɑ][tɑ] nor the nonnative 1 ing infancy. Using the mismatch negativity Thompson contrast [ki][qi] whereas their 6 to paradigm, seven Swedish language environ 8monthold counterparts still could (Werker & ment 10montholds were tested on their per Tees, 1984). ception of six different consonantal and tonal This specialisation in the native language Thai speech contrasts, native and nonnative to holds around six months of age even on a su the infants. Infant brain activation in response prasegmental language level: American Eng to the speech contrasts was measured with lish language environment infants younger than eventrelated potentials (ERPs). They show six months are equally sensitive to all stress mismatch negativity at 300 ms, significant for contrast change in the native condition, but not patterns of words, and do not only prefer the for contrast change in the nonnative condition. ones predominantly present in their native lan Differences in native and nonnative speech dis guage as infants older than six months do crimination are clearly reflected in the ERPs (Jusczyk, Cutler, & Redanz, 1993). and confirm earlier findings obtained by behav During the first year of life, infants’ speech ioural techniques. ERP measurement thus perception changes from languagegeneral to suitably complements infant speech discrimina languagespecific in several features. Adults are tion research. already so specialised in their native language that their ability to discriminate nonnative Introduction speech contrasts is greatly diminished and can Speech perception bootstraps language acquisi only partially be retrained (Tees & Werker, tion and forms the basis for later language de 1984; Werker & Tees, 1999, 2002). By con velopment. During the first six months of life, trasting native and nonnative discrimination infants are ‘citizens of the world’ (Kuhl, 2004) performance the degree of languagespecificity and perform well in both nonnative and native in speech perception is shown and developmen speech discrimination tasks (Burnham, Tyler, & tal change can be described (Burnham, 2003). Horlyck, 2002). In the presence of experience with the native For example, 6month and language, languagespecific speech perception German infants tested on a German, but not refines, whereas in the absence of experience English contrast [dut][dyt], and an English, but nonnative speech perception declines. This not German contrast [dɛt][dæt], discriminated study focuses on 10montholds whose speech both contrasts equally well (Polka & Bohn, perception is languagespecific. 1996). Around the age of six months, a percep A common behavioural paradigm used to tual shift occurs in favour of the native lan test discrimination abilities in infants younger guage, earlier for vowels than for consonants than one year is the conditioned headturn (Polka & Werker, 1994). Around that time in method (e.g., Polka, Colantonio, & Sundara, fants’ nonnative speech discrimination per 2001). This method requires the infant to be formance starts to decline (Werker & Lalonde, able to sit on the parent’s lap and to control 1988), while they continue to build their native head movement. Prior to the experimental test

130 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University phase, a training phase needs to be incorporated into the experiment to build up the association Hypothesis between perceived changes in contrast presenta If ERPs and especially mismatch negativity are tion and reward display in the infants. The confirmed by the current study as physiological number of trials that it takes the infant to reach correlates to behavioural infant speech dis criterion during training significantly reduces crimination data, 10month lan the possible number of later test trials since the guage environment children would discriminate total test time of 10 min maximum remains in native, but not nonnative contrast changes, as variant in infants. they should perceive speech in a language Can electroencephalography (EEG) meas specific manner at this stage of their develop urement provide a physiological correlate to the ment. behavioural discrimination results? The answer is yes. Brain activation waves in response to Method stimulus presentation are called eventrelated potentials (ERPs) and often show a stimulus Participants typical curve (Teplan, 2002). This can be for Seven 10monthold infants (four girls and example a negativity response in the ERP, three boys) participated in the study. Their av called mismatch negativity (MMN), reflecting erage age was ten months and one week, with stimulus change in a series of auditory signals an age range of ten to eleven months. The par (Näätänen, 2000). MMN reflects automatic ticipants’ contact details were obtained from the change detection processes on neural level governmental residence address registry. Fami (Kushnerenko, Ceponiene, Balan, Fellman, & lies with 10monthold children who live in Näätänen, 2002). It is also used in neonate test Greater Stockholm were randomly chosen and ing as it is the earliest cognitive ERP compo invited to participate via mail. They expressed nent measurable (Näätänen, 2000). The general their interest in the study by returning a form on advantage of using ERPs in infant research lies the basis of which the appointment was booked exactly within in the automaticity of these proc over the phone. All children were growing up esses that does neither demand attention nor in a monolingual Swedishspeaking environ training (Cheour, Leppänen, & Kraus, 2000). ment. As reward for participation, all families For example, mismatch negativity repre received certificates with a photo of the infant sents 6montholds’ discrimination of conso wearing the EEG net. nant duration changes in Finnish nonsense words (Leppänen et al., 2002). Similarly, dif Stimuli ferences in the stress patterns of familiar words Speech stimuli were in combination with the are reflected in the ERPs of German and French vowel /a/ the Thai bilabial stops /b̬ /, /b/, and 4montholds (Friederici, Friedrich, & Christo /ph/ and the dental/alveolar plosives /d̬ /, /d/, and phe, 2007). This reveals early languagespecific /th/ in midlevel tone (0), as well as the velar speech perception at least in suprasegmental plosive [ka] in low (1), high falling (2), and low aspects of language. rising (4) tone. Thai distinguishes three voicing How infant language development from levels. In the example of the bilabial stops this languagegeneral to languagespecific discrimi means that /b̬ / has a VOT of 97 ms, /b/ of 6 ms nation of speech contrasts can be mapped onto and /ph/ of 64 ms (Burnham, Francis, & Web neural response patterns was demonstrated in a ster, 1996). Out of these three stimulus sets study with 7 and 11monthold American Eng contrast pairs were selected that can be contras lish environment infants (RiveraGaxiola, tive (native) or not contrastive (nonnative) in h SilvaPereya, & Kuhl, 2005). The infants could Swedish. The consonantal contrasts [ba][p a] h be classed into different ERP patterns groups, and [da][t a] are contrastive in Thai and in showing not only negativity at discrimination Swedish, whereas the consonantal contrasts but also positive differences. Discrimination of [b̬ a][ba] and [d̬ a][da] are only contrastive in Spanish voiceonset time (VOT) differences Thai. Both consonantal contrasts were midtone was present in the 7montholds but not in the exemplars but the third set of contrasts was to 11montholds (RiveraGaxiola et al., 2005). nal. It presents the change between high falling and low rising tone in the contrast [ka2][ka4]

131 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University and between low and low rising tone in the con urement onset. All EEG channel data was am trast [ka1][ka4]. Although the two tonal con plified with an EGI NetAmps 300 amplifier and trasts must be considered nonnative to Swedish recorded with a sampling rate of one sample infants, nonThai listeners in general seem to every 4 ms. The program Netstation 4.2.1 was rely on complex acoustic variables when trying used to record and analyse the ERPs. to discriminate tone (Burnham et al., 1996), The stimuli were presented with KOSS which therefore makes it difficult to predict the loudspeakers, mounted at a distance of about discrimination of tonal contrast change. 100 cm in front of the child. The volume was After recording, all speech stimuli were pre set to 55 dB at the source. The experiment was sented to an expert panel consisting of two Thai programmed and controlled by the eprime 1.2 native speakers and one trained phonetician in software. order to select the three best exemplars per stimulus type out of ten (see Table 1 for varia Procedure tion of utterance duration between the selected All infant participants were seated in their par exemplars). ent’s lap, facing a TV screen on which silenced Table 1. The table shows utterance duration in ms short cartoon movie clips played during the ex for all selected exemplars per stimulus type and periment to entertain the infants and keep them average duration in ms per stimulus type (Thai tone as motionless as possible. The infants were is demarcated by number). permitted to eat, breastfeed, sleep, as well as suck on dummies or other objects during stimu b̬ a0 ba0 pha0 da0 d̬ a0 tah0 ka1 ka2 ka4 lus exposure. 1 606 576 667 632 533 607 613 550 535 Dependent on the randomisation of the first 2 626 525 646 634 484 528 558 502 538 contrast between two and five repetitions, the 3 629 534 629 599 508 585 593 425 502 duration of the entire experiment varied be M 620 545 647 622 508 573 588 492 525 tween 10 and 13 min. Infant and parent behav iour was monitored through an observer win Within each trial, the first contrast of each pair dow and the experiment was aborted in the case was repeated two to five times until the second of increasing infant fussiness this happened in contrast was presented twice after a brief inter one case after 8 min of stimulus exposure. stimulus interval of 300 ms. Each stimulus type (native consonantal, nonnative consonantal, Data treatment nonnative tonal) was presented twelve times The EEG recordings were filtered with a band within a block. Within each block, there were pass filter of 0.3 to 50 Hz and clipped into 1000 36 change trials and nine nochange trials. A ms windows starting at the onset of the second change trial repeated identical exemplars for the contrast. These windows were then cleaned first contrast and then presented the identical from all 10 ms segments during which the ERP exemplar of the second contrast twice. A no curve changed faster than 200 V to remove change trial had identical first and second measurement artefacts caused by body move sound exemplars, presented randomly between ment and eye blinks. If more than 80% of the four and seven times. A completed experiment segments of one single electrode were marked consisted of three blocks à 145 trials. as artefacts, the entire data from that electrode was not included in the average. Equipment The EEG recordings took place in a radiation insulated nearsoundproof test chamber at the Results Phonetics Lab at Stockholm University. In accordance with other infant speech percep Infant brain activation was measured by tion ERP studies (e.g., Friederici et al., 2007), EGI Geodesic Hydrocel GSN Sensor nets with the international 1020 electrode system was 124 electrodes on the infant net sizes. These net selected to structure the EEG data. Within this types permit EEG measurement without requir system, the analysis focused on electrode T3, ing gel application which makes them particu situated at the temporal lobe in the left hemi larly compatible with infant research; potas sphere, as MMN in 8monthold infants has sium chloride and generic baby shampoo serve previously been found to be largest in T3 (Pang as conductive lubricants instead. All electrode et al., 1998). impedances were kept below 50 kΩ at meas

132 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Comparing native and nonnative consonan However, the curve for [b̬ a][ba] and [d̬ a][da] tal contrast change trials, the curve for [ba] that are contrastive in Thai, but not in Swedish, [pha] and [da][tha] that are contrastive both in shows no negativity response. The infants were Thai and in Swedish shows a dip between 200 not able to detect the contrast change in the and 350 ms (Figure 1). nonnative consonantal condition, as the flat graph indicates. The two curves differ signifi cantly between 200 and 350 ms (p<.001), dem µV onstrating that the neural responses of 10 monthold Swedish language environment in fants discriminate native but not nonnative con sonantal contrast changes. These discrimination

ms abilities show in the typical neural mismatch negativity response. Comparing the nonnative tonal contrast change condition to its nochange condition, the previous result of a lack of discrimination for nonnative contrasts in 10montholds is regen erated (Figure 2). The graph for the nonnative tonal contrast change remains relatively flat and stable throughout the early window typical for Figure 1. The graph compares the ERPs for the na mismatch negativity responses, while being tive and nonnative consonantal contrast change tri negative however. The nochange curve for the als during 1000 ms after stimulus onset of the sec tonal condition on the other hand responds with ond contrast. The ERP for the native condition increasing positive neural activation to the repe shows mismatch negativity in µV between 200 and tition of the first contrast that peaks just after 350 ms, typical for the discrimination of auditory 400 ms with 12.8 µV. Neither of the tonal con change. The ERP for the nonnative condition how ditions shows mismatch negativity. ever only shows a stable continuation of the curve.

Discussion This negativity response is mismatch negativity This study shows that ERPs and in particular and at the same time a sign that the 10month the concept of mismatch negativity reflect re olds discriminated the contrast change. It peaks liably infant speech discrimination abilities, at 332 ms with 6.3 µV. previously mostly demonstrated in behavioural experiments, and confirms ERP data as a µV physiological correlate to those. It replicated the findings of RiveraGaxiola and colleagues in 11monthold American in fants discriminating English, but not Spanish contrasts (2005) with Swedish 10montholds. The Swedish infant participants discriminated only Thai contrasts that are also legitimate in ms Swedish, but not those that are illegitimate in Swedish. This result is in line with our predic tion that contrasts that sound native would be discriminated, but not those considered to be nonnative. At ten months, infants’ speech per ception is languagespecific, which results in good discrimination abilities of native speech sounds and a loss of discrimination abilities of Figure 2. The graph shows the ERPs for the nonna nonnative speech sounds. This establishes a tive tonal change and nochange trials in µV during neural basis to the wellknown findings from 1000 ms after stimulus onset of the second contrast conditioned headturn studies for example. (which is of course identical to the first in the no On a side note: in order to be able to truly change condition). No mismatch negativity can be speak of a decline or a loss of the nonnative observed in either condition. speech discrimination abilities in the Swedish

133 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

10monthold infants, further studies with in fants younger than six months are necessary Acknowledgements and currently under way to provide the required This study is the joint effort of the language de developmental comparison. MMN was strongest in the nonnative con velopment research group at the Phonetics Lab sonantal change condition in this study. Even at Stockholm University and was funded by the though the tonal stimuli are generally very in Swedish Research Council (VR 4212007 teresting for infants and potentially not proc 6400), the Knut and Alice Wallenberg Founda essed the same way as speech, the absence of tion (Grant no. KAW 2005.0115), and the Bank clear mismatch negativity shows that the 10 of Sweden Tercentenary Foundation (MILLE, montholds’ brains did not react in the same RJ K20030867), the contribution of all of way to change in tones as they did to change in which we acknowledge gratefully. We would consonants in the native condition. Further also like to thank all participating families: more, repetition of the same tone elicited higher without your interest and dedication, our re absolute activation than change in a series of search would not be possible. tonal speech sounds. Interestingly, MMN is in this study the only response pattern to a detected change in a series Footnotes of speech sounds. RiveraGaxiola and col 1 Thompson is an Interior Salish (ative In leagues had found subgroups with positive or dian) language spoken in south central British negative ERP discrimination curves in 11 Columbia. In native terms, it is called montholds (2005), whereas MMN is a strong thlakampx or Inslekepmx. The example con and stable indicator of discrimination in our trast differs in place of articulation. participants. And this is the case even with varying repetition distribution of the first con References trast taking up between 50 % and 70 % of a trial (corresponding to two to five times) in Burnham, D. (2003). Language specific speech perception and the onset of reading. Read comparison to the fixed presentation of the sec ing and Writing: An Interdisciplinary Jour ond contrast (two times). Leppänen and col nal, 16(6), 573609. leagues (2002) presented the standard stimulus Burnham, D., Francis, E., & Webster, D. with 80% and each of their two deviant stimu (1996). The development of tone perception: lus with 10 % probability with a 610 ms inter Crosslinguistic aspects and the effect of stimulus interval observe MMN in 6month linguistic context. Paper presented at the olds. The MMN effect is therefore quite robust PanAsiatic Linguistics: Fourth Interna to changes in trial setup, at least in 10month tional Symposium on Language and Lin olds. guistics, Vol. 1: Language and Related Sci ences, Institute of Language and Culture for Rural Development, Mahidol University, Conclusions Salaya, Thailand. The ERP component mismatch negativity Burnham, D., Tyler, M., & Horlyck, S. (2002). (MMN) is a reliable sign for the detection of Periods of speech perception development change in a series of speech sounds in 10 and their vestiges in adulthood. In P. Bur monthold Swedish language environment in meister, T. Piske & A. Rohde (Eds.), An in tegrated view of language development: fants. For the consonantal contrasts, the infants’ Papers in honor of Henning Wode (pp. 281 neural response shows discrimination for na 300). Trier: Wissenschaftlicher Verlag. tive, but not nonnative contrasts. Neither do the Cheour, M., Leppänen, P. H. T., & Kraus, N. infants indicate discrimination of the nonnative (2000). Mismatch negativity (MMN) as a tonal contrasts. This confirms previous findings tool for investigating auditory discrimina (RiveraGaxiola et al., 2005) and provides tion and sensory memory in infants and physiological evidence for languagespecific children. Clinical europhysiology, 111(1), speech perception in 10 montholds. 416. Friederici, A. D., Friedrich, M., & Christophe, A. (2007). Brain responses in 4monthold infants are already languagespecific. Cur rent Biology, 17(14), 12081211.

134 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Jusczyk, P. W., Cutler, A., & Redanz, N. J. Teplan, M. (2002). Fundamentals of EEG (1993). Infants' preference for the predomi measurement. Measurement Science Re nant stress patterns of English words. Child view, 2(2), 111. Development, 64(3), 675687. Werker, J. F., & Lalonde, C. E. (1988). Cross Kuhl, P. K. (2004). Early language acquisition: language speech perception: Initial capabili Cracking the speech code. ature Reviews: ties and developmental change. Develop euroscience, 5(11), 831843. mental Psychology, 24(5), 672683. Kuhl, P. K., Williams, K. A., Lacerda, F., Ste Werker, J. F., & Tees, R. C. (1984). Cross vens, K. N., & Lindblom, B. (1992). Lin language speech perception: Evidence for guistic experience alters phonetic perception perceptual reorganization during the first in infants by 6 months of age. Science, year of life. Infant Behavior and Develop 255(5044), 606608. ment, 7, 4963. Kushnerenko, E., Ceponiene, R., Balan, P., Werker, J. F., & Tees, R. C. (1999). Influences Fellman, V., & Näätänen, R. (2002). Matu on infant speech processing: Toward a new ration of the auditory change detection re synthesis. Annual Review of Psychology, 50, sponse in infants: A longitudinal ERP study. 509535. euroReport, 13(15), 18431848. Werker, J. F., & Tees, R. C. (2002). Cross Leppänen, P. H. T., Richardson, U., Pihko, E., language speech perception: Evidence for Eklund, K., Guttorm, T. K., Aro, M., et al. perceptual reorganization during the first (2002). Brain responses to changes in year of life. Infant Behavior and Develop speech sound durations differ between in ment, 25, 121133. fants with and without familial risk for dys lexia. Developmental europsychology, 22(1), 407422. Näätänen, R. (2000). Mismatch negativity (MMN): perspectives for application. Inter national Journal of Psychophysiology, 37(1), 310. Pang, E. W., Edmonds, G. E., Desjardins, R., Khan, S. C., Trainor, L. J., & Taylor, M. J. (1998). Mismatch negativity to speech stimuli in 8montholds and adults. Interna tional Journal of Psychophysiology, 29(2), 227236. Polka, L., & Bohn, O.S. (1996). A cross language comparison of vowel perception in Englishlearning and Germanlearning in fants. Journal of the Acoustical Society of America, 100(1), 577592. Polka, L., Colantonio, C., & Sundara, M. (2001). A crosslanguage comparison of /d/ /th/ perception: Evidence for a new devel opmental pattern. Journal of the Acoustical Society of America, 109(5 Pt 1), 21902201. Polka, L., & Werker, J. F. (1994). Developmen tal changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology: Human Perception and Per formance, 20(2), 421435. RiveraGaxiola, M. C. A., SilvaPereya, J., & Kuhl, P. K. (2005). Brain potentials to na tive and nonnative speech contrasts in 7 and 11monthold American infants. Devel opmental Science, 8(2), 162172. Tees, R. C., & Werker, J. F. (1984). Perceptual flexibility: Maintenance or recovery of the ability to discriminate nonnative speech sounds. Canadian Journal of Psychology, 38(4), 579590.

135 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Development of self-voice recognition in children Sofia Strömbergsson Department of Speech, Music and Hearing, School of Computer Science and Communication, KTH, Stockholm

Abstract form well above chance, and that this ability also improves with age (Bartholomeus, 1973; The ability to recognize the recorded voice as Spence et al, 2002). However, the variability one’s own was explored in two groups of chil- among the children is large. These reports sug- dren, one aged 4-5 and the other aged 7-8. The gest that there is a developmental aspect to the task for the children was to identify which one ability to recognize or identify recorded voices, of four voice samples represented their own and that there might be a difference in how voice. The results indicate that 4 to 5 year-old children perform on speaker identification tasks children perform as well as 7 to 8 year-old when compared to adults. children when identifying their own recorded Shuster (1998) presented a study where chil- voice. Moreover, a time span of 1-2 weeks be- dren and adolescents (age 7-14) with deviant tween recording and identification does not af- speech production of /r/ were recorded when fect the younger children’s performance, while pronouncing words containing /r/. The re- the older children perform significantly worse cordings were then edited so that the /r/ after this time span. Implications for the use of sounded correct. A recording in the listening recordings in speech and language therapy are script prepared for a particular child could thus discussed. be either an original recording or a “corrected” recording, spoken either by the child him- Introduction self/herself or another speaker. The task for the To many people, the recorded voice often children was to judge both the correctness of sounds unfamiliar. We are used to hearing our the /r/ and the identity of the speaker. One of voice through air and bone conduction simulta- the findings in this study was that the children neously as we speak, and as the recorded had difficulty identifying the speaker as him- speech lacks the bone conduction filtering, its self/herself when hearing a “corrected” version acoustic properties are different from what we of one of their own recordings. The author are used to (Maurer & Landis, 1990). But even speculates that the editing process could have though people recognize that the recorded introduced or removed something, thereby voice sounds different from the voice as we making the recording less familiar to the normally hear it, people most often still recog- speaker. Another confounding factor could be nize the recording as their own voice. In a re- the 1-2 week time span between the recording cent study on brain hemisphere lateralization of and the listening task; this could also have self-voice recognition in adult subjects (Rosa et made the task more difficult than if the children al, 2008), a mean accuracy of 95% showed that had heard the “corrected” version directly after adults rarely mistake their own recorded voice the recording. Unfortunately, no studies of how for someone else’s voice. the time span between recording and listening Although there have been a few studies on might affect children’s performance on speaker adult’s perception of their own recorded voice, identification tasks have been found, and any children’s self-perception of their recorded effects caused by this factor remain unclear. voices is relatively unexplored. Some studies Of the few studies that have been done to have been made of children’s ability to recog- explore children’s perception of recorded nize other familiar and unfamiliar voices. For voices – of their own recorded voice in particu- example, it has been reported that children’s lar – many were done over twenty years ago. ability to recognize previously unfamiliar Since then, there has been a considerable in- voices improves with age, and does not ap- crease in the number of recording devices that proach adult performance levels until the age of can potentially be present in children’s envi- 10 (Mann et al, 1979). Studies of children’s ronments. This strongly motivates renewed and ability to identify familiar voices have revealed deeper exploration into children’s self- that children as young as three years old per- perception of their recorded voice, and possible developmental changes in this perceptual abil-

136 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ity. If it is found that children indeed recognize as references. None of the reference children their recorded voice as their own, this may have were known to the children in the test groups. important implications for the use of recordings in speech and language intervention. Recording/Identification procedure A computer program was used to present the Purpose words in the scripts in random order, and for The purpose of this study is to explore chil- each word dren’s ability to recognize recordings of their own voice as their own, and whether this ability 1. Play a reference voice (adult) that reads varies depending on the age of the child and the a target word, while displaying a picture time between the recording and the listening. that illustrates the word. The research questions are: 2. Record the subject’s production of the same word (with the possibility of lis- 1. Are children with normal hearing able tening to the recording and re-recording to recognize their own recorded voice until both child and experimenter are as their own, and identify it when pre- satisfied). sented together with 3 other child 3. Play the subject’s production and 3 ref- voices? erence children’s productions of the 2. Will this ability be affected by the time same word, in random order, letting the span between recording and listening? subject select one of these as his/her 3. Will the performance be affected by the own. (See Figure 1.) age of the child?

It is hypothesized that the older children will perform better than the younger children, and that both age groups will perform better when listening immediately after the recording than when listening 1-2 weeks after the recording.


Participants 45 children with Swedish as their mother tongue, and with no known hearing problems and with no previous history of speech and lan- Figure 1: The listening/identification setup. guage problems or therapy were invited to par- In both test sessions, the children were fitted ticipate. The children were divided into two age with a headset and the experimenter with head- groups, with 27 children aged 4-5 years (rang- phones to supervise the recordings. The chil- ing from 4;3 to 5;11, mean age 5;3) in the dren were instructed to select the character they younger group and 18 children aged 7-8 years believed represented their own voice by point- (ranging from 7;3 to 8;9, mean age 8;0) in the ing at the screen; the actual selection was man- older group. Only children whose parents did aged by the experimenter by mouse clicking. not know of or suspect any hearing or language The children were given two introductory train- problems in the child were invited. All children ing items, to assure understanding of the task. were recruited from pre-schools in Stockholm. In the first test session, the children per- formed both the recording and the voice identi- Material fication task, i.e. step 1-3. For the recordings, A recording script of 24 words was constructed all children were instructed to speak with their (see Appendix). The words in the script all be- normal voice, and utterances were re-recorded gan with /tV/ or /kV/, and all had primary stress until both child and experimenter were satis- on the first syllable. fied. In the second test session, after a period of Three 6-year old children (two girls and one 1-2 weeks, the children performed only the boy, included by the same criteria as the chil- identification task, i.e. step 3. Apart from gen- dren participating in the study) were recorded eral encouragement, the experimenter provided

137 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University no feedback regarding the children’s perform- A significant difference was found between ance during the voice identification task. All the older children’s performance on the first actions – recording, listening and selecting – and the second test. The hypothesis that chil- were logged by the computer program. dren would perform better when listening im- mediately after the recording than after a period Results of 1-2 weeks could thus be confirmed for the older children, but not for the younger children. Table 1 displays the mean correct own-voice This might suggest a developmental aspect to identification for all 45 children on both test what cues children use when identifying their occasions. The standard deviation reveals a own voice; younger children might attend to large variation within the groups; the perform- more stable characteristics, while older children ance varies between 4 and 24 in both the first recognize their own voice by other, more time- and the second test. However, the average re- sensitive, features. As the age of the older chil- sults on both test occasions are well above dren in this study match the age of the children chance level. A closer look at the individual in Shuster (1998), the results presented here results reveals that two children perform at supports the suggested interpretation of Shus- chance level (or worse), while 12 children ter’s results; children’s difficulties to identify (27% of the children) perform with more than themselves as the speaker in “corrected” ver- 90% accuracy. sions of their own recordings could be ex- Table 1. Mean correct responses on the first and plained by the time span between recording and second test, for both age groups (max score/test = identification. 24). In this study, the children’s speech produc- tion was only controlled to the extent that the First test Second test experimenter instructed the children to speak Younger 18.8 (SD: 5.5) 17.9 (SD: 6.2) with their normal voice, both when introducing Older 21.0 (SD: 2.2) 16.6 (SD: 5.5) the children to the task and whenever the ex- Mean 19.7 (SD: 4.6) 17.3 (SD: 5.9) perimenter judged that the child was somehow “playing” with his/her voice. However, some No difference was found between the younger children tended to be more playful than others, and the older children’s performance on the and it is unlikely that all recordings reflect the first test (t(37.2) = 1.829, p = 0.076) or on the children’s normal speech behavior (whatever second test (t(43) = 0.716, p = 0.478). that is). Although this might certainly have an For the older children, a significant differ- impact on the results – the children might rec- ence was found between the children’s per- ognize their speaking behavior rather than their formance on the first test and their performance own voice – this would have been difficult to on the second test (t(17) = 4.370, p<0.001), avoid. Moreover, considering that speech play while for the younger children, no significant is often encouraged in clinical settings, one difference was found between the results from could argue that this is also ecologically valid. the two tests (t(26) = 1.517, p=0.141). The large variation among the children could be due to differences in attention, con- centration or understanding of the task, but may Discussion also be explained by a difference in aptitude for The high average performance rates confirm the task at hand. A closer inspection of the re- that children in the ages 4-5 years old and 7-8 cordings and results of the three children with years old are indeed able to recognize their re- the worst results (with a total score of 15 or be- corded voice as their own. However, large low) revealed that two of these children actu- variation was found among the children, with a ally produced slightly deviant speech (despite few children performing at chance level (or their parents’ assurance that their children had worse) and more children performing with 90% normal speech and language development). accuracy or more. This was also noted by the experimenter at the Contrary to the hypothesis, no significant time of the recordings, judging both from the difference was found between the age groups. recordings and the children’s spontaneous Thus, no support was found for a development speech. One of the children (a girl aged 5;8) in children’s self-voice recognition ability be- produced [j] for /r/. Another child (a boy aged tween the ages of 4-8 years. 5;4) exhibited the same /r/-deviation, together

138 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University with a few cluster simplification patterns, such Acknowledgements as [ ] for “tavla” (picture). For the third of tva This work was funded by The Swedish Gradu- these children (a boy aged 4;4), and for all of ate School of Language Technology (GSLT). the other children included in the study, no speech production deviations was noted or could be detected in the recordings. This might References suggest a correlation between deviant speech Bartholomeus, B. (1973) Voice identification production and difficulties of recognizing the by nursery school children, Canadian Jour- recorded voice as one’s own. However, a con- nal of Psychology/Revue canadienne de tradictory example was also found that had to psychologie 27, 464-472. be excluded from the study. Dentalisation (i.e. Mann, V. A., Diamond, R. and Carey, S. (1979) systematic substitution of [t], [d] and [n] for /k/, Development of voice recognition: Parallels /g/ and /ng/, respectively) was noted for one with face recognition, Journal of Experi- girl who could not participate for a second test, mental Child Psychology 27, 153-165. and who was therefore excluded from this Maurer, D. and Landis, T. (2005) Role of bone study. Interestingly, this girl scored 23 of 24 on conduction in the self-perception of speech, the first test. These single cases do certainly not Folia Phoniatrica 42, 226-229. present a uniform picture of the relation be- Rosa, C., Lassonde, M., Pinard, C., Keenan, J. tween deviant speech production and the ability P. and Belin, P. (2008) Investigations of to recognize the recorded voice as one’s own, hemispheric specialization of self-voice rec- but rather illustrate the need for further investi- ognition, Brain and Cognition 68, 204-214. gation of this relation. Shuster, L. I. (1998) The perception of cor- The results in this study give support to the rectly and incorrectly produced /r/, Journal use of recordings in a clinical setting, e.g. when of Speech, Language, and Hearing Research promoting awareness in the child of deviations 41, 941-950. in his/her speech production. An example of an Spence, M. J., Rollins, P. R. and Jerger, S. effort in this direction is presented in Shuster (2002) Children’s Recognition of Cartoon (1998), where children were presented with Voices, Journal of Speech, Language, and original and “corrected” versions of their own Hearing Research 45, 214-222. speech production. The great variation between children in their ability to recognize their re- corded voice as their own requires further ex- Appendix ploration. Orthography Transcription In English 1) k /ko/ (the letter k) 2) kaka /kka/ cake Conclusions 3) kam /kam/ comb The findings in this study indicate that children 4) karta /ka/ map 5) katt /kat/ cat in the ages of 4-5 and 7-8 years can indeed rec- 6) kavel /kvl/ rolling pin ognize their own recorded voice as their own; 7) ko /ku/ cow average performance results are well above 8) kopp /kp/ cup chance. However, there is a large variability 9) korg /korj/ basket among the children, with a few children per- 10) kula /kla/ marble forming at chance level or worse, and many 11) kulle /kl/ hill 12) kung king children performing with more than 90% accu- /k/ 13) tåg /to/ train racy. No significant difference was found be- 14) tak /tk/ roof tween the younger and the older children’s per- 15) tant /tant/ lady formance, suggesting that self-voice perception 16) tavla /tvla/ picture does not improve between these ages. Further- 17) tidning /tin/ newspaper more, a time span of 1-2 weeks between re- 18) tiger /tir/ tiger cording and identification seems to make the 19) tomte /tmt/ Santa Claus 20) topp /tp/ top identification task more difficult for the older 21) tub /tb/ tube children, whereas the same time span does not 22) tumme /tm/ thumb affect the younger children’s results. The find- 23) tunga /ta/ tongue ings here support the use of recordings in clini- 24) tupp /tp/ rooster cal settings.

139 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Studies on using the SynFace talking head for the hear- ing impaired Samer Al Moubayed1, Jonas Beskow1, Ann-Marie Öster1, Giampiero Salvi1, Björn Granström1, Nic van Son2, Ellen Ormel2 ,Tobias Herzke3 1 KTH Centre for Speech Technology, Stockholm, Sweden. 2 Viataal, Nijmegen, The Netherlands. 3HörTech gGmbH, Germany. [email protected], {beskow, annemarie, giampi, }@speech.kth.se, [email protected], elle- [email protected], [email protected]

Abstract SynFace is a lip-synchronized talking agent The SynFace Lip-Synchronized which is optimized as a visual reading support Talking Agent for the hearing impaired. In this paper we present the large scale hearing impaired user SynFace (Beskow et al, 2008) is a suppor- studies carried out for three languages in the tive technology for hearing impaired persons, Hearing at Home project. The user tests focus which aims to re-create the visible articulation on measuring the gain in Speech Reception of a speaker, in the form of an animated talking Threshold in Noise and the effort scaling when head. SynFace employs a specially developed using SynFace by hearing impaired people, real-time phoneme recognition system, based where groups of hearing impaired subjects with on a hybrid of recurrent artificial neural net- different impairment levels from mild to severe works (ANNs) and Hidden Markov Models and cochlear implants are tested. Preliminary (HMMs) that delivers information regarding analysis of the results does not show significant the speech articulation to a speech animation gain in SRT or in effort scaling. But looking at module that renders the talking face to the large cross-subject variability in both tests, it is computer screen using 3D graphics. clear that many subjects benefit from SynFace SynFace previously has been trained on four especially with speech with stereo babble. languages: English, Flemish, German and Swe- dish. The training used the multilingual SpeechDat corpora. To align the corpora, the Introduction HTK (Hidden markov models ToolKit) based There is a growing number of hearing im- RefRec recogniser (Lindberg et al, 2000) was paired persons in the society today. In the on- trained to derive the phonetic transcription of going EU-project Hearing at Home (HaH) the corpus. Table 1 presents the % correct (Beskow et al., 2008), the goal is to develop the frame of the recognizers of the four languages next generation of assistive devices that will SynFace contains. allow this group - which predominantly in- Table 1. Complexity and % correct frame of the re- cludes the elderly - equal participation in com- cognizers of different languages in SynFace. munication and empower them to play a full role in society. The project focuses on the Language Connections % correct frame needs of hearing impaired persons in home en- Swedish 541,250 54.2 vironments. English 184,848 53.0 For a hearing impaired person, it is often German 541,430 61.0 necessary to be able to lip-read as well as hear Flemish 186,853 51.0 the person they are talking with in order to communicate successfully. Often, only the au- dio signal is available, e.g. during telephone User Studies conversations or certain TV broadcasts. One of The SynFace has been previously evaluated the goals of the HaH project is to study the use by subjects in many ways in Agelfors et al of visual lip-reading support by hard of hearing (1998), Agelfors et al (2006) and Siciliano et al people for home information, home entertain- (2003). In the present study, a large scale test of ment, automation, and care applications. the use of SynFace as an audio-visual support

140 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University for hearing impaired people with different hear- Small Scale SRT Study on Normal Hear- ing loss levels. The tests investigate how much ing subjects subjects benefit from the use of SynFace in A first small scale SRT intelligibility expe- terms of speech intelligibility, and how difficult riment was performed on normal hearing sub- it is for a subject to understand speech with the jects ranging in age between 26 and 40. This help of SynFace. Following is a detailed de- experiment is established in order to confirm scription of methods used in these tests. the improvement in speech intelligibility of the current SynFace using the SRT test. Method The tests were carried out using five normal SRT or Speech Reception Threshold is the hearing subjects. The stimuli consisted of two speech signal SNR when the listener is able to SRT measurements while each measurement understand 50% of the words in the sentences. used a list of 10 sentences, and stationary noise In this test, SRT value is measured one time was added to the speech signal. A training ses- with a speech signal alone without SynFace, sion was performed before the real test to con- with two types of noise, and another time with trol the learning effect, and two SRT measure- the use of (when looking at) SynFace. If the ments were performed after that, one without SRT level has decreased when using SynFace, looking at SynFace and one looking at Syn- that means the subject has benefited from the Face. Figure 1 shows the SRT levels obtained use of SynFace, since the subject could under- in the different conditions, where each line cor- stand 50% of the words with a higher noise responds to a subject. level than when listening to the audio signal It is clear in the figure that all 5 subjects re- alone. quired lower SNR level when using SynFace To calculate the SRT level, a recursive pro- compared to the audio-only condition and the cedure described by Hagerman & Kinnefors SRT for all of them decreased in the au- (1995), is used, where the subject listens to dio+SynFace condition. successive sentences of 5 words, and depending An ANOVA analysis and successive mul- on how many words the subject recognizes cor- tiple comparison analysis confirm that there is a rectly, the SNR level of the signal is changed so significant decrease (improvement) of SRT (p < the subject can only understand 50% of the 0.001) for SynFace over the audio-alone condi- words. tion. The SRT value is estimated for each subject in five conditions, a first estimation is used as training, to eliminate any training effect. This was recommended in Hagerman, B., & Kinne- fors (1995). Two SRT values are estimated in the condition of speech signal without SynFace, but with two types of noise, Stationary noise, and Babble noise (containing 6 speakers). The other two estimations are for the same types of noise, but with the use of SynFace, that is when the subject is looking at the screen with Syn-

Face, and listening in the head-phones to a noi- Figure 1: SRT value for five subjects in the audio- sy signal. only condition and audio-SynFace condition. In the effort scaling test, the easiness of us- ing SynFace by hearing impaired persons was Hearing Impaired Subjects targeted. To establish this, the subject has to listen to These tests are performed on five groups of sentences in the headphones, sometimes when hearing impaired subjects with different hear- looking at SynFace and sometimes without ing impairment levels (Mild, Moderate, and looking at SynFace, and choose a value on a subjects with cochlear implants). Every group pseudo continuous scale, ranging from 1 to 6, consists of 15 subjects. Table shows informa- telling how difficult it is to listen to the speech tion and the location of the user groups. signal transmitted through the head-phones.

141 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Table 2. Description of the hearing impaired test difference in scaling value between the condi- subjects groups tion of speech with and speech without Syn- Swedish German Flemish Face. But again, the scaling value shows a high # Subject 15 15+15 15+15 inter-subject variability. Hearing Moderate Mild+ Moderate+ Another investigation we carried out was to Impairment Moderate Cochlear study the effect of the SRT measurement list Implants length on the SRT value. As mentioned before, Location KTH- Hörtech- Viataal- the SRT measurement used lists of 20 sen- tences, where every sentence contained 5 Preliminary Analysis words, and one training measurement was done at the beginning to eliminate any training ef- Mean results of the SRT measurement tests fect. Still, when looking at the average trend of are presented in Figure 2. The figure shows the the SRT value over time for each sentence, the level of SRT value for the different hearing im- SRT value was decreasing, this can be ex- pairment groups (cochlear implants with a noti- plained as an ongoing training throughout the ceably higher level than the other groups), as measurement for each subject. But when look- well the difference in SRT value with and ing at the individual SRT value per test calcu- without using SynFace. The mean values do not lated after the 10th and the 20st sentence for show significant decrease or increase in the each measurement, an observation was that for SRT level when using SynFace than with au- some of the measurements, the SRT value of dio-only conditions. Nevertheless, when look- the same measurement increased at the 20st sen- ing at the performance of the subjects indivi- tence compared to the 10th sentence. Figure 4 dually, a high inter-subject variability is clear presents the difference of SRT value at the 20st which means that certain subjects have bene- sentence and the 10th sentence for 40 SRT mea- fited from the use of SynFace. Figure 3 shows surements which shows that although most of the sorted delta SRT value per subject for the the measurements had a decreasing SRT value, Swedish moderate hearing impaired subject and some of them had an increasing one. This the Dutch cochlear implants subjects in speech means that the longer measurement is not al- with babble noise condition. In addition to the ways better (decreasing the learning effect). high variability among subjects, and the high We suspect here that this can be a result of range scaling between the groups with different that the 20 sentences long measurements are hearing impairment levels, it is clear that, in the too long for the hearing impaired subjects, and case of babble noise, most of the Swedish mod- that they might be getting tired and loosing erate hearing impairments subjects show bene- concentration when the measurement is as long fit (negative delta SRT). as 20 sentences and hence requiring a higher Regarding the results of the effort scaling, SNR. subjects at all locations, do not show significant

Synface, Sentence test




10 SRTs Icra without Synface Icra with Synface Babble without Synface 5 Babble with Synface


-5 Sweden; Gemany; Gemany; Netherlands, Netherlands, KTH, mod HI HTCH, mild HTCH, mod Viataal, mod Viataal, CI HI HI HI

Figure 2. Mean SRT value for each of the subject s groups with and without the use of SynFace and with two types of noise: stationary and babble. 142 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 3. The delta SRT value (with SynFace-Without SynFace) per subject with babble noise. Left: the Swedish moderate hearing impaired group. Right: the Dutch cochlear implants subjects.

Acknowledgements This work has been carried out under the Hearing at Home (HaH) project. HaH is funded by the EU (IST-045089). We would like to thank other project members at KTH, Sweden; HörTech, OFFIS, and ProSyst, Germany; VIATAAL, the Netherlands, and Telefonica Figure 4. The delta SRT at the 20 item in the list I&D, Spain. and the 10th for 40 SRT measurements. Discussion References Overall, the preliminary analysis of the re- Agelfors, E., Beskow, J., Dahlquist, M., sults of both the SRT test and the effort scaling Granström, B., Lundeberg, M., Spens, showed limited beneficial effects for SynFace. K-E., & Öhman, T. (1998). Synthetic However, the Swedish participants showed an faces as a lipreading support. In Pro- overall beneficial effect for the use of SynFace ceedings of ICSLP'98. in the SRT test when listening to speech with Agelfors, E., Beskow, J., Karlsson, I., Kew- babble noise. ley, J., Salvi, G., & Thomas, N. (2006). Another possible approach when examining User Evaluation of the SYNFACE the benefit of using SynFace may be looking at Talking Head Telephone. Lecture Notes individual results as opposed to group means. in Computer Science, 4061, 579-586. The data shows that some people benefit from Beskow, J., Granström, B., Nordqvist, P., the exposure to SynFace. In the ongoing analy- Al Moubayed, S., Salvi, G., Herzke, T., sis of the tests, we will try to see if there are & Schulz, A. (2008). Hearing at Home correlations in the results for different tests per – Communication support in home en- subject, and hence to study if there are certain vironments for hearing impaired per- feature which characterize subjects who show sons. In Proceedings of Interspeech. consistent benefit from SynFace throughout all Brisbane, Australia. the tests. Hagerman, B., & Kinnefors, C. (1995). Ef- ficient adaptive methods for measuring Conclusions speech reception threshold in quiet and The paper reports on the methods used for in noise. Scand Audiol, 24, 71-77. Lindberg, B., Johansen, F. T., Warakagoda, the large scale hearing impaired tests with Syn- N., Lehtinen, G., Kai, Z., Gank, A., Face lip-synchronized talking head. Prelimi- Elenius, K., & Salvi, G. (2000). A noise nary analysis of the results from the user stu- robust multilingual reference recogniser dies with hearing impaired subjects where per- based on SpeechDat(II). In Proc of formed at three sites. Although SynFace ICSLP 2000, (pp. 370-373). Beijing. showed consistent advantage for the normal Siciliano, C., Faulkner, A., & Williams, G. hearing subjects, SynFace did not show a con- (2003). Lipreadability of a synthetic sistent advantage with the hearing impaired talking face in normal hearing and hea- subjects, but there were SynFace benefits for ringimpaired listeners. In AVSP 2003- some of the subjects in all the tests, especially International Conference on Audio- for speech-in-babble-noise condition. Visual Speech Processing.

143 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

On extending VTLN to phoneme-specific warping in automatic speech recognition Daniel Elenius and Mats Blomberg Departmen of Speech, Music and Hearing, KTH, Stockholm

Abstract tion of warp factors to be used during recogni- tion. One reason presented was that the gain in Phoneme- and formant-specific warping has practice could be limited by the need of correctly been shown to decrease formant and cepstral estimating a large number of warp factors. mismatch. These findings have not yet been fully Phone clustering was suggested as a method to implemented in speech recognition. This paper limit the number of warping factors needed to discusses a few reasons how this can be. A small estimate. experimental study is also included where pho- One method used in ASR is VTLN, which neme-independent warping is extended towards phoneme-specific warping. The results of this performs frequency warping during analysis of investigation did not show a significant de- an utterance to reduce spectral mismatch caused crease in error rate during recognition. This is by speakers having different vocal tract lengths also in line with earlier experiments of methods (Lee and Rose, 1996). They steered the degree of discussed in the paper. warping by a time-independent warping-factor which optimized the likelihood of the utterance given an acoustic model using the maximum Introduction likelihood criterion. The method has also been In ASR, mismatch between training and test frequently used in recognition experiments both conditions degrades the performance. Therefore with adults and children (Welling, Kanthak and much effort has been invested into reducing this Ney, 1999; Narayanan and Potamianos, 2002; mismatch using normalization of the input Elenius and Blomberg, 2005; Giuliani, Gerosa speech and adaptation of the acoustic models and Brugnara 2006). A limitation with this ap- towards the current test condition. proach is that time-invariant warping results in Phoneme-specific frequency scaling of a all phonemes as well as non-speech segments speech spectrum between speaker groups has sharing a common warping factor. been shown to reduce formant- (Fant, 1975) and In recent years increased interest has been cepstral- distance (Potamianos and Narayanan, directed towards time-varying VTLN (Miguel 2003). Frequency scaling has also been per- et.al., 2005; Maragakis et.al., 2008). The former formed as a part of vocal tract length normali- method estimates a frame-specific warping zation (VTLN) to reduce spectral mismatch factor during a memory-less Viterbi decoding caused by speakers having different vocal tract process, while the latter method uses a two-pass lengths (Lee and Rose 1996). However, in con- strategy where warping factors are estimated trast to findings above this scaling is normally based on an initial grouping of speech frames. made without regard to sound-class. How come The former method focuses on revising the hy- that phoneme-specific frequency scaling in pothesis of what was said during warp estima- VTLN has not yet been fully implemented in tion while the latter focuses on sharing the same ASR (automatic speech recognition) systems? warp factor within each given group. Pho- Formant frequency mismatch was reduced neme-specific warping can be implemented to by about one-half when formant- and some degree with either of these methods. Either vowel-category- specific warping was applied by explicitly forming phoneme-specific groups compared to uniform scaling (Fant, 1975). Also or implicitly by estimating frame-specific warp phoneme-specific warping without for- factors. mant-specific scaling has been beneficial in However, none of the methods above pre- terms of reducing cepstral distance (Potamianos sents a complete solution for phoneme-specific and Narayanan, 2003). In the study it was also warping. One reason is that more then one in- found that warp factors differed more between stantiation of a phoneme can occur far apart in phonemes for younger children than for older time. This introduces a long distance depend- ones. They did not implement automatic selec-

144 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ency due to a shared warping factor. For the warping function is used, where the amount of frame-based method using a memory-less warping is steered by a warping-factor. Viterbi process this is not naturally accounted for. Warp factor estimation A second reason is that in an unsupervised Given a specific form of the frequency warping two-pass strategy initial mismatch causes rec- to be performed, a question still remains of the ognition errors which limit the performance. degree of warping. In Lee and Rose (1996) this Ultimately initial errors in assigning frames to was steered by a common warping factor for all group-identities will bias the final recognition sound-classes. The amount of warping was de- phase towards the erroneous identities assigned termined by selecting the warping-factor that in the first pass. maximized the likelihood of the warped utter- The objective of this paper is to assess the ance given an acoustic model. In the general impact of phoneme-specific warping on an case this maximization lacks a simple closed ASR-system. First a discussion is held regarding form and therefore the search involves an ex- issues with phoneme-specific warping. Then an haustive search on a set of warping factors. experiment is set up to measure the accuracy of a An alternative to warp the utterance is to system performing phoneme-specific VTLN. perform a transform of the model parameters of The results are then presented on a con- the acoustic model towards the utterance. nected-digit task where the recognizer was Thereby a warp-specific model is generated. In trained for adults and evaluated on children’s this case, warp factor selection amounts to se- speech. lecting the model that best fit data, which is a standard classification problem. So given a set Phoneme-specific VTLN of warp-specific models one can select the This section describes some of the challenges in model that results in the maximum likelihood of phoneme-specific vocal tract length normaliza- the utterance. tion. Phoneme-specific warp estimation Selection of frequency warping function Let us consider extending the method above to a In (Fant, 1975) a case was made for phoneme-specific case. Instead of a scalar vowel-category and formant- specific scaling in warping factor a vector of warping factors can be contrast to uniform scaling. This requires for- estimated with one factor per phoneme. The task mant tracking and subsequent calculations of is now to find the parameter vector that maxi- formant-specific scaling factors, which is pos- mizes the likelihood of the utterance given the sible during manual analysis. Following an warped models. In theory this results in an ex- identical approach under unsupervised ASR haustive search of all combinations of warping would include automatic formant tracking, factors. For 20 phonemes with 10 warp candi- 20 likelihood calcula- which is a non-trivial problem without a final dates, this amounts to 10 tions. This is not practically feasible and thereby solution (Vargas at.al, 2008). an approximate method is needed. Lee and Rose (1996) avoided explicit In (Miguel et.al, 2005) a two-pass strategy warping of formants by performing a common was used. During the first pass a preliminary frequency warping function for all formants. segmentation is made. This is then held constant Since the function is equal for all formants, no during warp estimation to allow separate formant-frequency estimation is needed when warp-estimates to be made for each phoneme. applying this method. The warping function can Both a regular recognition phase as well as be linear, piece-wise linear or non-linear. Uni- K-Means grouping has been used in their re- form frequency scaling of the frequency interval gion-based extension to VTLN of formants is possible using a linear or The group-based warping method above re- piece-wise linear function. This could also be lies on a two-pass strategy where a preliminary extended to a rough formant scaling, using a fixed classification is used during warp factor non-linear function, under the simplified as- estimation, which is then applied in a final rec- sumption that the formant regions do not over- ognition phase. Initial recognition errors can lap. ultimately cause a warp to be selected that This paper is focused on uniform scaling of maximizes the likelihood of an erroneous iden- all formants. For this aim a piece-wise linear tity. Application of this warping factor will then

145 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University bias the final recognition towards the erroneous identities. The severity of this hazard depends on the number of categories used and the kind of Experimental study confusions made. Phoneme-specific warping has been explored in An alternative to a two-pass approach is to terms of WER (word error rate) in an experi- successively revise the hypothesis of what has mental study. This investigation was made on a been said as different warping factors are connected-digit string task. For this aim a rec- evaluated. Following this line of thought leads to ognition system was trained on adult speakers: a parallel of warp factor estimation and deter- This system was then adapted towards children mination of what was said. For a speech recog- by performing VTLT (vocal tract length trans- nizer using Viterbi-decoding this can be im- formation). A comparison between pho- plemented by adding a warp-dimension to the neme-independent and -specific adaptation phoneme-time trellis (Miguel et.al. 2005). This through warping the models of the recognizer leads to a frame-specific warping factor. Un- was conducted. Unsupervised warping during constrained this would lead to a large amount of test was also conducted using two groups of computations. Therefore a constraint on the phonemes with separate warping factors. The time-derivative of the warp factor was used to groups used were formed by separating silence, limit the search space. A slowly varying warping factor might not /t/ and /k/ forming the rest of the phonemes. be realistic even though individual articulators move slowly. One reason is that given multiple Speech material sources of sound a switch between them can The corpora used for training and evaluation cause an abrupt change in the warping factor. contain prompted digit-strings recorded one at a This switch can for instance be between speak- time. Recordings were made using directional ers, to/from non-speech frames, or a change in microphones close to the mouth. The experi- place and manner of articulation. The change ments were performed for Swedish using two could be performed with a small movement different corpora, namely SpeeCon and which causes a substantial change in the air-flow PF-STAR for adults and children respectively. path. To some extent this could perhaps be taken PF-STAR consists of children speech in into account using parallel warp-candidates in multiple languages (Batliner et.al. 2005). The the beam-search used during recognition. Swedish part consists of 198 children of 4 to 8 In this paper model-based warping is per- years repeating oral prompts spoken by an adult formed. For each warp setting the likelihood of speaker. In this study only connected-digit the utterance given the set of warped models is strings were used to concentrate on acoustic calculated using the Viterbi algorithm. The warp modeling rather than language models. Each set is chosen that results in the maximum like- child was orally prompted to speak 10 lihood of the utterance given the warped models. three-digit strings amounting to 30 digits per In contrast to the frame-based, long distance speaker. Recordings were performed in a sepa- dependencies are taken into account. This is rate room at daycare and after-school centers. handled by warping the phoneme models used to During these recordings sound was picked up by recognize what was said. Thereby each instan- a head-set mounted cardioid microphone, tiation of the model during recognition is forced Sennheiser ME 104. The signal was digitized to share the same warping factor. This was not using 24 bits @ 32 kHz using an external the case in the frame-based method which used a usb-based A/D converter. In the current study memory-less Viterbi decoding scheme for warp the recordings were down-sampled to 16 bits @ factor selection. 16 kHz to match that used in SpeeCon. Separate recognitions for each combination SpeeCon consists of both adults and children of warping factors were used to avoid relying on down to 8 years (Großkopf et.al 2002). In this an initial recognition phase, as was done in the study, only digit-strings recordings were used. region-based method. The subjects were prompted using text on a To cope with the huge search space two ap- computer screen in an office environment. Re- proaches were taken in the current study namely: cordings were made using the same kind of mi- reducing the number of individual warp factors crophone as was used in Pf-Star. An analog by clustering phonemes together and by super- high-pass filter with a cut-off frequency of 80 Hz vised adaptation to a target group. was used, and digital conversion was performed

146 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University using 16 bits at 16 kHz. Two sets were formed separately, the performance increase was re- for training and evaluation respectively con- duced by 0.2% absolute. Further division by sisting of 60 speakers each to match Pf-Star. forming a 3:rd group with unvoiced fricatives {/s/, /S/, /f/ and /v/} was also attempted, but with Recognition system no improvement in recognition to that above. In The adaptation scheme was performed using a this case /v/ in “två” is mainly unvoiced phone-level HMM-system (Hidden Markow Table 1. Recognition results with model Model) for connected digit-string recognition. group-specific warping factors. Unsupervised like- Each string was assumed to be framed by silence lihood maximization of each test utterance. The (/sil/) and consist of an arbitrary number of group was formed by separating /sil/, /t/ and /k/ digit-words. These were modeled as concatena- from the rest of the models. tions of three state three-phone models ended by an optional short-pause model. The short pause Method WER model consisted of one state, which shared it’s VTLN 1-warping factor 13,2 pdf (probability density function) with the centre Speech 13,4 state of the silence model. 2 Groups (separate estimation) 13,1 The distribution of speech features in each 2 Groups (joint maximization) 12,9 state was modeled using GMMs (Gaussian Mixture Models) with 16 mixtures and diagonal Phoneme-specific adaptation of an adult recog- covariance matrices. The feature vector used nizer to children resulted in warping factors consisted of 13 * 3 elements. These elements given in Figure 1. The method gave silence a correspond to static parameters and their first and second order time derivatives. The static warping factor of 1.0, which is reasonable. In coefficients consisted of the normalized log en- general voiced-phonemes were more strongly ergy of the signal and MFCCs (Mel Frequency warped than un-voiced ditto. Cepstrum Coefficients). These coefficients were 1,5 1,45 extracted using a cosine transform of a mel 1,4 scaled filter bank consisting of 38 channels in 1,35 1,3 the range corresponding to the interval 0 to 7.6 1,25 kHz. 1,2 1,15 Training and recognition experiments were 1,1 conducted using the HTK speech recognition 1,05 1 software package (Young et.al., 2005). Pho- silts vSf kOnamo:l ei:ruh:y:e:U neme-specific adaptation of the acoustic models and warp factor search was performed by sepa- Figure 1. Phoneme-specific warp adapting adult rate programs. The adaptation part was per- models to children sorted in increasing warp-factor. formed by applying the corresponding Further division of the adaptation data into age piece-wise linear VTLT in the model space as groups resulted in the age and phoneme-specific was used in the feature space by Pitz and Ney warping factors shown in Figure 2. In general, 2005. the least warping of adult models was needed for 8 year old children compared to younger chil- Results dren. 1,7 The WER (word error rate) of recognition ex- 1,6 periments where unsupervised adaptation to the 1,5 4 test utterance was performed is shown in Table 1. 1,4 5 6 The baseline experiment using pho- 1,3 7 8 neme-independent warping resulted in a WER 1,2

(word error rate) of 13.2%. Introducing two 1,1 groups ({/sil/, /t/, /k/} and {the rest of the mod- 1 els}) with separate warping factors lowered the sil t v S k s O o: f e l a uh: n r e: y: m i: U error rate to 12.9%. This required that an ex- Figure 2. Phoneme and age-specific warping factors. haustive search of all combinations of two Optimized on likelihood of adaptation data. The warping factors was performed. If an assump- phonemes are sorted in increasing warp-factor for 6 tion that the warping factor could be estimated year old speakers.

147 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The found warping factors for the child and age groups were then applied on the test-data to measure the implication on the WER. The result References of this experiment is given in Table 2. Intro- Batliner A, Blomberg M, D’Acry S, Elenius D ducing phoneme-specific warping did not sub- and Giuliani D. (2005). The PF_STAR stantially reduce the number of errors compared Children’s Speech Corpus. Interspeech to a shared warping factor for all phonemes. 2005, 2761 – 2764. Elenius, D., Blomberg, M. (2005) Adaptation Table 2. Recognition results with adult model and Normalization Experiments in Speech adapted to children using a fixed warping vector for Recognition for 4 to 8 Year old Children. In all utterances with one warp factor per pho- neme .Phoneme- dependent and -independent Proc Interspeech 2005, pp. 2749 - 2752. warping is denoted Pd and Pi respectively. Fant, G. (1975) Non-uniform vowel normaliza- tion. STL-QPSR. Quartely Progress and Method WER Status Report. Departement for Speech Mu- Fix Pi 13,7 sic and Hearing, Stockholm, Sweden 1975. Fix Pd 13,2 Giuliani, D., Gerosa, M. and Brugnara, F. (2006) Fix Pd per age 13,2 Improved Automatic Speech Recognition Through Speaker Normalization. Computer Speech & Language, 20 (1), pp. 107-123, Discussion Jan. 2006. Time invariant VTLN has in recent years been Großkopf B, Marasek K, v. d. Heuvel, H., Diehl extended towards phoneme-specific warping. F, and Kiessling A (2002). SpeeCon - speech The increase in recognition accuracy during data for consumer devices: Database speci- experimental studies has however not yet re- fication and validation. Second International flected the large reduction in mismatch shown Conference on Language Resources and by Fant (1975). Evaluation 2002. One reason for the discrepancy can be that Lee, L., and Rose, R. (1996) Speaker Normali- unconstrained warping of different phonemes zation Using Efficient Frequency Warping can cause unrealistic transformation of the Procedures. In proc. Int. Conf. on Acoustic, phoneme space. For instance swapping places of Speech and Signal Processing,1996, Vol 1, the low left and upper right regions could be pp. 353-356. performed by choosing a high and low warping Maragakis, M. G. and Potamianos, A. (2008) factor respectively. Region-Based Vocal Tract Length Nor- malization for ASR. Interspeech 2008. pp. 1365 - 1368. Conclusion Miguel, A., Lleida, E., Rose R. C.,.Buera, L. and In theory phoneme-specific warping has a large ortega, A. (2005) Augmented state space potential for improving the ASR accuracy. This acoustic decoding for modeling local vari- potential has not yet been turned into signifi- ability in speech. In Proc. Int. Conf. Spoken cantly increased accuracy in speech recognition Language Processing. Sep 2005. experiments. One difficulty to manage is the Narayanan, S., Potamianos, A. (2002) Creating large search space resulting from estimating a Conversational Interfaces for Children. IEEE large number of parameters. Further research is Transactions on Speech and Audio Proc- still needed to explore remaining approaches of essing, Vol. 10, No. 2, February 2002. incorporating phoneme-dependent warping into Pitz, M. and Ney, H. (2005) Vocal Tract Nor- ASR. malization Equals Linear Transformation in Cepstral Space, IEEE Trans. On Speech and Audio Processing, 13(5):930-944, 2005. Acknowledgements Potamianos, A. Narayanan, S. (2003) Robust The authors wish to thank the Swedish Research Recognition of Children’s Speech. IEEE Council for founding the research presented in Transactions on Speech and Audio Proc- this paper. essing, Vol 11, No 6, November 2003. pp. 603 – 616. Welling, L., Kanthak, S. and Ney, H. (1999) Improved Methods for Vocal Tract Nor- malization. ICASSP 99, Vol 2, pp. 161-164.

148 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P. (2005) The HTK book. Cambridge Univer- sity Engineering Department 2005. Vargas, J. and McLaughlin, S. (2008). Cascade Prediction Filters With Adaptive Zeros to Track the Time-Varying Resonances of the Vocal Tract. In Transactions on Audio Speech, and Language Processing. Vol. 16 No 1 2008. pp. 1-7.

149 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Visual discrimination between Swedish and Finnish among L2-learners of Swedish Niklas Öhrström, Frida Bulukin Wilén, Anna Eklöf and Joakim Gustafsson Department of Linguistics, Stockholm University.

Abstract erally, females perform better than males (Johnson et al., 1988). A series of speech reading experiments were The importance of the visual signal in carried out to examine the ability to discrimi- speech perception has recently been stressed in nate between Swedish and Finnish among L2 two articles. Soto-Faraco et al. (2007) carried learners of Swedish and Spanish as their moth- out a study where subjects were presented si- er tongue. This group was compared with na- lent clips of a bilingual speaker uttering sen- tive speakers of Swedish and a group with no tences in either Spanish or Catalan. Following knowledge in Swedish or Finnish. The results the first clip, another one was presented. The showed tendencies, that familiarity with Swe- subjects’ task was to decide whether the lan- dish increased the discrimination ability be- guage had been switched or not from the one tween Swedish and Finnish. clip to the other. Their subjects were either people with Spanish or Catalan as their first Introduction language. The other group consisted of people Audition is the main modality for speech de- from Italy and England with no knowledge in coding. Nevertheless, visual information about Spanish or Catalan. In the first group bilinguals the speech gestures while listening provides performed best. However, people with either complementary visual cues to speech percep- Spanish or Catalan as their mother tongue per- tion. This use of visual information plays a sig- formed better than chance level. The second nificant role, especially during noisy conditions group did not perform better than chance level, (Sumby and Pollack, 1954; Erber, 1969). How- ruling out the possibility that the performance ever, McGurk and MacDonald (1976) showed of the first group was due to paralinguistic or that the visual signal is incorporated in auditory extralinguistic signals. Their performance was speech percept, even at favorable S/N ratios. based on linguistic knowledge in one of the They used dubbed tapes with a face pronounc- presented languages. Later, Weikum et al. ing the syllables [gaga] and [baba]. When lis- (2007) carried out a similar study, where the teners saw the face articulating [gaga], while speaker was switching between English and the audio track was changed for [baba], the ma- French. In this study, the subjects were 4-, 6- jority reported having heard [dada]. Later and 8-month-old infants, acquiring English. Traunmüller and Öhrström (2007) demonstrat- According to their results, the 4-, and 6-month- ed that this phenomenon also holds for vowels, olds performed well, while the 8-month-olds where the auditory percept is influenced by performed worse. One interpretation is that the strong visual cues such as lip rounding. These 4- and 6-month-olds discriminate on the basis findings are clear evidence that speech percep- of psycho-optic differences, while the 8-month- tion in face-to-face-communication is a bimod- olds are about to lose this ability as a result of al rather than uni-modal process. acquiring the visually discriminable categories In case of no available acoustic speech sig- of English. These are important findings, since nal, the listener must fully rely on visual speech it highlights the visual categories as part of the cues, i.e. speech reading. Visual information linguistic competence. The two studies are not alone is in most cases not sufficient for speech fully comparable, since they deal with different processing, since many speech sounds fall into languages, which do not necessarily differ in the same visually discriminable category. Ho- the same way. It cannot be excluded that morganic speech sounds are difficult to distin- French and English are so dissimilar visually, guish, while labial features, such as degree of that it might be possible to discriminate be- lip rounding or lip closure are easily distin- tween them on the basis of psycho-optic differ- guishable (Amcoff, 1970). It has been shown ences only. It does however suggest that we that performance in speech reading varies might relearn to discriminate the L1 from an greatly across perceivers (Kricos, 1996). Gen- unknown language.

150 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

This study deals with L2 learning. Can we fält, and four finnish sentences: (i) Teiden luo- learn an L2 to the extent, that it could be visual- kitteluperusteet vaihtelevat maittain, (ii) Jänis- ly discriminable from an unknown language? ten sukupuolet ovat samannäköisiä, (iii) Yleen- Can we establish new visual language-specific sä tiet ovat myös numeroitu, ja usein tieluokka categories as adults? To answer these questions voidaan päätellä numerosta, (iv) Kalojen tuki- the visual discriminability between Finnish and rangan huomattavin osa muodostuu selkärang- Swedish among (i) people with Swedish as asta ja kallosta. their L1, (ii) people with Swedish as their L2 (immigrants from Latin-America) and (iii) Subjects people with no knowledge in either Swedish or Three groups were examined. Group 1 con- Finnish (Spanish citizens). The Swedish and sisted of 22 (12 female and 10 male) L2- Finnish speech sound inventories differ in speakers of Swedish, aged 23-63 years (Mean= many ways: The two languages both use front 37.5 years). They were all Spanish speaking and back rounded vowels. Finnish lacks the dif- immigrants from Latin-America. Group 2 con- ference between in- and out-rounded vowels sisted of 12 (6 male and 6 female) L1 speakers (Volotinen, 2008). This feature is easy to perce- of Swedish, aged 18-53 years (Mean=38.8 ive visually, since in-rounding involves hidden years). Group 3 consisted of 10 (4 female and 6 teeth (behind the lips), while out-rounding does male) L1 speakers of Spanish, aged 24-47 years not. The Finnish vowel system recruits only (Mean=37.7 years). They were all residents of three degrees of openness, while Swedish San Sebastian (Spain), with Spanish as their L1 makes use of four degrees. The temporal as- and no knowledge in Swedish or Finnish. pects may be difficult to perceive visually (Wada et al. 2003). Swedish is often referred to Procedure as being a stress timed language (Engstrand, 2004): The distances between stressed syllables Each group was presented 16 sentence pairs in are kept more or less constant. In addition, quasi-randomized order. The subjects’ task was Swedish make use of long and short vowels in to judge whether or not the following sentence stressed syllables. The following consonant was in the same language as the first. In group length is complementary striving to keep 1, information about their education in Swedish stressed syllables at constant length. Finnish is was asked for (i.e. number of semesters at the often referred as a quantity language and vo- School of SFI, ) and the wels as well as consonants can be long or short age when arriving in Sweden. In group 1, the regardless of stressing. Unlike Swedish, a long subjects were asked to estimate their use of vowel can be followed by a long consonant. Swedish as compared with their use of Spanish Swedish is abundant in quite complex conso- on a four-degree-scale. nant clusters. Finnish is more restrained from that point of view. Results

Method Group 1 (L2 speaker of Swedish) Group 1 achieved a result of, on average, 10.59 Speech material correct answer of 16 possible (sd = 2.17). A The study is almost a replica of that of Soto- one-sample t-test revealed that their perfor- Faraco et al. (2007). One bilingual, Finno- mance was significantly over chance level Swedish, male was chosen as speaker. His (p<0.001, t21=3.819). Performance was posi- Swedish pronunciation was judged to be on L1- tively correlated with use of Swedish as com- level, by the authors. His Finnish pronunciation pared to Spanish (Pearson ρ = 0.353, p = 0.107, was judged (by two finnish students) to be al- N=22). This correlation did not reach signific- most on L1-level. The speaker was videotaped ance level. Performance was negatively corre- while pronouncing the four swedish sentences: lated with age when arriving to Sweden (ρ = - (i) Fisken är huvudföda för trädlevande djur, 0.328, p = 0.136). Performance was weakly ne- (ii) Vid denna sorts trottoarer brukar det vara gatively correlated with years spent in Sweden pölfritt, (iii) Denna är ej upplåten för motor- (ρ = -0.217, p = 0.333). However this correla- fordon utan är normalt enbart avsedd för rull- tion failed to attain significance. stolar, (iv) En motorväg är en sån väg som be- står av två körbanor med i normalfallet två kör-

151 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Group 2 (L1 speakers of Swedish) portant, since the learning and establishment of Group 2 achieved a result of, on average, 11.25 new visual categories doesn’t stop after the correct answer of 16 possible (sd = 1.35). A course taken. If we acknowledge the visually one-sample t-test revealed that their perfor- discriminable categories as part of the linguistic mance was significantly over chance level competence, it must be a goal for L2 learners to master these categories. (p<0.001, t11=4.437).

Group 3 (No knowledge in Swedish) Acknowledgements Group 3 achieved a result of, on average, 9.50 We would like to thank Pilvi Mattila and Liisa correct answer of 16 possible (sd = 1.96). A Nousiainen for their evaluation of the speaker’s one-sample t-test revealed that their perfor- swedish accent. Thank you Hernan Quintero, mance was significantly over chance level Lena Höglund Santamarta and Xila Philström (p<0.05, t9=2.262). for practical issues, concerning the experimen- tal procedure in a foreign language. Summary Results An ANOVA was performed to reveal any sig- References nificant differences between the groups. None Amcoff S (1970) Visuell perception av talljud of the differences reached significance, but the och avläsestöd för hörselskadade. Rapport difference between group 2 and 3 almost Nr. 7, LSH Uppsala: Pedagogiska institu- reached significance (p = 0.101). tionen. Ellis R. (1994) The study of second language Discussion acquisition. Oxford: University press. There was a negative correlation between sub- Engstrand O. (2004) Fonetikens grunder. Lund: jects’ age of arrival and performance. Studies Studentlitteratur. has shown that learning L2 pronunciation is Erber N. (1969) Interaction of audition and vi- more difficult among elderly as compared with sion in the recognition of oral speech stimu- younger learners (Ellis, 1994). It is likely that li. Journal of Speech and hearing Research age is a factor that impedes visual perception of 12, 423-425. an L2 as well. Johnson F.M. et al. (1988) Sex differences in Subjects in group 3 performed better than lipreading. Bulletin of Psychonomic Society chance level. These results cannot be full ex- 26, 106-108. plained, and they are not in line with the results Kricos P. (1996) Differences in visual intelligi- obtained by Soto-Faraco et al. (2007), whose bility across talkers. In Stork D and - subjects, with no knowledge in the target lan- necke M. (eds.) Speech reading by humans guages, didn’t reach a result above chance lev- and machines. Heidelberg, Germany: el. There are several possible explanations: The Springer-Verlag, 43-55. visual differences between Swedish and Fin- McGurk H. & MacDonald J. (1976) Hearing nish (produced by our speaker) could be so lips and seeing voices. Nature 264, 746-748. large, that the difference might be perceived on Soto-Faraco S. et al. (2007) Discriminating basis of psycho-optic signals. Another possible languages by speech-reading. Perception explanation has to do with extra- or paralin- and Psychophysics 69, 218-231. guistic visual signals that have become lan- Sumby W.H. & Pollack I. (1954) Visual con- guage dependent (for our speaker). stribution to speech intelligibility in noise. Group 2 achieved a higher score than Journal of Acoustical Society of America group3. The difference almost reached signific- 26, 212-215. ance, and showed that knowledge in at least Traunmüller H. & Öhrström N. (2007) Audi- one of the target languages favors visual dis- ovisual perception of openness and lip criminability. These tendencies are in line with rounding in front vowels. Journal of Phone- Soto-Faraco et al. (2007). tic. 35, 244-258. In group 1, there were tendencies that the Volotinen H. (2008) För öppet och tonlöst. Ut- estimated amount of use of Swedish was well talssvårigheter i svenskan på universitetsni- correlated with the performance. This factor vån. Pro Gradu-avhandling i svenska språ- was stronger than the number of semesters ket, Institutionen för språk. Jyväskylä uni- spent on learning Swedish at SFI. This is im- versity.

152 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Wada Y. et al. (2003) Audiovisual integration in temporal perception. international Psy- chophysiology 50, 117-124. Weikum W. et al. (2007) Visual language dis- crimination in infancy. Science. 316. 1159. McGurk H. MacDonald J. (1976) Hearing lips and seeing voices. Nature 264, 746-748.

153 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Estimating speaker characteristics for speech recogni- tion Mats Blomberg and Daniel Elenius Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm

Abstract problem is that the estimation of these features A speaker-characteristic-based hierarchic tree quickly becomes computationally heavy, since of speech recognition models is designed. The each candidate value has to be evaluated in a leaves of the tree contain model sets, which are complete recognition procedure, and the num- created by transforming a conventionally ber of candidates needs to be sufficiently high trained set using leaf-specific speaker profile in order to have the required precision of the vectors. The non-leaf models are formed by estimate. This problem becomes particularly merging the models of their child nodes. Dur- severe if there is more than one property to be ing recognition, a maximum likelihood crite- jointly optimized, since the number of evalua- rion is followed to traverse the tree from the tion points equals the product of the number of root to a leaf. The computational load for esti- individual candidates for each property. Two- mating one- (vocal tract length) and four- stage techniques, e.g. (Lee and Rose, 1998) and dimensional speaker profile vectors (vocal tract (Akhil et. al., 2008), reduce the computational length, two spectral slope parameters and requirements, unfortunately to the prize of model variance scaling) is reduced to a frac- lower recognition performance, especially if the tion compared to that of an exhaustive search accuracy of the first recognition stage is low. among all leaf nodes. Recognition experiments In this work, we approach the problem of on children’s connected digits using adult mod- excessive computational load by representing els exhibit similar recognition performance for the range of the speaker profile vector as quan- the exhaustive and the one-dimensional tree tized values in a multi-dimensional binary tree. search. Further error reduction is achieved Each node contains an individual value, or an with the four-dimensional tree. The estimated interval, of the profile vector and a correspond- speaker properties are analyzed and discussed. ing model set. The standard exhaustive search for the best model among the leaf nodes can now be replaced by a traversal of the tree from Introduction the root to a leaf. This results in a significant Knowledge on speech production can play an reduction of the computational amount. important role in speech recognition by impos- There is an important argument for structur- ing constraints on the structure of trained and ing the tree based on speaker characteristic adapted models. In contrast, current conven- properties rather than on acoustic observations. tional, purely data-driven, speaker adaptation If we know the acoustic effect of modifying a techniques put little constraint on the models. certain property of this kind, we can predict This makes them sensitive to recognition errors models of speaker profiles outside their range and they require a sufficiently high initial accu- in the adaptation corpus. This extrapolation is racy in order to improve the quality of the mod- generally not possible with the standard acous- els. tic-only representation. Several speaker characteristic properties In this report, we evaluate the prediction have been proposed for this type of adaptation. performance by training the models on adult The most commonly used is compensation for speech and evaluating the recognition accuracy mismatch in vocal tract length, performed by on children’s speech. The achieved results ex- Vocal Tract Length Normalization (VTLN) hibit a substantial reduction in computational (Lee and Rose, 1998). Other candidates, less load while maintaining similar performance as explored, are voice source quality, articulation an exhaustive grid search technique. clarity, speech rate, accent, emotion, etc. In addition to the recognized identity, the However, there are at least two problems speaker properties are also estimated. As these connected to the approach. One is to establish can be represented in acoustic-phonetic terms, the quantitative relation between the property they are easier to interpret than the standard and its acoustic manifestation. The second model parameters used in a recognizer. This

154 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University provides a mechanism for feedback from tion approach. The current set contains a few speech recognition research to speech produc- basic properties described below. These are tion knowledge. similar, although not identical, to our work in (Blomberg and Elenius, 2008). Further devel- Method opment of the set will be addressed in future work.

Tree generation VTLN The tree is generated using a top-down design An obvious candidate as one element in the in the speaker profile domain, followed by a speaker profile vector is Vocal Tract Length bottom-up merging process in the acoustic Normalisation (VTLN). In this work, a standard model domain. Initially, the root node is loaded two-segment piece-wise linear warping func- with the full, sorted, list of values for each di- tion projects the original model spectrum into mension in the speaker profile vector. A num- its warped spectrum. The procedure can be per- ber of child nodes are created, whose lists are formed efficiently as a matrix multiplication in achieved by binary splitting each dimension list the standard acoustic representation in current in the mother node. This tree generation proc- speech recognition systems, MFCC (Mel Fre- ess proceeds until each dimension list has a quency Cepstral Coefficients), as shown by Pitz single value, which defines a leaf node. In this and Ney (2005). node, the dimension values define a unique speaker profile vector. This vector is used to Spectral slope predict a profile-specific model set by control- ling the transformation of a conventionally Our main intention with this feature is to com- trained original model set. When all child node pensate for differences in the voice source models to a certain mother node are created, spectrum. However, since the operation cur- they are merged into a model set at their mother rently is performed on all models, also un- node. The merging procedure is repeated up- voiced and non-speech models will be affected. wards in the tree, until the root model is The feature will, thus, perform an overall com- reached. Each node in the tree now contains a pensation of mismatch in spectral slope, model set which is defined by its list of speaker whether caused by voice source or the trans- profile values. All models in the tree have equal mission channel. structure and number of parameters. We use a first order low-pass function to approximate the gross spectral shape of the voice source function. This corresponds to the Search procedure effect of the parameter Ta in the LF voice During recognition of an utterance, the tree is source model (Fant, Liljencrants and Lin, used to select the speaker profile whose model 1985). In order to correctly modify a model in set maximizes the score of the utterance. The this feature, it is necessary to remove the char- recognition procedure starts by evaluating the acteristics of the training data and to insert child nodes of the root. The maximum- those of the test speaker. A transformation of likelihood scoring child node is selected for this feature thus involves two parameters: an further search. This is repeated until a stop cri- inverse filter for the training data and a filter terion is met, which can be that the leaf level or for the test speaker. a specified intermediate level is reached. An- This two-stage normalization technique other selection criterion may be the maximum gives us the theoretically attractive possibility scoring node along the selected root-to-leaf to use separate transformations for the vocal path (path-max). This would account for the tract transfer function and the voice source possibility that the nodes close to the leaves fit spectrum (at least in these parameters). After partial properties of a test speaker well but have the inverse filter, there remains (in theory) only to be combined with sibling nodes to give an the vocal tract transfer function. Performing overall good match. frequency warping at this position in the chain will thus not affect the original voice source of Model transformations the model. The new source characteristics are inserted after the warping and are also unaf- We have selected a number of speaker proper- fected. In contrast, conventional VTLN implic- ties to evaluate our multi-dimensional estima- itly warps the voice source spectrum identically

155 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University to the vocal tract transfer function. Such an as- Training and evaluation sets consist of 60 sumption is, to our knowledge, not supported speakers, resulting in a training data size of by speech production theory. 1800 digits and a children’s test data of 1650 digits. The latter size is due to the failure of Model variance some children to produce all the three-digit An additional source of difference between strings. adults’ and children’s speech is the larger intra- The low age of the children combined with and inter-speaker variability of the latter cate- the fact that the training and testing corpora are gory (Potamianos and Narayanan, 2003). We separate makes the recognition task quite diffi- account for this effect by increasing the model cult. variances. This feature will also compensate for mismatch which can’t be modeled by the other Pre-processing and model configuration profile features. Universal variance scaling is A phone model representation of the vocabu- implemented by multiplying the diagonal co- lary has been chosen in order to allow pho- variance elements of the mixture components neme-dependent transformations. A continu- by a constant factor. ous-distribution HMM system with word- internal, three-state triphone models is used. Experiments The output distribution is modeled by 16 di- agonal covariance mixture components. One- and four-dimension speaker profiles were The cepstrum coefficients are derived from used for evaluation. The single dimension a 38-channel mel filterbank with 0-7600 Hz speaker profile was frequency warping frequency range, 10 ms frame rate and 25 ms (VTLN). The four-dimensional profile con- analysis window. The original models are sisted of frequency warping, the two voice trained with 18 MFCCs plus normalized log source parameters and the variance scaling fac- energy, and their delta and acceleration fea- tor. tures. In the transformed models, reducing the number of MFCCs to 12 compensates for cep- Speech corpora stral smoothing and results in a standard 39- The task of connected digits recognition in the element vector. mismatched case of child test data using adult training data was selected. Two corpora, the Test conditions Swedish PF-Star children’s corpus (PF-Star- The frequency warping factor was quantized Sw) (Batliner et. al., 2005) and TIDIGITS, into 16 log-spaced values between 1.0 and 1.7, were used for this purpose. In this report, we representing the amount of frequency expan- present the PF-Star results. Results on sion of the adult model spectra. The two voice TIDIGITS will be published in other reports. source factors and the variance scaling factor, PF-Star-Sw consists of 198 children aged being judged as less informative, were quan- between 4 and 8 years. In the digit subset, each tized into 8 log-spaced values. The pole cut-off child was aurally prompted for ten 3-digit frequencies were varied between 100 and 4000 strings. Recordings were made in a separate Hz and the variance scale factor ranged be- room at day-care and after school centers. tween 1.0 and 3.0. Downsampling and re-quantization of the The one-dimensional tree consists of 5 lev- original specification of Pf-Star-Sw was per- els and 16 leaf nodes. The four-dimensional formed to 16 bits / 16 kHz. tree has the same number of levels and 8192 Since PF-Star-Sw does not contain adult leaves. The exhaustive grid search was not per- speakers, the training data was taken from the formed for four dimensions, due to prohibitive adult Swedish part of the SPEECON database computational requirements. (Großkopf et al, 2002). In that corpus, each The node selection criterion during the tree speaker uttered one 10 digit-string and four 5 search was varied to stop at different levels. An digit-strings, using text prompts on a computer additional rule was to select the maximum- screen. The microphone signal was processed likelihood node of the traversed path from the by an 80 Hz high-pass filter and digitized with root to a leaf node. These were compared 16-bits / 16 kHz. The same type of head-set mi- against an exhaustive search among all leaf crophone was used for PF-Star-Sw and nodes. SPEECON.

156 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Training and recognition experiments were differs from these, mainly in the peak region. conducted using HTK (Young et. al., 2005). The cause of its bimodal character calls for fur- Separate software was developed for the trans- ther investigation. A possible explanation may formation and the model tree algorithms. lie in the fact that the reference models are trained on both male and female speakers. Dis- Results and discussion tinct parts have probably been assigned in the trained models for these two categories. The Recognition results for the one- and four- two peaks might reflect that some utterances element speaker profiles are presented in Table are adjusted to the female parts of the models 1 for different search criteria together with a while others are adjusted to the male parts. This baseline result for non-transformed models. might be better caught by the more detailed The error rate of the one-dimensional tree- four-dimensional estimation. based search was as low as that of the exhaus- 120 tive search at a fraction (25-50%) of the com- putational load. This result is especially posi- 100 tive, considering that the latter search is guar- 80 1-dim exhaustive anteed to find the global maximum-likelihood 60 1-dim tree 4-dim tree speaker vector. Nbr utterances 40 Even the profile-independent root node pro- vides substantial improvement compared to the 20 baseline result. Since there is no estimation 0 1 1.2 1.4 1.6 1.8 procedure involved, this saves considerable Warp factor computation. In the four-dimensional speaker profile, the Figure 1. Histogram of estimated frequency warp computational load is less than 1% of the ex- factors for the three estimation techniques. haustive search. A minimum error rate is Figure 2 shows scatter diagrams for average reached at a stop level two and three levels be- warp factor per speaker vs. body height for low the root. Four features yield consistent im- one- and four-dimensional search. The largest provements over the single feature, except for difference between the plots occurs for the the root criterion. Clearly, vocal tract length is shortest speakers, for which the four- very important, but spectral slope and variance dimensional search shows more realistic values. scaling also have positive contribution. This indicates that the latter makes more accu- Table 1. Number of recognition iterations and word rate estimates in spite of its larger deviation error rate for one and four-dimensional speaker from a Gaussian distribution in Figure 1. This profile. is also supported by a stronger correlation be- tween warp factor and height (-0.55 vs. -0.64).

Search No. iterations WER(%) 1.6 1.6 alg. 1.5 1.5 Baseline 1 32.2 1.4 1.4 1-D 4-D 1-D 4-D 1.3 1.3 1.2 1.2 Warp factor Warp Exhaus- 16 8192 11.5 - 1.1 1.1 tive 1 1 Root 1 1 11.9 13.9 80 100 120 140 160 80 100 120 140 160 Body height Body height Level 1 2 16 12.2 11.1 Level 2 4 32 11.5 10.2 Figure 2. Scatter diagrams of warp factor vs. body height for one- (left) and four-dimensional (right) Level 3 6 48 11.2 10.2 search. Each sample point is an average of all utter- Leaf 8 50 11.2 10.4 ances of one speaker. Path-max 9 51 11.9 11.6 The operation of the spectral shape com- Histograms of warp factors for individual pensation is presented in Figure 3 as an average utterances are presented in Figure 1. The distri- function over the speakers and for the two butions for exhaustive and 1-dimensional leaf speakers with the largest positive and negative search are very similar, which corresponds well deviation from the average. The average func- with their small difference in recognition error tion indicates a slope compensation of the fre- rate. The 4-dimensional leaf search distribution quency region below around 500 Hz. This

157 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University shape may be explained as children having STAR Children’s Speech Corpus, Proc. In- steeper spectral voice source slope than adults, terSpeech, 2761-2764. but there may also be influence from differ- Blomberg, M., and Elenius, D. (2008) Investi- ences in the recording conditions between PF- gating Explicit Model Transformations for Star and SPEECON. Speaker Normalization. Proc. ISCA ITRW

18 Speech Analysis and Processing for Knowl- edge Discovery, Aalborg, Denmark,. 12 Fant, G. and Kruckenberg, A. (1996) Voice 6 source properties of the speech code. TMH- Maximum 0 Average dB QPSR 37(4), KTH, Stockholm, 45-56. 0 2000 4000 6000 8000 10000 Minimum Fant, G., Liljencrants, J. and Lin, Q. (1985) A -6 four-parameter model of glottal flow. STL- -12 QPSR 4/1985, KTH, Stockholm, 1-13.

-18 Großkopf, B., Marasek, K., v. d. Heuvel, H., Frequency Diehl, F., Kiessling, A. (2002) SPEECON -

speech data for consumer devices: Database Figure 3. Transfer function of the voice source specification and validation, Proc. LREC. compensation filter as an average over all test Lee, L. and Rose, R. C. (1998) A Frequency speakers and the functions of two extreme speakers. Warping Approach to Speaker Normalisa- The model variance scaling factor has an tion, IEEE Trans. On Speech and Audio average value of 1.39 with a standard deviation Processing, 6(1): 49-60. of 0.11. This should not be interpreted as a ratio Pitz, M. and Ney, H. (2005) Vocal Tract Nor- between the variability among children and that malization Equals Linear Transformation in of adults. This value is rather a measure of the Cepstral Space, IEEE Trans. On Speech and remaining mismatch after compensation of the Audio Processing, 13(5):930-944. other features. Potamianos A. and Narayanan S. (2003) Robust Recognition of Children’s Speech, IEEE Conclusion Trans. on Speech and Audio Processing, 11(6):603-616. A tree-based search in the speaker profile space provides recognition accuracy similar to an ex- haustive search at a fraction of the computa- tional load and makes it practically possible to perform joint estimation in a larger number of speaker characteristic dimensions. Using four dimensions instead of one increased the recog- nition accuracy and improved the property es- timation. The distribution of the estimates of the individual property features can also pro- vide insight into the function of the recognition process in speech production terms.

Acknowledgements This work was financed by the Swedish Re- search Council.

References Akhil, P. T., Rath, S. P., Umesh, S. and Sanand, D. R. (2008) A Computationally Efficient Approach to Warp Factor Estimation in VTLN Using EM Algorithm and Sufficient Statistics, Proc. Interspeech. Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D. (2002) The PF-

158 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

159 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Auditory white noise enhances cognitive performance under certain conditions: Examples from visuo-spatial working memory and dichotic listening tasks Göran G. B. W. Söderlund, Ellen Marklund, and Francisco Lacerda Department of Linguistics, Stockholm University, Stockholm

Abstract compatible distractors (Broadbent, 1958). The effects hold across a wide variety of tasks, dis- This study examines when external auditive tractors and participant populations (e.g. Bo- noise can enhance performance in a dichotic man et al., 2005; Hygge et al., 2003). In con- listening and a visuo-spatial working memory trast to the main body of evidence regarding task. Noise is typically conceived of as being distractors and noise, there has been a number detrimental for cognitive performance; howev- of reports of counterintuitive findings. ADHD er, given the mechanism of stochastic reson- children performed better on arithmetic’s when ance (SR), a certain amount of noise can bene- exposed to rock music (Abikoff et al., 1996; fit performance. In particular we predict that Gerjets et al., 2002). Children with low socio- low performers will be aided by noise whereas economic status and from crowded households high performers decline in performance during performed better on memory test when exposed the same condition. Data from two experiments to road traffic noise (Stansfeld et al., 2005). will be presented; participants were students at These studies did not, however, provide satis- Stockholm University. factory theoretical account for the beneficial effect of noise, only referring to general in- Introduction crease of arousal and general appeal counteract- ing boredom. Aim Signaling in the brain is noisy, but the brain possesses a remarkable ability to distin- The aim of this study is to further investigate guish the information carrying signal from the the effects of auditory white noise on attention surrounding, irrelevant noise. A fundamental and cognitive performance in a normal popula- mechanism that contributes to this process is tion. Earlier research from our laboratory has the phenomenon of stochastic resonance (SR). found that noise exposure can, under certain SR is the counterintuitive phenomenon of prescribed settings, be beneficial for perform- noise-improved detection of weak signals in the ance in cognitive tasks, in particular for indi- central nervous system. SR makes a weak sig- viduals with attentional problems such as At- nal, below the hearing threshold, detectable tention Deficit/Hyperactivity Disorder (ADHD) when external auditory noise is added (Moss et (Söderlund et al., 2007). Positive effects of al., 2004). In humans, SR has also been found noise was also found in a normal population of in the sensory modalities of touch (Wells et al., school children among inattentive or low 2005), hearing (Zeng et al., 2000), and vision achieving children (Söderlund & Sikström, (Simonotto et al., 1999), all in which moderate 2008). The purpose of this study is to include noise has been shown to improve sensory dis- two cognitive tasks that has not earlier been crimination. However, the effect is not re- performed under noise exposure. The first task stricted to sensory processing as SR has also is the dichotic listening paradigm that measures been found in higher functions e.g., auditory attention and cognitive control. The second noise improved the speed of arithmetic compu- task is a visuo-spatial working memory test that tations in a group of school children (Usher & measures working memory performance. Feingold, 2000). SR is usually quantified by Participants were students at Stockholm Uni- plotting detection of a weak signal, or cogni- versity. tive performance, as a function of noise intensi- ty. This relation exhibits an inverted U-curve, Background where performance peaks at a moderate noise It has long been known that, under most cir- level. That is, moderate noise is beneficial for cumstances, cognitive processing is easily dis- performance whereas too much, or too little turbed by environmental noise and non-task noise, attenuates performance.

160 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The purpose with present experiment is to find According to the Moderate Brain Arousal out whether noise exposure will facilitate cog- (MBA) model (Sikström & Söderlund, 2007), a nitive control in either forced-left ear, or neurocomputational model of cognitive perfor- forced-right ear condition in a group of stu- mance in ADHD, noise in the environment in- dents. Four noise levels will be used to deter- troduces internal noise into the neural system to mine the most appropriate noise level. compensate for reduced neural background ac- tivity in ADHD. This reduced neural activity is Experiment 2. Visuo-spatial memory believed to depend on a hypo-functioning do- The visuo-spatial working memory (vsWM) pamine system in ADHD (Solanto, 2002). The test is a sensitive measure of cognitive deficits MBA model suggests that the amount of noise in ADHD (Westerberg et al., 2004). This test required for optimal cognitive performance is determines working memory capacity without modulated by dopamine levels and therefore being affected from previous skills or know- differs between individuals. Dopamine ledge. modualtes neural responses and function by Earlier research has shown that performing increasing the signal-to-noise ratio through vsWM tasks mainly activates the right hemis- enhanced differentiation between background, phere, which indicates that the visou-spatial efferent firing and afferent stimulation (Cohen ability is lateralized (Smith & Jonides, 1999). et al., 2002). Thus, persons with low levels of Research from our group has found that white dopamine will perform inferior in tasks that noise exposure improve performance vsWM in deserves a large signal-to-noise ratio. It is both ADHD and control children (Söderlund et proposed that inattentive and/or low performing al. manuscript). This finding rises the question participants will benefit from noise whereas whether lateralized noise (left or right ear) ex- attentive or high performers will not. posure during vsWM encoding will affect per- formance differently. Experiments The purpose of the second experiment is to find out if effects of noise exposure to the left Experiment 1. Dichotic listening ear will differ from exposure to the right ear. Dichotic literally means listening to two differ- Control conditions will be noise exposure to ent verbal signals (typically the syllables ba, da, both ears and no noise. The prediction is that ga, pa, ta, ka) presented at the same time, one noise exposure to the left ear will affect per- presented in the left ear and one in the right ear. formance in either positive or negative direction The common finding is that participants are whereas exposure to the right ear will be close more likely to report the syllable presented in to the baseline condition, no noise. the right ear, a right ear advantage (REA) (Hugdahl & Davidson, 2003). During stimulus Methods driven bottom-up processing language stimuli are normally (in right-handers) perceived by the Experiment 1. Dichotic listening left hemisphere that receive information from the contralateral right ear in a dichotic stimulus presentation situation. If participants are in- Participants structed to attend to and report from the stimuli Thirty-one students from Stockholm University, presented to the left ear, forced-left ear condi- aged between 18 and 36 years (M=28,6), seven- tion, this requires top-down processing to shift teen women and fourteen men. Twenty-nine attention from right to left ear. Abundant evi- where right-handed and two left-handed. dence from Hugdahl’s research group has shown that this attentional shift is possible to Design and material do for healthy persons but not for clinical The design was a 2 x 4, where attention (forced groups distinguished by attentional problems left vs. forced right ear) and noise level (no like in: schizophrenia, depression, and ADHD noise, 50, 60, and 72 dB) were independent who generally fails to make this shift from right variables (within subject manipulations). De- to left ear (Hugdahl et al., 2003). The forced- pendent variable was number of correct recalled left situation produces a conflict that requires syllables. Speech signal was 64 dB. Inter- cognitive control to be resolved. stimulus intervals were 4 seconds and 16 syl- lables where presented in each condition. Four

161 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University control syllables (same in both ears) where pre- sented in each trial so maximum score was 12 Procedure syllables. Participants were divided into two Participants sat in a silent room in front of a groups after their aggregate performance in the computer screen and responded by using the most demanding, forced left ear condition, in mouse pointer. The noise was presented in head the four noise levels. phones. Recall time is not limited and partici- pants click on a green arrow when they decide Procedure to continue. Every participant performs the test Participants sat in a silent room in front of a four times, one time in each noise condition. computer screen and responded by pressing the Order of noise conditions were randomized. first letter of the perceived syllable on a key- board. Syllables and noise where presented through earphones. Conditions and syllables Results were presented in random order and the expe- riment was programmed in E-prime 1.2 (Psy- Experiment 1. Dichotic listening chology software). Nine trials of stimuli were In the non-forced baseline condition a signifi- presented; the first time was the non-forced cant right ear advantage was shown. A main baseline condition. The remaining eight trials effect of attention was found in favor for the where either forced left ear or forced right ear right ear, more syllables were recalled in the presented under the four noise conditions. The forced right ear condition in comparison with testing session lasted for approximately 20 mi- the forced left ear, (Figure 1). There was no nutes. main effect of noise while noise affected condi- tions differently. An interaction between atten- Experiment 2. Visuo-Spatial WM task tion and noise was found (F(28,3) = 5.66, p = .004) In the forced left condition noise exposure Participants did not affect performance at all, the small in- crease in the lowest noise condition was non- Twenty students at Stockholm University aged significant. A facilitating effect of noise was 18-44 years (M=32,3), 9 men and 11 women. 3 found in the forced right ear condition, the where left-handed and 17 right handed. more noise the better performance (F(30,1) = 5.63, p = .024). Design and material The design was a 4 x 2, where noise (no noise, noise left ear, noise right ear, noise both ears) Recall as a Function of Attention and Noise 8.0 were the within subject manipulation and per-

7.5 formance level (high vs. low performers) were Noise, p = .024 the between group manipulation. The noise lev- 7.0 el was set in accordance with earlier studies to 77 dB. The visuo-spatial WM task (Spanboard 6.5 2009) consists of red dots (memory stimuli) Noise x attention, p = .004 6.0 that are presented one at a time at a computer

5.5 screen in a four by four grid. Inter-stimulus in- y tervals were 4 seconds, target is shown 2 sec ns. 5.0 and a 2 sec pause before next target turns up. Left Ear Participants are asked to recall location, and the 4.5 Right ear order in which the red dots appear. The work- 4.0 ing memory load increases after every second N1 N2 N3 N4 trial and the WM capacity is estimated based on Noise Levels the number of correctly recalled dots. Partici- Figure 1. Number correct recalled syllables as a pants were divided into two groups after their function of noise. Noise levels were N1= no noise, performance in the spanboard task, an aggre- N2 = 50 dB, N3 = 60 dB, and N4 = 72 dB. gate measure of their result in all four condi- When participants where divided according to tions. The results for high performers where performance in the dichotic listening task, noise between 197-247 points (n=9) and low perfor- improved performance for both groups in the mers between 109-177 points (n=11). forced-right ear condition. A trend was found in

162 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

the forced-left condition where high performers deteriorated by noise and there was no change for low performers (p = .094) Conclusions The rationale behind these two studies was to Experiment 2. Visuo-spatial WM task investigate effects of noise in two cognitive tasks that put high demands on executive func- No effect of noise was present when the entire tions and working memory in a normal popula- sample was investigated. However, when par- tion. Results showed that there was an effect of ticipants were divided into two groups based on noise in both experiments. In the dichotic lis- their test performance a two way ANOVA re- tening experiment there was a main effect of vealed a trend towards interaction between late- noise derived from the forced-right ear condi- ralized noise and group (F(16,3)=2.95, tion. In the visuo-spatial working memory task p=.065). When noise exposure to both ears was there was no main effect of noise. However, excluded from the ANOVA it did reach signi- when you split the group in two, high and low ficance (F(17,2)=4.13, p=.024). Further inte- performers, you get significant results in accor- ractions were found between group and no dance with predictions. noise vs. noise left ear (F(18,1)= 8.76, p=.008) The most intriguing result in the present and between left ear and right ear study is the lateralization effect of noise expo- (F(18,1)=4.59, p=.046). No interaction was sure in the visuo-spatial working memory task. found between group and noise right ear vs. Firstly, we have shown that the noise effect is noise both ears (Figure 2). cross modal, auditory noise exerted an effect on a visual task. Secondly, the pattern of high and

Recall as a Function of Noise and Group low performers was inverted (in all conditions). The lateralization effect could have two possi- 60 overall: noise x group p = .065 ble explanations; the noise exposure to the right

55 hemisphere either interacts with the right- dominant lateralized task specific activation in

50 the visuo-spatial task, or it activates crucial at- p = .008 p = .046 ns. tentional networks like the right dorsolateral 45 pre-frontal cortex (Corbetta & Shulman, 2002). In the dichotic listening experiment we on- 40 ly got effects in the easy, forced-right ear condi-

35 tion, that owes a large signal-to-noise ratio. The Number correct recalled dots recalled Number correct more demanding forced-left ear condition may

30 High Performers get improved by noise exposure when exposed Low Performers to inattentive, ADHD participants, this will be 25 tested in upcoming experiments. No noise Noise left Noise right Noise both To gain such prominent group effects de- Figure 2. Number correct recalled dots as a function spite the homogeneity of the group consisting of laterlized noise. (77 dB, left ear, right ear, both of university students demonstrates a large po- ears, or no noise) tential for future studies on participants with Noteworthy is that the low performing group attentional problems, such as in ADHD. consisted nine women and two men whereas the high performing group consisted of two Acknowledgements women and seven men. However the gender Data collection was made by students from the and noise interaction did not reach significance Department of Linguistics and from students at but indicated a trend (p= .096). the speech therapist program. This research was Paired samples t-test showed that the noise funded by a grant from Swedish Research increment in left ear for the low performing Council (VR 421-2007-2479). group was significant (t(10)=2.25, p=.024 one tailed) and the decrement for the high perform- ing group in the same condition significant as well (t(8)=1.98, p=.042 one tailed).

163 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Simonotto, E., Spano, F., Riani, M., Ferrari, A., Levero, F., Pilot, A., et al. (1999). fMRI References studies of visual cortical activity during

noise stimulation. Neurocomputing: An In- Abikoff, H., Courtney, M. E., Szeibel, P. J., & ternational Journal. Special double volume: Koplewicz, H. S. (1996). The effects of au- Computational neuroscience: Trends in re- ditory stimulation on the arithmetic perfor- search 1999, 26-27, 511-516. mance of children with ADHD and nondi- Smith, E. E., & Jonides, J. (1999). Storage and sabled children. Journal of Learning Dis- executive processes in the frontal lobes. abilities, 29(3), 238-246. Science, 283(5408), 1657-1661. Boman, E., Enmarker, I., & Hygge, S. (2005). Solanto, M. V. (2002). Dopamine dysfunction Strength of noise effects on memory as a in AD/HD: integrating clinical and basic function of noise source and age. Noise neuroscience research. Behavioral Brain Health, 7(27), 11-26. Research, 130(1-2), 65-71. Broadbent, D. E. (1958). The effects of noise Stansfeld, S. A., Berglund, B., Clark, C., Lo- on behaviour. In. Elmsford, NY, US: Per- pez-Barrio, I., Fischer, P., Ohrstrom, E., et gamon Press, Inc. al. (2005). Aircraft and road traffic noise Cohen, J. D., Braver, T. S., & Brown, J. W. and children's cognition and health: a cross- (2002). Computational perspectives on do- national study. Lancet, 365(9475), 1942- pamine function in prefrontal cortex. Cur- 1949. rent Opinion in Neurobiology, 12(2), 223- Söderlund, G. B. W., & Sikström, S. (2008). 229. Positive effects of noise on cogntive per- Corbetta, M., & Shulman, G. L. (2002). Control formance: Explaining the Moderate Brain of goal-directed and stimulus-driven atten- Arousal Model. Proceedings from ICBEN, tion in the brain. Nature Reviews. Neuros- International Comission on the Biological cience, 3(3), 201-215. Effects of Noise. Gerjets, P., Graw, T., Heise, E., Westermann, Söderlund, G. B. W., Sikström, S., & Smart, A. R., & Rothenberger, A. (2002). Deficits of (2007). Listen to the noise: Noise is benefi- action control and specific goal intentions in cial for cognitive performance in ADHD. hyperkinetic disorder. II: Empirical re- Journal of Child Psychology and Psychiatry, sults/Handlungskontrolldefizite und stö- 48(8), 840-847. rungsspezifische Zielintentionen bei der Usher, M., & Feingold, M. (2000). Stochastic Hyperkinetischen Störung: II: Empirische resonance in the speed of memory retrieval. Befunde. Zeitschrift für Klinische Psycho- Biological Cybernetics, 83(6), L11-16. logie und Psychotherapie: Forschung und Wells, C., Ward, L. M., Chua, R., & Timothy Praxis, 31(2), 99-109. Inglis, J. (2005). Touch noise increases vi- Hugdahl, K., & Davidson, R. J. (2003). The brotactile sensitivity in old and young. Psy- asymmetrical brain. Cambridge, MA, US: chological Science, 16(4), 313-320. MIT Press, 796. Westerberg, H., Hirvikoski, T., Forssberg, H., Hugdahl, K., Rund, B. R., Lund, A., Asbjorn- & Klingberg, T. (2004). Visuo-spatial work- sen, A., Egeland, J., Landro, N. I., et al. ing memory span: a sensitive measure of (2003). Attentional and executive dysfunc- cognitive deficits in children with ADHD. tions in schizophrenia and depression: evi- Child Neuropsychology, 10(3), 155-161. dence from dichotic listening performance. Zeng, F. G., Fu, Q. J., & Morse, R. (2000). Biological Psychiatry, 53(7), 609-616. Human hearing enhanced by noise. Brain Hygge, S., Boman, E., & Enmarker, I. (2003). Research, 869(1-2), 251-255. The effects of road traffic noise and mea- ningful irrelevant speech on different mem- ory systems. Scandinavian Journal Psychol- ogy, 44(1), 13-21. Moss, F., Ward, L. M., & Sannita, W. G. (2004). Stochastic resonance and sensory in- formation processing: a tutorial and review of application. Clinical Neurophysiology, 115(2), 267-281. Sikström, S., & Söderlund, G. B. W. (2007). Stimulus-dependent dopamine release in at- tention-deficit/hyperactivity disorder. Psy- chological Review, 114(4), 1047-1075.

164 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

165 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Factors affecting visual influence on heard vowel roun- dedness: Web experiments with Swedes and Turks Hartmut Traunmüller Department of Linguistics, University of Stockholm

Abstract that under ideal audibility and visibility condi- tions, roundedness is largely heard by vision, The influence of various general and stimulus- while heard openness (vowel height) is hardly specific factors on the contribution of vision to at all influenced by vision (Traunmüller & heard roundedness was investigated by means Öhrström, 2007). These observations make it of web experiments conducted in Swedish. The clear that the presence/absence of features tends original utterances consisted of the syllables to be perceived by vision if their auditory cues /ɡyːɡ/ and /ɡeːɡ/ of a male and a female speak- are subtle while their visual cues are prominent. er. They were synchronized with each other in Differences between phonetic systems are also all combinations, resulting in four stimuli that relevant. When, e.g. an auditory [ɡ] is presented were incongruent in vowel quality, two of them in synchrony with a visual [b], this is likely to additionally in speaker sex. One of the experi- fuse into a [ɡ͡ b] only for perceivers who are ments was also conducted in Turkish, using the competent in a language with a [ɡ͡ b]. Others are same stimuli. The results showed that visible more likely to perceive a [ɡ] or a consonant presence of lip rounding has a weaker effect on cluster. The observed lower visual influence in audition than its absence, except for conditions speakers of Japanese as compared with English that evoke increased attention, such as when a (Sekiyama and Burnham, 2008) represents a foreign language is involved. The results sug- more subtle case, whose cause may lie outside gest that female listeners are more susceptible the phonetic system. to vision under such conditions. There was no The influence of vision is increased when significant effect of age and of discomfort felt the perceived speech sounds foreign (Sekiyama by being exposed to dubbed speech. A discre- and Tohkura, 1993; Hayashi and Sekiyama, pancy in speaker sex did not lead to reduced 1998; Chen and Hazan, 2007). This is referred influence of vision. The results also showed that to as the “foreign-language effect”. habituation to dubbed speech has no deteri- The influence of vision varies substantially orating effect on normal auditory-visual inte- between speakers and speaking styles (Munhall gration in the case of roundedness. et al., 1996; Traunmüller and Öhrström, 2007). The influence of vision also varies greatly Introduction between perceivers. There is variation with age. In auditory speech perception, the perceptual Pre-school children are less sensitive (Sekiyama weight of the information conveyed by the visi- and Burnham, 2004) although even pre- ble face of a speaker can be expected to vary linguistic children show influence of vision with many factors. (Burnham and Dodd, 2004). There is also a subtle sex difference: Women tend to be more 1) The particular phonetic feature and system susceptible to vision (Irwin et al., 2006; 2) Language familiarity Traunmüller and Öhrström, 2007). 4) The individual speaker and speech style The influence of vision increases with de- 3) The individual perceiver creasing audibility of the voice, e.g. due to 5) Visibility of the face / audibility of the voice noise, and decreases with decreasing visibility 6) The perceiver’s knowledge about the stimuli of the face, but only very litte with increasing 7) Context distance up to 10 m (Jordan, 2000). 8) Cultural factors Auditory-visual integration works even Most studies within this field have been con- when there is a discrepancy in sex between a cerned with the perception of place of articula- voice and a synchronized face (Green et al., tion in consonants, like McGurk and MacDo- 1991) and it is also robust with respect to what nald (1976). These studies have shown that the the perceiver is told about the stimuli. A minor presence/absence of labial closure tends to be effect on vowel perception has, nevertheless, perceived by vision. As for vowels, it is known been observed when subjects were told the sex

166 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University represented by an androgynous voice (Johnson, Method Strand and d’Imperio, 1991). Auditory-visual integration is robust to se- mantic factors (Sams et al, 1998) but it is af- Speakers fected by context, e.g. the vocalic context of The speakers were two native Swedes, a male consonants (Shigeno, 2002). It can also be af- doctoral student, 29 years (index ♂), and a fe- fected by the experimental method (e.g., male student, 21 years (index ♀). These were blocked vs. random stimulus presentation). two of the four speakers who served for the ex- It has been suggested that cultural conven- periments reported in Traunmüller and tions, such as socially prescribed gaze avoid- Öhrström (2007). For the present experiment, a ance, may affect the influence of vision (Se- selection of audiovisual stimuli from this expe- kiyama and Tohkura, 1993; Sekiyama, 1997). riment was reused. Exposure to dubbed films is another cultural factor that can be suspected to affect the influ- Speech material ence of vision. The dubbing of foreign movies The original utterances consisted of the Swe- is a widespread practice that often affects nearly dish nonsense syllables /ɡyːɡ/ and /ɡeːɡ/. Each all speakers of certain languages. Since in auditory /ɡyːɡ/ was synchronized with each dubbed speech, the sound is largely incongruent visual /ɡeːɡ/ and vice-versa. This resulted in 2 with the image, habituation requires learning to times 4 stimuli that were incongruent in vowel disrupt the normal process of auditory-visual quality, half of them being, in addition, incon- integration. Considering also that persons who gruent in speaker (male vs. female). are not habituated often complain about dis- comfort and mental pain when occasionally ex- Experiments posed to dubbed speech, it deserves to be inves- Four experiments were conducted with instruc- tigated whether the practice of dubbing deteri- tions in Swedish. The last one of these was also orates auditory-visual integration more perma- translated and conducted in Turkish, using the nently in the exposed populations. same stimuli. The number of stimuli was li- The present series of web experiments had mited to 5 or 6 in order to facilitate the recruit- the primary aim of investigating (1) the effects ment of subjects. of the perceiver’s knowledge about the stimuli and (2) those of a discrepancy between face and Experiment 1 voice (male/female) on the heard presence or absence of liprounding in front vowels. Sequence of stimuli (in each case first vowel by Additional factors considered, without be- voice, second vowel by face): ing experimentally balanced, were (3) sex and e♂e♂, e♀y♂ x, y♂e♀ x, e♂y♂ n, y♀e♀ n (4) age of the perceiver, (5) discomfort felt from dubbed speech, (6) noticed/unnoticed For each of the five stimuli, the subjects were phonetic incongruence and (7) listening via asked for the vowel quality they heard. loudspeaker or headphones. “x” indicates that the subjects were also asked The experiments were conducted in Swe- for the sex of the speaker. dish, but one experiment was also conducted in “n” indicates that the subjects were also asked Turkish. The language factor that may disclose whether the stimulus was natural or dubbed. itself in this way has to be interpreted with cau- tion, since (8) the “foreign-language effect” re- Experiment 2 mains confounded with (9) effects due to the In this experiment, there were two congruent difference between the phonetic systems. stimuli in the beginning. After these, the sub- Most Turks are habituated to dubbed jects were informed that they would next be ex- speech, since dubbing foreign movies into Tur- posed to two stimuli obtained by cross-dubbing kish is fairly common. Some are not habituated, these. The incongruent stimuli and their order since such dubbing is not pervasive. This al- of presentation were the same as in Exp. 1. Se- lows investigating (10) the effect of habituation quence of stimuli: to dubbed speech. Since dubbing into Swedish e e , y y , e y , y e , e y n, y e n is only rarely practiced - with performances in- ♀ ♀ ♂ ♂ ♀ ♂ ♂ ♀ ♂ ♂ ♀ ♀ tended for children - adult Swedes are rarely habituated to dubbed speech.

167 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Experiment 3 The questions asked concerned the following: This experiment differed from Exp. 1 in an in- First general question, multiple response: verted choice of speakers. Sequence of stimuli: . Swedish (Turkish) first language e♀e♀, e♂y♀ x, y♀e♂ x, e♀y♀ n, y♂e♂ n . Swedish (Turkish) best known language . Swedish (Turkish) most heard language Experiment 4 Further general questions, alternative response: This experiment differed from Exp. 1 only in • Video ok | not so in the beginning | not ok the order of stimulus presentation. It was con- • Listening by headphones | by loudspeaker ducted not only in Swedish but also in Turkish. • Heard well | not so | undecided. Sequence of stimuli: • Used to dubbed speech | not so | undecided. e♂e♂, y♂e♀ x, e♀y♂ x, y♀e♀ n, e♂y♂ n • Discomfort (obehag, rahatsızlık) from dubbed speech | not so | undecided. • Male | Female Subjects • Age in years .... . For the experiments with instructions in Swe- • Answers trustable | not so. dish, most subjects were recruited via web fora: If one of the negations shown here in italics was Forumet.nu > Allmänt forum, chosen, the results were not evaluated. Ex- Flashback Forum > Kultur > Språk, cluded were also cases in which more than one Forum för vetenskap och folkbildning, vowel failed to be responded to. KP-webben > Allmänt prat (Exp. 1). The faces were shown on a small video Since young adult males dominate on these fo- screen, width 320 px, height 285 px. The height ra, except the last one, where girls aged 10-14 of the faces on screen was roughly 55 mm (♂) years dominate, some additional adult female and 50 mm (♀).The subjects were asked to look subjects were recruited by distribution of slips at the speaker and to tell what they ‘heard’ ‘in in cafeterias at the university and in a concert the middle (of the syllable)’. Each stimulus was hall. Most of the subjects of Exp. 2 were re- presented twice, but repetition was possible. cruited by invitation via e-mail. This was also Stimulus-specific questions: the only method used for the experiment with instructions in Turkish. This method resulted in Vowel quality (Swedes): a more balanced representation of the sexes, as • i | e | y | ö | undecided (natural stimuli) can be seen in Figure 1. • y | i | yi | undecided (aud. [y], vis. [e]) • e | ö | eö | undecided (aud. [e], vis. [y]) Procedure Swedish non-IPA letter: ö [ø]. The instructions and the stimuli were presented in a window 730 x 730 px in size if not Vowel quality (Turks): changed by the subject. There were eight or nine displays of this kind, each with a heading • i | e | ü | ö | undecided (natural stimuli) ‘Do you also hear with your eyes?’. The whole • ü | üy | i | ı | undecided (aud. [y], vis. [e]) session could be run through in less than 3 mi- • e | eö | ö | undecided (aud. [e], vis. [y]) nutes if there were no cases of hesitation. Turkish non-IPA: ü [y], ö [ø], ı [ɯ] and y [j].

Fig. 1. Population pyramids for (from left to right) Exp. 1, 2, 3, 4 (Swedish version) and 4 (Turkish version). Evaluated subjects only. Males left, females right.

168 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

• Female sounding male | male looking fe- ence of vision also for the combination of audi- male | undecided (when voice♂ & face♀) tory [y] and visual [e], which was predominant- • Male sounding female | female looking ly perceived as an [i] and only seldom as [yj] male | undecided (when voice♀ & face♂) among both Turks and Swedes. The response • Natural | dubbed | undecided (when speaker [ɯ] (Turkish only) was also rare. The propor- congruent, vowel incongruent) tion of visually influenced responses substan- Upon completing the responses, these were tially exceeded the proportion of stimuli per- transmitted by e-mail to the experimenter, to- ceived as undubbed, especially so among gether with possible comments by the subject, Turks. who was invited to an explanatory demonstra- The results from Exp. 3, in which the tion speakers had been switched (Table 2), showed (http://legolas.ling.su.se/staff/hartmut/webexper the same trend that can be seen in Exp. 1, al- iment/xmpl.se.htm, ...tk.htm). though visible presence of roundedness had a Subjects participating via Swedish web fora prominent effect with the female speaker within were informed within 15 minutes or so about the frame of the previous experiment. how many times they had heard by eye. A subject-specific measure of the overall in- fluence of vision was obtained by counting the responses in which there was any influence of Results vision and dividing by four (the number of in- The most essential stimulus specific results are congruent stimuli presented). summarized in Table 1 for Exp. 1, 2 and 4 and A preliminary analysis of the results from in Table 2 for Exp. 3. Subjects who had not in- Exp. 1 to 3 did not reveal any substantial effects dicated the relevant language as their first or of habituation to dubbing, discomfort from their best known language have been excluded. dubbing, sex or age. It can be seen in Table 1 and 2 that for each Table 2.Summary of stimulus specific results for auditory-visual stimulus combination, there Exp. 3 arranged as in Table 1. were only minor differences in the results be- tween Exp. 1, 2, and the Swedish version of Stimulus Prev. Exp 3 Exp. 4. For combinations of auditory [e] and exp. visual [y], the influence of vision was, however, Voice & Face n=42, n=47 Nat clearly smaller than that observed within the frame of the previous experiment (Traunmüller 20 ♂, 22 ♀ O 41 ♂ , 6 ♀ and Öhrström, 2007), while it was clearly y♂ & e♂ 79 4 81 66 greater in the Turkish version of Exp. 4. Ab- e♀ & y♀ 86 3 34 11 sence of lip rounding in the visible face had generally a stronger effect than visible presence y♀ & e♂ - 2 85 of lip rounding, in particular among Swedes. In e♂ & y♀ - 1 28 the Turkish version, there was a greater influ- Table 1. Summary of stimulus specific results from Exp. 1, 2, and 4: Percentage of cases showing influence of vision on heard roundedness in syllable nucleus (monophthong or diphthong). No influence assumed when response was ‘undecided’. O: Order of presentation. Nat: Percentage of stimuli perceived as natural (undubbed) shown for stimuli without incongruence in sex. Corresponding results from previous experiment (Traunmüller and Öhrström, 2007) shown in leftmost column of figures.

Stimulus Prev. Exp. 1 Exp. 2 Exp. 4 Exp. 4 Voice & Face exp. Informed Sweds Turks, n=42, n=185 n=99, n=84, n=71, 20 ♂, 122 ♂, 57 ♂, 73 ♂, 30 ♂, 22 ♀ O 63 ♀ Nat O 42 ♀ Nat O 11 ♀ Nat 41 ♀ Nat y♀ & e♀ 83 4 82 72 4 81 66 3 81 61 99 64 e♂ & y♂ 50 3 26 15 3 25 11 4 23 9 79 23 y♂ & e♀ - 2 80 2 65 1 75 94 e♀ & y♂ - 1 41 1 42 2 52 83

169 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

For Exp. 4, the result of Chi-square tests of the only visually. This is likely to have increased effects of subject-specific variables on the in- the influence of visual presence of roundedness. fluence of vision are listed in Table 3. The present experiments had the aim of in- Table 3. Effects of general variables (use of head- vestigating the effects of phones, habituated to dubbing, discomfort from 1) a male/female face/voice incongruence, dubbing, sex and age) on ”influence of vision”. 2) the perceiver’s knowledge about the stimuli, 3) sex of perceiver, n Swedes n Turks 4) age of the perceiver, Phone use 25 of 83 0.02 12 of 68 0.11 5) discomfort from dubbed speech, Habituated 7 of 81 0.7 53 of 68 0.054 6) noticed/unnoticed incongruence, Discomfort 50 of 76 0.4 33 of 63 0.93 7) listening via loudspeaker or headphones, Female 11 of 84 0.9 39 of 69 0.045 8) language and foreignness, Age 84 0.24 69 9) habituation to dubbed speech.

Use of headphones had the effect of reducing 1) The observation that a drastic incongru- the influence of vision among both Swedes ence between face and voice did not cause a (significantly) and Turks (not significantly). significant reduction of the influence of vision Habituation to dubbed speech had no noti- agrees with previous findings (Green et al., ceable effect among the few Swedes (7 of 81, 1991). It confirms that auditory-visual integra- 9% habituated), but it increased(!) the influence tion occurs after extraction of the linguistically of vision to an almost significant extent among informative quality in each modality, i.e. after Turks (53 of 68, 78% habituated). demodulation of voice and face (Traunmüller Discomfort felt from dubbed speech had no and Öhrström, 2007b). significant effect on the influence of vision. 2) Since the verbal information about the Such discomfort was reported by 66% of the dubbing of the stimuli was given in cases in Swedes and also by 52% of the Turks. which the stimuli were anyway likely to be per- Among Turks, females were significantly ceived as dubbed, the negative results obtained more susceptible to vision than males, while here are still compatible with the presence of a there was no noticeable sex difference among small effect of cognitive factors on perception, Swedes. such as observed by Johnson et al. (1999). 3) The results obtained with Turks confirm that women are more susceptible to visual in- Discussion formation. However, the results also suggest The present series of web experiments dis- that this difference is likely to show itself only closed an asymmetry in the influence of vision when there is an increased demand of attention. on the auditory perception of roundedness: Ab- This was the case in the experiments by sence of lip rounding in the visible face had a Traunmüller and Öhrström (2007) and it is also stronger effect on audition than visible presence the case when listening to a foreign language, of lip rounding. This is probably due to the fact an unfamiliar dialect or a foreigner’s speech, that lip rounding (protrusion) is equally absent which holds for the Turkish version of Exp. 4. throughout the whole visible stimulus when The present results do not suggest that a sex there is no rounded vowel. When there is a difference will emerge equally clearly when rounded vowel, there is a dynamic rounding Turks listen to Turkish samples of speech. gesture, which is most clearly present only in 4) The absence of a consistent age effect the middle of the stimulus. Allowing for some within the range of 10-65 years agrees with asynchrony, such a visible gesture is also com- previous investigations. patible with the presence of a diphthong such as 5) The absence of an effect of discomfort [eø], which was the most common response from dubbed speech on influence of vision sug- given by Turks to auditory [e] dubbed on visual gests that the experience of discomfort arises as [y]. The reason for the absence of this asymme- an after-effect of speech perception. try in the previous experiment (Traunmüller 6) Among subjects who indicated a stimu- and Öhrström, 2007). can be seen in the higher lus as dubbed, the influence of vision was re- demand of visual attention. In this previous ex- duced. This was expected, but it is in contrast periment, the subjects had to identify rando- with the fact that there was no significant re- mized stimuli, some of which were presented duction when there was an obvious discrepancy in sex. It appears that only the discrepancy in

170 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the linguistically informative quality is relevant Green K. P., Kuhl P. K., Meltzoff A. N., and here, and that an additional discrepancy be- Stevens E. B. (1991) Integrating speech in- tween voice and face can even make it more formatin across talkers, gender, and sensory difficult to notice the relevant discrepancy. This modality: Female faces and male voices in appears to have happened with the auditory e♀ the McGurk effect. Perception and Psycho- dubbed on the visual y♂ (see Table 1). phys 50, 524–536. 7) The difference between subjects listen- Hayashi T., and Sekiyama K. (1998) Native- ing via headphones and those listening via foreign language effect in the McGurk ef- loudspeaker is easily understood: The audibility fect: a test with Chinese and Japanese. of the voice is likely to be increased when using AVSP’98, Terrigal, Australia. headphones, mainly because of a better signal- http://www.isca-speech.org/archive/avsp98/ to-noise ratio. Irwin, J. R., Whalen, D. H., and Fowler, C. A. 8) The greater influence of vision among (2006) A sex difference in visual influence Turks as compared with Swedes most likely on heard speech. Perception and Psycho- reflects a “foreign language effect” (Hayashi physics, 68, 582–592. and Sekiyama, 1998; Chen and Hazan, 2007). Johnson K., Strand A.E., and D’Imperio, M. To Turkish listeners, syllables such as /ɡyːɡ/ (1999) Auditory-visual integration of talker and /ɡiːɡ/ sound as foreign since long vowels gender in vowel perception. Journal of Pho- occur only in open syllables and a final /ɡ/ netics 27, 359–384. never in Turkish. Minor differences in vowel Jordan, T.R. (2000) Effects of distance on visu- quality are also involved. A greater influence of al and audiovisual speech recognition. Lan- vision might perhaps also result from a higher guage and Speech 43, 107–124. functional load of the roundedness distinction, McGurk H., and MacDonald J. (1976) Hearing but this load is not likely to be higher in Tur- lips and seeing voices. Nature 264, 746– kish than in Swedish. 748. 9) The results show that habituation to Munhall, K.G., Gribble, P., Sacco, L., Ward, M. dubbed speech has no deteriorating effect on (1996) Temporal constraints on the McGurk normal auditory-visual integration in the case of effect. Perception and Psychophysics 58, roundedness. The counter-intuitive result show- 351–362. ing Turkish habituated subjects to be influenced Sams, M., Manninen, P., Surakka, V., Helin, P. more often by vision remains to be explained. It and Kättö, R. (1998) McGurk effect in in should be taken with caution since only 15 of Finnish syllables, isolated words, and words the 68 Turkish subjects were not habituated to in sentences: Effects of word meaning and dubbing. sentence context. Speech Communication 26, 75–87. Shigeno, S. (2002) Influence of vowel context Acknowledgements on the audio-visual perception of voiced I am grateful to Mehmet Aktürk (Centre for Re- stop consonants. Japanese Psychological search on Bilingualism at Stockholm Universi- Research 42, 155–167. ty) for the translation and the recruitment of Sekiyama K., Tohkura Y. (1993) Inter-language Turkish subjects. His service was financed differences in the influence of visual cues in within the frame of the EU-project CONTACT speech perception. Journal of Phonetics 21, (NEST, proj. 50101). 427–444. Sekiyama, K., and Burnham, D. (2004) Issues References in the development of auditory-visual speech perception: adults, infants, and Burnham, D., and Dodd, B. (2004) Auditory- children, In INTERSPEECH-2004, 1137– visual speech integration by prelinguistic in- 1140. fants: Perception of an emergent consonant Traunmüller H., and Öhrström, N. (2007) Au- in the McGurk effect. Developmental Psy- diovisual perception of openness and lip chobiology 45, 204–220. rounding in front vowels. Journal of Phonet- Chen, Y., and Hazan, V. (2007) Language ef- ics 35, 244–258. fects on the degree of visual influence in Traunmüller H., and Öhrström, N. (2007b) The audiovisual speech perception. Proc. of the auditory and the visual percept evoked by 16th International Congress of Phonetic the same audiovisual stimuli. In AVSP- Sciences, 2177–2180. 2007, paper L4-1.

171 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University Breathiness differences in male and female speech. Is H1-H2 an appropriate measure? Adrian P. Simpson Institute of German Linguistics, University of Jena, Jena, Germany

Abstract foged and Antoñanzas Barroso, 1985; Klatt and Klatt, 1990). A well­established difference between male and It is the last of these measures that has most female voices, at least in an Anglo­Saxon con­ commonly been applied to measuring sex-spe- text, is the greater degree of breathy voice used cific voice quality differences. by women. The acoustic measure that has most In this paper I set out to show that relating commonly been used to validate this difference the strength of the first harmonic to other spec­ are the relative strengths of the first and second tral measures as a way of comparing breathiness harmonics, H1­H2. This paper suggests that between male and female speakers is unreliable. sex­specific differences in harmonic spacing The line of argumentation is as follows. The fre­ combined with the high likelihood of nasality quency of the first nasal formant (FN1) can be es­ being present in the vocalic portions make the timated to in the region of 200–350 Hz for both use of H1­H2 an unreliable measure in estab­ male and female speakers (Stevens et al., 1985; lishing sex­specific differences in breathiness. Maeda, 1993). At a typical male fundamental frequency of 120 Hz this will be expressed in an Introduction enhancement of the second and third harmonics. One aspect of male and female speech that has By contrast, at a typical female fundamental fre­ attracted a good deal of interest are differences quency of over 200 Hz it may well be the first in voice quality, in particular, breathy voice. harmonic that is more affected by FN1. Compari­ Sex-specific differences in breathy voice have son of H1 and H2 as a measure of breathiness been examined from different perspectives. Hen- has to be carried out on opener vowel qualities ton and Bladon (1985) examine behavioural dif- in order to minimise the effect of the first oral ferences, whereas in the model proposed by resonance, F1. Lowering of the soft palate is Titze (1989), differences in vocal fold dimen- known to increase with the degree of vowel sions predict a constant dc flow during female openness. Although the ratio of the opening into voicing. In an attempt to improve the quality of the oral cavity and that into the nasal port is cru­ female speech synthesis, Klatt and Klatt (1990) cial for the perception of nasality (Laver, 1980), use a variety of methods to analyse the amount acoustic correlates of nasality are present when of aspiration noise in the male and female the velopharyngeal port is open. It cannot be ex­ source. cluded, then, that any attempt to compare the A variety of methods have been proposed to male and female correlates of breathiness in measure breathiness: terms of the first harmonic might be confounded ● Relative lowering of fundamental fre- by the sex­specific effects of FN1 on the first two quency (Pandit, 1957). harmonics, in particular a relative strengthening ● Presence of noise in the upper spectrum of the first female and the second male harmon­ (Pandit, 1957; Ladefoged and Antoñan- ic. Establishing that female voices are breathier zas Barroso, 1985; Klatt and Klatt, than male voices using the relative intensities of 1990). the first two harmonics might then be a self­ful­ ● Presence of tracheal poles/zeroes (Klatt filling prophecy. and Klatt, 1990). ● Relationship between the strength of the Data first harmonic H1 and amplitude of the The data used in this study are drawn from two first formant A1 (Fischer-Jørgensen, sources. The first data set was collected as part 1967; Ladefoged, 1983). of a study comparing nasometry and spectrogra- ● Relationship between the strength of the phy in a clinical setting (Benkenstein, 2007). first and second harmonic, H1-H2 (Fis- Seven male and fifteen female speakers were cher-Jørgensen, 1967; Henton and recorded producing word lists, short texts and Bladon, 1985; Huffman, 1987; Lade- the map task using the Kay Elemetrics Nasome-

172 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ter 6200. This method uses two microphones separated by an attenuating plate (20 dB separa- tion) that capture the acoustic output of the nose and the mouth. A nasalance measure is calculat- ed from the relative intensity of the two signals following bandpass filtering of both at ca. 500 Hz. The present study is interested only in a small selection of repeated disyllabic oral and nasal words from this corpus so only the com- bined and unfiltered oral and nasal signals will be used. The second data set used is the publicly available Kiel Corpus of Read Speech (IPDS, 1994), which contains data from 50 German speakers (25 female and 25 male) reading col- lections of sentences and short texts. Spectral analysis of consonant-vowel-consonant se- quences analogous to those from the first dataset was carried out to ensure that the signals from the first dataset had not been adversely affected by the relatively complex recording setup in- volved with the nasometer together with the sub- sequent addition of the oral and nasal signals.

Sex-specific harmonic expression of nasality The rest of this paper will concern itself with demonstrating sex-specific differences in the harmonic expression of nasality. In particular, I will show how FN1 is responsible for a greater enhancement of the second male and the first (i.e. fundamental) female harmonic. Further, I will show that FN1 is expressed in precisely those contexts where one would want to measure H1- H2. In order to show how systematic the pat- terns across different speakers are, individual DFT spectra from the same eight (four male, four female) speakers will be used. It is impor- tant to emphasise that there is nothing special about this subset of speakers – any of the 22 speakers could have been used to illustrate the same patterns. We begin with spectral differences found within nasals, i.e. uncontroversial cases of nasal- Figure 1: DFT spectra calculated at midpoint ity. Figure 1 contains female and male spectra of [n] in mahne for four female (top) and four taken at the centre of the alveolar nasal in the male speakers. word mahne (“warn”). From the strength of the word mahne. Figure 2 shows spectra taken from the same word tokens as shown in Figure 1, this lower harmonics, FN1 for both the male and fe- male speakers is around 200–300 Hz, which is time calculated at the midpoint of the long open commensurate with sweep-tone measurements vowel in the first syllable. While there is a good of nasals taken from Båvegård et al. (1993). deal of interindividual variation in the position Spectrally, this is expressed as a strengthening and amplitude of F1 and F2, due to qualitative primarily of the first female and the second male differences as well as to the amount of vowel harmonic. nasalisation present, there is clear spectral evi- It is reasonable to assume that the velum will dence of FN1, again to be found in the increased be lowered throughout the production of the intensity of the second male and the first female

173 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University harmonic.

Figure 3: DFT spectra calculated at midpoint Figure 2: DFT spectra calculated at midpoint of the open vowel in the first syllable of Pate of the open vowel in the first syllable of (same speakers). mahne (same speakers). Let us now turn to the type of context where one Neither nasals nor vocalic portions in a phono- would want to measure H1­H2. The first syllable logically nasal environment would be chosen as of the word Pate (“godfather”) contains a vowel suitable contexts for measuring H1-H2. Howev- in a phonologically oral context with an open er, they do give us a clear indication of the spec- quality which maximises F1 and hence minimis­ tral correlates of both vocalic and consonantal es its influence on the lower harmonics. Figure 3 nasality, and in particular we were able to estab- shows DFT spectra calculated at the midpoint of lish systematic sex-specific harmonic differ- the long open vowel in tokens of the word Pate ences. (“godfather”) for the same set of speakers. In contrast to the categorically identical tokens in

174 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the nasalised environment in Figure 2, it is Deutschen. Peter Lang, Frankfurt. somewhat easier to estimate F1 and F2. Howev­ Båvegård M., Fant G., Gauffin J. and Liljen­ er, the most striking similarity with the spectra crants J. (1993) Vocal tract sweeptone data in Figure 2 is evidence of a resonance in the re­ and model simulations of vowels, laterals gion 200–300 Hz, suggesting that here too, and nasals. STL­QPSR 34, 43– 76. nasality is present, and as before is marked pri­ Fischer­Jørgensen E. (1967) Phonetic analysis of marily by a strengthened female fundamental breathy (murmured) vowels in Gujarati. In­ and a prominent second male harmonic. dian Linguistics, 71–139. Henton C. G. and Bladon R. A. W. (1985) Discussion Breathiness in normal female speech: Ineffi­ ciency versus desirability. Language and Increased spectral tilt is a reliable acoustic indi­ Communication 5, 221–227. cation of breathy voice. However, as I have at­ Huffman M. K. (1987) Measures of phonation tempted to show in this paper, using the strength type in Hmong. J. Acoust. Soc. Amer. 81, of the first harmonic, without taking into consid­ 495–504. eration the possibility that nasality may also be IPDS (1994) The Kiel Corpus of Read Speech. acoustically present, makes it an inappropriate Vol. 1, CD­ROM#1. Institut für Phonetik point of reference when studying sex­specific und digitale Sprachverarbeitung, Kiel. differences. Indeed, data such as those shown in Klatt D. H. and Klatt L. C. (1990) Analysis, syn­ Figure 3 could have been used to show that in thesis, and perception of voice quality varia­ German, too, female speakers are more breathy tions among female and male talkers. J. than males. The female spectral tilt measured us­ Acoust. Soc. Amer. 87, 820–857. ing H1­H2 is significantly steeper than that of Ladefoged P. (1983) The linguistic use of differ­ the males. ent phonation types. In: Bless D. and Abbs J. So, what of other studies? It is hard to make (eds) Vocal Fold Physiology: Contemporary direct claims about other studies in which indi­ Research and Clinical Issues, 351–360. San vidual spectra are not available. However, it is Diego: College Hill. perhaps significant that Klatt and Klatt (1990) Ladefoged P. and Antoñanzas Barroso N. (1985) first added 10 dB to each H1 value before cal­ Computer measures of breathy phonation. culating the difference to H2, ensuring that the Working Papers in Phonetics, UCLA 61, 79– H1­H2 difference is almost always positive 86. (Klatt and Klatt, 1990: 829; see also e.g. Trittin Laver J. (1980) The phonetic description of and de Santos y Lleó, 1995 for Spanish). The voice quality. Cambridge: Cambridge Uni­ male average calculated at the midpoint of the versity Press. vowels in reiterant [ʔɑ] and [hɑ] syllables is Maeda S. (1993) Acoustics of vowel nasaliza­ 6.2 dB. This is not only significantly less than tion and articulatory shifts in French nasal­ the female average of 11.9 dB, but also indicates ized vowels. In: Huffman M. K. and Krakow that the male H2 in the original spectra is consis­ R. A. (eds) Nasals, nasalization, and the tently stronger, once the 10 dB are subtracted velum, 147–167. San Diego: Academic again. Press. I have not set out to show in this paper that Pandit P. B. (1957) Nasalization, aspiration and female speakers are less breathy than male murmur in Gujarati. Indian Linguistics 17, speakers. I am also not claiming that H1­H2 is 165–172. an unreliable measure of intraindividual differ­ Stevens K. N., Fant G. and Hawkins S. (1987) ences in breathiness, as it has been used in a lin­ Some acoustical and perceptual correlates of guistic context (Bickley, 1982). However, it nasal vowels. In Channon R. and Shockey L. seems that the method has been transferred from (eds) In Honor of Ilse Lehiste: Ilse Lehiste the study of intraindividual voice quality differ­ Püdendusteos, 241–254. Dordrecht: Foris. ences to an interindividual context without con­ Titze I. R. (1989) Physiologic and acoustic dif­ sidering the implications of other acoustic fac­ ferences between male and female voices. J. tors that confound its validity. Acoust. Soc. Amer. 85, 1699–1707. Trittin P. J. and de Santos y Lleó A. (1995) References Voice quality analysis of male and female Spanish speakers. Speech Communication Benkenstein R. (2007) Vergleich objektiver 16, 359–368. Verfahren zur Untersuchung der Nasalität im

175 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Emotions in speech: an interactional framework for clinical applications Ani Toivanen1& Juhani Toivanen2 1University of Oulu 2MediaTeam, University of Oulu & Academy of Finland

Abstract range, average RMS intensity, average speech/articulation rate and the proportion of The expression of emotion in human commu- spectral energy below 1,000 Hz. At the symbol- nicative interaction has been studied extensive- ic level, the distribution of tone types and focus ly in different theoretical paradigms (linguis- structure in different syntactic patterns can con- tics, phonetics, psychology). However, there vey emotional content. appears to be a lack of research focusing on The vocal parameters of emotion may be emotion expression from a genuinely interac- partially language-independent at the signal tional perspective, especially as far as the clin- level. For example, according to the “universal ical applications of the research are concerned. frequency code” (Ohala, 1983), high pitch un- In this paper, an interactional, clinically iversally depicts supplication, uncertainty and oriented framework for an analysis of emotion defenseless, while low pitch generally conveys in speech is presented. dominance, power and confidence. Similarly, high pitch is common when the speaker is fear- Introduction ful, such an emotion being typical of a “de- Human social communication rests to a great fenseless” state. extent on non-verbal signals, including the An implicit distinction is sometimes made (non-lexical) expression of emotion through between an emotion/affect and an attitude (or speech. Emotions play a significant role in so- stance in modern terminology) as it is assumed cial interaction, both displaying and regulating that the expression of attitude is controlled by patterns of behavior and maintaining the ho- the cognitive system that underpins fluent meostatic balance in the organism. In everyday speech in a normal communicative situation, communication, certain emotional states, for while true emotional states are not necessarily example, boredom and nervousness, are proba- subject to such constraints (the speech effects in bly expressed mainly non-verbally since socio- real emotional situations may be biomechani- cultural conventions demand that patently nega- cally determined by reactions not fully con- tive emotions be concealed (a face-saving strat- trolled by the cognitive system). It is, then, egy in conversation). possible that attitude and emotion are expressed Today, the significance of emotions is large- in speech through at least partly different pro- ly acknowledged across scientific disciplines, sodic cues (which is the taking-off point for the and “Descartes’ error” (i.e. the view that emo- symbolic/signal dichotomy outlined above). tions are “intruders in the bastion of reason”) is However, this question is not a straightforward being corrected. The importance of emo- one as the theoretical difference between emo- tions/affect is nowadays understood better, also tion and attitude has not been fully established. from the viewpoint of rational decision-making (Damasio, 1994). Emotions in speech Basically, emotion in speech can be broken By now, a voluminous literature exists on the down to specific vocal cues. These cues can be emotion/prosody interface, and it can be said investigated at the signal level and at the sym- that the acoustic/prosodic parameters of emo- bolic level. Such perceptual features of tional expression in speech/voice are unders- speech/voice vs. emotion/affect as “tense”, tood rather thoroughly (Scherer, 2003). The “lax”, “metallic” and “soft”, etc. can be traced general view is that pitch (fundamental fre- back to a number of continuously variable quency, f0) is perhaps the most important pa- acoustic/prosodic features of the speech signal rameter of the vocal expression of emotion (Laver, 1994). These features are f0-related, in- (both productively and perceptually); energy tensity-related, temporal and spectral features (intensity), duration and speaking rate are the of the signal, including, for example, average f0 other relevant parameters.

176 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Somewhat surprisingly, although the emo- prise, sad) while reading out a number of sen- tion/vocal cue interface in speech has been in- tences. The raters (normal subjects) had signifi- vestigated extensively, there is no widely ac- cant difficulty recognizing the simulated emo- cepted definition or taxonomy of emotion. Ap- tions (as opposed to portrayals of the same parently, there is no standard psychological emotions by a group representing the normal theory of emotion that could decide the issue population). once and for all: the number of basic (and sec- In general, it has been found out that speech ondary) emotions is still a moot point. Never- and communication problems typically precede theless, certain emotions are often considered to the onset of psychosis; dysarthria and dyspro- represent “basic emotions”: at least fear, anger, sody appear to be common. Affective flattening happiness, sadness, surprise and disgust are is indeed a diagnostic component of psychosis among the basic emotions (Cornelius, 1996). (along with, for example, grossly disorganized Research on the vocal expression of emo- speech), and anomalous prosody (e.g. a lack of tion has been largely based on scripted non- any observable speech melody) may thus be an interactional material; a typical scenario in- essential part of the dysprosody evident in psy- volves a group of actors simulating emotions chosis (Golfarb & Bekker, 2009). Moreover, while reading out an emotionally neutral sen- schizophrenics’ speech seems to contain more tence or text. There are now also databases con- pauses and hesitation features than normal taining natural emotional speech, but these cor- speech (Covington et al., 2005). Interestingly, pora (necessarily) tend to contain although depressed persons’ speech also typi- blended/uncertain and mixed emotions rather cally contains a decreased amount of speech per than “pure” basic emotions (see Scherer, 2003, the speech situation, the distribution of pauses for a review). appears to be different from schizophrenic speech: schizophrenics typically pause in Emotions in speech: clinical in- “wrong” (syntactically/semantically) unmoti- vated places, while the pausing is more logical vestigations and grammatical in depressed speech. Schi- The vocal cues of affect have been investigated zophrenic speech thus seems to reflect the errat- also in clinical settings, i.e. with a view to ic semantic structure of what is said (Clemmer, charting the acoustic/prosodic features of cer- 1980). tain emotional states (or states of emotional It would be fascinating to think that certain disorders or mental disorders). For example, it prosodic features (or their absence) could be of is generally assumed that clinical depression help for the general practitioner when diagnos- manifests itself in speech in a way which is ing mental disorders. Needless to say, such fea- similar to sadness (a general, “non-morbid” tures could never be the only diagnostic tool emotional state). Thus, a decreased average f0, but, in the best scenario, they would provide a decreased f0 minimum, and a flattened f0 some assistive means for distinguishing be- range are common, along with decreased inten- tween some alternative diagnostic possibilities. sity and a lower rate of articulation (Scherer, 2000). Voiced high frequency spectral energy Emotions in speech: an interac- generally decreases. Intonationally, sad- ness/depression may typically be associated tional clinical approach with downward directed f0 contours. In the following sections we outline a prelimi- Psychiatric interest in prosody has recently nary approach to investigating emotional shed light on the interrelationship between speech and interaction within a clinical context. schizophrenia and (deficient or aberrant) proso- What follows is, at this stage, a proposal rather dy. Several investigators have argued that schi- than a definitive research agenda. zophrenics recognize emotion in speech consi- derably worse than members of the normal Prosodic analysis: 4-Tone EVo population. Productively, the situation appears Our first proposal concerns the prosodic anno- quite similar, i.e. schizophrenics cannot convey tation procedure for a speech material produced affect through vocal cues as consistently and in a (clinical) setting inducing emotionally la- effectively as normal subjects (Murphy & Cut- den speech. As is well known, ToBI labeling ting, 1990). In the investigation by Murphy & (Beckman & Ayers, 1993) is commonly used in Cutting (1990), a group of schizophrenics were the prosodic transcription of (British and Amer- to express basic emotions (neutral, angry, sur-

177 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ican) English (and the system is used increa- The discussions were recorded in a sound- singly for the prosodic annotation of other lan- treated room; the speakers’ speech data was guages, too), and good inter-transcriber consis- recorded directly to hard disk (44.1 kHz, 16 bit) tency can be achieved as long as the voice qual- using a high-quality microphone. The interac- ity analyzed represents normal (modal) phona- tion was visually recorded with a high-quality tion. Certain speech situations, however, seem digital video recorder directly facing the speak- to consistently produce voice qualities different er. The speech data consisted of 574 ortho- from modal phonation, and the prosodic analy- graphic words (82 utterances) produced by sis of such speech data with traditional ToBI three female students (20-27 years old). Five labeling may be problematic. Typical examples Finnish students of linguistics/phonetics lis- are breathy, creaky and harsh voice qualities. tened to the tapes and watched the video data; Pitch analysis algorithms, which are used to the subjects transcribed the data prosodically produce a record of the fundamental frequency using 4-Tone EVo. The transcribers had been (f0) contour of the utterance to aid the ToBI given a full training course in 4-Tone EVo style labeling, yield a messy or lacking f0 track on labeling. Each subject transcribed the material non-modal voice segments. Non-modal voice independently of one another. qualities may represent habitual speaking styles As in the evaluation studies of the original or idiosyncrasies of speakers but they are often ToBI, a pairwise analysis was used to evaluate prosodic characteristics of emotional discourse the consistency of the transcribers: the label of (sadness, anger, etc.). It is likely, for example, each transcriber was compared against the la- that the speech of a depressed subject is to a bels of every other transcriber for the particular significant extent characterized by low f0 tar- aspect of the utterance. The 574 words were gets and creak. Therefore, some special (possi- transcribed by the five subjects; thus a total of bly emotion-specific) speech genres (observed 5740 (574x10 pairs of transcribers) transcriber- and recorded in clinical settings) might be prob- pair-words were produced. The following con- lematic for traditional ToBI labeling. sistency rates were obtained: presence of pitch A potential modified system would be “4- accent (73 %), choice of pitch accent (69 %), Tone EVo” – a ToBI-based framework for tran- presence of phrase accent (82 %), presence of scribing the prosody of modal/non-modal voice boundary tone (89 %), choice of phrase accent in (emotional) English. As in the original ToBI (78 %), choice of boundary tone (85 %), choice system, intonation is transcribed as a sequence of break index (68 %). of pitch accents and boundary pitch movements The level of consistency achieved for 4- (phrase accents and boundary tones). The origi- Tone EVo transcription was somewhat lower nal ToBI break index tier (with four strengths than that reported for the original ToBI system. of boundaries) is also used. The fundamental However, the differences in the agreement le- difference between 4-Tone EVo and the origi- vels seem quite insignificant bearing in mind nal ToBI is that four main tones (H, L, h, l) are that 4-Tone EVo uses four tones instead of two! used instead of two (H, L). In 4-Tone EVo, H and L are high and low tones, respectively, as Gaze direction analysis are “h” and “l”, but “h” is a high tone with non- Our second proposal concerns the multimodali- modal phonation and “l” a low tone with non- ty of a (clinical) situation, e.g. a patient inter- modal phonation. Basically, “h” is H without a view, in which (emotional) speech is produced. clear pitch representation in the record of f0 It seems necessary to record the interactive sit- contour, and “l” is a similar variant of L. uation as fully as possible, also visually. In a Preliminary tests for (emotional) English clinical situation, where the subject’s overall prosodic annotation have been made using the behavior is being (at least indirectly) assessed, model, and the results seem promising (Toiva- it is essential that other modalities than speech nen, 2006). To assess the usefulness of 4-Tone be analyzed and annotated. Thus, as far as emo- EVo, informal interviews with British exchange tion expression and emotion evaluation in inte- students (speakers of southern British English) raction are concerned, the coding of the visually were used (with permission obtained from the observable behavior of the subject should be a subjects). The speakers described, among other standard procedure. We suggest that, after re- things, their reactions to certain personal di- cording the discourse event with a video re- lemmas (the emotional overtone was, predicta- corder, the gaze of the subject is annotated as bly, rather low-keyed). follows. The gaze of the subject (patient) may

178 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University be directed towards the interlocutor (+directed affective state of the speaker (possibly includ- gaze) or shifted away from the interlocutor (- ing mental disorders in some subjects). An directed gaze). The position of the subject rela- analysis of emotions in a speech situation must tive to the interlocutor (interviewer, clinician) take these aspects into account, and a speech may be neutral (0-proxemics), closer to the in- analyst doing research on clinical speech ma- terlocutor (+proxemics) or withdrawn from the terial should see and hear beyond “prosodemes” interlocutor (-proxemics). Preliminary studies and given emotional labels when looking into indicate that the inter-transcriber consistency the data. even for the visual annotation is promising (Toivanen, 2006). References Post-analysis: meta-interview Beckman M.E. and Ayers G.M. (1993) Guide- lines for ToBI Labeling. Linguistics De- Our third proposal concerns the interactionality partment, Ohio State University. and negotiability of a (clinical) situation yield- Clemmer E.J. (1980) Psycholinguistic aspects ing emotional speech. We suggest that, at some of pauses and temporal patterns in schizoph- point, the subject is given an opportunity to renic speech. Journal of Psycholinguistic evaluate and assess his/her emotional (speech) Research 9, 161-185. behavior. Therefore, we suggest that the inter- Cornelius R.R. (1996) The science of emotion. viewer (the clinician) will watch the video re- Research and tradition in the psychology of cording together with the subject (the patient) emotion. New Jersey: Prentice-Hall. and discuss the events of the situation. The aim Covington M, He C., Brown C., Naci L., of the post-interview is to study whether the McClain J., Fjorbak B., Semple J. and subject can accept and/or confirm the evalua- Brown J. (2005) Schizophrenia and the tions made by the clinician. An essential ques- structure of language: the linguist’s view. tion would seem to be: are certain (assumed) Schizophrenia Research 77, 85-98. manifestations of emotion/affect “genuine” Damasio A. (1994) Descartes’ error. New emotional effects caused by the underlying York: Grosset/Putnam. mental state (mental disorder) of the subject, or Golfarb R. and Bekker N. (2009) Noun-verb are they effects of the interactional (clinical) ambiguity in chronic undifferentiated schi- situation reflecting the moment-by-moment de- zophrenia. Journal of Communication Dis- veloping communicative/attitudinal stances be- orders 42, 74-88. tween the speakers? That is, to what extent is Laver J. (1994) Principles of phonetics. Cam- the speech situation, rather than the underlying bridge: Cambridge University Press. mental state or mood of the subject, responsible Murphy D. and Cutting J. (1990) Prosodic for the emotional features observable in the sit- comprehension and expression in schizoph- uation? We believe that this kind of post- renia. Journal of Neurology, Neurosurgery interview would enrich the clinical evaluation and Psychiatry 53, 727-730. of the subject’s behavior. Especially after a Ohala J. (1983) Cross-language use of pitch: an treatment, it would be useful to chart the sub- ethological view. Phonetica 40, 1-18. ject’s reactions to his/her recorded behavior in Scherer K.R. (2000) Vocal communication of an interview situation: does he/she recognize emotion. In Lewis M. and Haviland-Jones J. certain elements of his/her behavior being due (eds.) Handbook of Emotions, 220-235. to his/her pre-treatment mental state/disorder? New York: The Guilford Press. Scherer K.R. (2003) Vocal communication of Conclusion emotion: a review of research paradigms. The outlined approach to a clinical evaluation Speech Communication 40, 227-256. of an emotional speech situation reflects the Toivanen J. (2006) Evaluation study of “4-Tone Systemic Approach: emotions, along with other EVo”: a multimodal transcription model for aspects of human behavior, serve to achieve emotion in voice in spoken English. In Toi- intended behavioral and interactional goals in vanen J. and Henrichsen P. (eds.) Current co-operation with the environment. Thus, emo- Trends in Research on Spoken Language in tions are always reactions also to the behavioral the , 139-140. Oulu Uni- acts unfolding in the moment-by-moment face- versity & CMOL, Copenhagen Business to-face interaction (in real time). In addition, School: Oulu University Press. emotions often reflect the underlying long-term

179 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Earwitnesses: The effect of voice differences in identifi- cation accuracy and the realism in confidence judg- ments Elisabeth Zetterholm1, Farhan Sarwar2 and Carl Martin Allwood3 1Centre for Languages and Literature, Lund University 2Department of Psychology, Lund University 3Department of Psychology, University of Gothenburg

Abstract Sex, age and dialect seem to be strong and dominant features in earwitness identification Individual characteristic features in voice and (Clopper et al, 2004; Eriksson et al., 2008; Lass speech are important in earwitness identifica- et al., 1976; Walden et al., 1978). In these stud- tion. A target-absent lineup with six foils was ies, there is nothing about how the witness’ used to analyze the influence of voice and confidence and its’ realism is influenced by speech features on recognition. The partici- these features. pants’ response for two voice foils were par- The study presented in this paper focuses on ticularly successful in the sense that they were the influence of differences and similarities in most often rejected. These voice foils were voice and speech between a target voice and six characterized by the features’ articulation rate foils in a lineup. A week passed between the and pitch in relation to the target voice. For the original presentation of the target speaker (at same two foils the participants as a collective for example the crime event) and the lineup, also showed marked underconfidence and es- which means that there is also a memory effect pecially good ability to separate correct and for the listeners participating in the lineup. incorrect identifications by means of their con- Spontaneous speech is used in all recordings fidence judgments for their answers to the iden- and only male native Swedish speakers. tification question. For the other four foils the participants showed very poor ability to sepa- Confidence and realism in confidence rate correct from incorrect identification an- swers by means of their confidence judgments. In this study, a participant’s confidence in his or her response to a specific voice in a voice parade with respect to if the voice belongs to Introduction the target or not, relates to whether this re- This study focuses on the effect of some voice sponse is correct or not. Confidence judgments and speech features on the accuracy and the re- are said to be realistic when they match the cor- alism of confidence in earwitnesses’ identifica- rectness (accuracy) of the identification re- tions. More specifically, the study analyzes the sponses. Various aspects of realism can be influence of characteristic features in the measured (Yates, 1994). For example, the over- speech and voices of the target speaker and the /underconfidence measure indicates whether foils in a target-absent lineup on identification the participant’s (or group’s) level of confi- responses and the realism in the confidence that dence matches the level of the accuracy of the the participants feel for these responses. This responses made. It is more concretely com- theme has obvious relevance for forensic con- puted as: Over-/underconfidence = (The mean texts. confidence) minus (The mean accuracy). Previous research with voice parades has Another aspect of the realism is measured often consisted of speech samples from labora- by the slope measure. This measure concerns a tory speech, which is not spontaneous (Cook & participant’s (or group’s) ability, by means of Wilding, 1997; Nolan, 2003). In spontaneous one’s confidence judgments, to, as clearly as speech in interaction with others, the assump- possible, separate correct from incorrect judg- tion is that the speakers might use another ments. This measure is computed as: Slope = speaking style compared with laboratory (The mean confidence for correct judgments) speech. In forensic research spontaneous minus (The mean confidence for incorrect speech is of more interest since that is a more judgments). The relation between a partici- realistic situation. pant’s level of confidence for a voice with a

180 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University specific characteristic compared with the par- for this study because (as described in more de- ticipant’s confidence for the other voices in the tail below) they share, or do not share, different lineup might indicate how important that char- features with the target speaker. It is features acteristic is for the participant’s judgment. In a such as pitch, articulation rate, speaking style forensic context the level of realism indicates and overall tempo and voice quality. how useful a participant’s confidence judg- The target speaker has a mean F0 (mean ments are in relation to targets’ and foils’ fundamental frequency) of 107 Hz, see Table 1. voices with specific features. The speech tempo is high overall and he has an almost forced speaking style with a lot of hesi- Method tation sounds and repetition of syllables when he is excited in the familiarization passage. The acoustic analysis confirms a high articulation Participants rate. 100 participants took part in the experiment. Foil 1 and 2 are quite close in their speech The mean age was 27 years. There were 42 and voices in the auditory analysis. Both speak males and 58 females. 15 participants had an- with a slightly creaky voice quality, although other mother tongue than Swedish, but they all foil 2 has a higher mean F0. Their articulation speak and understand Swedish. Four of them rate is quite high and close to the target arrived in Sweden as teenagers or later. Five speaker. Foil 3 and 6 speak with a slower participants reported minor impaired hearing speech tempo and a low and a high pitch re- and four participants reported minor speech spectively is audible. In the acoustic analysis it impediment, one of them stuttering. is also obvious that both foil 3 and 6 have an articulation rate which is lower than the target Materials speaker. Foil 4 is the speaker who is closest to The dialogue of two male speakers was re- the target speaker concerning pitch and speak- corded. They played the role of two burglars ing style. He speaks with a forced, almost stut- planning to break into a house. This recording tering voice when he is keen to explain some- was about 2 minutes long and was used as the thing. His articulation rate is high and he also familiarization passage, that is, as the original uses a lot of hesitation sounds and filled pauses. experienced event later witnessed about. The Foil 5 has quite a high articulation rate, in simi- speakers were 27 and 22 year old respectively larity to the target speaker, but he has a higher when recorded and both speak with a Scanian pitch and his dialect is not as close to the target dialect. The 22 years old speaker is the target speaker as the other foils. and he speaks most of the time in the presented All the speakers, including the target passage. speaker, are almost the same age, see Table 1. The lineup in this study used recordings of The results of the acoustic measurements of six male speakers. It was spontaneous speech mean fundamental frequency (F0) and std.dev. recorded as a dialogue with another male are also shown in the table. The perceptual speaker. This male speaker was the same in all auditory impression concerning the pitch is recordings and that was an advantage since he confirmed in the acoustic analysis. was able to direct the conversation. They all Table 1. Age, F0 mean and standard deviations talked about the same topic, and they all had (SDs) for the target speaker and the six foils some kind of relation to it since it was an ordi- nary situation. As a starting point, to get differ- Age F0, mean SDs. ent points of view on the subject talked about target 22 107 Hz 16 Hz and as a basis for their discussion, they all read foil 1 23 101 Hz 26 Hz an article from a newspaper. It had nothing to foil 2 21 124 Hz 28 Hz do with forensics. The recordings used in the foil 3 23 88 Hz 15 Hz lineups were each about 25 sec long and only a foil 4 23 109 Hz 19 Hz part of the original recordings, that is, the male foil 5 23 126 Hz 21 Hz conversation partner is not audible in the line- foil 6 25 121 Hz 17 Hz ups. All the six male speakers have a Scanian Procedure dialect with a characteristic uvular /r/ and a The experimental sessions took place in classes slightly diphthongization. They were chosen at the University of Lund and in classes with

181 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University final year students at a high school in a small might have had an influence. Each voice was town in Northern . The experiment con- played twice, which means that the participants ductor visited the classes twice. The first time, had a reasonable amount of time to listen to it. the participants listened to the 2 minute dia- The voices they heard in the middle of the test logue (the original event). The only instruction as well as the last played voice were chosen they got was that they had to listen to the dia- most often. logue. Nothing was said about that they should Only 10 participants had no ‘yes’ answers focus on the voices or the linguistic content. at all, which is 10% of all listeners, that is, The second time, one week later, they listened these participants had all answers correct. The to the six male voices (the foils), in randomized result for confidence for these 10 participants order for each listener group. Each voice was showed that their average confidence level was played twice. The target voice was absent in the 76 %, which can be compared with the average test. The participants were told it was six male confidence level for the remaining 90 partici- voices in the lineup. This was also obvious pants, 69 %. No listener had more than 4 ‘yes’ when looking at the answer sheets. There were answers, which means that no one answered six different listener groups for the 100 partici- ‘yes’ all over without exception. pants presented in this paper. The number of participants in each group differed between seven and 27 people. Voices in presentation order For each voice, the participants had to make 50 45 a decision if the voice was the same as the one 40 who talked mostly in the dialogue last week. 35 30 There were two choices on the answer sheet for 25 each voice; ‘I do recognize the voice’ or ‘I do Number 20 15 not recognize the voice’. They were not told if 10 the target voice was absent or not, nor were 5 they initially told if a voice would be played 0 more than once. There was no training session. voic e 1 voi ce 2 voi c e 3 voic e 4 voic e 5 voi c e 6 The participants were told that they could listen Figure 1. Numbers of ‘yes’ answers (e.g., errors) to each voice twice before answering, but not focusing on the presentation order. recommended to do that.

Directly after their judgment of whether a In Figure 2 the results are shown focusing specific voice was the target or not, the partici- on each foil 1-6. Most of the participants se- pants also had to estimate the confidence in lected foil 4 as the target speaker. The same re- their answer. The confidence judgment was sults are shown in Table 2. Foil 4 is closest to made on scale ranging from 0% (explained as the target speaker and they share more than one ”absolutely sure this voice sample is not the feature in voice and speech. The speaking style target”) via 50% (“guessing”) to 100% (”abso- with the forced, almost stuttering voice and lutely sure this voice sample is the target”). hesitation sounds is obvious and might remind the listeners of the target speaker when plan- Results and Discussion ning the burglary. Foil 3 and 6 were chosen Since this is an ongoing project the results pre- least often, and these foils are different from sented here are only results from the first 100 the target speaker both concerning pitch, articu- participants. We expect 300 participants at the lation rate and overall speaking style. It is not end of this study. surprising that there were almost no difference Figure 1 shows the number of the ‘I do rec- in results between foil 1 and 2. These male ognize the voice’ answers, or ‘yes’ answers. speakers have very similar voices and speech. Since the voice lineups were target-absent line- They are also quite close to the target speaker ups, a “yes” answer equals an incorrect answer, in the auditory analysis (i.e. according to an that is, an incorrect identification. expert holistic judgment). Foil 5 received many When looking at the answers focusing on ‘yes’ answers as well. He reminds of the target the presentation order, which, as noted above, speaker concerning articulation rate, but not as was randomized for each group, it is obvious obviously as foil 4. He also has a higher pitch that there is a tendency not to choose the first and a slightly different dialect compared to the voice. There were no training sessions and that target speaker.

182 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The results indicate that the participants The mean confidence for all the foils was were confused about foil 4 and this is an ex- 69.93 (SD = 25.00). Table 2 shows that there pected result. The confusion might be ex- was no great difference in the level of the con- plained both in the auditory and the acoustic fidence judgments for the given identification analyses. The overall speaking style, the articu- answers for the respective foils. A one-way lation rate as well as the pitch, of the target ANOVA showed no significant difference be- speaker and foil 4 are striking. tween the foils with respect to their confidence The results show that the mean accuracy of (F = .313). all the foils was 66.17 with a standard deviation Turning next to the over/underconfidence (SD) of 47.35. The results for each of the foils measure (O/U-confidence), the average O/U- are shown in Table 2. These results were ana- confidence computed over all the foils was 3.77 lyzed with a one-way ANOVA and the out- (SD = 52.62), that is, a modest level of over- come shows that the difference in how often the confidence. Table 2 shows the means and SDs six foils were (correctly) rejected was signifi- for O/U-confidence for each foil. It can be cant, F (5, 594) = 12.69, p < .000. Further post noted that the participants showed quite good hoc Tukey test revealed realism with respect to their level of over/underconfidence for item 5 and an espe- cially high level of overconfidence for item 4. Moreover, the participants’ showed undercon- Foils 1 - 6 fidence for items 3 and 6, that is, the same 60 items showing the highest level of correctness. 50 A one-way ANOVA showed that there was 40 a significant difference between the foils with 30 respect to the participants’ O/U-confidence, F Number 20 (5, 394) = 9.47, p < .000. Further post hoc 10 Tukey tests revealed that the confidence of the

0 participants who rejected foil 3 and foil 6 foil 1 foil 2 foil 3 foil 4 foil 5 foil 6 showed significantly lower over- Figure 2. Numbers of ‘yes’ answers (e.g., errors) /underconfidence as compared to the confi- focusing on the foils 1-6. dence of participants for foil 1, foil 2 and foil 4.

Table 2. Means (SDs) for accuracy (correct- that participant rejected the foil 3 (M=85) and ness), confidence, over-/underconfidence and foil 6 (M=87), significantly more often as com- slope for the six foils pared with the other foils. Foil 1 Foil 2 Foil 3 Foil 4 Foil 5 Foil 6 We next look at the results for the confi- Accu- 57.00 57.00 85.00 48.00 63.00 87.00 dence judgments and their realism. When ana- racy (49.76) (49.76) (35.89) (50.21) (48.52) (33.79) Confi- 69.90 70.00 70.20 70.40 67.49 71.70 lyzing the confidence values we first reversed dence (22.18) (24.82) (28.92) (21.36) (23.81) (28.85) the confidence scale for all participants who Over- 12.9% 13.0% -14.8% 22.4% 4.4% -15.3% /under gave a “no” answer to the identification ques- conf tion. This means that a participant who an- Slope -5.07 -2.04 18.27 0.43 1.88 12.57 swered “no” to the identification question and then gave “0 %” (“absolutely sure that this Table 2 also shows the means for the slope voice sample is not the target”) received a measure (ability to separate correct from incor- 100% score in confidence when the scale was rect answers to the identification question by reversed. Similarly, a participant who gave 10 means of the confidence judgments) for the 6 as a confidence value received 90 and a partici- foils. The overall slope for all data was 2.21. pant who gave 70 received a confidence value That is, the participants on average showed a of 30 after the transformation, etc. 50 in confi- very poor ability to separate correct from incor- dence remained 50 after the transformation. In rect answers by means of their confidence judg- this way the meaning of the confidence ratings ments. However, it is of great interest to note could be interpreted in the same way for all that the only items for which the participants participants, irrespective of their answers to the showed a clear ability to separate correct from identification question. incorrect answers was for the two foils (3 and

183 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

6) for which they showed the highest level of study, the general level of realism in the O/U- correct answers to the identification question. confidence measure might have been different. We next discuss the variation between the Summary and conclusions foils with respect to their level of overconfi- dence. It can be discerned from Table 2 that the In this study the original event consisted of a level of overconfidence follows the respective dialogue between two persons and, similarly, foil’s level of accuracy. When the level of iden- the recordings for the foils were a dialogue. tification accuracy is high, the level of O/U- This is an important feature of this study and confidence is lower, or even turns into under- something that contributes to increasing the confidence. Thus, a contributing reason to the ecological validity in this research area since variation in overconfidence between the foils previous research often has used monologue (in addition to the similarity of the foils’ voices readings of text both as the original events and to that of the target), may be that the partici- as the recognition stimuli. To what extent this pants expected to be able to identify a foil as feature of the study influenced the results is not the target and when they could not do so this clear since we did not have a comparison con- resulted in less confidence in their answer. An- dition in this context. other speculation is that the participants’ gen- Characteristic voice features had an impact eral confidence level may have been the most upon the listeners in this study. The results, so important factor. In practice, if these specula- far, are expected since the participants seem to tions are correct it is possible that the different be confused and thought that the voice of foil 4 speech features of the foils’ voices did not con- was the target speaker, compared with the other tribute very much to the participants’ level of foils in the lineup. Foil 4 was the most alike confidence or degree of overconfidence. In- concerning pitch and speaking style. It might stead the participants’ confidence may be regu- be that the speaking style and the forced voice lated by other factors as speculated above. was a kind of hang-up for the listeners. Even The results for the slope measure showed though all male speakers had almost the same that the participants evidenced some ability to dialect and the same age as the target speaker, separate correct from incorrect answers by there were obvious differences in their voices means of their confidence judgments for the and speech behavior. The listeners were not two foils 3 and 6, that is, the foils for which the told what to focus on when listening the first participants showed the highest level of accu- time. As noted above, we don’t know if the use racy in their identifications. These two foils of a dialogue with a forensic content had an ef- were also the foils that may be argued to be fect upon the result. The recordings in the perceptually (i.e., “experientially”) most sepa- lineup were completely different in their con- rate from the target voice. For the other four tent. foils the participants did not evidence any abil- In brief, the results in this study suggest that ity at all to separate correct from incorrect iden- prominent characteristic features in voice and tification answers by means of their confidence speech are important in an earwitness identifi- judgments. cation situation. In a forensic situation it would Finally, given that they hold in future re- be important to be aware of characteristic fea- search, the results showed that earwitnesses’ tures in the voice and speech. confidence judgments do not appear to be a Turning next to the realism in the partici- very reliable cue as to the correctness of the pants’ confidence judgments it is of interest their identifications, at least not in the situation that the participants in this study over all, in investigated in this study, namely the context of contrast to some other studies on earwitnesses target-absent lineups when the target voice oc- (e.g., Olsson et al, 1998), showed only a mod- cur in dialogues both in the original event and est level of overconfidence. However, a recent in the foils’ voice sample. The results showed review of this area shows that the level of real- that although the average level of overconfi- ism found depends on the specific measure dence was fairly modest when computed over used and various specific features of the voices all foils, the level of over-underconfidence var- involved. For example, more familiar voices ied a lot between the different foils. Still it are associated with better realism in the confi- should be noted that for those two foils where dence judgments (Yarmey, 2007). Had a differ- the participants had the best accuracy level they ent mixture of voices been used in the present also tended to give higher confidence judg-

184 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University ments for correct answers as compared with incorrect answers. However, more research is obviously needed to confirm the reported re- sults.

Acknowledgements This work was supported by a grant from Cra- foordska stiftelsen, Lund.

References Clopper C.G. and Pisoni D.B. (2004) Effects of talker variability on perceptual learning of dialects. Language and Speech, 47 (3), 207- 239. Cook S. and Wilding J. (1997) Earwitness tes- timony: Never mind the variety, hear the length. Applied Cognitive Psychology, 11, 95-111. Eriksson J.E., Schaeffler F., Sjöström M., Sul- livan K.P.H. and Zetterholm E. (submitted 2008) On the perceptual dominance of dia- lect. Perception & Psychophysics. Lass N.J., Hughes K.R., Bowyer M.D., Waters L.T. and Bourne V.T. (1976) Speaker sex identification from voice, whispered, and filtered isolated vowels. Journal of the Acoustical Society of America, 59 (3), 675- 678. Nolan F. (2003) A recent voice parade. Foren- sic Linguistics, 10, 277-291. Olsson N., Juslin P., & Winman A. (1998) Re- alism of confidence in earwitness versus eyewitness identification. Journal of Ex- perimental Psychology: Applied, 4, 101– 118. Walden B.E., Montgomery A.A., Gibeily G.J., Prosek R.A. and Schwartz D.M. (1978) Correlates of psychological dimensions in talker similarity. Journal of Speech and hearing Research, 21, 265-275. Yarmey A.D. (2007) The psychology of speaker identification and earwitness mem- ory. In R.C. Lindsay, D.F. Ross, J. Don Read & M.P. Toglia (Eds.), Handbook of eyewitness psychology, Volume 2, Memory for people (pp. 101-136). Mahwah, N.J.: Lawrence Erlbaum Associates. Yates J.F. (1994) Subjective probability accu- racy analysis. In G. Wright & P. Ayton (Eds.), Subjective probability (pp. 381-410). New York: John Wiley & Sons.

185 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Perception of voice similarity and the results of a voice line-up Jonas Lindh Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Sweden

Abstract concentrate on voice quality similarity. The speech material used in the present study was The perception of voice similarity is not the originally produced for an ear witness study same as picking a speaker in a line-up. This where 7 speaker line-ups were used to test study investigates the similarities and differ- voice recognition reliability in ear witnesses. ences between a perception experiment where The speakers in that study were male and people judged voice similarity and the results matched for general speaker characteristics like from voice line-up experiment. Results give us sex, age and dialect. The results from the ear an idea about what listeners do when they try witness study and the judgments of voice simi- to identify a voice and what parameters play an larity were then compared. It was found, for important role. The results show that there are example, that the occurrence of false accep- similarities between the voice similarity judg- tances (FA) was not randomly distributed but ments and the line-up results.They differ, how- systematically biased towards certain speakers. ever, in several respects when we look at Such results raise obvious questions like: Why speaking parameters. This finding has implica- were these particular speakers chosen? Are tions for how to consider the similarities be- their speaker characteristics particularly similar tween foils and suspects when setting up a line- to those of the intended target? Would an aural up as well as how we perceive voice similari- voice comparison test single out the same ties in general. speakers? The results give implications on the existence of speech characteristics still present Introduction in backward speech. It can also be shown that Aural/acoustic methods in forensic speaker speakers that are judged as wolves (term from comparison cases are common. It is possible to speaker verification, where a voice is rather divide speaker comparison into 2 different similar to many models) can be picked more branches depending on the listener. The 1st is easily in a line-up if they also possess speech the expert witness’ aural examination of speech characteristics that are similar to the target. samples. In this case the expert tries to quantify and assess the similarities/dissimilarities be- Method tween speakers based on linguistic, phonologi- To be able to collect sufficiently large amounts cal and phonetic features and finally evaluate of data, two different web tests were designed. the distinctiveness of those features (French & One of the web based forms was only released nd branch is the speaker Harrison, 2007). The 2 to people that could insure a controlled envi- comparison made by naive listeners, for exam- ronment in which the test was to take place. ple victims of a crime where they heard a Such a controlled environment could for exam- voice/speaker, but could not see the perpetrator. ple be a student lab or equivalent. A second In both cases, some kind of voice quality would form was created and published to as many be used as a parameter. However, it is not thor- people as possible throughout the web, a so- oughly investigated if this parameter can be called uncontrolled test group. The two groups' separated from so called articulation or speak- results were treated separately and later corre- ing parameters such as articulation rate (AR) or lated to see whether the data turned out to be pausing, which are parameters that have shown similar enough for the results to be pooled. to be useful parameters when comparing speak- ers (Künzel, 1997). To be able to study this closer a web based perception experiment was The ear witness study set up where listeners were asked to judge To gain a better understanding of earwitness voice similarity in a pairwise comparison test. performance a study was designed in which The speech was played backwards to remove children aged 7-8 and 11-12 and adults served speaking characteristics and force listeners to as informants. A total of 240 participants were

186 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University equally distributed between the three age groups and exposed to an unfamiliar voice. The web based listening tests Each participant was asked to come along with The listening tests had to be made interactive an experimenter to a clothes shop where they and with the results for the geographically dis- stopped outside a fitting cubicle. Behind the persed listeners gathered in an automatic man- curtain they could here an unfamiliar voice ner. Google docs provide a form to create web planning of a crime (PoC). The recording they based question sheets collecting answers in a heard was played with a pair of high quality spreadsheet as you submit them and that was loudspeakers and was approximately 45 sec- the form of data collection we chose to use for onds long. After two weeks, the witnesses were the perception part of the study. However, if asked to identify the target-voice in a line-up (7 one cannot provide a controlled environment, voices). Half of the witnesses were exposed to the results cannot be trusted completely. As an a target-present line-up (TP), and the other half answer to this problem two equal web based to a target-absent line-up (TA). The line-up was listening tests were created, one intended for a also played to the witness on loudspeakers from guaranteed controlled environment and one a computer and the voices presented on a power openly published test, here referred to as un- point slide. First an excerpt from a recording of controlled. The two test groups are here treated a city walk of a bout 25 seconds was played. separately and correlated before being merged After that a shorter part of the excerpt of about in a final analysis. 12-15 seconds was used. First they had to say In the perception test for the present study, whether they thought the voice was present in 9 voices were presented pair-wise on a web the line-up, and if so, they pointed the voice page and listeners were asked to judge the simi- out. Secondly they were asked about their con- larity on a scale from 1 to 5, where 1 was said fidence and what they remembered from what to represent “Extremely similar or same” and 5 the voice in the cubicle had said. This was done “Not very similar”. Since we wanted to mini- to see whether it was possible to predict identi- mize the influence of any particular language or fication accuracy by analyzing memory for speaking style, the speech samples were played content (Öhman, Eriksson & Granhag, 2009). backwards. The listeners were also asked to To be able to quantify speaking parameters, submit information about their age, first lan- pausing and articulation rate was measured. Ar- guage and dialectal background (if Swedish ticulation rate is here defined as produced syl- was their first language). There was also a lables excluded pausing. Pauses are defined as space where they could leave comments after clearly measurable silences longer than 150 ms. the completion of test and some participants used this opportunity. The speech samples used The test material in the perception test were the first half of the The recordings consisted of spontaneous 25 second samples used in the earwitness line- speech elicited by asking the speakers to de- ups, except for the pairs where both samples scribe a walk through the centre of Gothenburg, were from the same speaker. In these cases the based on a series of photos presented to them. other item was the second half of the 25 second The 9 (7 plus 1 in TA + target) speakers were samples. Each test consisted of 45 comparisons all selected as a very homogeneous group, with and took approximately 25 minutes to com- the same dialectal background (Gothenburg plete. 32 (7 male, 25 female) listeners per- area) and age group (between 28–35). The formed the controlled listening test and 20 (6 speakers were selected from a larger set of 24 male, 14 female) the uncontrolled test. speakers on the basis of a speaker similarity perception test using two groups of under- graduate students as subjects. The subjects had Results and Discussion to make similarity judgments in a pairwise The results will be presented separately in the comparison test where the first item was always first 2 paragraphs and then the comparison is the target speaker intended for the line-up test. done with a short discussion in the last section. Subjects were also asked to estimate the age of the speakers. The recordings used for these The overall results of the ear w itness tests were 16 kHz /16 bit wave files. study The original purpose of the study was to com- pare performance between the age groups. Here

187 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

False Acceptances / Foil in Target Absent (TA) and Target Present (TP)

30 25 7-8 y 20 11-12 y 15 Nr Adults 10 5 All 0 JL MM TN KG CF JÅ MS JL MM TN KG CF JÅ


Figure 1. False acceptances for each figurant speaker in the 3 age groups and the sum (all) for both target absent (TA) and target present (TP). we are only interested in the general tendencies Pauses tend to increase in duration with high of the false acceptances (the picking of the articulation rate (Goldman-Eisler, 1961). wrong voice) and the true, i.e. correct identifi- cations. In Figure 1 we present the false accep- tances given by the different age groups and all Pausing for Line-up Speakers together. 45 In Figure 1 it is very clear that false accep- 40 tance is biased toward certain speakers such as 35 30 Pausdur/min 25 speaker CF followed by MM and JL. It is no- Paus_N/min 20 ticeable that correct acceptances in TP was 27 Paus_%

N/min (%) 15 and that can explain the decrease in FA for MM 10 5 and JL, however, the degree of FA for speaker 0 CF is even higher in TP (28). CF JÅ JL KG MS MM NS TN PoC Speakers

Articulation Rate Figure 3. Pausing (pause duration per minute, num- 7 6 ber of pauses per minute and percentage pause from 5 total utterance duration) for the speakers in the line- 4 AR up. 3

N/second 2 1 The pausing measurement shows a bias towards 0 speaker CF, which might explain some of the CF JÅ JL KG MS MM NS TN PoC false acceptances. Speaker The perception test results Figure 2. Articulation rate (produced syllables per Both listening tests separately (controlled and second) for the speakers in the line-up. uncontrolled) show significant inter-rater agreement (Cronbach’s alpha = 0.98 for the In Figure 2 we can see that the target (PoC) controlled and 0.959 for the uncontrolled test). was produced with a fast articulation rate. Sev- When both datasets are pooled the inter-rater eral speakers follow with rather average values agreement remains at the same high level (al- around 5 syllables per second. The speaker pha = 0.975) indicating that listeners in both with the highest AR compared to PoC is CF. In subgroups have judged the voices the same Figure 3 we take a closer look at pausing. way. This justifies using the pooled data from

188 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University both groups (52 subjects altogether) for the fur- that CF is generally judged as most similar to ther analysis of the perception test results. the target speaker (even more than the actual target in the TP line-up). We have also found that the result can partly be explained by the Mean Voice Similarity Judgement Speakers vs. PoC similarity in speaking tempo parameters. How- 4,5 ever, since the result is also confirmed in the 4 perception experiment it must mean either that 3,5 3 the tempo parameters are still obvious in back- 2,5 PoC ward speech or that there is something else that 2

1,5 make listeners choose certain speakers. Perhaps Mean similarity 1 the indication that speakers are generally high 0,5 0 ranked, or wolves (term from speaker verifica- JÅ JL KG MM MS PoC NS TN CF tion, see Melin, 2006), in combination with Speaker similar aspects of tempo make judgments bi- ased. More research to isolate voice quality is Figure 4. Mean voice similarity judgment by listen- needed to answer these questions in more de- ers comparing each speaker against target PoC. The tail. closer to 1 the more similar voice according to judg- ments. Acknowledgements The voice similarity judgments indicate the Many thanks to the participants in the listening same as the line-up regarding speaker CF, who test. My deepest gratitude to the AllEars project is judged to be closest to the target followed by Lisa Öhman and Anders Eriksson for providing JL and MM. It is also noticeable that those me with data from the line-up experiments be- speakers are among the speakers who get the fore they have been published. highest mean overall similarity judgments compared to all the other speakers. References Table 1. The table shows speaker ranks based on French, P., and Harrison, P. (2007) Position mean similarity judgment for both listener groups Statement concerning use of impressionistic pooled. likelihood terms in forensic speaker com- parison cases, with a foreword by Peter French & Philip Harrison. International Speaker JÅ JL KG MM MS PoC NS TN CF Journal of Speech Language and the Law. JÅ 1 4 5 3 6 8 9 7 2 [Online] 14:1. JL 3 1 8 5 7 4 2 9 6 KG 5 9 1 2 3 7 8 6 4 Künzel, H. (1997) Some general phonetic and MM 4 5 2 1 3 8 9 7 6 forensic aspects of speaking tempo. Foren- MS 7 8 6 5 2 9 3 1 4 sic Linguistics 4, 48–83. PoC 5 3 6 4 9 1 7 8 2 Goldman-Eisler, F. (1961) The significance of NS 6 2 8 5 3 7 1 9 4 changes in the rate of articulation. Lang. TN 6 9 5 4 1 7 8 2 3 and Speech 4, 171-174. CF 2 9 6 7 3 5 8 4 1 Öhman, L., Eriksson, A. and Granhag, P-A. Mean 4.3 5.6 5.2 4.0 4.1 6.2 6.1 5.9 3.6 rank (2009) Unpublished Abstract. Earwitness Std dev 2.0 3.2 2.4 1.8 2.6 2.5 3.2 2.9 1.7 identification accuracy in children vs. adults. The mean rank in table 1 indicates how the speaker is ranked compared to the other voices in similarity judgment.

Comparison of results and discussion The purpose of the study was to compare the general results from the line-up study and the results of the perception experiment presented here. A comparison between the results show

189 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Project presentation: Spontal – multimodal database of spontaneous speech in dialog Jonas Beskow, Jens Edlund, Kjell Elenius, Kahl Hellmer, David House & Sofia Strömbergsson KTH Speech Music & Hearing, Stockholm, Sweden

Abstract Scope We describe the ongoing Swedish speech data- 60 hours of dialog consisting of 120 half-hour base project Spontal: Multimodal database of sessions will be recorded in the project. Each spontaneous speech in dialog (VR 2006-7482). session consists of three consecutive 10 minute The project takes as its point of departure the blocks. The subjects are all native speakers of fact that both vocal signals and gesture involv- Swedish and balanced (1) for gender, (2) as to ing the face and body are important in every- whether the interlocutors are of opposing gend- day, face-to-face communicative interaction, er and (3) as to whether they know each other and that there is a great need for data with or not. This balance will result in 15 dialogs of which we more precisely measure these. each configuration: 15x2x2x2 for a total of 120 dialogs. Currently (April, 2009), about 33% of Introduction the database has been recorded. The remainder is scheduled for recording during 2010. All sub- Spontal: Multimodal database of spontaneous jects permit, in writing (1) that the recordings speech in dialog is an ongoing Swedish speech are used for scientific analysis, (2) that the ana- database project which began in 2007 and will lyses are published in scientific writings and (3) be concluded in 2010. It is funded by the Swe- that the recordings can be replayed in front of dish Research Council, KFI - Grant for large audiences at scientific conferences and such- databases (VR 2006-7482). The project takes as like. its point of departure the fact that both vocal In the base configuration, the recordings are signals and gesture involving the face and body comprised of high-quality audio and high- are key components in everyday face-to-face definition video, with about 5% of the record- interaction – arguably the context in which ings also making use of a motion capture sys- speech was borne – and focuses in particular on tem using infra-red cameras and reflective spontaneous conversation. markers for recording facial gestures in 3D. In Although we have a growing understanding addition, the motion capture system is used on of the vocal and visual aspects of conversation, virtually all recordings to capture body and we are lacking in data with which we can make head gestures, although resources to treat and more precise measurements. There is currently annotate this data have yet to be allocated. very little data with which we can measure with precision multimodal aspects such as the timing Instruction and scenarios relationships between vocal signals and facial and body gestures, but also acoustic properties Subjects are told that they are allowed to talk that are specific to conversation, as opposed to about absolutely anything they want at any read speech or monologue, such as the acous- point in the session, including meta-comments tics involved in floor negotiation, feedback and on the recording environment and suchlike, grounding, and resolution of misunderstand- with the intention to relieve subjects from feel- ings. ing forced to behave in any particular manner. The goal of the Spontal project is to address The recordings are formally divided into this situation through the creation of a Swedish three 10 minute blocks, although the conversa- multimodal spontaneous speech database rich tion is allowed to continue seamlessly over the enough to capture important variations among blocks, with the exception that subjects are in- speakers and speaking styles to meet the de- formed, briefly, about the time after each 10 mands of current research of conversational minute block. After 20 minutes, they are also speech. asked to open a wooden box which has been placed on the floor beneath them prior to the recording. The box contains objects whose identity or function is not immediately obvious. The subjects may then hold, examine and

190 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 1. Setup of the recording equipment used to create the Spontal database. discuss the objects taken from the box, but they frame rate in post processing. The recording may also chose to continue whatever discussion setup is illustrated in Figure 1. they were engaged in or talk about something Figure 2 shows a frame from each of the entirely different. two video cameras aligned next to each other, so that the two dialog partners are both visible. Technical specifications The opposing video camera can be seen in the centre of the image, and a number of tripods The audio is recorded on four channels using a holding the motion capture cameras are visible. matched pair of Bruel & Kjaer 4003 omni- The synchronization turn-table is visible in the directional microphones for high audio quality, left part of the left pane and the right part of the and two Beyerdynamic Opus 54 cardioid head- right pane. The table between the subjects is set microphones to enable subject separation for covered in textiles, a necessary precaution as transcription and dialog analysis. The two om- the motion capture system is sensitive to re- ni-directional Bruel & Kjaer microphones are flecting surfaces. For the same reason, subjects placed approximately 1 meter from each sub- are asked to remove any jewelry, and other shi- ject. Two JVC HD Everio GZ-HD7 high defini- ny objects are masked with masking tape. tion video cameras are placed to obtain a good Figure 3 shows a single frame from the vid- view of each subject from a height that is ap- eo recording and the corresponding motion- proximately the same as the heads of both of capture data from a Spontal dialog. As in Fig- the participating subjects. They are placed ure 2, we see the reflective markers for the mo- about 1.5 meters behind the subjects to minim- tion-capture system on the hands, arms, shoul- ize interference. The cameras record in mpeg-2 ders, trunk and head of the subject. Figure 4 is a encoded full HD with the resolution 3D data plot of the motion capture data from 1920x1080i and a bitrate of 26.6 Mbps. To en- the same frame, with connecting lines between sure audio, video and motion-capture synchro- the markers on the subject’s body. nization during post processing, a record player is included in the setup. The turntable is placed between the subjects and a bit to the side, in full view of the motion capture cameras. The mark- er that is placed near the edge on the platter ro- tates with a constant speed (33 rpm) and enables high-accuracy synchronization of the

191 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Figure 2. Example showing one frame from the two video cameras taken from the Spontal database.

Annotation The Spontal database is currently being tran- scribed orthographically. Basic gesture and di- alog-level annotation will also be added (e.g. turn-taking and feedback). Additionally, auto- matic annotation and validation methods are being developed and tested within the project. The transcription activities are being performed in parallel with the recording phase of the project with special annotation tools written for the project facilitating this process. Specifically, the project aims at annotation that is both efficient, coherent, and to the larg- Figure 4. 3D representation of the motion capture est extent possible objective. To achieve this, data corresponding to the video frame shown in automatic methods are used wherever possible. Figure 3. The orthographic transcription, for example, follows a strict method: (1) automatic speech/non-speech segmentation, (2) ortho- Concluding remarks graphic transcription of resulting speech seg- A number of important contemporary trends in ments, (3) validation by a second transcriber, speech research raise demands for large speech (4) automatic phone segmentation based on the corpora. A shining example is the study of orthographic transcriptions. Pronunciation va- everyday spoken language in dialog which has riability is not annotated by the transcribers, but many characteristics that differ from written is left for the automatic segmentation stage (4), language or scripted speech. Detailed analysis which uses a pronunciation lexicon capturing of spontaneous speech can also be fruitful for most standard variations. phonetic studies of prosody as well as reduced and hypoarticulated speech. The Spontal data- base will make it possible to test hypotheses on the visual and verbal features employed in communicative behavior covering a variety of functions. To increase our understanding of tra- ditional prosodic functions such as prominence lending and grouping and phrasing, the data- base will enable researchers to study visual and acoustic interaction over several subjects and dialog partners. Moreover, dialog functions such as the signaling of turn-taking, feedback, attitudes and emotion can be studied from a Figure 3. A single frame from one of the video multimodal, dialog perspective. cameras. In addition to basic research, one important application area of the database is to gain

192 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University knowledge to use in creating an animated talk- ing agent (talking head) capable of displaying realistic communicative behavior with the long- term aim of using such an agent in conversa- tional spoken language systems. The project is planned to extend through 2010 at which time the recordings and basic orthographic transcription will be completed, after which the database will be made freely available for research purposes.

Acknowledgements The work presented here is funded by the Swe- dish Research Council, KFI - Grant for large databases (VR 2006-7482). It is performed at KTH Speech Music and Hearing (TMH) and the Centre for Speech Technology (CTT) with- in the School of Computer Science and Com- munication.

193 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

A first step towards a text-independent speaker verifi- cation Praat plug-in using Mistral/Alize tools Jonas Lindh Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg

Abstract Mistral (Alize), an open source toolkit for building a text-independent speaker Text-independent speaker verification can be a comparison system useful tool as a substitute for passwords or in- creased security check. The tool can also be The NIST speaker recognition evaluation cam- used in forensic phonetic casework. A text-in- paign started already 1996 with the purpose of dependent speaker verification Praat plug-in driving the technology of text-independent was created using tools from the open source speaker recognition forward as well as test the Mistral/Alize toolkit. A gate keeper setup was performance of the state-of-the-art approach created for 13 department employees and test- and to discover the most promising algorithms ed for verification. 2 different universal back- and new technological advances (from ground models where trained and the same set http://www.nist.gov/speech/tests/sre/ Jan 12, tested and evaluated. The results show promis- 2009). The aim is to have an evaluation at least ing results and give implications for the useful- every second year and some tools are provided ness of such a tool in research on voice quality. to facilitate the presentation of the results and handling the data (Martin and Przybocki, 1999). A few labs have been evaluating their Introduction developments since the very start with increas- Automatic methods are increasingly being used ing performances over the years. These labs in forensic phonetic casework, but most often generally have always performed best in the in combination with aural/acoustic methods. It evaluation. However, an evaluation is a rather is therefore important to get a better under- tedious task for a single lab and the question of standing of how the two systems compare. For some kind of coordination came up. This co- several studies on voice quality judgement, but ordination could be just to share information, also as a tool for visualisation and demonstra- system scores or other to be able to improve the tion, a text-independent speaker comparison results. On the other hand, the more natural was implemented as a plugin to the phonetic choice to be able to share and interpret results is analysis program Praat (Boersma & Weenink, open source. On the basis of this Mistral and 2009). The purpose of this study was to make more specifically the ALIZE SpkDet packages an as easy to use implementation as possible so were developed and released as open source that people with phonetic knowledge could use software under a so-called LGPL licence the system to demonstrate the technique or per- (Bonastre et al., 2005; 2008). form research. A state-of-art technique, the so called GMM-UBM (Reynolds, 2000), was ap- Method plied with tools from the open source toolkit Mistral (former Alize) (Bonastre et al., 2005; A standard setup was made for placing data 2008). This paper describes the surface of the within the plugin. On the top of the tree struc- implementation and the tools used without any ture several scripts controlling executable bina- deeper analysis to get an overview. A small test ries, configuration files, data etc. were created was then made on high quality recordings to with basic button interfaces that show up in a see what difference the possession of training given Praat configuration. The scripts were data for the universal background model makes. made according to the different necessary steps The results show that for demonstration pur- that have to be covered to create a test environ- poses a very simple world model including the ment. speakers you have trained as targets is suffi- cient. However, for research purposes a larger world model should be trained to be able to show more correct scores.

194 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Steps for a fully functional text-indepen- using 2 different world models. For this pur- dent system in Praat pose 13 colleagues (4 female and 9 males) at First of all some kind of parameterization has to the department of linguistics were recorded us- be made of the recordings at hand. In this first ing a headset microphone. To enroll them as implementation SPro (Guillaume, 2004) was users they had to read a short passage from a chosen for parameter extraction as there was al- well known text (a comic about a boy ending ready support for this implemented in the Mis- up with his head in the mud). The recordings tral programs. There are 2 ways to extract pa- from the reading task were between 25-30 rameters, either you choose a folder with audio seconds. 3 of the speakers were later recorded files (preferably wave format, however other to test the system using the same kind of head- formats are supported) or you record a sound in set. 1 male and 1 female speaker was then also Praat directly. If the recording is supposed to be recorded to be used as impostors. For the test a user of the system (or a target) a scroll list utterances the subjects were told to produce an with a first option “New User” can be chosen. utterance close to “Hej, jag heter X, jag skulle This function will control the sampling fre- vilja komma in, ett två tre fyra fem.” (“Hi, I am quency and resample if sample frequency is X, I would like to enter, one two three four other than 16 kHz (currently default), perform a five.”). The tests were run twice. In the first test frame selection by excluding silent frames only the enrolled speakers were used as UBM. longer than 100 ms before 19 MFCCs are ex- In the second the UBM was trained on excerpts tracted and stored in parameter file. The param- from interviews with 109 young male speakers eters are then automatically energy normalized from the Swedia dialect database (Eriksson, before storage. The name of the user is then 2004). The enrolled speakers were not included also stored in a list of users for the system. If in the second world model. you want to add more users you go through the same procedure again. When you are done you Results and discussion can choose the next option in the scroll list At the enrollment of speakers some mistakes in called “Train Users”. This procedure will con- the original scripts were discovered such as trol the list of users and then normalize and how to handle clipping in recordings as well as train the users using a background model feedback to the user while training models. The (UBM) trained using Maximum Likelihood scripts were updated to take care of that and af- Criterion. The individual models are trained to terwards enrollment was done without prob- maximise the a posteriori probability that the lems. In the first test only the intended target claimed identity is the true identity given the speakers were used to train a UBM before they data (MAP training). This procedure requires were enrolled. that you already have a trained UBM. However, if you do not, you can choose the function “Train World” which will take your list of users LLR Score Test 1 Speaker RA

(if you have not added others to be included in 0,6 the world model solely) and train one with the 0,4 default of 512 Gaussian mixture models 0,2 0 (GMM). The last option on the scroll list is in- -0,2 MMMFFMFMMMMFM LLR stead “Recognise User” which will test the re- -0,4 RA JA PN JV KC AE EB JL TL JaL SS UV HV -0,6 cording against all the models trained by the RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t system. A list of raw (not normalised) log like- -0,8 -1 lihood ratio scores gives you feedback on how Test Speaker well the recording fitted any of the models. In a commercial or fully-fledged verification system you would also have to test and decide on Figure 1. Result for test 1 speaker RA against all threshold, as that is not the main purpose here enrolled models. Row 1 shows male (M) or female we are only going to speculate on possible use (F) model, row 2 model name and row 3 the test of threshold for this demo system. speaker. In Figure 1 we can observe that the speaker is Preliminary UBM performance test correctly accepted with the only positive LLR To get first impression how well the imple- (0.44). The closest following is then the model mentation worked a small pilot study was made of speaker JA (-0.08).

195 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

LLR Score Test 1 Speaker JA LLR Score Test 1 Imposter Speaker HE

0,4 0,4 0,2 0,2 0 0 -0,2 MMMMMMFFMMFFM -0,2 FFFMMFMMMMMMM

LLR -0,4 JA TL AE JL PN RA JV EB HV JaL KC UV SS LLR -0,4 EB KC UV PN HV JV JA JL JaL RA SS TL AE -0,6 JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t -0,6 HEimHEimHEimHEimHEimHEimHEimHEimHEimHEimHEimHEimHEim -0,8 -0,8 -1 -1 Test Speaker Test Speaker

Figure 2. Result for test 1 speaker JA against all en- Figure 5. Result for test 1 female imposter speaker rolled models. HE against all enrolled models. nd In the 2 test there is a lower acceptance score The female imposter was more successful in (0.25) for the correct model. However, the clos- test 1. She gained 2 positive LLRs for 2 models est model (TL) also has a positive LLR (0.13). of enrolled speakers. In test 2 the world model was exchanged LLR Score Test 1 Speaker HV and models retrained. This world model was trained on excerpts of spontaneous speech from 0,2 109 young male speakers recorded with a simi- 0 MMMMMFMFMMFFM lar quality as the enrolled speakers. -0,2 HV AE SS PN JaL UV RA KC JA JL JV EB TL -0,4 LLR HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t -0,6 LLR Score Test 2 Speaker RA

-0,8 0,4 0,3 -1 0,2 Test Speaker 0,1 0 -0,1 MFMMFMMMFMMFM LLR -0,2 Figure 3. Result for test 1 speaker HV against all -0,3 RA KC JA PN JV SS JaL JL EB TL HV UV AE -0,4 RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t enrolled models. -0,5 rd -0,6 In the 3 test the correct model is highest Test Speaker ranked again, however, the LLR (0.009) is low.

Figure 6. Result for test 2 speaker RA against all LLR Score Test 1 Imposter Speaker MH enrolled models. 0 The increase in data for world model training MFMMMFMFFMMMM -0,1 has had no significant effect in this case. -0,2 AE EB PN HV RA UV JA KC JV JaL JL TL SS -0,3 MhimMhimMhimMhimMhimMhimMhimMhimMhimMhimMhimMhimMhim -0,4 LLR -0,5 LLR Score Test 2 Speaker JA -0,6 0,6 -0,7 -0,8 0,4

Test Speaker 0,2

0 LLR MMMMMFFMMMMFF -0,2 Figure 4. Result for test 1 imposter speaker MH JA TL JL PN RA JV KC JaL AE SS HV UV EB -0,4 against all enrolled models. JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t JA_t -0,6 st The 1 imposter speaker has no positive values Test Speaker and the system seems to successfully keep the door closed. Figure 7. Result for test 2 speaker JA against all en- rolled models.

196 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

For the test of speaker JA the new world model When it comes to the female impostor the result improved the test result significantly. The cor- it becomes even clearer that female training rect model now gets a very high score (0.53) data is missing in the world model. All scores and even though the second best has a positive except 1 are positive and some of the scores LLR (0.03) it is very low. very high.

LLR Score Test 2 Speaker HV Conclusions 0,2 This first step included a successful implemen- 0,1 tation of open source tools, building a test 0 -0,1 MMFMFFMMFMMMM framework and scripting procedures for text-in-

LLR -0,2 HV AE UV SS KC EB RA JaL JV JL PN JA TL dependent speaker comparison. A small pilot -0,3 HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t HV_t study on performance of high quality record- -0,4 ings were made. We can conclude that it is not -0,5 sufficient to train a UBM using only male Test Speaker speakers if you want the system to be able to handle any incoming voice. However, for Figure 8. Result for test 2 speaker HV against all demonstration purposes and comparison be- enrolled models. tween small amounts of data it is sufficient to use the technique. Also for this test the new world model im- proves the correct LLR and creates a larger dis- tance to the other models. References Boersma, Paul & Weenink, David (2009). LLR Score Test 2 Imposter Speaker MH Praat: doing phonetics by computer (Ver- sion 5.1.04) [Computer program]. Retrieved 0,1 April 4, 2009, from http://www.praat.org/ 0 -0,1 FFMMMMMFFMMMM Bonastre, J-F, Wils, F. & Meigner, S. (2005) -0,2 EB KC PN RA AE JA HV JV UV JL JaL SS TL ALIZE, a free toolkit for speaker recogni-

LLR -0,3 MhimMhimMhimMhimMhimMhimMhimMhimMhimMhimMhimMhimMhim tion, in Proceedings of ICASSP, 2005, pp. -0,4 737–740. -0,5 Bonastre, J-F, Scheffer, N., Matrouf, C., Fre- -0,6 douille, A., Larcher, A., Preti, A., Test Speaker Pouchoulin, G., Evans, B., Fauve, B. & Ma- son, J.S. (2008) ALIZE/SpkDet: a state-of- Figure 9. Result for test 2 impostor speaker MH the-art open source software for speaker re- against all enrolled models. cognition. In Odyssey 2008 - The Speaker and Language Recognition Workshop, In the male impostor test for test 2 we obtained 2008. a rather peculiar result where the male impostor Eriksson, A. (2004) SweDia 2000: A Swedish gets a positive LLR for a female target model. dialect database. In Babylonian Confusion The lack of female training data in the world Resolved. Proc. Nordic Symposium on the model is most probably the explanation for that. Comparison of Spoken Languages, ed. by P.

LLR Score Test 2 Imposter Speaker HE J. Henrichsen, Copenhagen Working Papers in LSP 1 – 2004, 33–48 0,8 0,7 Guillaume, G. (2004) SPro: speech signal pro- 0,6 0,5 cessing toolkit, Software available at http:// 0,4 gforge.inria.fr/projects/spro .

LLR 0,3 0,2 0,1 Martin, A. F. and Przybocki, M. A. (1999) The 0 -0,1 MMFMMFMMFFMMM NIST 1999 Speaker Recognition Evalu-

AE HV UV TL PN EB JaL JA JV KC RA JL SS ation-An Overview. Digital Signal Pro-

HEimHEimHEimHEimHEimHEimHEimHEimHEimHEimHEimHEimHEim cessing 10: 1–18. Test Speaker Reynolds, D. A., Quatieri, T. F., Dunn, R. B., (2000) Speaker Verification Using Adapted Figure 10. Result for test 2 female impostor speaker Gaussian Mixture Models, Digital Signal HE against all enrolled models. Processing, 2000.

197 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Modified re-synthesis of initial voiceless plosives by concatenation of speech from different speakers Sofia Strömbergsson Department of Speech, Music and Hearing, School of Computer Science and Communication, KTH, Stockholm

Abstract do not produce, while others have problems both in perceiving and producing a phonologi- This paper describes a method of re- cal distinction. synthesising utterance-initial voiceless plo- Based on the above, it seems reasonable to sives, given an original utterance by one assume that enhanced feedback of one’s own speaker and a speech database of utterances by speech might be particularly valuable to a child many other speakers. The system removes an with phonological difficulties, in increasing initial voiceless plosive from an utterance and his/her awareness of his/her own speech pro- replaces it with another voiceless plosive se- duction. Hearing a re-synthesised (“corrected”) lected from the speech database. (For example, version of his/her own deviant speech produc- if the original utterance was /tat/, the re- tion might be a valuable assistance to the child synthesised utterance could be /k+at/.) In the to gain this awareness. In an effort in this direc- method described, techniques used in general tion, Shuster (1998) manipulated (“corrected”) concatenative speech synthesis were applied in children’s deviant productions of /r/, and then order to find those segments in the speech da- let the subjects judge the correctness and tabase that would yield the smoothest concate- speaker identity of speech samples played to nation with the original segment. Results from them (which could be either original/incorrect a small listening test reveal that the concate- or edited/corrected speech, spoken by them- nated samples are most often correctly identi- selves or another speaker). The results from fied, but that there is room for improvement on this study showed that the children had most naturalness. Some routes to improvement are difficulties judging their own incorrect utter- suggested. ances accurately, but also that they had difficul- ties recognizing the speaker as themselves in Introduction their own “corrected” utterances. These results In normal as well as deviant phonological de- show that exercises of this type might lead to velopment in children, there is a close interac- important insights to the nature of the phono- tion between perception and production of logical difficulties these children have, as well speech. In order to change a deviant (non-adult) as providing implications for clinical interven- way of pronouncing a sound/syllable/word, the tion. child must realise that his/her current produc- tion is somehow insufficient (Hewlett, 1992). Applications of modified re-synthesis There is evidence of a correlation between the Apart from the above mentioned study by amount of attention a child (or infant) pays to Shuster (1998), where the author used linear his/her own speech production, and the pho- predictive parameter modification/synthesis to netic complexity in his/her speech production edit (or “correct”) deviant productions of /r/, a (Locke & Pearson, 1992). As expressed by more common application for modified re- these authors (p. 120): “the hearing of one’s synthesis is to create stimuli for perceptual ex- own articulations clearly is important to the periments. For example, specific speech sounds formation of a phonetic guidance system”. in a syllable have been transformed into inter- Children with phonological disorders pro- mediate and ambiguous forms between two duce systematically deviant speech, due to an prototypical phonemes (Protopapas, 1998). immature or deviant cognitive organisation of These stimuli have then been used in experi- speech sounds. Examples of such systematic ments of categorical perception. Others have deviations might be stopping of fricatives, con- modulated the phonemic nature of specific sonant cluster reductions and assimilations. segments, while preserving the global intona- Some of these children might well perceive tion, syllabic rhythm and broad phonotactics of phonological distinctions that they themselves natural utterances, in order to study what

198 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University acoustic cues (e.g. phonotactics, syllabic speech database will be referred to as “the tar- rhythm) are most salient in identifying lan- get corpus”. guages (Ramus & Mehler, 1999). In these types For the remainder part of the re-synthesis, a of applications, however, stimuli have been cre- small corpus of 12 utterances spoken by a fe- ated once and there has been no need for real- male speaker was recorded with a Sennheiser time processing. m@b 40 microphone at 16 kHz/16 bit sampling The computer-assisted language learning frequency. The recordings were made in a rela- system VILLE (Wik, 2004) includes an exer- tively quiet office environment. Three utter- cise that involves modified re-synthesis. Here, ances (/tat/, /kak/ and /pap/) were recorded four the segments in the speech produced by the times each. This corpus will be referred to as user are manipulated in terms of duration, i.e. “the remainder corpus”. stretched or shortened, immediately after re- Table 1. Number of utterances in the target corpus. cording. At the surface, this application shares several traits with the application suggested in Nbr of utterances this paper. However, more extensive manipula- Utterance-initial /pV/ 2 680 tion is required to turn one phoneme into an- Utterance-initial /tV/ 4 562 other, which is the goal for the system de- Utterance-initial /kV/ 5 614 scribed here. Total 12 857 Purpose The purpose of this study was to find out if it is Re-synthesis at all possible to remove the initial voiceless plosive from a recorded syllable, and replace it Each step in the re-synthesis process is de- with an “artificial” segment so that it sounds scribed in the following paragraphs. natural. The “artificial” segment is artificial in the sense that it was never produced by the Alignment speaker, but constructed or retrieved from For aligning the corpora (the target corpus and somewhere else. As voiceless plosives gener- the remainder corpus), the NALIGN aligner ated by formant synthesizers are known to lack (Sjölander, 2003) was used. in naturalness (Carlson & Granström, 2005), retrieving the target segment from a speech da- Feature extraction tabase was considered a better option. For the segments in the target corpus, features were extracted at the last frame before the mid- Method dle of the vowel following the initial plosive. For the segments in the remainder corpus, fea- tures were extracted at the first frame after the Material middle of the vowel following the initial plo- The Swedish version of the Speecon corpus sive. The extracted features were the same as (Iskra et al, 2002) was used as a speech data- described by Hunt & Black (1996), i.e. base, from which target phonemes were se- MFCCs, log power and F0. The Snack tool lected. This corpus contains data from 550 SPEATURES (Sjölander, 2009) was used to adult speakers of both genders and of various extract 13 MFCCs. F0 and log power were ex- ages. The speech in this corpus was simultane- tracted using the Snack tools PITCH and ously recorded at 16 kHz/16 bit sampling fre- POWER, respectively. quency by four different microphones, in dif- ferent environments. For this study, only the Calculation of join cost recordings made by a close headset microphone Join costs between all possible speech segment (Sennheiser ME104) were used. No restrictions combinations (i.e. all combinations of a target were placed on gender, age or recording envi- segment from the target corpus and a remainder ronment. From this data, only utterances start- segment from the remainder corpus) were cal- ing with an initial voiceless plosive (/p/, /t/ or culated as the sum of /k/) and a vowel were selected. This resulted in a speech database consisting of 12 857 utter- 1. the Euclidean distance (Taylor, 2008) in F0 ances (see Table 1 for details). Henceforth, this 2. the Euclidean distance in log power

199 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

3. the Mahalanobis distance (Taylor, 2008) Pearson correlations were used to assess in- for the MFCCs tra-rater agreement for each listener separately. F0 distance was weighted by 0.5. A penalty of 10 was given to those segments from the target Results corpus where the vowel following the initial The results of the evaluation are presented in plosive was not /a/, i.e. a different vowel than Table 1. the one in the remainder corpus. The F0 weighting factor and the vowel-penalty value Table 1. Evaluation results for the concatenated were arrived at after iterative tuning. The dis- and original speech samples. The first column dis- plays the percentage of correctly identified sylla- tances were calculated using a combination of bles, and the second column displays the average Perl and Microsoft Excel. naturalness judgments (max = 100).

Concatenation % correct syll Naturalness For each possible segment combination Concatenated 94% 49 (SD: 20) ((/p|t|k/) + (/ap|at|ak/), i.e. 9 possible combina- Original 100% 89 (SD: 10) tions in total), the join costs were ranked. The five combinations with the lowest costs within The listeners demonstrated high inter-rater each of these nine categories were then con- agreement on naturalness rating (ICC = 0.93), catenated using the Snack tool CONCAT. Con- but lower agreement on syllable identification catenation points were located to zero-crossings accuracy (ICC = 0.79). within a range of 15 samples after the middle of Average intra-rater agreement for all listen- the vowel following the initial plosive. (And if ers was 0.71 on naturalness rating, and 0.72 on no zero-crossing was found within that range, syllable identification accuracy. the concatenation point was set to the middle of the vowel.) Discussion Evaluation Considering that the purpose of this study was to study the possibilities of generating under- 7 adult subjects were recruited to perform a lis- standable and close to natural sounding con- tening test. All subjects were native Swedes, catenations of segments from different speak- with no known hearing problems and naïve in ers, the results are actually quite promising. the sense that they had not been involved in any The listeners’ syllable identification accuracy work related to speech synthesis development. of 94% indicates that comprehensibility is not a A listening test was constructed in Tcl/Tk to big problem. Although the total naturalness present the 45 stimuli (i.e. the five concatena- judgement average of 49 (of 100) is not at all tions with the lowest costs for each of the nine impressive, an inspection of the individual different syllables) and 9 original recordings of samples reveals that there are actually some the different syllables. The 54 stimuli were all concatenated samples that receive higher natu- repeated twice (resulting in a total of 108 ralness ratings than original samples. Thus, the items) and presented in random order. The task results confirm that it is indeed possible to gen- for the subjects was to decide what syllable erate close to natural sounding samples by con- they heard (by selecting one of the nine possi- catenating speech from different speakers. ble syllables) and judge the naturalness of the However, when considering that the long-term utterance on a scale from 0 to 100. The subjects goal is a working system that can be imple- had the possibility to play the stimuli as many mented and used to assist phonological therapy times as they wanted. Before starting the actual with children, the system is far from complete. test, 6 training items were presented, after As of now, the amount of manual interven- which the subjects had the possibility of asking tion required to run the re-synthesis process is questions regarding the test procedure. large. Different tools were used to complete different steps (various Snack tools, Microsoft Statistical analysis Excel), and Perl scripts were used as interfaces Inter-rater agreement was assessed via the in- between these steps. Thus, there is still a long traclass correlation (ICC) coefficient (2, 7) for way to real-time processing. Moreover, it is syllable identification accuracy and naturalness still limited to voiceless plosives in sentence- rating separately. initial positions, and ideally, the system should

200 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University be more general. However, considering that Gerosa, M., Gioliani, D. & Brugnara, F. (2007) children usually master speech sounds in word- Acoustic variability and automatic recogni- initial and word-final positions later than in tion of children’s speech. Speech Commu- word-medial positions (Linell & Jennische, nication 49, 847-835. 1980), this limitation should not be disqualify- Hewlett, N. (1992) Processes of development ing on its own. and production. In Grunwell, P. (ed.) De- The speech data in this work came from velopmental Speech Disorders, 15-38. Lon- adult speakers. New challenges can be expected don: Whurr. when faced with children’s voices, e.g. in- Hunt, A. and Black, A. (1996) Unit selection in creased variability in the speech database a concatenative speech synthesis system us- (Gerosa et al, 2007). Moreover, variability in ing a large speech database. Proceedings of the speech of the intended user - the child in the ICASSP 96 (Atlanta, Georgia), 373-376. therapy room - can also be expected. (Not to Iskra, D., Grosskopf, B., Marasek, K., Van Den mention the variability from child to child in Heuvel, H., Diehl, F., and Kiessling, A. motivation and ability and will to comply with (2002) Speecon - speech databases for con- the therapist’s intervention plans.) sumer devices: Database specification and The evaluation showed that there is much validation. room for improving naturalness, and fortu- Linell, P. & Jennische, M. (1980) Barns uttal- nately, some improvement strategies can be sutveckling, Stockholm: Liber. suggested. First, more manipulations with Locke, J.L. & Pearson, D.M. (1992) Vocal weighting factors might be a way to assure that Learning and the Emergence of Phonologi- the combinations that are ranked the highest are cal Capacity. A Neurobiological Approach. also the ones that sound the best. As of now, In C.A. Ferguson, L. Menn & C. Stoel- this is not always the case. During the course of Gammon (Eds.), Phonological Develop- this investigation, attempts were made at in- ment. Models, research, implications., creasing the size of the target corpus, by includ- York: York Press. ing word-initial voiceless plosives within utter- Protopapas, A. (1998) Modified LPC resynthe- ances as well. However, these efforts did not sis for controlling speech stimulus dis- improve the quality of the output concatenated criminability. 136th Annual Meeting of the speech samples. The current system does not Acoustical Society of America, Norfolk, involve any spectral smoothing; this might be a VA, October 13-16. way to polish the concatenation joints to im- Ramus, F. & Mehler, J. (1999) Language iden- prove naturalness. tification with suprasegmental cues: A study Looking beyond the context of modified re- based on speech resynthesis, Journal of the synthesis to assist therapy with children with Acoustical Society of America 105, 512- phonological impairments, the finding that it is 521. indeed possible to generate natural sounding Shuster, L. I. (1998) The perception of cor- concatenations of segments from different rectly and incorrectly produced /r/. Journal speakers might be valuable in concatenative of Speech, Language and Hearing Research synthesis development in general. This might 41, 941-950. be useful in the context of extending a speech Sjölander, K. The Snack sound toolkit, De- database if the original speaker is no longer partment of Speech, Music and Hearing, available, e.g. with new phonemes. However, it KTH, Stockholm, Sweden. Online: seems reasonable to assume that the method is http://www.speech.kth.se/snack/, 1997- only applicable to voiceless segments. 2004, accessed on April 12, 2009. Sjölander, K. (2003) An HMM-based system Acknowledgements for automatic segmentation and alignment of speech. Proceedings of Fonetik 2003 This work was funded by The Swedish Gradu- (Umeå University, Sweden), PHONUM 9, ate School of Language Technology (GSLT). 93-96. Taylor, P. (2008) Text-to-Speech Synthesis, References Cambridge University Press. Wik, P. (2004) Designing a virtual language Carlson, R. & Granström, B. (2005) Data- tutor. Proceedings of Fonetik 2004 (Stock- driven multimodal synthesis. Speech Com- holm University, Sweden), 136-139. munication 47, 182-193.

201 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Cross - modal Clustering in the Acoustic - Articulatory Space G. Ananthakrishnan and Daniel M. eiberg Centre for Speech Technology, CSC, KTH, Stockholm [email protected], [email protected]

Abstract nistic case, one can say that if the same acoustic This paper explores crossmodal clustering in parameters are produced by more than one ar the acousticarticulatory space. A method to ticulatory configuration, then the particular improve clustering using information from mapping is considered to be nonunique. It is more than one modality is presented. Formants almost impossible to show this using real re and the Electromagnetic Articulography meas corded data, unless more than one articulatory urements are used to study corresponding clus configuration produces exactly the same acous ters formed in the two modalities. A measure tic parameters. However, not finding such in for estimating the uncertainty in correspon stances does not imply that nonuniqueness dences between one cluster in the acoustic does not exist. space and several clusters in the articulatory Qin and CarreiraPerpinán (2007) proposed space is suggested. that the mapping is nonunique if, for a particu lar acoustic cluster, the corresponding articula Introduction tory mapping may be found in more than one cluster. Evidence of nonuniqueness in certain Trying to estimate the articulatory measure acoustic clusters for phonemes like //, /l/ and ments from acoustic data has been of special /w/ was presented. The study by Qin quantized interest for long time and is known as acoustic the acoustic space using the perceptual Itakura toarticulatory inversion. Though this mapping distance on LPC features. The articulatory between the two modalities expected to be a space was clustered using a nonparametric onetoone mapping, early research presented Gaussian density kernel with a fixed variance. some interesting evidence showing non The problem with such a definition of non uniqueness, in this mapping. Biteblock ex uniqueness is that one does not know what is periments have shown that speakers are capable the optimal method and level of quantization of producing sounds perceptually close to the for clustering the acoustic and articulatory intended sounds even though the jaw is fixed in spaces. an unnatural position (Gay et al., 1981). Mer A later study by Neiberg et. al. (2008) ar melstein (1967) and Schroeder (1967) have gued that the different articulatory clusters shown, through analytical articulatory models, should not only map onto a single acoustic clus that the inversion is unique to a class of area ter but should also map onto acoustic distribu functions rather than a unique configuration of tions with the same parameters, for it to be the vocal tract. called nonunique. Using an approach based on With the advent of measuring techniques finding the Bhattacharya distance between the like Electromagnetic Articulography (EMA) distributions of the inverse mapping, they found and XRay Microbeam, it was possible to col that phonemes like /p/, /t/, /k/, /s/ and /z/ are lect simultaneous measurements of acoustics highly nonunique. and articulation during continuous speech. Sev In this study, we wish to observe how clus eral attempts have been made by researchers to ters in the acoustic space map onto the articula perform acoustictoarticulatory inversion by tory space. For every cluster in the acoustic applying machine learning techniques to the space, we intend to find the uncertainty in find acousticarticulatory data (Yehia et al., 1998 ing a corresponding articulatory cluster. It must and Kjellström and Engwall, 2009). The statis be noted that this uncertainty is not necessarily tical methods applied to the problem of map the nonuniqueness in the acousticto ping brought a new dimension to the concept of articulatory mapping. However, finding this un nonuniqueness in the mapping. In the determi certainty would give an intuitive understanding

202 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University about the difficulties in the mapping for differ belongs to, say, γk The correct acoustic th ent phonemes. Gaussian ‘λn’ for the the ‘n ’ data point having Clustering the acoustic and articulatory acoustic features ‘yn’ and articulatory features spaces separately, as was done in previous stud ‘xn’ is given by the maximum crossmodal a ies by Qin and CarreiraPerpinán (2007) as well posteriori probability as Neiberg et al. (2008), leads to hard bounda λn = arg max P(λi | xn , y n , γ k ) ries in the clusters. The cluster labels for the 1≤i ≤ I instances near these boundaries may estimated = argmaxp(xn ,y n | λi , γ k )*P( λ i| γ k )*P( γ k ) incorrectly, which may cause an over estima 1≤i ≤ I tion of the uncertainty. This situation is ex (1) plained by Fig. 1 using synthetic data where we The knowledge about the articulatory cluster can see both the distributions of the synthetic can then be used to improve the estimate of the data as well as the Maximum Aposteriori Mode A Mode B 8 12

Probability (MAP) Estimates for the clusters. 10 6

8 We can see that, because of the incorrect clus 4

6 tering, it seems as if data belonging to one clus 2 4

0 ter in mode A belongs to more than one cluster 2

2 in mode B. 0

4 2 In order to mitigate this problem, we have 6 4 2 0 2 4 6 8 4 2 0 2 4 6 8 10 12 suggested a method of crossmodal clustering Mode A Mode B where both the available modalities are made 8 12 10 6 use of by allowing soft boundaries for the clus 8 4 ters in each modality. Crossmodal clustering 6 2 4 has been dealt with in detail under several con 0 2

2 texts of combining multimodal data. Coen 0

4 2 (2005) proposed a self supervised method 6 4 2 0 2 4 6 8 4 2 0 2 4 6 8 10 12 where he used acoustic and visual features to Figure 1. The figures above show a synthesized learn perceptual structures based on temporal example of data in two modalities. The figures correlations between the two modalities. He below show how MAP hard clustering may bring used the concept of slices, which are topologi about an effect of uncertainty in the correspondence cal manifolds encoding dynamic states. Simi between clusters in the two modalities. larly, Belolli et al.(2007) proposed a clustering algorithm using Support Vector Machines correct acoustic cluster and vice versa as shown (SVMs) for clustering interrelated text data below sets. γ n = argmaxp(xn , y n | λi , γ k )*P( γ k| λ i )*P( λ i ) The method proposed in this paper does not 1≤k ≤ K make use of correlations, but mainly uses co (2) clustering properties between the two modali Where P(λ|γ) is the crossmodal prior and the ties in order to perform the crossmodal cluster p(x,y|λ,γ) is the joint crossmodal distribution. ing. Thus, even nonlinear dependencies (un If the first estimates of the correct clusters are correlated) may also be modeled using this MAP, then the estimates of the correct clusters simple method. of the speech segments are improved Figure 2. The figures shows an improved

Mode A Mode B Theory 8 12 10 6

We assume that the data is a Gaussian Mixture 8 4 Model (GMM). The acoustic space Y = {y1, 6 2 y2…y} with ‘’ data points is modelled using 4 0 2 ‘I’ Gaussians, {λ1, λ2,…λI} and the articulatory 2 0 space X = {x1, x2…x} is modelled using ‘K’ 4 2 6 4 2 0 2 4 6 8 4 2 0 2 4 6 8 10 12 Gaussians, {γ1,γ2,….γK}. ‘I’ and ‘K’ are obtained by minimizing the Bayesian performance and soft boundaries for the synthetic Information Criterion (BIC). If we know which data using crossmodal clustring, here the effect of articulatory Gaussian a particular data point uncertainty in correspondences is less.

203 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University

Formants for vowel /e/ Mid−saggital plane of the mouth 200

500 300

0 Upper Lip 400 Velum

−500 500

600 −1000

Lower Lip Tongue Dorsum 700 −1500 Tongue Body

800 −2000 <−−−−− Formant F1 (Hz) 900 −2500 Tongue Tip

1000 −3000 Lower Jaw

1100 <−−−− Vertical Direction −−−−> (1/100th of mm) 3000 2500 2000 1500 1000 500 −1000 0 1000 2000 3000 4000 5000 6000 <−−−−Formant F2 (Hz) <−−−− Horizontal Direction −−−−> (1/100th of mm)

Figure 3. The figure on the left side shows the estimated gaussians. One can observe that a single formant space for the vowel /e/ and the cluster in the acoustic space may be spread over corresponding articulatory positions on the right several clusters in the articulatory space. hand. The ellipses show the locations of the Lower Jaw measurements along the Formants for vowel /@/ mid−saggital plane of the mouth 0 −1800

200 −2000

400 −2200


−2400 800

−2600 1000 <−−−−− Formant F1 (Hz) −2800 1200

1400 −3000

3500 3000 2500 2000 1500 1000 500 <−−−− Vertical Direction −−−−> (1/100th of mm) −300 −200 −100 0 100 200 300 400 500 <−−−−Formant F2 (Hz) <−−−− Horizontal Direction −−−−> (1/100th of mm) Figure 4. The figure on the left side shows the formants for the phoneme //. The figure on the right indicate the measurements of the lower jaw for the corresponding instances. The ellipses indicate the locations of the estimated gaussians. We can see that some clusters is located within a small region in the articulatory space, while a few other clusters are spread all over. recursively. Finally a soft clustering is obtained prior, i.e. P(γ |λ). We propose that the measure which maximizes the crossmodal aposteriori of uncertainity in the crossmodal cluster probability in both the modes. We call this correspondence, ‘U’, is given by the Entropy of method Maximum CrossModal APosteriori P(γ |λ). Probability (MCMAP). Proving the K convergence of the algorithm is beyond the Uλ = P(γk | λ i )log K (P( γ k | λ i )) scope of this paper. However, the algorithm i ∑ k=1 converged for all the experiments within 50 (3) iterations. In Fig. 2, we can see that the estimate I U = U*P()λ of the correct cluster is slightly better than a ∑ λi i simple a posteriori probability in Fig. 1. i=1 The uncertainity of the clustering for Log to the base ‘K’ is taken in order to normalize the effect of different number of acoustic to articulatory inversion in a particular articulatory clusters. The entropy is a good phoneme, can be estimated by the crossmodal

204 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University measure of the uncertainty in prediction, and clusters for the different vowels tested. Fig 6. thus forms an intuitive measure for our shows the correspondence uncertainty of purpose. It is always between 0 and 1 and so individual articulators. comparisons between different crossmodal clusterings is easy. 1 indicates very high uncertainty while 0 indicates onetoone Overall Uncertaintyfor the Vowels 0.9 mapping between corresponding clusters in the 0.8 0.7 two modalities. 0.6 0.5 0.4 0.3 0.2

Experiments and Results Uncertainty 0.1 The MOCHATIMIT database (Wrench, 2001) 0 was used to perform the experiments. The data æ e : u::: : consists of simultaneous measurements of Vow els acoustic and articulatory data for a female Figure 5. The figure shows the overall speaker. The articulatory data consisted of 14 uncertainty (for the whole articulatory channels, which included the X and Yaxis configuration) for the British vowels. positions of EMA coils on 7 articulators, the Lower Jaw (LJ), Upper Lip (UL), Lower Lip Uncertainty for Individual Articulators V 6 (LL), Tongue Tip (TT), Tongue Body (TB), TD 5 TB

Tongue Dorsum (TD) and Velum (V). Only 4 TT LL vowels were considered for this study and the 3 UL acoustic space was represented by the first 5 2 LJ formants, obtained from 25 ms acoustic Uncertainty 1 windows shifted by 10 ms. The articulatory 0 æ e :u::: : data was lowpass fitered and downsampled in order to correspond with acoustic data rate. The Figure 6. The figure shows the uncertainty for uncertainty (U) in clustering was estimated individual articulators for the British vowels. using Equation 3 for the British vowels, namely /, æ, e, , :, u:, :, :, , :, , /. The Discussion articulatory data was first clustered for all the From Fig. 5 it is clear that the shorter vowels articulatory channels and then was clustered seem to have more uncertainty than longer individually for each of the 7 articulators. vowels which is intuitive. The higher uncer Fig. 3 shows the clusters in both the tainty is seen for the short vowels /e/ and //, acoustic and articulatory space for the vowel while there is almost no uncertainty for the long /e/. We can see that data points corresponding vowels /:/ and /:/. The overall uncertainty for to one cluster in the acoustic space (F1F2 the entire configuration is usually around the formant space) correspond to more than one lowest uncertainty for a single articulator. This is intuitive, and shows that even though certain cluster in the articulatory space. The ellipses, articulator correspondences are uncertain, the which correspond to initial clusters are replaced correspondences are more certain for the over by different clustering labels estimated by the all configuration. When the uncertainty for in MCMAP algorithm. So though the acoustic dividual articulators is observed, then it is ap features had more than one cluster in the first parent that the velum has a high uncertainty of estimate, after crossmodal clustering, all the more than 0.6 for all the vowels. This is due to instances are assigned to a single cluster. the fact that nasalization is not observable in Fig. 4 shows the correspondences between the formants very easily. So even though differ acoustic clusters and the LJ for the vowel //. ent clusters are formed in the articulatory space, We can see that the uncertainty is less for some they are seen in the same cluster in the acoustic of the clusters, while it is higher for some space. The uncertainty is much less in the lower others. Fig. 5 shows the comparative measures lip correspondence for the long vowels /:/, /u:/ of overall the uncertainty (over all the and /:/ while it is high for // and /e/. The TD articulators), of the articulatory clusters shows lower uncertainty for the back vowels corresponding to each one of the acoustic /u:/ and /:/. The uncertainty for TD is higher

205 Proceedings, FOETIK 2009, Dept. of Linguistics, Stockholm University for the front vowels like /e/ and //. The Kjellström, H. and Engwall, O. (2009) Audio uncertainty for the tongue tip is lower for the visualtoarticulatory inversion. Speech vowels like // and /:/ while it is higher for Communication 51(3), 195209. /:/ and //. These results are intuitive, and Mermelstein, P., (1967) Determination of the show that it is easier to find correspondences VocalTract Shape from Measured Formant between acoustic and articulatory clusters for Frequencies, J. Acoust. Soc. Am. 41, 1283 some vowels, while it is more difficult for oth 1294. ers. Neiberg, D., Ananthakrishnan, G. and Engwall, O. (2008) The Acoustic to Articulation Conclusion and Future Work Mapping: Nonlinear or Nonunique? Pro ceedings of Interspeech, 14851488. The algorithm proposed, helps in improving the Qin, C. and CarreiraPerpinán, M. Á. (2007) An clustering ability using information from multi Emperical Investigation of the Nonunique ple modalities. A measure for finding out un ness in the AcoustictoArticulatory Map certainty in correspondences between acoustic ping. Proceedings of Interspeech, 74–77. and articulatory clusters has been suggested and Schroeder, M. R. (1967) Determination of the empirical results on certain British vowels have geometry of the human vocal tract by acous been presented. The results presented are intui tic measurements. J. Acoust. Soc. Am tive and show difficulties in making predictions 41(2), 1002–1010. about the articulation from acoustics for certain Wrench, A. (1999) The MOCHATIMIT articu sounds. It follows that certain changes in the latory database. Queen Margaret University articulatory configurations cause variation in College, Tech. Rep, 1999. Online: the formants, while certain articulatory changes http://www.cstr.ed.ac.uk/research/projects/a do not change the formants. rtic/mocha.html. It is apparent that the empirical results pre Yehia, H., Rubin, P. and VatikiotisBateson. sented depend on the type of clustering and ini (1998) Quantitative association of vocal tialization of the algorithm. This must be ex tract and facial behavior. Speech Communi plored. Future work must also be done on ex cation 26(12), 2343. tending this paradigm to include other classes of phonemes as well as different languages and subjects. It would be interesting to see if these empirical results can be generalized or are spe cial to certain subjects and languages and ac cents.

Acknowledgements This work is supported by the Swedish Research Council project 80449001, Computer Animated Language Teachers.

References Bolelli L., Ertekin S., Zhou D. and Giles C. L. (2007) KSVMeans: A Hybrid Clustering Algorithm for MultiType Interrelated Data sets. International Conference on Web Intel ligence, 198–204. Coen M. H. (2005) CrossModal Clustering. Proceedings of the Twentieth National Con ference on Artificial Intelligence, 932937. Gay, T., Lindblom B. and Lubker, J. (1981) Production of biteblock vowels: acoustic equivalence by selective compensation. J. Acoust. Soc. Am. 69, 802810, 1981.

206 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

207 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Swedish phonetics 1939-1969 Paul Touati French Studies, Centre for Languages and Literature, Lund University

Abstract how technological equipment specially adapted to phonetic research was developed, and how The aim of the current project (“Swedish Pho- diverging phonetic explanations became com- netics ’39-‘69”) is to provide an account of the peting paradigms. historical, social, discursive, and rhetoric con- The claim sustaining this investigation is ditions that determined the emergence of pho- that knowledge is a product of continually re- netic science in Sweden between 1939 and newed and adjusted interactions between a se- 1969. The inquiry is based on a investigation in ries of instances, such as: fundamental research, four areas: how empirical phonetic data were institutional strategies and the ambition of indi- analysed in the period, how the discipline vidual researchers. In this perspective, the in- gained new knowledge about phonetic facts quiry will demonstrate that phonetic knowledge through improvements in experimental settings, was grounded first by discussions on the valid- how technological equipment specially adapted ity of questions to be asked, then by an evalua- to phonetic research was developed, and how tion in which results were “proposed, negoti- diverging phonetic explanations became com- ated, modified, rejected or ratified in and peting paradigms. Understanding of the devel- through discursive processes” (Mondada 1995) opment of phonetic knowledge may be synthe- and finally became facts when used in scientific sised in the persona of particularly emblematic articles. Therefore, in order to understand the phoneticians: embodied the construction of this knowledge, it seems impor- boom that happened in the field of Swedish tant to undertake both a study of the phonetic phonetics during this period. The emergence of content and of the rhetoric and the discursive internationally recognized Swedish research in form used in articles explaining and propagat- phonetics was largely his work. This investiga- ing phonetic facts. A part of this research is in tion is based on two different corpora. The first this way related to studies on textuality (Bron- corpus is the set of 216 contributions, the full carkt 1996), especially those devoted to “aca- authorship of Malmberg published between demic writing” (Berge 2003; Bondi & Hyland 1939 and 1969. The second corpus is his ar- 2006; Del Lungo Camiciotti 2005; Fløttum & chive, owned by Lund University. It includes Rastier 2003; Ravelli & Ellis 2004; Tognini - semi-official and official letters, administrative Bonelli) and to studies on metadiscourse and correspondence, funding applications (…). The interactional resources (Hyland 2005; Hyland two are complementary. The study of both is & Tse 2006; Kerbrat-Orecchioni 2005; Ädel necessary for achieving a systematic descrip- 2006). tion of the development of phonetic knowledge Understanding of the development of pho- in Sweden. netic knowledge may be synthesised in the per- sona of particularly emblematic phoneticians. Research in progress Among these, none has better than Bertil 1 The aim of the current project (“Swedish Pho- Malmberg [1913-1994] embodied the boom netics ’39-’69”) is to provide an account of the that happened in the field of Swedish phonetics. historical, social, discursive, and rhetoric con- The emergence of internationally recognized ditions that determined the emergence of pho- Swedish research in phonetics was largely his netic science in Sweden during a thirty year pe- work. As Rossi (1996: 99) wrote: “Today, all riod, situated between 1939 and 1969 (see phoneticians identify with this modern concept Touati 2009; Touati forthcoming). The inquiry of phonetics that I will hereafter refer to, fol- is based on a systematic investigation essen- lowing Saussure, as speech linguistics. The tially in four areas: how empirical phonetic data much admired and respected B. Malmberg sig- were analysed in the period, how the discipline nificantly contributed to the development of gained new knowledge about phonetic facts this concept in Europe”. through improvements in experimental settings,

208 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The starting date, “the terminus a quo” cho- way that, on the contrary, knowledge is unsta- sen for the study is set in 1939, the year of the ble and highly subject to negotiation. first publication by Malmberg, entitled "Vad är fonologi?" (What is phonology?). The “termi- nus ad quem” is fixed to 1969 when Malmberg The phonetician left phonetics for the newly created chair in Bertil Malmberg was born on April 22, 1913 in general linguistics at the University of Lund. the city of Helsingborg, situated in Scania in southern Sweden (see also Sigurd 1995). In the autumn of 1932, then aged nineteen, he began Corpora to study at the University of Lund. He obtained Malmberg's authorship continued in an unbro- his BA in 1935. During the following academic ken flow until the end of his life. The list of his year (1936-1937), he went to Paris to study publications, compiled as “Bertil Malmberg phonetics with Pierre Fouché [1891-1967]. Bibliography” by Gullberg (1993), amounts to That same year he discovered phonology 315 titles (articles, monographs, manuals, re- through the teaching of André Martinet [1908- ports). 1999]. Back in Lund, he completed his higher The first corpus on which I propose to con- education on October 5, 1940 when he de- duct my analysis is the set of 216 contributions, fended a doctoral dissertation focused on a tra- the full authorship of Bertil Malmberg pub- ditional topic of philology. He was appointed lished between 1939 and 1969. The second cor- "docent" in Romance languages on December pus is his archive, owned by Lund University 6, 1940. (“Lunds universitetsarkiv, Inst. f. Lingvistik, After a decade of research, in November 24, Prefektens korrespondens, B. Malmberg”) It 1950, Malmberg finally reached the goal of his includes semi-official and official letters, ad- ambitions, both personal and institutional. Pho- ministrative correspondence, inventories, fund- netic sciences was proposed as the first chair in ing applications, administrative orders, tran- phonetics in Sweden and established at the scripts of meetings. In its totality, this corpus University of Lund, at the disposal of Malm- reflects the complexity of the social and scien- berg. Phonetics had thus become an academic tific life at the Institute of Phonetics. Malmberg discipline and received its institutional recogni- enjoyed writing. He sent and received consid- tion. Letters of congratulation came from far erable numbers of letters. Among his corre- and wide. Two of them deserve special men- spondents were the greatest linguists of his time tion. They are addressed to Malmberg by two (Benveniste, Delattre, Dumezil, Fant, Halle, major representatives of contemporary linguis- Hjelmslev, Jakobson, Martinet), as well as col- tics, André Martinet and Roman Jakobson. leagues, students, and representatives of the Martinet's letter is sent from Columbia Univer- non-scientific public. Malmberg was a perfect sity: « Cher Monsieur, / Permettez-moi tout polyglot. He took pleasure in using the lan- d’abord de vous féliciter de votre nomination. guage of his correspondent. The letters are in C’est le couronnement bien mérité de votre Swedish, German, English, Spanish, Italian, belle activité scientifique au cours des années and in French, the latter obviously the language 40. Je suis heureux d’apprendre que vous allez for which he had a predilection. pouvoir continuer dans de meilleures condi- The first corpus consists of texts in phonet- tions l’excellent travail que vous faites en ics. They will be analysed primarily in terms of Suède.» Jakobson's letter, sent from Harvard their scientific content (content-oriented analy- University early in 1951, highlights the fact sis). The second corpus will be used to describe that the appointment of Malmberg meant the the social and institutional context (context- establishment of a research centre in phonetics oriented analysis). The two are complementary. and phonology in Sweden: “[...] our warmest The study of both is necessary for achieving a congratulations to your appointment. Finally systematic description of the development of phonetics and phonemics have an adequate cen- phonetic knowledge. While the articles pub- ter in Sweden”. As can be seen, both are de- lished in scientific journals are meant to ensure lighted not only by Malmberg's personal suc- the validity of the obtained knowledge by fol- cess but also by the success of phonetics as an lowing strict research and writing procedures, academic discipline. the merit of the correspondence is to unveil in a unique, often friendly, sometimes astonishing

209 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

A point of departure: Two articles d’un style inscripteur et placé devant un cylin- dre enregistreur. Le mouvement vibratoire de la Malmberg started his prolific authorship in colonne d’air s’inscrit sur un papier noirci sous 1939 with an article dedicated to the new Pra- la forme d’une ligne sinueuse. » (Malmberg gue phonology. For this inaugural article, 1940: 63) Malmberg set the objective to inform Swedish language teachers about a series of fundamental The first analysed words anden vs. anden phonological concepts such as function, pho- (‘soul’ vs. ‘duck’) revealed a difference in tonal neme, opposition and correlation, concepts ad- manifestation. More examples of minimal pairs vanced by the three Russian linguists R. Jakob- of word accents, displayed in figures and son [1896-1982], S. Karcevskij [1884-1995] curves confirmed the observation. Malmberg and N.S. Trubetzkoy [1890-1938]. To empha- closed his article by putting emphasis on the size the revolutionary aspects of Prague pho- significance of word accents as an important nology, Malmberg started off by clarifying the research area in experimental phonetics: difference between phonetics and phonology: « Il y aurait ici un vaste champ de travail pour « Alors que la phonétique se préoccupe des la phonétique expérimentale, surtout si on faits sonores à caractère langagier et qu’elle se considère toutes les variations dialectales et in- propose de décrire de manière objective, voire dividuelles qui existent dans le domaine des expérimentale, les différentes phases de la pro- langues suédoise et norvégienne. » (Malmberg duction de la parole et ce faisant du rôle des 1940: 76) organes phonatoires, la phonologie fixe son at- tention uniquement sur la description des pro- priétés de la parole qui ont un rôle fonction- Final year in phonetics (1968) nel » (Malmberg, 1939 : 204.) Malmberg's correspondence during the year He praised Prague phonology in its effort to 1968 is particularly interesting. It contains identify and systematize functional linguistic abundant information, not least about the vari- forms, but did not hesitate to pronounce severe ety of people writing to Malmberg, of issues criticism against Trubetzkoy and his followers raised, and about how Malmberg introduced when they advocated phonology as a science events of his private life, health, social and in- strictly separate from phonetics. Malmberg sus- stitutional activities in his letters. Some exam- tained the idea that phonology, if claimed to be ples will demonstrate the rich contents of the a new science, must engage in the search for letters, here presented chronologically (while in relationships between functional aspects of the archive, the correspondence is in alphabeti- sounds and their purely phonetic properties cal order (based on the initial of the surname of within a given system – a particular language. the correspondent): His article includes examples borrowed from January: The year 1968 began as it should! French, German, Italian, Swedish and Welsh. On January 4, representatives of students sent a For Malmberg, there is no doubt that « la pho- questionnaire concerning the Vietnam War to nologie et la phonétique ne sont pas des scien- Professor Malmberg. They emphasized that, ces différentes mais deux points de vue sur un they particularly desired answers from the même objet, à savoir les formes sonores du lan- group of faculty professors. Malmberg appar- gage » (Malmberg 1939 : 210). ently neglected to answer. Hence a reminder The first article in experimental phonetics was dispatched on January 12. A few days was published the following year (Malmberg later, Malmberg received a prestigious invita- 1940) but was based on research conducted tion to participate in a "table ronde", a panel during his stay in Paris 1937-1938. Advised discussion on structuralism and sociology. The and encouraged by Fouché, Malmberg tackles, invitation particularly stressed the presence to in this first experimental work, an important be of Lévi-Strauss and other eminent professors problem of Swedish phonetics, namely the de- of sociology and law. We learn from Malm- scription of musical accents. In the resulting berg's response, dated February 12, that he had article, he presents the experimental protocol as declined the invitation for reasons of poor follows: health. January also saw the beginning of an important correspondence between Malmberg « On prononce le mot ou la phrase en question and Max Wajskop [1932-1993]. In this corre- dans une embouchure reliée par un tube élasti- spondence, some of Malmberg's letters were que à un diaphragme phonographique pourvu

210 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University going to play a decisive role for the genesis of A theoretical dead-end phonetics in Belgium, and subsequently also in In an article just three pages long, Malmberg the francophone world at large (see Touati, (1968) traces a brief history of phonemes. The forthcoming). February: On February 8, Sture Allen [born argument opens on a structuralist credo « On 1928] invites Malmberg to participate in a radio est d’accord pour voir dans les éléments du program on "Vad är allmän språkvetenskap?" langage humain de toute grandeur et à tous les (What is general linguistics?). Once more, he niveaux de la description scientifique (contenu, had to decline for reasons of health (letter of expression, différentes fonctions, etc…), des February 12). On February 19, the university éléments discrets. » Malmberg continues with administration informs that there were three a summary of the efforts of classical phoneti- applicants for the post as lecturer in phonetics cians to produce that monument of phonetic at the University of Lund: C.-C. Elert, E. Gård- knowledge - the International - cre- ing and K. Hadding-Kock. ated with the ambition to reflect a universal and March: The school specialized in education physiological (articulatory according to Malm- of deaf children in Lund asks Malmberg to help berg) description of phonemes. The authorities reflect on new structures for its future function- quoted here are Passy, Sweet, Sievers, ing. Forchhammer and Jones. He continues by re- April: Malmberg gives a lecture on “Foneti- ferring to « l’idée ingénieuse [qui] surgira de ska aspekter på uttalsundervisningen i skolor décomposer les dits phonèmes […] en unités för hörande och hörselskadade” (Phonetic as- plus petites et par là même plus générales et de pects in the teaching of pronunciation in voir chaque phonème comme une combinaison schools for the deaf and hearing disabled). […] de traits distinctifs ». In other words, he September: The student union informs refers to "Preliminaries to Speech Analysis" by Malmberg that a Day's Work for the benefit of Jakobson, Fant and Halle (1952), a publication the students in Hanoi will be organised during which may be considered a turning point in the week of September 21 to 29. phonetics. Indeed, later in his presentation, October: A letter of thanks is sent to Hans Malmberg somehow refutes his own argument Vogt, professor at the University of Oslo, for about the ability of acoustic properties to be the assessment done in view of Malmberg's ap- used as an elegant, simple and unitary way for pointment to the chair of general linguistics in modelling sounds of language. He highlights Lund. That same month he received a rather the fact that spectrographic analysis reveals the amusing letter from a young man from Sunds- need for an appeal to a notion such as the locus vall who asks for his autograph as well as a in order to describe, in its complexity and signed photograph. The young man expands on variation, the acoustic structure of consonants. that his vast collection already boasts the auto- Malmberg completed his presentation by em- graph of the king of Sweden. Malmberg grants phasizing the following: the wishes of his young correspondent on No- « Mais rien n’est stable dans le monde des sci- vember 4. ences. En phonétique l’intérêt est en train de se November: A "docent" at the University of déplacer dans la direction des rapports stimulus Uppsala who disagrees with his colleagues et perception […] Si dans mon travail sur le about the realization of schwa asks Malmberg classement des sons du langage de 1952, j’avais to serve as expert and make a judgment in the espéré retrouver dans les faits acoustiques cet matter. In late November, Malmberg has to go ordre qui s’était perdu en cours de route avec to Uppsala to attend lectures given by appli- l’avancement des méthodes physiologiques, je cants for two new Swedish professorships in deviens maintenant de plus en plus enclin à phonetics, at the University of Uppsala and the chercher cet ordre non plus dans les spectres University of Umeå, respectively. Candidates qui les spécifient mais sur le niveau perceptuel. who will all become renowned professors in Ma conclusion avait été fondée sur une fausse phonetics are Claes-Christan Elert, Kerstin idée des rapports entre son et impression audi- Hadding-Koch, Björn Lindblom and Sven Öh- tive. Je crois avoir découvert, en travaillant par man (In 1968 and 1969, there is a strong proc- exemple sur différents problèmes de la proso- ess of institutionalization of phonetics taking die, que ces rapports sont bien plus compliqués place in Sweden). que je l’avais pensé au début » (Malmberg 1968 : 165)

211 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

As can be seen from reading these lines, phonétique à la communication parlée. Re- Malmberg had the courage to recognize that he vue d’anthropologie des connaissances 3, had underestimated the difficulties pertaining to 97-114. the relationship between sound and auditory Hyland K. (2005) Metadiscourse. Exploring impression. It seems that Malmberg had the Interaction in Writing, London-New York: premonition of the cognitive and central role Continum. played by the perception of sounds but he was Hyland K. & Bondi M. (eds.) (2006) Academic not able to recognise it properly since he was Discourse Across Disciplines. Bern: Peter captive in his structural paradigm. Lang. Kerbrat-Orecchioni C. (2005) Le discours en interaction. Paris: Armand Colin. To conclude Malmberg, B. (1939) Vad är fonologi ?, Mo- In a letter, dated October 30,1968, addressed to derna Språk XXXIII, 203-213. his friend and fellow, the Spanish linguist A. Malmberg, B. (1940) Recherches expérimenta- Quilis, Malmberg says that he suffers from a les sur l’accent musical du mot en suédois. limitation: “ Tengo miedo un poco del aspecto Archives néerlandaises de Phonétique expé- muy técnico y matemático de la fonética mod- rimentale, Tome XVL, 62-76. erna. A mi edad estas cosas se aprenden Malmberg, B. (1968) Acoustique, audition et difícilmente.” By the end of 1968, Malmberg is perfection linguistique. Autour du problème thus well aware of the evolution of phonetics de l’analyse de l’expression du langage. and thereby of what had become his own scien- Revue d’Acoustique 3-4, 163-166. tific limitations. Empirical phonetic research Mondada L. (1995) La construction discursive had taken a radical technological orientation des objets de savoir dans l'écriture de la (see Grosseti & Boë 2008). It is undoubtedly science. Réseaux 71, 55-77. with some relief that he joined his new assign- Ravelli L.J. & Ellis R.A. (2004) Analysing ment as professor of general linguistics. Academic Writing : Contextualized Frame- works. London : Continuum. Notes Rossi M. (1996). The evolution of phonetics: A fundamental and applied science. Speech 1. And of course , the other grand Communication, vol. 18, noº 1, 96-102. old man of Swedish phonetics. Sigurd B. (1995). Bertil Malmberg in memor- iam. Working Papers 44,1-4. References Tognini-Bonelli E .& Del Longo Camiciotti G. Ädel A. (2006) Metadiscourse in L1 and L2 (eds.) (2005) Strategies in Academic Dis- English. Amsterdam/Philadelphia: John course. Amsterdam/Philadelphia: John Ben- Benjamins Publishing Company. jamins Publishing Company. Berge K.L. (2003) The scientific text genres as Touati P. (2009) De la construction discursive social actions: text theoretical reflections on et rhétorique du savoir phonétique en Suède the relations between context and text in : Bertil Malmberg, phonéticien (1939- scientific writing. In Fløttum K. & Rastier 1969). In Bernardini P, Egerland V & F. (eds) Academic discourse. Multidiscipli- Grandfeldt J (eds) Mélanges plurilingues nary approaches, 141-157, Olso: Novus offerts à Suzanne Schlyter à l’occasion de Press. son 65éme anniversaire, Lund : Études ro- Broncarkt J.-P. (1996) Activité langagière, tex- manes de Lund 85, 417-439. tes et discours. Pour un interactionnisme Touati P. (Forthcoming) De la médiation épis- socio-discursif. Lausanne-Paris: Delachaux tolaire dans la construction du savoir scien- & Niestlé. tifique. Le cas d’une correspondance entre Fløttum K. & Rastier F. (eds) (2003) Acade- phonéticiens. Revue d’anthropologie des mic discourse. Multidisciplinary ap- connaissances. proaches. Olso: Novus Press. Gullberg, M. (1993) Bertil Malmberg Bibliog- raphy, Working Papers 40, 5-24. Grossetti M. & Boë L.-J. (2008) Sciences hu- maines et recherche instrumentale : qui ins- trumente qui ?. L’exemple du passage de la

212 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

213 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University How do Swedish encyclopedia users want pronuncia- tion to be presented? Michaël Stenberg Centre for Languages and Literature, Lund University

Abstract were four introductory questions about age and linguistic background. In order to evaluate the This paper about presentation of pronunciation questions, a pilot test was first made, with five in Swedish encyclopedias is part of a doctoral participants: library and administrative staff, dissertation in progress. It reports on a panel and students of linguistics, though not special- survey on how users view presentation of pro- izing in phonetics. This pilot test, conducted in nunciation by transcriptions and recordings, March 2009, resulted in some of the questions so-called audio pronunciations. The following being revised for the sake of clarity. main issues are dealt with: What system should The survey proper was carried out in be used to render stress and segments? For March―April 2009. Fifty-four subjects be- what words should pronunciation be given tween 19 and 80 years of age, all of them affili- (only entry headwords or other words as well)? ated to Lund University, were personally ap- What kind of pronunciation should be present- proached. No reward was offered for parti- ed (standard vs. local, original language vs. cipating. Among them were librarians, admi- swedicized)? How detailed should a phonetic nistrative staff, professors, researchers and stu- transcription be? How should ‘audio pronunci- dents. Their academic studies comprised Lin- ations’ be recorded (human vs. synthetic guistics (including General Linguistics and speech, native vs. Swedish speakers, male vs. Phonetics) Logopedics, Audiology, Semiology, female speakers)? Results show that a clear Cognitive Science, English, Nordic Langua- majority preferred IPA transcriptions to ‘re- ges, German, French, Spanish, Italian, Polish, spelled pronunciation’ given in ordinary or- Russian, Latin, , Japanese, Translation thography. An even vaster majority (90%) did Program, Comparative Literature, Film Studies, not want stress to be marked in entry head- Education, Law, Social Science, Medicine, words but in separate IPA transcriptions. Only Biology and Environmental Science. A majori- a small number of subjects would consider us- ty of the subjects had Swedish as their first ing audio pronunciations made up of synthetic language; however, the following languages speech. were also represented: Norwegian, Dutch, Ger- man, Spanish, Portuguese, Romanian, Russian, Introduction Bulgarian and Hebrew. In spite of phonetic transcriptions having been The average time for filling in the 11-page used for more than 130 years to show pro- questionnaire was 20 minutes. Each question nunciation in Swedish encyclopedias, very little had 2―5 answer options. As a rule, only one of is known about the users’ preferences and their them should be marked, but for questions opinion of existing methods of presenting pro- where more than one option was chosen, each nunciation. I therefore decided to procure infor- subject’s score was evenly distributed over the mation on this. Rather than asking a random options marked. Some follow-up questions sample of more than 1,000 persons, as in were not to be answered by all subjects. In a customary opinion polls, I chose to consult a small number of cases, questions were mista- smaller panel of persons with a high probability kenly omitted. The percentage of answers for a of being experienced users of encyclopedias. certain option has always been based on the This meant a qualitative metod and more quali- actual number of subjects who answered each fied questions than in a mass survey. question. For many of the questions, an oppor- tunity for comments was provided. In a few cases, comments made by subjects have led to Method reinterpretation of their answers, i.e., if the A questionnaire made up of 24 multiple choice choice of a given option does not coincide with questions was compiled. Besides these, there

214 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University a comment on it, the answer has been inter- examples of systems for marking stress in entry preted in accordance with the comment. headings are given. However, subjects showed a strong tendency to dislike having stress Questions and results marked in entry headings. As many as 90% favoured a separate IPA transcription instead. The initial question concerned the main motive According to the comments made, the reason for seeking pronunciation advice in encyclope- was that they did not want the image of the dias. As might have been expected, a vast ma- orthograpic word to be disturbed by signs that jority, 69%, reported that they personally want- could possibly be mistaken for . ed to know the pronunciation of items they Table 1 shows five different ways of were looking up, but, interestingly enough, for marking stress in orthographic words that the 13%, the reason was to resolve disputes about panel had to evaluate. The corresponding IPA pronunciation. Others used the pronunciation transcriptions of the four words would be advice to feel more secure in company or to [noˈbɛl], [ˈmaŋkəl], [ˈramˌløːsa] and [ɧaˈmɑːn]. prepare themselves for speaking in public. When it came to the purpose of the advice given, almost half of the subjects (44%) wanted Table 1. Examples of systems for marking main it to be descriptive (presenting one or more ex- stress in orthographic words: (a) IPA system as used by Den Store Danske Encyklopædi, (b) Na- isting pronunciations). The other options were tionalencyklopedin & Nordisk Familjebok 2nd edn. prescriptive and guiding, the latter principle be- system, (c) SAOL ( Wordlist), ing adopted by several modern encyclopedias. Svensk uppslagsbok & NE:s ordbok system, (d) Bra For entries consisting of personal names, a Böckers lexikon & Lexikon 2000 system, (e) Brock- striking majority, 97%, wanted pronunciation haus, Meyers & Duden Aussprachewörterbuch sys- to be given not only for second (family) names, tem.1 but also for first names, at least for persons who are always referred to by both names. This re- (a) Noˈbel ˈMankell ˈRamlösa schaˈman sult is quite contrary to the prevalent tradition (b) Nobe´l Ma´nkell Ra´mlösa schama´n in Sweden, where pronunciation is provided ex- (c) Nobel´ Man´kell Ram´lösa schama´n clusively for second names. Somewhat surpris- (d) Nobel Mankell Ramlösa schaman ingly, a majority of 69% wanted pronunciation (e) Nobel Mankell Ramlösa schama̱n (or stress) only to be given for entry headings, not for scientific terms mentioned later. Of the In case stress was still to be marked in entry remaining 31%, however, 76% wanted stress to headings, the subjects’ preferences for the be marked in scientific terms, e.g., Calendula above systems were as follows: officinalis, mentioned either initially only or also further down in the article text. (a) : 51 % (b) : 11 % Notation of prosodic features (c) : 9 % The next section covered stress and tonal fea- (d) : 6 % tures. 46% considered it sufficient to mark (e) : 20 % main stress, whereas main plus secondary stress was preferred by 31%. The rest demanded even As the figures show, this meant a strong sup- a third degree of stress to be featured. Such a port for IPA, whereas three of the systems system has been used in John Wells’s Longman widely used in Sweden were largely dismissed. Pronunciation Dictionary, but was abandoned System (e) is a German one, used in works with with its 3rd edition (2008). Max Mangold in the board of editors. It has the 70% of the subjects wanted tonal features to same economic advantages as (c), and is well be dipslayed, and 75% of those thought suited for Swedish, where quantity is comple- Swedish accent 1 and 2 and the corresponding mentary distributed between vowels and conso- Norwegian tonelag features would suffice to be nants in stressed syllables. System (d), which shown. does not account for quantity, can be seen as a A number of systems for marking stress simplification of (e). It seems to have been in- exist, both within phonetic transcriptions in troduced in Sweden by Bra Böckers Lexikon, a square brackets and outside these, in words very widespread Swedish encyclopedia, having written in normal orthography. In table 1 the Danish work Lademanns Leksikon as its

215 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University model, published from 1973 on and now super- local. Like loanwords, many foreign geogra- seded by Lexikon 2000. The only Swedish en- phical names, e.g., Hamburg, London, Paris, cyclopedia where solely IPA transcriptions in Barcelona, have obtained a standard, swedi- brackets are used appears to be Respons (1997 cized pronunciation, whereas other ones, some- —8), a minor work of c. 30,000 entries, which times—but not always—less well-known, e.g., is an adaptation of the Finnish Studia, aimed at Bordeaux, Newcastle, Katowice, have not. The young people. Its pronunciation system is, how- panel was asked how to treat the two types of ever, conceived in Sweden. names. A majority, 69% wanted a swedicized It ought to be mentioned that SAOB (Sven- pronunciation, if established, to be given, other- ska Akademiens ordbok), the vast dictionary of wise the original pronunciation. However, the the Swedish language, which began to be pub- remaining 31% would even permit the editors lished in 1898 (sic!) and is still under edition, themselves to invent a pronunciation con- uses a system of its own. The above examples sidered easier for speakers of Swedish in would be represented as follows: nåbäl3, ‘difficult’ cases where no established swedi- maŋ4kel, ram3lø2sa, ʃ ama4n. The digits 1—4 fications exist, like Łódź and Poznań. Three represent different degrees of stress and are subjects commented that they wanted both the placed in the same way as the stress marks in original and swedicized pronunciation to be system (c) above, their position thus denoting given for Paris, Hamburg, etc. quantity, from which the quality of the a’s In most of Sweden /r/ + dentals are could, in turn, be derived. The digits also ex- amalgamated into retroflex sounds, [ʂ], [ ʈ ], [ɖ] press accent 1 (in Mankell) and accent 2 (in etc. In Finland, however, and in southern Ramlösa). Being complex, this system has not Sweden, where /r/ is always realized as [ʁ ] or been used in any encyclopedia. [ʀ ], the /r/ and the dentals are pronounced separately. One question put to the panel was Notation of segments whether etc. should be transcribed as For showing the pronunciation of segments, retroflex sounds—as in the recently published there was a strong bias, 80%, in favour of the Norstedts svenska uttalsordbok (a Swedish IPA, possibly with some modifications, where- pronunciation dictionary)—or as sequences of as the remaining 20% only wanted letters of the [r] and dentals—as in most encyclopedias. The to be used. Two questions scores were 44% and 50% respectively, with an concerned the narrowness of transcriptions. additional 6% answering by an option of their Half of the subjects wanted transcriptions to be own: the local pronunciation of a geographical as narrow as in a textbook of the language in name should decide. No one in the panel was question, 31% narrow enough for a word to be from Finland, but 71% of those members with identified by a native speaker if pronounced in Swedish as their first language were speakers of accordance with the transcription. The re- dialects lacking retroflex sounds. maining 19% meant that narrowness should be Particularly for geographical names, two allowed to vary from language to language. different pronunciations often exist side by Those who were of this opinion had the side: one used by the local population, and following motives for making a more narrow another, a so-called reading pronunciation, used transcription for a certain language: the lan- by people from outside, and sometimes by the guage is widely studied in Swedish schools inhabitants when speaking to strangers. The (e.g., English, French, German, Spanish), 47%; latter could be described as the result of the language is culturally and geographically somebody—who has never heard the name close to Sweden, e.g., Danish, Finnish), 29%; pronounced—reading it and making a guess at the pronunciation of the language is judged to its pronunciation. Often the reading pronun- be easy for speakers of Swedish without know- ciation has become some sort of national ledge of the language in question, (e.g., Italian, standard. A Swedish example is the ancient Spanish, Greek), 24%. More than one option town of Vadstena, on site being pronounced had often been marked. [ˈvasˌsteːna], elsewhere mostly [ˈvɑːdˌsteːna]. The reading pronunciation was preferred by What pronunciation to present? 62% of the subjects, the local one by 22%. The remainder also opted for local pronunciation, One section dealt with the kinds of pronun- provided it did not contain any phonetic ciation to present. An important dimension is features alien to speakers of standard Swedish. swedicized—foreign, another one standard—

216 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

For English, Spanish and Portuguese, both possibilites—which seems to be a wise different standards exist in Europe, the Ame- strategy—, 19% would just listen, wheras the ricas and other parts of the world. The panel remaining 10% would stick to the transcrip- was asked whether words in these languages tions. should be transcribed in one standard variety This section concluded with a question for each language (e.g., Received Pronuncia- about the preferred way to describe the speech tion, Madrid Spanish and Lisbon Portuguese), sounds represented by the signs. Should it be one European and one American pronunciation made by means of articulation descriptions like for each language, or if the local standard ‘voiced bilabial fricative’ or example words pronunciation (e.g., Australian English) should from languages where the sound appears, as as far as possible be provided. The scores ‘[β] Spanish saber, jabón’ or by clickable obtained were 27%, 52% and 21% respectively. recordings? Or by a combination of these? The Obviously, the panel felt a need to distinguish scores were approximately 18%, 52% and 31% between European and American pronuncia- respectively. Several subjects preferred combi- tions, which is done in Nationalencyklopedin. It nations. In such cases, each subject’s score was could be objected that native speakers of the evenly distributed over the options marked. languages in question use their own variety, irrespective of topic. On the other hand, it may Familiarity with IPA alphabet be controversial to transcribe a living person’s In order to provide an idea of how familiar the name in a way alien to him-/herself. For panel members were with the IPA alphabet, example, the name Berger is pronounced they were finally presented with a chart of 36 [ˈbɜːdʒə] in Britain but [ˈbɜ˞ːgər] in the U.S. frequently used IPA signs and asked to mark those they felt sure of how to pronounce. The Audio pronunciations average number of signs marked turned out to There were five questions about audio pronun- be 17. Of the 54 panel members, 6 did not mark ciations, i.e. clickable recordings. The first one any sign at all. The top scores were [æ]: 44, [ʃ ] was whether such recordings should be read by and [o]: both 41, [u]: 40, [ə]: 39, [a]: 37 and [ʒ]: native speakers in the standard variety of the 35. Somewhat surprising, [ʔ ] obtained no less language in question (as done in the digital than 17 marks. versions of Nationalencyklopedin) or by one and the same speaker with a swedicized Discussion pronunciation. Two thirds chose the first option. Apparently, Sweden and Germany are the two The next question dealt with speaker sex. countries where pronunciation in encyclopedias More than 87% wanted both male and female are best satisfied. Many important works in oth- speakers, evenly distributed, while 4% er countries either do not supply pronunciation preferred female and 8% male speakers. One of at all (Encyclopædia Britannica), or do so only the subjects opting for male speakers com- sparingly (Grand Larousse universel and Den mented that men, or women with voices in the Store Danske Encyklopædi), instead referring lower frequency range, were preferable since their users to specialized pronunciation dictio- they were easier to perceive for many persons naries. This solution is unsatisfactory because with a hearing loss. (i) such works are not readily available (ii) they Then subjects were asked if they would like are difficult for a layman to use (iii) you have to use a digital encyclopedia where pronun- to consult several works with different nota- ciation was presented by means of synthetic tions (iv) you will be unable to find the pronun- speech recordings. 68% were clearly against, ciation of many words, proper names in parti- and of the remaining 32%, some expressed cular. reservations like ‘Only if extremely natural’, ‘If An issue that pronunciation editors have to I have to’ and ‘I prefer natural speech’. consider, but that was not taken up in the sur- In the following question, the panel was vey is how formal—casual the presented pro- asked how it would most frequently act when nunciation should be. It is a rather theoretical seeking pronunciation information in a digital problem, complicated to explain to panel mem- encyclopedia with both easily accessible audio bers if they are not able to listen to any record- pronunciations and phonetic transcriptions. No ings. Normally, citation forms are given, but it less than 71% declared that they would use can be of importance to have set rules for how

217 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University coarticulation and sandhi phenomena should be (d) I don’t imagine any pronunciation at all but treated. memorize the image of the written word Another tricky task for pronunciation edi- and link it to the concept it represents: tors is to characterize the pronunciation of the 11%. phonetic signs. As one subject pointed out in a comment, descriptions like ‘voiced bilabial It can be doubted whether (d) is a plausible op- fricative’ do not tell you much unless you have tion for people using alphabetical script. One been through an elementary course of phone- subject commented that it was not. Anyway, it tics. Neither do written example words serve seems that it would be more likely to be used their purpose to users without knowledge of the by those brought up in the tradition of icono- languages in question. It is quite evident that graphic script. Researchers of the reading pro- audio recordings of the written example words cess might be able to judge. —in various languages for each sign, thus illu- The outcome is that the panel is rather strating the phonetic range of it—would really reluctant to use Swedish pronunciation—even add something. tentatively—for foreign words, like saying for The panel favoured original language pro- example [ˈʃɑːkəˌspeːarə] for Shakespeare or nunciation both in transcriptions (69% or more) [ˈkɑːmɵs] for Camus, pronunciations that are and in audio recordings (67%). At least in Swe- sometimes heard from Swedish children. den, learners of foreign languages normally aim Rather, they prefer to make guesses like at a pronunciation as native-like as possible. [ˈgriːnwɪtʃ] for Greenwich, as is frequently done However, this might not always be valid for en- in Sweden. cyclopedia users. When speaking your mother tongue, pronouncing single foreign words in a Conclusion truly native-like way may appear snobbish or affected. Newsreaders usually do not change Sweden has grand traditions in the field of pre- their base of articulation when encountering a senting pronunciation in encyclopedias, but this foreign name. A general solution is hard to does not mean that they should be left un- find. Since you do not know for what purpose changed. It is quite evident from the panel’s an- users are seeking pronunciation advice, adopt- swers that the principle of not giving pronunci- ing a fixed level of swedicization would not be ation for first names is totally outdated. satisfactory. The Oxford BBC Guide to The digital revolution provides new possi- pronunciation has solved this problem by bilities. Not only does it allow for showing supplying two pronunciations: an anglicized more than one pronunciation, e.g., one standard one, given as ‘respelled pronunciation’, and and one regional variety, since there is now another one, more close to the original space galore. Besides allowing audio record- language, transcribed in IPA. ings of entry headings, it makes for better de- scriptions of the sounds represented by the vari- Interim strategies ous signs, by completing written example words in various languages with sound record- One question was an attempt to explore the the ings of them. strategies most frequently used by the subjects IPA transcriptions should be favoured when when they had run into words they did not producing new encyclopedias. The Internet has know how to pronounce, in other words to find contributed to an increased use of the IPA, es- out what was going on in their minds before pecially on the Wikipedia, but since the authors they began to seek pronunciation advice. The of those transcriptions do not always have suffi- options and their scores were as follows: cient knowledge of phonetics, the correctness of certain transcriptions ought to be questioned. (a) I guess at a pronunciation and then use it The extent to which transcriptions should be silently to myself: 51% used, and how detailed they should be must de- (b) I imagine the word pronounced in Swedish pend on the kind of reference book and of the and then I use that pronunciation si- group of users aimed at. Nevertheless, account lently to myself: 16% must always be taken of the many erroneous (c) I can’t relax before I know how to pro- pronunciations that exist and continue to nounce the word; therefore, I avoid all spread, e.g., [ˈnætʃənəl] for the English word conjectures and immediately try to find national, a result of Swedish influence. out how the word is pronounced: 22%

218 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Acknowledgements I wish to thank all members of the panel for their kind help. Altogether, they have spent more than two working days on answering my questions—without being paid for it.

Notes 1. In Bra Böckers Lexikon and Lexikon 2000, system (d)—dots under vowel signs—is used for denoting main stress also in transcriptions within brackets, where segments are rendered in IPA. 2. Also available free of charge on the Internet.

References Bra Böckers Lexikon (1973—81 and later edns.) Höganäs: Bra Böcker. Den Store Danske Encyklopædi (1994—2001) Copenhagen: Danmarks Nationalleksikon. Duden, Aussprachewörterbuch, 6th edn., revised and updated (2005) Mannheim: Dudenverlag. Elert, C.-C. (1967) Uttalsbeteckningar i svenska ordlistor, uppslags- och läroböcker. In Språkvård 2. Stockholm: Svenska språknämnden. Garlén, C. (2003) Svenska språknämndens ut- talsordbok. Stockholm: Svenska språk- nämnden. Norstedts ordbok. Lexikon 2000 (1997—9) Malmö: Bra Böcker. Nationalencyklopedin (1989—96) Höganäs: Bra Böcker. Nordisk familjebok, 2nd edn. (1904—26) Stockholm: Nordisk familjeboks förlag. Olausson, L. and Sangster, C. (2006) Oxford BBC Guide to pronunciation. Oxford: Oxford Univ. Press. Respons (1997—8) Malmö: Bertmarks. Rosenqvist, H. (2004) Markering av prosodi i svenska ordböcker och läromedel. In Ek- berg, L. and Håkansson, G. (eds.) Nordand 6. Sjätte konferensen om Nordens språk som andraspråk. Lund: Lunds universitet. Institutionen för nordiska språk. Svenska Akademiens ordbok (1898—) Lund: C.W.K. Gleerup.2 Svenska Akademiens ordlista, 13th edn. (2006) Stockholm: Norstedts akademiska förlag (distributor).2 Svensk uppslagsbok, 2nd edn. (1947—55) Malmö: Förlagshuset Norden.

219 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

LVA-technology – The illusion of “lie detection”1 Francisco Lacerda Department of Linguistics, Stockholm University

Abstract The new speech-based lie-detection LVA- Nemesysco’s claims technology is being used in some countries to According to Nemesysco’s web site, “LVA screen applicants, passengers or customers in identifies various types of stress, cognitive areas like security, medicine, technology and processes and emotional reactions which to- risk management (anti-fraud). However, a sci- gether comprise the “emotional signature” of an entific evaluation of this technology and of the individual at a given moment, based solely on principles on which it relies indicates, not sur- the properties of his or her voice”i. Indeed, prisingly, that it is neither valid nor reliable. This article presents a scientific analysis of this “LVA is Nemesysco's core technology adapted LVA-technology and demonstrates that it sim- to meet the needs of various security-related ply cannot work. activities, such as formal police investigations, security clearances, secured area access control, intelligence source questioning, and hostage Introduction negotiation”ii and “(LVA) uses a patented and After of the attacks of September 11, 2001, the unique technology to detect ‘brain activity demand for security technology was considera- traces’ using the voice as a medium. By utiliz- bly (and understandably) boosted. Among the ing a wide range spectrum analysis to detect security solutions emerging in this context, minute involuntary changes in the speech wave- Nemesysco Company’s applications claim to be form itself, LVA can detect anomalies in brain capable of determining a speaker’s mental state activity and classify them in terms of stress, ex- from the analysis of samples of his or her voice. citement, deception, and varying emotional In popular terms Nemesysco’s devices can be states, accordingly”. Since the principles and generally described as “lie-detectors”, pre- the code used in the technology are described in sumably capable of detecting lies using short the publicly available US 6,638,217 B1 patent, samples of an individual’s recorded or on-line a detailed study of the method was possible and captured speech. However Nemesysco claims its main conclusions are reported here. their products can do much more than this. Their products are supposed to provide a whole Deriving the “emotional signa- range of descriptors of the speaker’s emotional status, such as exaggeration, excitement and ture” from a speech signal “outsmarting” using a new “method for detect- While assessing a person’s mental state using ing emotional status of an individual”, through the linguistic information provided by the the analysis of samples of her speech. The key speaker (essentially by listening and interpret- component is Nemesysco’s patented LVA- ing the person’s own description of her or his technology (Liberman, 2003). The technology state of mind) might, in principle, be possible if is presented as unique and applicable in areas based on an advanced speech recognition sys- such as security, medicine, technology and risk tem, Nemesysco’s claim that the LVA- management (anti-fraud). Given the conse- technology can derive "mental state" informa- quences that applications in these areas may tion from “minute involuntary changes in the have for the lives of screened individuals, a sci- speech waveform itself” is at least astonishing entific assessment of this LVA-technology from both a phonetic and a general scientific should be in the public’s and authorities’ inter- perspective. How the technology accomplishes est. this is however rather unclear. No useful infor-

1 This text is a modified version of Lacerda (2009), “LVA-technology – A short analysis of a lie”, available online at http://blogs.su.se/frasse/, and is intended for a general educated but not specialized audience.

220 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University mation is provided on the magnitude of these contaminated with room acoustics and back- “minute involuntary changes” but the wording ground noise, the particular temporal profile of conveys the impression that these are very sub- the waveform is essentially created by the vocal tle changes in the amplitude and time structure tract’s response to the pulses generated by the of the speech signal. A reasonable assumption vocal folds’ vibration. However these pulses is to expect the order of magnitude of such "in- are neither “minute” nor “involuntary”. The voluntary changes" to be at least one or two or- changes observed in the details of the wave- ders of magnitude below typical values for forms can simply be the result of the superposi- speech signals, inevitably leading to the first tion of pulses that interfere at different delays. issue along the series of ungrounded claims In general, the company’s descriptions of the made by Nemesysco. If the company's reference methods and principles are circular, inconclu- to "minute changes" is to be taken seriously, sive and often incorrect. This conveys the im- then such changes are at least 20 dB below the pression of superficial knowledge of acoustic speech signal's level and therefore masked by phonetics, obviously undermining the credibil- typical background noise. For a speech wave- ity of Nemesysco’s claims that the LVA- form captured by a standard microphone in a technology performs a sophisticated analysis of common reverberant room, the magnitude of the speech signal. As to the claim that the prod- these "minute changes" would be comparable to ucts marketed by Nemesysco would actually be that of the disturbances caused by reflections of able to detect the speaker’s emotional status, the acoustic energy from the walls, ceiling and there is no known independent evidence to sup- floor of the room. In theory, it could be possible port it. Given the current state of knowledge, to separate the amplitude fluctuations caused by unless the company is capable of presenting room acoustics from fluctuations associated scientifically sound arguments or at least pro- with the presumed “involuntary changes” but ducing independently and replicable empirical the success of such separation procedure is data showing that there is a significant differ- critically dependent on the precision with which ence between their systems’ hit and false-alarm the acoustic signal is represented and on the rates, Nemesysco’s claims are unsupported. precision and adequacy of the models used to represent the room acoustics and the speaker's How LVA-technology works acoustic output. This is a very complex problem that requires multiple sources of acoustic in- This section examines the core principles of formation to be solved. Also the reliability of Nemesysco’s LVA-technology, as available in the solutions to the problem is limited by fac- the Visual Basic Code in the method’s patent. tors like the precision with which the speaker's direct wave-front (originating from the Digitizing the speech signal speaker’s mouth, nostrils, cheeks, throat, breast For a method claiming to use information from and other radiating surfaces) and the room minute details in the speech wave, it is surpris- acoustics can be described. Yet another issue ing that the sampling frequency and the sample raised by such “sound signatures” is that they sizes are as low as 11.025 kHz and 8 bit per are not even physically possible given the sample. By itself, this sampling frequency is masses and the forces involved in speech pro- acceptable for many analysis purposes but, duction. The inertia of the vocal tract walls, ve- without knowing which information the LVA- lum, vocal folds and the very characteristics of technology is supposed to extract from the sig- the phonation process lead to the inevitable nal, it is not possible to determine whether conclusion that Nemesysco’s claims of picking 11.025 kHz is appropriate or not. In contrast, up that type of "sound signatures" from the the 8 bit samples inevitably introduce clearly speaker’s speech waveform are simply not real- audible quantification errors that preclude the istic. It is also possible that these “minute analysis of “minute details”. With 8 bit samples changes” are thought as spreading over several only 256 levels are available to encode the periods of vocal-fold vibration. In this case they sampled signal’s amplitude, rather than 65536 would be observable but typically not “involun- quantization levels associated with a 16 bit tary”. Assuming for a moment that the signal sample. In acoustic terms this reduction in sam- picked up by Nemesysco’s system would not be ple length is associated with a 48 dB increase of

221 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University the background noise relative to what would higher than the maximum of the first and third have been possible using a 16-bit/sample repre- samples, provided all three samples are above sentation. It is puzzling that such crude signal an arbitrary threshold of +15. Similarly, a representations are used by a technology claim- is also detected when the middle sample value ing to work on “details”. But the degradation of is lower than the minimum of both the first and the amplitude resolution becomes even worse the third samples in the triplet and all three as a “filter” that introduces a coarser quantiza- samples are below -15. In short, thorns are local tion using 3 units’ steps reduces the 256 levels maxima, if the triplet is above +15 and local of the 8-bit representation to only 85 quantiza- minima if the triplet is below -15. Incidentally tion levels (ranging from -42 to +42). This very this is not compatible with the illustration pro- low sample resolution (something around 6.4- vided in fig. 2 of the patent, where any local bit/sample), resulting in a terrible sound quality, maxima or minima are counted as thorns, pro- is indeed the basis for all the subsequent signal vided the three samples fall outside the processing carried out by the LVA-technology. (-15;+15) threshold interval. The promise of an analysis of “minute” details in the speech waveform cannot be taken seri- "Plateaus" ously. Figure 1 displays a visual analogue of the Potential plateaus are detected when the sam- signal degradation introduced by the LVA- ples in a triplet have a maximum absolute am- technology. plitude deviation that is less than 5 units. The ±15 threshold is not used in this case but to count as a plateau the number of samples in the sequence must be between 5 and 22. The num- ber of occurrences of plateaus and their lengths are the information stored for further process- ing.

A blind technology Although Nemesysco presents a rationale for the choice of these “thorns” and “plateaus” that simply does not make sense from a signal proc- essing perspective, there are several interesting properties associated with these peculiar vari- ables. The crucial temporal information is com- pletely lost during this analysis. Thorns and pla- teaus are simply counted within arbitrary Figure 1. Visual analogs of LVA-technology’s speech signal input. The 256×256 pixels image, chunks of the poorly represented speech signal corresponding to 16 bit samples, is sampled which means that a vast class of waveforms down to 16×16 pixels (8 bit samples) and finally created by shuffling the positions of the thorns down-sampled to approximately 9×9 pixels and plateaus are indistinguishable from each representing the ±42 levels of amplitude encod- other in terms of totals of thorns and plateaus. ing used by the LVA-technology. Many of these waveforms may even not sound like speech at all. This inability to distinguish between different waveforms is a direct conse- The core analysis procedure quence of the information loss accomplished by In the next step, the LVA-technology scans that the signal degradation and the loss of temporal crude speech signal representation for “thorns” information. In addition to this, the absolute and “plateaus” using triplets of consecutive values of the amplitudes of the thorns can be samples. arbitrarily increased up to the ±42 maximum level, creating yet another variant of physically "Thorns" different waveforms that are interpreted as According to Nemesysco’s definition, thorns identical from the LVA-technology’s perspec- are counted every time the middle sample is tive.

222 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

The measurement of the plateaus appears to any other and of course contain no more infor- provide only very crude information and is af- mation than what already was present in the fected by some flaws. Indeed, the program code four variables above. allows for triplets to be counted as both thorns and plateaus. Whether this is intentional or just Examples of waveforms that be- a programming error is impossible to determine 2 because there is no theoretical model behind the come associated with “LIES” LVA-technology against which this could be Figure 2 shows several examples of a synthetic checked. In addition, what is counted as a pla- vowel that was created by superimposing with teau does not even have to look like a plateau. the appropriate delays to generate different fun- An increasing or decreasing sequence of sam- damental frequencies, a glottal pulse extracted ples where differences between adjacent sam- from a natural production. ples are less than ±5 units will count as a pla- After calibration with glottal pulses simulat- teau. Only the length and the duration of these ing a vowel with a 120 Hz fundamental fre- plateaus are used and because the ±5 criterion is quency, the same glottal pulses are interpreted actually a limitation on the derivate of the am- as indicating a “LIE” if the fundamental fre- plitude function, large amplitude drifts can oc- quency is lowered to 70 Hz whereas a raise in cur in sequences that are still viewed by LVA- fundamental frequency from 120 Hz to 220 Hz technology as if they were flat. Incidentally, is detected as “outsmart”. Also a fundamental given that these plateaus can be up to 22 sam- frequency as low as 20 Hz is interpreted as sig- ples long, the total span of the amplitude drift nalling a “LIE”, relative to the 120 Hz calibra- within a plateau can be as large as 88 units, tion. which would allow for a ramp to sweep through Using the 20 Hz waveform as calibration the whole range of possible amplitudes (-42 to and testing with the 120 Hz is detected as “out- +42). This is hardly compatible with the notion smart”. A calibration with the 120 Hz wave of high precision technology suggested by above followed by the same wave contaminated Nemesysco. Finally, in addition to the counting by some room acoustics is also interpreted as of plateaus, the program also computes the “outsmart”. square root of the cumulative absolute devia- tion for the distribution of the plateau lengths. The illusion of a serious analysis Maybe the intention was to compute the stan- The examples above suggest that the LVA- dard deviation of the sample distribution and technology generates outputs contingent on the this is yet another programming error but since relationship between the calibration and the test there is no theoretical rationale it is impossible signals. Although the signal analysis performed to discuss this issue. by the LVA-technology is a naive and ad hoc measurement of essentially irrelevant aspects of Assessing the speaker’s emo- the speech signal, the fact that some of the "de- tional state tected emotions" are strongly dependent on the The rest of the LVA-technology simply uses the statistical properties of the "plateaus" leads to information provided by these four variables: outcomes that vaguely reflect variations in F0. (1) the number of thorns per sample, (2) the For instance, the algorithm's output tends to be number of counts of plateaus per sample, (3) a “lie” when the F0 of the test signal is gener- the average length of the plateaus and (4) the ally lower than that of the calibration. The main square root of their cumulative absolute devia- reason for this is that the program issues “lie”- tion. From this point on the program code is no warnings when the number of detected “pla- longer related to any measureable physical teaus” during the analysis phase exceeds by a events. In the absence of a theoretical model, certain threshold the number of “plateaus” the discussion of this final stage and its out- measured during calibration. When the F0 is come is obviously meaningless. It is enough to point out that the values of the variables used to issue the final statements concerning the 2 speaker’s emotional status are as arbitrary as The amplitudes of the waveforms used in this demon- stration are encoded at 16 bit per sample.

223 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

low amplitude oscillations are lost and the se- quences are interpreted as plateaus that are longer (and therefore fewer within the analysis window) than those measured in speech seg- ments produced with higher F0. Such momen- tary changes in the structure of the plateaus are interpreted by the program's arbitrary code as indicating "deception". Under typical circum- stances, flagging "lie" in association with low- ering of F0 will give the illusion that the pro- gram is doing something sensible because F0 tends to be lower when a speaker produces fill- ers during hesitations than when the speaker's speech flows normally. Since the "lie-detector" is probably calibrated with responses to ques- tions about obvious things the speaker will tend to answer using a typical F0 range that will generally be higher than when the speaker has to answer to questions under split-attention loads. Of course, when asked about events that demand recalling information, the speaker will tend to produce fillers or speak at a lower speech rate, thereby increasing the probability of being flagged by the system as attempting to "lie", although in fact hesitations or lowering of F0 are known to be no reliable signs of decep- tion. Intentionally or by accident, the illusion of seriousness is further enhanced by the random character of the LVA outputs. This is a direct consequence of the technology's responses to both the speech signal and all sorts of spurious acoustic and digitalization accidents. The insta- bility is likely to confuse both the speaker and the "certified examiner", conveying the impres- sion that the system really is detecting some brain activity that the speaker cannot control3 and may not even be aware of! It may even give the illusion of robustness as the performance is equally bad in all environments.

The UK's DWP's evaluation of LVA Figure 2. The figures above show synthetic vo- wels constructed by algebraic addition of de- The UK's Department of Work and Pensions layed versions of a natural glottal pulse. These has recently published statistics on the results waveforms lead generate different “emotional of a large and systematic evaluation of the outputs” depending on the relationship between LVA-technologyiii assessing 2785 subjects and the F0 of the waveform being tested and the F0 costing £2.4 millioniv. The results indicate that of the “calibration” waveform. the areas under the ROC curves for seven dis- tricts vary from 0.51 to 0.73. The best of these low, the final portions of the vocal tract's damped responses to more sparse glottal pulses will tend to achieve lower amplitudes in be- 3 Ironically this is true because the output is determined tween consecutive pulses. Given the technol- by random factors associated with room acoustics, back- ogy's very crude amplitude quantization, these ground noise, digitalization problems, distortion, etc.

224 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University results corresponds to a d' of about 0.9, which is In terms of “lie-detection”, the algorithm re- a rather poor performance. But the numbers re- lies strongly on the variables associated with ported in the table reflect probably the judge- the plateaus. Given the phonetic structure of the ments of the "Nemesysco-certified" personal4 speech signals, this predicts that, in principle, in which case the meaningless results generated lowering the fundamental frequency and chang- by the LVA-technology may have been over- ing the phonation mode towards a more creaky ridden by personal's uncontrolled "interpreta- voice type will tend to count as an indication of tions" of the direct outcomes after listening to lie, in relation to a calibration made under mo- recordings of the interviews. dal phonation. Of course this does not have anything to do with lying. It is just the conse- Tabell 1. Evaluation results published by the UK's quence a common phonetic change in speaking DWP. style, in association with the arbitrary construc- tion of the “lie”-variable that happens to give

more weight to plateaus, which in turn are as-

N sociated with the lower waveform amplitudes Curve

benefit benefit benefit benefit

AUC of ROC AUC of towards the end of the glottal periods in particu- with changewith in changewith in Low risk cases risk Low cases risk Low High risk cases High risk cases with no changewith in no changewith in lar when the fundamental frequency is low.

The overall conclusion from this study is

that from the perspectives of acoustic phonetics and speech signal processing, the LVA- True PositiveTrue False Posit ive True Negative True False Negative False technology stands out as a crude and absurd Jobcentre Plus 787 354 182 145 106 0.54 processing technique. Not only it lacks a theo- Birmingham 145 60 49 3 33 0.73 retical model linking its measurements of the Derwentside 316 271 22 11 12 0.72 waveform with the speaker’s emotional status

Edinburgh 82 60 8 8 6 0.66 but the measurements themselves are so impre-

Harrow 268 193 15 53 7 0.52 cise that they cannot possibly convey useful in-

Lambeth 1101 811 108 153 29 0.52 formation. And it will not make any difference if Nemesysco “updates” in its LVA-technology. Wealden 86 70 7 8 1 0.51 The problem is in the concept’s lack of validity. Overall 2785 1819 391 381 194 0.65 Without validity, “success stories” of “percent detection rates” are simply void. Indeed, these “hit-rates” will not even be statistically signifi- Conclusions cant different from associated “false-alarms”, The essential problem of this LVA-technology given the method’s lack of validity. Until proof is that it does not extract relevant information of the contrary, the LVA-technology should be from the speech signal. It lacks validity. simply regarded as a hoax and should not be Strictly, the only procedure that might make used for any serious purposes (Eriksson & sense is the calibration phase, where variables Lacerda, 2007). are initialized with values derived from the four variables above. This is formally correct but References rather meaningless because the waveform Eriksson, A. and Lacerda, F. (2007). Charla- measurements lack validity and their reliability tanry in forensic speech science: A problem is low because of the huge information loss in to be taken seriously. Int Journal of Speech, the representation of the speech signal used by Language and the Law, 14, 169-193. the LVA-technology. The association of ad hoc Liberman, A. (10-28-2003). Layered Voice waveform measurements with the speaker’s Analysis (LVA). [6,638,217 B1]. US patent. emotional state is extremely naive and un- i grounded wishful thinking that makes the http://www.nemesysco.com/technology.html ii whole calibration procedure simply void. http://www.nemesysco.com/technology-lvavoiceanalysis.html iii http://spreadsheets.google.com/ccc?key=phNtm3LmDZEME67- 4 nBnsRMw An inquiry on the methodological details of the evalua- iv tion was sent to the DWP on the 23 April 2009 but the http://www.guardian.co.uk/news/datablog/2009/mar/19/dwp-voice- methodological information has not yet been provided. risk-analysis-statistics

225 Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm University

Author index

Al Moubayed 140 Mårback 92 Allwood 180 Marklund 160 Ambrazaitis 72 McAllister 120 Ananthakrishnan 202 Narel 130 Asu 54 Neiberg 202 Beskow 28, 140, 190 Öhrström 150 Blomberg 144, 154 Ormel 140 Bruce 42, 48 Öster 96, 140 Bulukin Wilén 150 Pabst 24 Carlson 86 Riad 12 Cunningham 108 Ringen 60 Edlund 102, 190 Roll 66 Eklöf 150 Salvi 140 Eklund 92 Sarwar 180 Elenius, D. 144, 154 Schalling 18 Elenius, K. 190 Schötz 42, 48, 54 Enflo 24 Schwarz 92, 130 Engwall 30 Seppänen 116 Forsén 130 Simpson 172 Granström 48, 140 Sjöberg 92 Gustafson, J. 28 Sjölander 36 Gustafson, K. 86 Söderlund 160 Gustafsson 150 Stenberg 214 Hartelius 18 Strömbergsson 136, 190, 198 Hellmer 190 Sundberg, J. 24 Herzke 140 Sundberg, U. 40, 126 Hincks 102 Suomi 60 Horne 66 Svantesson 78, 82 House 78, 82, 190 Tånnander 36 Inoue 112 Tayanin 78, 82 Johansson 130 Toivanen, A. 176 Karlsson 78, 82 Toivanen, J. 116, 176 Keränen 116 Touati 208 Klintfors 40, 126 Traunmüller 166 Krull 18 Tronnier 120 Kügler 54 Valdés 130 Lacerda 126, 130, 160, 220 van Son 140 Lång 130 Väyrynen 116 Lindblom 8, 18 Wik 30 Lindh 186, 194 Zetterholm 180


Department of Linguistics Phonetics group