Quick viewing(Text Mode)

Working Papers

Working Papers

ISSN 0280-526X

Lund University Centre for and Literature

General Linguistics Phonetics

Working Papers

52. 2006

Proceedings from Fonetik 2006 Lund, June 7–9, 2006

Edited by Gilbert Ambrazaitis and Susanne Schötz

Lund University Centre for Languages and Literature

General Linguistics Phonetics

Working Papers

52. 2006

Proceedings from F onetik 2006 Lund, June 7–9, 2006

Edited by Gilbert Ambrazaitis and Susanne Schötz

Working Papers Department of Linguistics and Phonetics Centre for Languages and Literature Lund University Box 201 S-221 00 LUND Fax +46 46 2224210 http://www.sol.lu.se/

This issue was edited by Gilbert Ambrazaitis and Susanne Schötz

© 2006 The Authors and the Department of Linguistics and Phonetics, Centre for Languages and Literature, Lund University

ISSN 0280-526X

Printed in Sweden, Mediatryck, Lund, 2006

i

Preface

The Proceedings of the Nineteenth Swedish Phonetics Conference, Fonetik 2006, com- prise this volume of the Working Papers from the Department of Linguistics and Phonet- ics at Lund University. Fonetik 2006 held at the Centre for Languages and Literature, June 7-9, 2006, is one in the series of yearly conferences for phoneticians and speech scientists in Sweden which regularly also attract participants from , Finland and Norway and sometimes from other countries as well. There are 38 contributions represented in this volume. A large variety of topics are covered in the papers, and we think that the volume gives a representative overview of current phonetics research in Sweden. We would like to thank all contributors to the Proceedings. We would like to acknowl- edge the valuable support from The Swedish Phonetics Foundation (Fonetikstiftelsen) and from The Centre for Languages and Literature and Lund University Faculty of the Hu- manities and Theology.

Lund, May 2006

The Organizing Committee

Gilbert Ambrazaitis, Gösta Bruce, Johan Frid, Per Lindblad, Susanne Schötz, Anders Sjöström, Joost van de Weijer, Elisabeth Zetterholm

ii

Previous Swedish Phonetics Conferences (from 1986)

I 1986 Uppsala University II 1988 Lund University III 1989 KTH Stockholm IV 1990 Umeå University (Lövånger) V 1991 Stockholm University VI 1992 Chalmers and Göteborg University VII 1993 UppsalaUniversity VIII 1994 Lund University (Höör) –– 1995 (XIII th ICPhS in Stockholm) IX 1996 KTH Stockholm (Nässlingen) X 1997 Umeå University XI 1998 Stockholm University XII 1999 Göteborg University XIII 2000 Skövde University College XIV 2001 Lund University (Örenäs) XV 2002 KTH Stockholm XVI 2003 Umeå University (Lövånger) XVII 2004 Stockholm University XVIII 2005 Göteborg University

iii

Contents

Emilia Ahlberg, Julia Backman, Josefin Hansson, Maria Olsson, and Anette Lohmander Acoustic Analysis of Phonetically Transcribed Initial Sounds in Babbling Sequences from Infants with and without Cleft Palate ...... 1 Gilbert Ambrazaitis and Gösta Bruce Perception of South Swedish Word Accents...... 5 Jonas Beskow, Björn Granström, and David House Focal Accent and Facial Movements in Expressive Speech ...... 9 Ulla Bjursäter A Study of Simultaneous-masking and Pulsation-threshold Patterns of a Steady-state Synthetic Vowel: A Preliminary Report ...... 13 Petra Bodén and Julia Gro e Youth in Multilingual Göteborg ...... 17 Rolf Carlson, Kjell Gustafson, and Eva Strangert Prosodic Cues for Hesitation ...... 21 Frantz Clermont and Elisabeth Zetterholm F-pattern Analysis of Professional Imitations of “hallå” in three Swedish ...... 25 Una Cunningham Describing Swedish-accented English ...... 29 Wim A. van Dommelen Quantification of Speech Rhythm in Norwegian as a Second Language...... 33 Jens Edlund and Mattias Heldner /nailon/ – Online Analysis of Prosody...... 37 Olov Engwall Feedback from Real & Virtual Language Teachers ...... 41 Lisa Gustavsson, Ellen Marklund, Eeva Klintfors, and Francisco Lacerda Directional Hearing in a Humanoid Robot ...... 45 Gert Foget Hansen and Nicolai Pharao Microphones and Measurements ...... 49 Mattias Heldner and Jens Edlund Prosodic Cues for Interaction Control in Spoken Dialogue Systems ...... 53 Pétur Helgason SMTC – A Swedish Map Task Corpus ...... 57 Snefrid Holm The Relative Contributions of Intonation and Duration to Degree of Foreign Accent in Norwegian as a Second Language...... 61 iv

Merle Horne The Filler EH in Swedish ...... 65 Per-Anders Jande Modelling Pronunciation in Discourse Context...... 69 Christian Jensen Are Verbs Less Prominent?...... 73 Yuni Kim Variation and Finnish Influence in Intonation ...... 77 Diana Krull, Hartmut Traunmüller, and Pier Marco Bertinetto Local Speaking Rate and Perceived Quantity: An Experiment with Italian Listeners ...... 81 Jonas Lindh A Case Study of /r/ in the Västgöta Dialect...... 85 Jonas Lindh Preliminary Descriptive F0-statistics for Young Male Speakers...... 89 Robert McAllister, Miyoko Inoue, and Sofie Dahl L1 Residue in L2 Use: A Preliminary Study of Quantity and Tense-lax...... 93 Yasuko Nagano-Madsen and Takako Ayusawa Cross-speaker Variations in Producing Attitudinally Varied Utterances in Japanese ...... 97 Daniel Neiberg, Kjell Elenius, Inger Karlsson, and Kornel Laskowski Emotion Recognition in Spontaneous Speech ...... 101 Susanne Schötz Data-driven Formant Synthesis of Speaker Age ...... 105 Rein Ove Sikveland How do we Speak to Foreigners? – Phonetic Analyses of Speech Communication between L1 and L2 Speakers of Norwegian ...... 109 Maria Sjöström, Erik J. Eriksson, Elisabeth Zetterholm, and Kirk P. H. Sullivan A Switch of Dialect as Disguise ...... 113 Gabriel Skantze, David House, and Jens Edlund Prosody and Grounding in Dialog...... 117 Eva Strangert and Thierry Deschamps The Prosody of Public Speech – A Description of a Project...... 121 Katrin Stölten Effects of Age on VOT: Categorical Perception of Swedish Stops by Near-native L2 Speakers ...... 125 Kari Suomi Stress, Accent and Vowel Durations in Finnish...... 129 Bosse Thorén Phonological Demands vs. System Constraints in an L2 Setting...... 133 Hartmut Traunmüller Cross-modal Interactions in Visual as Opposed to Auditory Perception of Vowels ...... 137 v

Marcus Uneson Knowledge-light Letter-to-Sound Conversion for Swedish with FST and TBL ...... 141 Sidney Wood The Articulation of Uvular Consonants: Swedish...... 145 Niklas Öhrström and Hartmut Traunmüller Acoustical Prerequisites for Visual Hearing ...... 149

Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 1 Working Papers 52 (2006), 1–4

Acoustic Analysis of Phonetically Transcribed Initial Sounds in Babbling Sequences from Infants with and without Cleft Palate

Emilia Ahlberg, Julia Backman, Josefin Hansson, Maria Olsson, and Anette Lohmander Institute of Neuroscience and Physiology/Speech Pathology, Sahlgrenska Academy at Göteborg University {gusahemi|gusbacju|gushanjos}@student.gu.se, [email protected], [email protected]

Abstract The aim of this study was to compare acoustic analysis of initial sound in babbling with corresponding phonetic transcription. Two speech pathologists had transcribed eight babbling sequences with total disagreement about whether the initial phoneme was a vowel or a plosive. After discussion, however, a consensus judgment was reached. To decide about the initial phoneme, an acoustic analysis was performed. Because of the deficient quality of some of the recordings, the results of the acoustic analysis were not completely reliable. However, the results were in relatively good agreement with the consensus judgments and indicate that the two methods should be used as complements to each other.

1 Background and introduction Perceptual judgment is most commonly used in clinical practice of speech pathology. However, what a human listener perceives is dependent of many different components, among these a listener’s expectations have great importance. Categorical perception signifies the tendency of humans to classify speech sounds into different categories (Lieberman and Blumstein 1988; Reisberg 2001). This may cause various listeners to sort a phone that is in between categories in different phonemic groups. In addition, perception is always subjective. Training with direct feedback is considered to be of great importance to create a common frame of reference for two different judges and to increase the perceptual awareness (Shriberg, 1972). Children with cleft palate often have difficulties attaining a sufficient intra-oral pressure when producing plosives. In babbling, the phonetic features of the speech sounds are also less distinctive than in later speech due to maturation. In the acoustic analysis plosives may therefore appear with a less distinct mark in the spectrogram. The purpose of this study was to examine the agreement between perceptual and acoustic analysis in judging the initial sound of babbling sequences of infants with and without cleft palate.

2 EMILIA AHLBERG ET AL . 2 Method 2.1 Material Babbling sequences at 18 months of age from 41 children with and without cleft palate had been phonetically transcribed by two listeners. The transcriptions were made independently, but in close connection with the transcriptions a consensus judgment was made. Consensus rules had been designed in advance to keep the judgments as equal as possible. Eight of these independent transcriptions by the two listeners had total disagreement regarding the initial phoneme. These as well as the consensus transcription were chosen for acoustic analysis.

2.2 Acoustic analysis Acoustic analysis of the eight babbling sequences was made to establish the initial sound s (vowel or plosive). The judges were unaware of whether the children had a cleft palate or not. The analysis was made using the software Praat (Boersma & Weenink, 2005). The intension to be able to gather specific signs of a plosive had to be complemented by more subtle features. In the analysis the formant transitions and the intensity (of blackness) were observed in detail. Reduction of noise and in some cases filtering in frequency was also made.

2.3 Statistical method To compare data from the judgments made by perceptual and acoustic analysis, the statistics software SPSS was used. The agreement between the different methods, as well as between the judges, was calculated using Cohen’s Kappa.

3 Results and discussion The acoustic analyses are presented in Table 1. In Table 2 the results from the acoustic and perceptual (phonetic transcriptions) analyses are presented.

Table 1. Description of the acoustic analysis of the initial sound from eight babbling sequences. 1. Plosive Possibly approximant since there is no obvious burst. 2. Vowel Vowel. Woman and child are speaking at the same time. The woman however is using ”baby talk”, meaning her F0 is higher and she is thereby sounding more like the child. Still, no natural formant-transitions for plosives are seen in the utterance. 3. Plosive Rather distinct burst, where formant-transitions are seen. 4. Plosive Distinct burst and formant-transitions are seen. 5. Plosive Obvious weakening of the formants initially, which implies a more closed mouth and could indicate a plosive. 6. Plosive At a detailed level a trace of a plosive is seen. Could also be an approximant? The result is very uncertain since an adult and the child is speaking at the same time. 7. Plosive A small formant-transition appears after a filtering of noise. Weak formants that could signify a nasal vowel or a more closed mouth are observed. Click/Bang sound is heard at the exact same time as the possible plosive. Uncertain result. 8. Vowel Nothing in the acoustic analysis indicates a plosive. 89

ACOUSTIC ANALYSIS OF TRANSCRIBED INITIAL SOUNDS IN BABBLING 3

Table 2. The results of the three perceptual judgments (judge 1, judge 2 and consensus judgment) and acoustic analysis. P = plosive, V = vowel. Child CP/Control Judge 1 Judge 2 Consensus Acoustic analysis 1 CP C V C C 2 CP V C V V 3 Control C V C C 4 Control C V C C 5 Control C V C C 6 CP V C C C 7 Control C V V C 8 Control V C V V

A calculation of agreement between the consensus judgment and the acoustic analysis resulted in a Kappa value of .71 (>.75 is considered good concordance) (Table 3).

Table 3. The values for agreement between acoustic analysis and the different listener conditions. Cons = consensus judgment, J1 = judge 1 and J2 = judge 2, Negative value = disagreement. Judgments Cohen’s Kappa Acous – Cons .71 J1 – J2 -.88 Acous – J1 .71 Acous – J2 -.56 Cons – J1 .47 Cons – J2 -.41

4 Conclusions According to the statistical analysis the agreement between the acoustic analysis and the consensus judgment is relatively good. Even though the Kappa value is .71, seven out of the eight judgments were equivalent. The limited amount of samples makes it difficult to draw conclusions. However, the results show that the consensus judgment was reliable. It also implies that judge 1 had better agreement with the acoustic analysis than judge 2. In fact judge 1 had as good agreement with the acoustic analysis as the consensus judgement but based on different samples. According to the acoustic analysis some initial sounds appeared more like approximants than plosives, which can explain the uncertainty in the perceptual judgments. Since there were only two possible options (plosive or vowel) in the acoustic analysis no consideration was taken to the approximantic signs in the spectrogram. This also makes the perceptual judgments less reliable as well as the results of the concordance between the perceptual judgments and the acoustic analysis. Three out of the eight babbling sequences were produced by children with cleft palate. Children with cleft palate have difficulties building up a sufficient intra-oral pressure, which can result in a less distinct burst in the spectrogram. In the acoustic analysis two specific sounds were interpreted as plosives (possibly approximants). These sounds were produced by children with CP, which explain the approximantic appearances. Since the result of the acoustic analysis is interpreted by a human, it is a subjective judgment. The anatomy of the speaker, the design of the room and the recording method are examples of things that are of importance for the analysis. In theory, the acoustic signs are well described, but are difficult to interpret when used clinically. Interfering sounds and noise

4 EMILIA AHLBERG ET AL . in the spectrogram are difficult to separate from the babbling. This is also important for the results of the perceptual judgments and could explain disagreement between perceptual judges. The fact that the judges in this study are disagreeing is not unique. In a study by Shriberg (1972), conclusion are drawn that training with a key is of importance for the concordance between judges. In this study the judges had experience of transcribing together but without direct feedback or a key. In conclusion, the results from this study show that neither the perceptual nor the acoustic judgment, gave reliable answers. However, one could assume that these both methods complement each other. In order to increase the reliability, both perceptually and acoustically, it is important that the recordings are of sufficient quality. This can be achieved by using high quality equipment, carefully consider the placing of the microphone and by using a designed recording room where the risk for disturbing sounds is minimal (for example by using soft toys when recording children). To get a more valid judgment it is also suggested to exclude utterances where competing sounds cannot be avoided.

References Boersma, P. & D. Weenink, 2005. Praat: doing phonetics by computer (Version 4.3.33) [Computer program] Retrieved October 7, 2005, from http://www.praat.org/. Lieberman, P. & S.E. Blumstein, 1988. Speech Physiology. Speech Perception and Acoustic Phonetics . Cambridge: Cambridge University Press. Lindblad, P., 1998. Talets Akustik och Perception . Kompendium, Göteborgs Universitet. Reisberg, D., 2001. Cognition – Exploring the Science of the Mind . New York: W.W. Norton & Company, Inc. Shriberg, L.D., 1972. Articulation Judgments: Some Perceptual Considerations. Journal of Speech and Hearing Research 15 , 876-882. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 5 Working Papers 52 (2006), 5–8

Perception of South Swedish Word Accents

Gilbert Ambrazaitis and Gösta Bruce Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University {Gilbert.Ambrazaitis|Gosta.Bruce}@ling.lu.se

Abstract A perceptual experiment concerning South Swedish word accents (accent I, accent II) is described. By means of editing and resynthesis techniques the F0 pattern of a test word in a phrase context has been systematically manipulated: initial rise (glide vs. jump) and final concatenation (6 timing degrees of the accentual fall). The results indicate that both a gliding rise and a late fall seem necessary for the perception of accent II, while there appear to be no such specific, necessary cues for the perception of accent I.

1 Introduction In the original Swedish intonation model (Bruce & Gårding, 1978) the two tonal word accents (accent I and accent II) are assigned bitonal representations in terms of High plus Low (HL), representing the accentual F0 fall. These Highs and Lows are timed differently, however, in relation to the stressed syllable depending on dialect type. For all dialect types, the HL of accent I precedes the HL of accent II. In South Swedish, the HL of accent I is aligned with the stressed syllable, while the HL of accent II is instead aligned with the post-stress syllable. A problem with the latter representation is that the stressed syllable in accent II words has no direct tonal representation. Thus this modelling does not reflect what should be the most perceptually salient part of the pitch pattern of accent II. Figure 1 shows prototypical F0 contours of the two word accents (minimal pair) in a prominent position of an utterance as produced by a male speaker of South Swedish (the second author). This particular problem of intonational modelling has been the starting-point of a phonetic experiment aimed at examining what is perceptually relevant in the F0 contours of accent I and accent II in the South Swedish dialect type. More specifically, our plan has been to run a perceptual experiment, where the intention was to find out what are the necessary and sufficient cues for the identification of both word accents.

Figure 1. Prototypical F0 contours of the two word accents in a prominent position of an utterance as produced by a male speaker of South Swedish: Jag har sett anden i dimman. (‘I have seen the duck/spirit in the fog.’) Thick line: acc. I (‘duck’) ; thin line: acc. II (‘spirit’) . 6 GILBERT AMBRAZAITIS & GÖSTA BRUCE 2 Method We asked subjects to judge whether they perceive the test word anden as either meaning ‘the duck’ (accent I) or ‘the spirit’ (accent II), in naturally produced and synthetically manipulated test utterances. We chose to put the test word in a non-final accented position of an utterance containing two accented words (test word and context word; see Table 1), for several reasons. First, we wanted to have the possibility of removing the accentual F0 fall of the test word while maintaining an utterance-final falling pattern. Second, we chose two different context words – one with accent I ( drömmen, ‘the dream’ ), one with accent II ( dimman, ‘the fog’) – in order to provide a “dialectal anchor” for the listeners. Third, by having the test word non- finally, we avoided phrase-final creaky voice on the test word, thus facilitating the editing of F0. Regarding semantic factors, we tried to choose context words which would be as “neutral” as possible, i.e. which would not bias the ratings of the test word. The test material was recorded by a male speaker of South Swedish (the second author) in the anechoic chamber at the Centre for Languages and Literature, Lund University.

Table 1. The structure of the test material, or the four recorded test utterances respectively. (‘ I have seen the duck/spirit in the dream/fog.’ ) Test word Context word Used for Jag har sett anden ( accI ) i drömmen ( accI ) / i dimman ( accII ). A: control stimuli anden ( accII ) i drömmen ( accI ) / i dimman ( accII ). B: primary stimuli

2.1 Stimuli We created 12 F0 contours and implemented them in two recorded utterances (B in Table 1), by means of F0 editing and resynthesis using the PSOLA manipulation option in Praat (). Figure 2 displays the 12 contours for one of the utterances (dimman ) as an example. The starting point was a stylization of the originally recorded F0 contours, i.e. with accent II on the test word (glide/dip4 in Figure 2). Based on this stylized accent II contour, three contours with a successively later F0 fall were created (dip3, dip2, dip1), each one aligned at successive segmental boundaries: in dip3, the fall starts at the vowel onset of the post-stress syllable (schwa), in dip2 at the following /n/ onset, and in dip1 at the onset of /i/. Thus, a continuum of concatenations between the two accented words was created. Two further steps were added to this continuum: one by completely removing the fall, yielding a contour that exhibits a high plateau between the two accented words (dip0), and one by shifting back the whole rise-fall pattern of the original accent II, yielding a typical accent I pattern (dip5). For each dip position, we also created a contour that lacks the initial

jump 0 2 1 glide 3 4 5

Figure 2. Stimulus structure, exemplified in the dimman context: 6 dip levels (0...5) x 2 rise types (jump, glide). These 12 F0 contours were implemented in both recordings ( dimman and drömmen ), yielding 24 stimuli. PERCEPTION OF SOUTH SWEDISH WORD ACCENTS 7 gliding rise on /a(n)/, by simply transforming it into a “jump” from low F0 in sett to high F0 right at the onset of anden . It should be pointed out that the difference between glide and jump is marginal for dip5 (i.e. accent I), and was implemented for the sake of symmetry only. Additionally, we generated 4 control stimuli which were based on the A-recordings (cf. Table 1). These are, however, not further considered in this paper.

2.2 Procedure All 24+4=28 stimuli were rated 4 times. The whole list of 112 stimuli was randomized and presented to the listeners in 8 blocks of 14 stimuli each, via headphones. The listeners heard each stimulus only once and had to rate it as either referring to a duck ( and ) or a spirit ( ande ), within 3 seconds, by marking it on a paper sheet. The whole test was included in a wav-file and took 11:31 minutes. Instructions were given orally and in written form. A training test with two blocks of 4 stimuli each was run before the actual experiment. 20 South Swedish native speakers, 5 male, 15 female, aged 19-32, with no reported hearing impairments, volunteered as subjects.

2.3 Data analysis Based on the four repetitions of each stimulus, an accent II score in % (henceforth %accII) was calculated per stimulus and subject. These %accII scores were used as raw data in the analyses. Means and standard deviations, pooled over all 20 listeners, were calculated for every stimulus. A three-way repeated-measures ANOVA was run for the 24 primary stimuli to test for effects of the following factors: F INAL WORD (2 levels: drömmen, dimman), R ISE TYPE (2 levels: jump, glide), and C ONCATENATION (6 levels: dip0…dip5).

3 Results The mean %accII ratings are displayed in Figure 3. The stimuli that were intended to represent clear cases of accent I (dip5), and accent II (glide/dip4) were convincingly rated as expected. The graphs for the two different contexts look very similar, and accordingly, FINAL W ORD had no significant effect (p>.8). Also, as would be expected from Figure 3, both R ISE T YPE and CONCATENATION have a significant effect (p<.001 each). However, the difference in rise type is not reflected in a constant %accII difference, which is especially salient in dip5. According- ly, we also found a significant interaction between R ISE T YPE and C ONCATENATION (p<.001).

final word = drömmen (accent I) final word = dimman (accent II)

100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 mean accent II rating in % in II rating accent mean mean accent II rating in in % rating II accent mean 0 0

0 1 2 3 4 5 0 1 2 3 4 5 concatenation concatenation Figure 3. Mean accII-ratings in % for 2 final word conditions, 6 dip levels (concatenation), and 2 rise types: glide (straight line) and jump (dotted line).

8 GILBERT AMBRAZAITIS & GÖSTA BRUCE 4 Discussion Referring back to the issue about necessary and sufficient cues for word accent identification (cf. Introduction), we will comment on a number of points in the light of our experiment. Is the gliding rise through the stressed syllable necessary for the perception of accent II? – Replacing this glide by an F0 jump up to the stressed syllable results in a sizeable decrease in the votes for accent II (cf. glide/dip4 vs. jump/dip4). This suggests that the gliding rise is necessary for the unambiguous perception of accent II. Is a late-timed fall necessary for the perception of accent II? – Replacing the F0 fall through the post-stress syllable by a high plateau in the target word yields a tendency towards accent I (cf. glide/dip4 vs. glide/dip0). This suggests that the fall is necessary. However, the fall must not be substancially earlier than in the original accent II word, since this would correspond to accent I. Thus, both the gliding rise and the late fall seem necessary for the unambiguous perception of accent II. When one of them is removed, the ratings tend towards accent I. When both these cues are absent, the tendency becomes rather strong (cf. jump/dip0). What is necessary, and what is sufficient for the perception of accent I? – Accent I is most convincingly represented by stimuli with an early fall (dip5). However, the discussion above has already shown that this early fall cannot be a necessary cue, since a number of stimuli lacking this early fall have received high accent I ratings (cf. also jump/dip1-2). Furthermore, a high-starting stressed syllable (jump) favors accent I ratings, but cannot be regarded as sufficient, since the absence of an accent II-like fall appears necessary (cf. dip3). Thus, our results are most easily explainable along the hypothesis that there are no specific necessary cues for accent I at all, but that simply the absence of accent II cues is sufficient for the perception of accent I. It is still remarkable, though, that the absence of only one accent II cue alone (e.g. the late fall) results in more votes for accent I than for accent II. Why does a glide followed by a plateau trigger a considerable number of votes for accent I? – We do not have a definite answer to this question. One possibility is that the conditions of phrase intonation play a role. In the early part of a phrase, the expectation is a rising pattern. Thus a phrase-initial accent I may be realized as a rising glide even in South Swedish, as long as there is no immediately following F0 fall. Another possibility is that the glide-plateau gesture represents a typical accent I pattern of another dialect type (Svea or Stockholm Swedish), even if the context word at the end has a stable South Swedish pattern. What does our experiment tell us about the markedness issue? – From the perspective of perceptual cues, accent II in South Swedish appears to be more “special” than accent I. This will lend some support to the traditional view of accent II being the marked member of the opposition (cf. Elert, 1964; Engstrand,1995; Riad 1998).

Acknowledgements Joost van de Weijer assisted us with advice concerning methodology and statistics.

References Bruce, G. & E. Gårding, 1978. A prosodic typology for . In E. Gårding et al. (eds.), Nordic Prosody . Lund: Lund University, Department of Linguistics, 219-228. Elert, C-C., 1964. Phonologic Studies of Quantity in Swedish . Stockholm: Almqvist & Wiksell. Engstrand, O., 1995. Phonetic interpretation of the word accent contrast in Swedish. Phonetica 52 , 171-179. Riad, T., 1998. Towards a Scandinavian accent typology. In W. Kehrein & R. Wiese (eds.), Phonology and Morphology of the . Tübingen: Niemeyer, 77-109. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 9 Working Papers 52 (2006), 9–12

Focal Accent and Facial Movements in Expressive Speech

Jonas Beskow, Björn Granström, and David House Dept. of Speech, Music and Hearing, Centre for Speech Technology (CTT), KTH, Stockholm {beskow|bjorn|davidh}@speech.kth.se

Abstract In this paper, we present measurements of visual, facial parameters obtained from a speech corpus consisting of short, read utterances in which focal accent was systematically varied. The utterances were recorded in a variety of expressive modes including Certain, Confirming, Questioning, Uncertain, Happy, Angry and Neutral. Results showed that in all expressive modes, words with focal accent are accompanied by a greater variation of the facial parameters than are words in non-focal positions. Moreover, interesting differences between the expressions in terms of different parameters were found.

1 Introduction Much prosodic information related to prominence and phrasing, as well as communicative information such as signals for feedback, turn-taking, emotions and attitudes can be conveyed by, for example, nodding of the head, raising and shaping of the eyebrows, eye movements and blinks. We have been attempting to model such gestures in a visual speech synthesis system, not only because they may transmit important non-verbal information, but also because they make the face look alive. In earlier work, we have concentrated on introducing eyebrow movement (raising and lowering) and head movement (nodding) to an animated talking agent. Lip configuration and eye aperture are two additional parameters that we have experimented with. Much of this work has been done by hand-manipulation of parametric synthesis and evaluated using perception test paradigms. We have explored three functions of prosody, namely prominence, feedback and interrogative mode useful in e.g. multimodal spoken dialogue systems (Granström, House & Beskow, 2002). This type of experimentation and evaluation has established the perceptual importance of eyebrow and head movement cues for prominence and feedback. These experiments do not, however, provide us with quantifiable data on the exact timing or amplitude of such movements used by human speakers. Nor do they give us information on the variability of the movements in communicative situations. This kind of information is important if we are to be able to implement realistic facial gestures and head movements in our animated agents. In this paper we will report on methods for the acquisition of visual and acoustic data, and present measurement results obtained from a speech corpus in which focal accent was systematically varied in a variety of expressive modes.

2 Data collection and corpus We wanted to be able to obtain articulatory data as well as other facial movements at the same time, and it was crucial that the accuracy in the measurements was good enough for 10 JONAS BESKOW ET AL . resynthesis of an animated head. The opto-electronic motion tracking system, the Qualysis MacReflex system, that we use has an accuracy better than 1 mm with a temporal resolution of 60 Hz. The data acquisition and processing is similar to earlier facial measurements carried out at CTT by e.g. Beskow et al. (2003). The set-up can be seen in Fig. 1, left picture.

Figure 1 . Data collection setup with video and IR-cameras, microphone and a screen for prompts (left) and a test subject with the IR-reflecting markers (right).

The subject could either pronounce sentences presented on the screen outside the window or be engaged in a (structured) dialogue with another person as shown in the figure. By attaching infrared (IR) reflecting markers to the subject’s face (see Fig. 1), the system is able to register the 3D coordinates for each marker. We used a number of markers to register lip movements as well as other facial movements such as eyebrows, cheek and chin. The speech material used for the present study consisted of 39 short, content neutral sentences such as “Båten seglade förbi” (The boat sailed by) and “Grannen knackade på dörren” (The neighbor knocked on the door), all with three content words which could each be focally accented. To elicit visual prosody in terms of prominence, these short sentences were recorded with varying focal accent position, usually on the subject, the verb and the object respectively, thus making a total of 117 sentences. The utterances were recorded in a variety of expressive modes including Certain, Confirming, Questioning, Uncertain, Happy, Angry and Neutral. This database is part of a larger database collected in the EU PF-Star project (Beskow et al., 2004).

3 Measurement procedure In the present database a total of 29 IR-sensitive markers were attached to the speaker’s face, of which 4 markers were used as reference markers (on the ears and on the forehead). The marker setup (as shown in Fig. 1) largely corresponds to the feature point (FP) configuration of the MPEG-4 facial animation standard. In the present study, we chose to base our quantitative analysis of facial movement on the MPEG-4 Facial Animation Parameter (FAP) representation. Specifically, we chose a subset of 31 FAPs out of the 68 FAPs defined in the MPEG-4 standard, including only the ones that we were able to calculate directly from our measured point data. We wanted to obtain a measure of how (in what FAPs) focus was realised by the recorded speaker for the different expressive modes. In an attempt to quantify this, we introduce the Focal Motion Quotient, FMQ, defined as the standard deviation of a FAP parameter taken over a word in focal position, divided by the average standard deviation of the same FAP in the same word in non-focal position. This quotient was then averaged over all sentence- triplets spoken with a given expressive mode. FOCAL ACCENT AND FACIAL MOVEMENTS IN EXPRESSIVE SPEECH 11 4 Results and discussion As a first step in the analysis, the FMQs for all the 31 measured FAPs were averaged across the 39 sentences. These data are displayed in Fig. 2 for the analyzed expressive modes, i.e. Angry, Happy, Confirming, Questioning, Certain, Uncertain and Neutral. As can be seen, the FMQ mean is always above one, irrespective of which facial movement, FAP, is studied. This means that a shift from a non-focal to a focal pronunciation on the average results in greater dynamics in all facial movements for all expressive modes. It should be noted that these are results from only one speaker and averages across the whole database. It is however conceivable that facial movements will at least reinforce the perception of focal accent. The mean FMQ taken over all expressive modes is 1.6. The expressive mode yielding the largest mean FMQ is Happy (1.9) followed by Confirming (1.7), while Questioning has the lowest mean FMQ value of 1.3. If we look at the individual parameters and the different expressive modes, some FMQs are significantly greater, especially for the Happy expression, up to 4 for parameter 34 “raise right mid eyebrow”.

4,5

4

3,5

3 Angry Happy 2,5 Confirming Questioning 2 Certain Uncertain 1,5 Neutral

1

0,5

0 3: 3: open jaw 14: thrust jaw 15: shift jaw 18: chindepress 39: puff left cheek 40: puff rightcheek 41: lift left cheek 42: lift right cheek 16: push bottom lip 52: raise bottom midlip 57: raise bottom lip lm 58: raise bottom lip rm 17: push top lip 51: lower top midlip 55: lower top lip left mid 56: lower top lip rm 53: strech left cornerlip 54: strech right cornerlip 59: raise left cornerlip 60: raise right cornerlip 31: raise left inner eyebrow 32: raise right inner eyebrow 33: raise left mid eyebrow 34: raise right mid eyebrow 35: raise left outereyebrow 36: raise right outer eyebrow 37: squeezeleft eyebrow 38: squeezeright eyebrow 48: headpitch 49: headyaw 50: headroll

FAP Figure 2. The focal motion quotient, FMQ, averaged across all sentences, for all measured MPEG-4 FAPs for several expressive modes (see text for definitions and details).

In order to more clearly see how different kinds of parameters affect the movement pattern, a grouping of the FAPs is made. In Fig. 3 the “Articulation” parameters are the ones primarily involved in the realization of speech sounds (the first 20 in Fig. 2). The “Smile” parameters are the 4 FAPs relating to the mouth corners. “Brows” correspond to the eight eyebrow parameters and “Head” are the three head movement parameters. The extent and type of greater facial movement related to focal accent clearly varies with the expressive mode. Especially for Happy, Certain and Uncertain, FMQs above 2 can be observed. The Smile group is clearly exploited in the Happy mode, but also in Confirming, which supports the finding in Granström, House & Swerts (2002) where Smile was the most prominent cue for confirming, positive feedback, referred to in the introduction. These results are also consistent with Nordstrand et al. (2004) which showed that lip corner displacement was more strongly influenced by utterance emotion than by individual vowel features.

12 JONAS BESKOW ET AL .

3

2,5

2 articulation smile 1,5 brows 1 head

0,5

0 Angry Happy Confirming Questioning Certain Uncertain Neutral

Figure 3. The effect of focus on the variation of several groups of MPG-4 FAP parameters, for different expressive modes

While much more detailed data on facial movement patterns is available in the database, we wanted to show the strong effects of focal accent on basically all facial movement patterns. Modelling the timing of the facial gestures and head movements relating to differences between focal and non-focal accent and to differences between expressive modes promises to be a fruitful area of future research.

Acknowledgements This paper describes research in the CTT multimodal communication group including also Loredana Cerrato, Mikael Nordenberg, Magnus Nordstrand and Gunilla Svanfeldt which is gratefully acknowledged. Special thanks to Bertil Lyberg for making available the Qualisys Lab at Linköping University. The work was supported by the EU/IST projects SYNFACE, PF-Star and CHIL, and CTT, supported by VINNOVA, KTH and participating Swedish companies and organizations.

References Beskow, J., L. Cerrato, B. Granström, D. House, M. Nordstrand & G. Svanfeldt, 2004. The Swedish PF-Star Multimodal Corpora. Proc. LREC Workshop, Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces , Lisbon, 34-37. Beskow, J., O. Engwall & B. Granström, 2003. Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. Proc. ICPhS 2003 , Barcelona, 431-434. Granström, B., D. House & J. Beskow, 2002. Speech and gestures for talking faces in conversational dialogue systems. In B. Granström, D. House & I. Karlsson (eds.), Multimodality in language and speech systems. Dordrecht: Kluwer Academic Publishers, 209-241. Granström, B., D. House & M. Swerts, 2002. Multimodal feedback cues in human-machine interactions. Proc. Speech Prosody 2002, Aix-en-Provence, 347-350. Nordstrand, M., G. Svanfeldt, B. Granström & D. House, 2004. Measurement of articulatory variation in expressive speech for a set of Swedish vowels. Journal of Speech Communication 44 , 187-196. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 13 Working Papers 52 (2006), 13–16

A Study of Simultaneous-masking and Pulsation-threshold Patterns of a Steady-state Synthetic Vowel: A Preliminary Report

Ulla Bjursäter Department of Linguistics, Stockholm University [email protected]

Abstract This study will be a remake in part of Tyler & Lindblom “Preliminary study of simultaneous- masking and pulsation-threshold patterns of vowels” (1982), with the use of today’s technology. A steady-state vowel as masker and pure tones as signals will be presented using simultaneous-masking (SM) and pulsation-threshold (PT) procedures in an adjustment method to collect the vowel masking pattern. Vowel intensity is changed in three steps of 15 dB. For SM, each 15 dB change is expected to result in about a 10-13-dB change in signal thresholds. For PT, the change in signal thresholds with vowel intensity is expected to be about 3-4 dB. These results would correspond with the results from the Tyler & Lindblom study. Depending on technology outcome, further experiments can be made, involving representations of dynamic stimuli like CV-transitions and .

1 Introduction This study is an attempt to partially replicate Tyler & Lindblom “Preliminary study of simultaneous-masking and pulsation-threshold patterns of vowels” (1982). Their intention was to investigate the effect of the two different masking types as well as the role of suppression in the coding of speech spectra. Suppression, or lateral inhibition, refers to the reduction in the reaction to one stimulus by the introduction of a second (Oxenham & Plack, 1998). The ability of one tone to suppress the activity of another tone of adjacent frequency has been thoroughly documented in auditory physiology (Delgutte, 1990; Moore, 1978). In speech, suppression can be used to investigate formant frequencies. In the original article, the authors (Tyler & Lindblom, 1982) constructed an experiment masking steady-state synthetic pure tones by simultaneous and pulsation-threshold patterns of vowels. Their vowels were synthesized on an OVE 1b speech synthesizer (Fant, 1960) with formant frequencies, bandwidths and intensities as approximate values for Swedish. In this study only one of the vowels from the original experiment is synthesized, using Madde, a singing synthesizer () instead of OVE 1b. In this experiment, the original vowel masking patterns will be used on the Swedish vowel /y/, a vowel that, according to Tyler & Lindblom (1982), is particularly useful in testing the role of suppression in speech as it has three closely spaced high frequency formants (F2, F3 and F4). F2 and F4 have about the same frequency as in the vowels /i/ and /e/, and a distinct perception of these three vowels must depend on good frequency resolution of F3 (Carlson et al., 1970; Bladon & Fant, 1978; Tyler & Lindblom, 1982). 14 ULLA BJURSÄTER

Tyler, Curtis & Preece (1978) have shown that vowel masking patterns from a forward masking (FM) procedure preserves the formants better than patterns obtained using simultaneous masking (SM).

2 Method 2.1 Subject All collected data will be from an experienced listener who will receive about 30 minutes of practice with the test procedure.

2.2 Procedure The subject’s hearing level will be controlled and set in connection with the experiment in the phonetics laboratory at the Department of Linguistics, SU, by Philips SBC HP890 headphones. The tone will be presented at different intensities to get a baseline point. The tests are constructed in a graphic programming language, LabView (). The Swedish vowel /y/ is synthesized in Madde, an additive, real-time, singing synthesizer (). The vowel formant frequencies and bandwidths used in this study (Table 1) are the same as used in Tyler & Lindblom (1982), and defined in Carlson, Fant & Granström (1975) and in Carlson, Granström & Fant (1970).

Table 1. Formant frequencies (F 1, F 2, F 3 and F 4), bandwidths and Q-values in Hertz for the vowel /y/. /y/ Frequency Bandwidth Q F1 255 Hz 62.75 Hz 4.06 F2 1930 Hz 146.5 Hz 13.17 F3 2420 Hz 171.0 Hz 14.15 F4 3300 Hz 215.0 Hz 13.35

The procedures of the two simultaneous-masking (SM) and pulsation-threshold (PT) tests are the same as in the Tyler & Lindblom (1982) study, although in this case with only one vowel instead of three. In the SM procedure, the vowels will be presented 875 ms (between 50% points), repeated with a 125-ms silent interval. Three pulses of the pure tone will appear within each masking. These pulses start 125 ms after the vowel onset, continue for 125 ms and are separated by 125 ms. Rise/fall times (between 10% and 90% points) are 7 ms for both signal and masker. In the PT procedure the masking vowel and the pulsating signal alternate, with duration of 125 ms each. In order to assist the subject in the task, every fourth signal (125 ms) is omitted. Rise/fall times are 7 ms and the signal and the vowel are separated by 0 ms. F0 levels vary between 80 Hz, 120 Hz and 240 Hz. The vowel intensity changes parametrically over a range of 45 dB in three steps of 15 dB. Intensity levels alternate between 55.5, 70.5 and 85.5 dB SPL, representing low-voiced, medium and strong speech. The presentation order of the vowel’s fundamental frequency and intensity will be randomized. The testing with PT-values and SM will alternate every 10 th minute. Each condition will be presented until five estimates are registered. The stimuli will be presented monaurally, to the right ear. The subject will be instructed to adjust the level of the signal to a just noticeable pulsation threshold level; all answers are automatically registered in the LabView programme. All data will be analyzed in SPSS 14.0 and MS Excel 2002. SIMULTANEOUS -MASKING & PULSATION -THRESHOLD PATTERNS OF /y/ 15 3 Results The technical solution of the test in LabView is currently under construction. The results of the tests are expected to correspond with the results from the Tyler & Lindblom (1982) study. Some variation may occur due to individual variations between subjects.

4 Discussion The expectations of this study are to get data that concurs with the results from the Tyler & Lindblom (1982) study. The results from the Tyler & Lindblom (1982) study show that the masking pattern received with the PT method delineates the vowel’s formant frequencies better than the pattern received with the SM pattern. Suppression only occurs when two sounds are presented simultaneously, as in the SM procedure, which seems to result in the signal needing a higher intensity to be detected. The difference between the SM and PT measurements were very small at low signal frequencies and quite large at high signal frequencies (Tyler & Lindblom, 1982). One of the explanations offered were that the high-intensity F 1 suppressed the activity caused by the higher formants, resulting in lower PT in the high-frequency regions. Tyler & Lindblom (1982) also propose that the suggested suppression effects for steady- state vowels also could occur for all speech sounds, although in natural speech, the duration for which the vowel achieves its target is typically very short. Depending on the outcome of the technology used in the PT and SM procedures, the program used in the test can be extended to further investigations of the effects of the two masking procedures on representations of dynamic stimuli, like CV-transitions and diphthongs.

Acknowledgements The LabView test is being constructed with the invaluable help of Ellen Marklund, Francisco Lacerda and Peter Branderud, SU.

References Bladon, A. & G. Fant, 1978. A two-formant model and the cardinal vowels. Speech Transmission Laboratory Quarterly Progress Status Report (STL-QPSR 1) , 1-8. Carlsson, R., G. Fant & B. Granström, 1975. Two-formant Models, Pitch and Vowel Perception. In G. Fant & M.A.A. Tathham (eds.), Auditory Analysis and Perception of Speech . London: Academic. Carlson, R., B. Granström & G. Fant, 1970. Some studies concerning perception of isolated vowels. Speech Transmission Laboratory Quarterly Progress Status Report (STL-QPSR 2/3) , 19-35. Delgutte, B., 1990. Physiological mechanisms of psychophysical masking: Observations from auditory-nerve fibers. Journal of the Acoustical Society of America 87 (2), 791-809. Fant, G., 1960. Acoustical Theory of Speech Production . The Hague: Mouton. Moore, B.C.J., 1978. Psychophysical tuning curves measured in simultaneous and forward masking. Journal of the Acoustical Society of America 63 (2), 524-532. Oxenham, A.J. & C.J. Plack, 1998. Suppression and the upward spread of masking. Journal of the Acoustical Society of America 104 (6), 3500-3510. Tyler, R.S., J.F. Curtis & J.P. Preece, 1978. Formant preservation revealed by vowel masking patterns. The Canadian Speech and Hearing Convention , Saskatoon, Saskatchewan.

16 ULLA BJURSÄTER

Tyler, R.S. & B. Lindblom, 1982. Preliminary Study of Simultaneous-Masking abs Pulsation Threshold Patterns of Vowels. Journal of the Acoustical Society of America 71 (1), 220- 224. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 17 Working Papers 52 (2006), 17–20

Youth Language in Multilingual Göteborg

Petra Bodén 1 and Julia Gro e2 1Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University [email protected] 2Institute of Swedish as a Second Language, Dept. of Swedish, Göteborg University [email protected]

Abstract In this paper, the results from a perception experiment about youth language in multilingual Göteborg are presented and discussed.

1 Introduction 1.1 Language and language use among young people in multilingual urban settings The overall goal of the research project ‘Language and language use among young people in multilingual urban settings’ is to describe and analyze a Swedish variety (or set of varieties) hereafter called SMG (Lindberg, 2006). SMG stands for ‘Swedish on Multilingual Ground’ and refers to youth varieties like “” and “Rosengård Swedish”. In the present paper, we address two of the project’s research questions: SMG’s relation to foreign accent and how SMG is perceived by the adolescents themselves.

1.2 Purpose of the perception experiment In the perception experiment, Göteborg students are asked to listen for examples of “gårdstenska” (the SMG spoken in Göteborg) in recordings from secondary schools. The purpose is to identify speakers of SMG for future studies and to test the hypotheses that 1) monolingual speakers of Swedish can speak SMG and 2) speakers of SMG can code-switch to a more standardized form of Swedish. Foreign accent, defined here as the result of negative interference from the speaker’s L1 (first language), cannot occur in the Swedish that is spoken by persons who have Swedish as their (only) L1, nor can foreign accent be switched off in certain situations.

2 Method Stimuli were extracted from the research project’s speech database and played once (over loudspeakers) to a total of 81 listeners. The listeners were asked to answer two questions about each stimulus: Does the speaker speak what is generally called gårdstenska? ( yes or no ), and How confident are you about that? (confident, rather confident, rather uncertain or uncertain ). The listeners were also asked to answer a few questions about who they believed typically speaks gårdstenska. The 19 stimuli used in the experiment were approximately 30 second long sections that had been extracted from spontaneous (unscripted) recordings made at secondary schools in Göteborg. The listeners in the experiment were students from the same two schools as the speakers. After having collected the answer sheets, a general discussion on SMG was held in each class. 18 PETRA BODÉN & JULIA GROSSE

3 Results and discussion 3.1 Listeners’ views on gårdstenska 80 of the 81 listeners answered the questions about who typically speaks gårdstenska. All 80 answered that adolescents are potential speakers of gårdstenska. 54% (43) answered that also children can speak gårdstenska and 15% (12) that adults are potential speakers. Almost half of the listeners (37) claimed that only adolescents speak gårdstenska. Only a third of the listeners (25) believed that gårdstenska can be spoken by persons without an immigrant background. Most of them, 23, answered that gårdstenska is also spoken by first and second generation immigrants. One listener, however, answered that only persons without immigrant background and first generation immigrants speak gårdstenska. The listener was herself a second generation immigrant. One listener in a similar experiment undertaken in Malmö (Hansson & Svensson, 2004) answered in the same fashion, i.e. excluding persons with the same background as the listener herself. Finally, 69% (55) of the listeners answered that only persons with an immigrant background speak gårdstenska. The majority, 30, regards both first and second generation immigrants as potential speakers of gårdstenska, whereas 15 only include first generation immigrants and 10 only second generation immigrants, see Figure 1.

First generation immigrants Second generation immigrants 30 Listeners with one Swedish-born parent Listeners without immigrant background 20

10

0 Number of answers of Number Only speakers Speakers born in Only second Everybody Second and first Only first without Sweden generation (regardless generation generation immigrant (regardless immigrants background and immigrants immigrants background background) place of birth) Figure 1. The listeners’ answers to the question: Who speaks what is generally called gårdstenska?

3.2 Listeners’ classification of the stimuli A statistically significant majority of the listeners regarded the stimuli P05a, P11a, P19, P10a, P08, P10b, P05b, P35, P11b and P47 as examples of gårdstenska (p<.05). Between 36% and 80% of those listeners felt confident in their classification of the stimuli as gårdstenska whereas the corresponding percentages for the listeners classifying the same stimuli as ‘not gårdstenska’ varied from 27 to 57. Stimuli P38, P25, S40, S08, S15 and S30b were judged as examples of something else than gårdstenska by a majority of the listeners (p<.05). Between 39% and 73% of the listeners felt confident in their classification of the stimuli as something else than gårdstenska whereas the corresponding percentages for the listeners classifying the stimuli as gårdstenska varied from only 10 to 50. Stimuli P49, S43, S30a and S23 were perceived as gårdstenska by about half of the listeners (p>.05). 27% to 45% of the listeners that classified the stimuli as gårdstenska reported feeling confident and 23% to 42% of the listeners that classified the stimuli as not gårdstenska. Both the listeners’ classifications of these four stimuli and their reported uncertainty indicate that the stimuli in question contain speech that cannot unambiguously be classified as either SMG or something else than SMG. YOUTH LANGUAGE IN MULTILINGUAL GÖTEBORG 19

3.3 Foreign accent or language variety? Two hypotheses were tested in the experiment: 1) that monolingual speakers of Swedish can speak gårdstenska and 2) that speakers of gårdstenska can code-switch to a more standardized form of Swedish. Table 1 shows the relationship between the speakers’ background (if they have an immigrant background or not) and language use (SMG or not). An immigrant background is neither necessary nor sufficient for a speaker to be classified as a speaker of SMG. Three monolingual speakers of SMG were identified. Speech produced by speakers P11, P05 and P10 was used in two different types of stimuli: a) talking to friends and b) to a project member/researcher. Unlike in the Malmö experiment (Hansson & Svensson, 2004), no speaker was classified as a speaker of SMG in one stimulus but not in the other. P11a was perceived as a speaker of gårdstenska by a statistically significant majority of the listeners in situation a (93%, p<.05) but not unambiguously classified in situation b (69% SMG classifications, p>.05). P05 and P10 also got larger proportions of SMG classifications in the a stimuli than the b stimuli, but in both types of stimuli they were classified as speakers of SMG (p<.05).

Table 1. Speakers’ background and classification by the listeners in the experiment. Classification Speakers born in Sweden Speakers not born in Sweden according to the with at least one with parents not to Sweden before 6 to Sweden at 6 listeners (p<.05) parent born in born in Sweden years of age years of age or later Sweden SMG P47, P35, P08 P10 P19 P11, P05 Not SMG S15 S30, P25, P38 S08 S40

3.4 Differences in awareness of and attitude towards gårdstenska One thing that should be mentioned is that the term gårdstenska used in the survey (and in this paper) did not seem to be as widely accepted as we thought. Initially there was some uncertainty among the listeners what kind of language use we were referring to. However, when we described it as a “Göteborg version of Rinkeby Swedish”, the listeners seemed to understand what they were asked to listen for. Since we were interested in the listeners’ attitudes towards, and their awareness of gårdstenska, we tried to initiate a discussion about the subject matter after having completed the experiment. The observations described below are based on field notes and recollections and are not to be seen as results of the experiment but rather as overall impressions. When asked on what grounds they had categorized the speakers in the experiment most listeners seemed to agree that the use of certain words was crucial for their decision. Some students mentioned pronunciation, prosody and word order as typical features. In two of the five classes most of the time was spent listing words and phrases typical for gårdstenska. In the other three classes the discussion topic varied from typical linguistic features to more socio-linguistic aspects of multi-ethnic youth language. Several students in different classes made an explicit distinction between gårdstenska and foreign accent, and in one class a discussion developed about the function of multi-ethnic youth language as an identity marker used by adolescences who aim to underline their non Swedish identity. The student who was the most active in this part of the discussion also emphasized the difference between multi- ethnic youth varieties and foreign accents and drew parallels to regional varieties of Swedish. Concerning students’ attitudes towards gårdstenska there appeared to be some considerable differences between some of the classes. From this angle the discussion was particularly interesting in one of the classes. Only one male student in this class seemed to identify with speakers of gårdstenska (or “invandriska” as he himself called it). This student said that he

20 PETRA BODÉN & JULIA GROSSE would not use what we referred to as gårdstenska in class because his classmates would laugh at him. He refused to name any typical words or features of gårdstenska in class but volunteered to hand in a word list, which only we as researchers were allowed to look at. This student made it clear that “invandriska” was a language he used with his friends outside his class and never in the classroom. Interestingly this was the same class we mentioned above, where students talked about gårdstenska as an identity marker, whereas some students were quite determined in their opinion that this kind of language use was due to a low proficiency in Swedish. Within the other classes the subject seemed less controversial. We can, of course, only speculate about the cause for these differences between the classes. One impression was that there was less controversy about the issue in those classes where more students seemed to identify with speakers of gårdstenska, which were also the more heterogeneous regarding the students’ linguistic and cultural background.

3.5 Listeners’ awareness of sociolinguistic variation After visiting the five different school classes in two of Göteborgs multi-lingual areas the overall impression was that a lot of the students showed at least some awareness of socio- linguistic aspects in language use. Some students, as mentioned above, explicitly discussed aspects of language and identity, showing great insight and strong opinions on the issue. Overall most students seemed to acknowledge that gårdstenska is spoken in certain groups (i.e. among friends but not with teachers or parents) and in certain situations and not in others. Thus the listeners showed some awareness of register variation, even though there were different opinions on the question to what extent speakers make a conscious linguistic choice or unconsciously adapt their language when code-switching between gårdstenska and other varieties of Swedish. There was, however, a minority of listeners who categorized what they heard in some of the stimuli as interlanguage of individuals lacking proficiency in Swedish.

4 Future work The monolingual speakers of SMG support the hypothesis of SMG being a variety of Swedish rather than foreign accent. From discussions with adolescents we have learnt that SMG is primarily used among friends and not with e.g. teachers and parents. Therefore it is interesting that some speakers in the experiment were perceived as speaking SMG (albeit to a lesser degree) even in dialogues with adults. Future work includes investigating if some features of SMG (e.g. the foreign-sounding pronunciation) are kept even in situation where other features (e.g. the SMG vocabulary) are not used, and if these features possibly are kept also later in life when the speakers no longer use a youth language.

Acknowledgements The research reported in this paper has been financed by the Bank of Sweden Tercentenary Foundation. The authors would like to thank Roger Källström for fruitful discussions on the experiment’s design and much appreciated help!

References Hansson, P. & G. Svensson, 2004. Listening for “Rosengård Swedish”. Proceedings FONETIK 2004 , 24-27. Lindberg, I., 2006. Språk och språkbruk bland ungdomar i flerspråkiga storstadsmiljöer 2000–2006 . Institute of Swedish as a Second Language, Göteborg University. http://hum.gu.se/institutioner/svenska-spraket/isa/verk/projekt/pag/pg_forsk2 Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 21 Working Papers 52 (2006), 21–24

Prosodic Cues for Hesitation

Rolf Carlson 1, Kjell Gustafson 1,2 , and Eva Strangert 3* 1Department of Speech, Music and Hearing, KTH {rolf|kjellg}@speech.kth.se 2Acapela Group Sweden AB, Solna [email protected] 3Department of Comparative Literature and Scandinavian Languages, Umeå University [email protected] *names in alphabetical order

Abstract In our efforts to model spontaneous speech for use in, for example, spoken dialogue systems, a series of experiments have been conducted in order to investigate correlates to perceived hesitation. Previous work has shown that it is the total duration increase that is the valid cue rather than the contribution by either of the two factors pause duration and final lengthening. In the present experiment we explored the effects of F0 slope variation and the presence vs. absence of creaky voice in addition to durational cues, using synthetic stimuli. The results showed that variation of both F0 slope and creaky voice did have perceptual effects, but to a much lesser degree than the durational increase.

1 Introduction Disfluencies of various types are a characteristic feature of human spontaneous speech. These can occur for reasons such as problems in lexical access or in the structuring of utterances or in searching feedback from a listener. The aim of the current work is to gain a better understanding of what features contribute to the impression of hesitant speech on a surface level. One of our long term research goals is to build a synthesis model which is able to produce spontaneous speech including disfluencies. Apart from increasing our understanding of the features of spontaneous speech, such a model can be explored in spoken dialogue systems, both to increase the naturalness of the synthesized speech (Callaway, 2003) and as a paralinguistic signalling of for example uncertainness in a dialogue. The current work deals with the modelling of one type of disfluency, hesitations. The work has been carried out through a sequence of experiments using Swedish speech synthesis. If we are to model hesitations in a realistic way in dialogue systems, we need to know more about what phonetic features contribute to the impression that a speaker is being hesitant. A few studies have shown that hesitations (and other types of disfluencies) very often go unnoticed in normal conversation, even during very careful listening, but scientific studies have in the past concentrated much more on the production than on the perception of hesitant speech. Pauses and retardations have been shown to be among the acoustic correlates of hesitations (Eklund, 2004). Significant patterns of retardation in function words before hesitations have been reported (Horne et al., 2003). A recent perception study (Lövgren & van Doorn, 2005) confirms that pause insertion is a salient cue to the impression of hesitation, and the longer the pause, the more certain the impression of hesitance. With a few exceptions, relatively little effort has so far been spent on research on spontaneous speech synthesis with a focus on disfluencies. In recent work (Sundaram & 22 ROLF CARLSON ET AL .

Narayanan, 2003) new steps are taken to predict and realize disfluencies as part of the unit selection in a synthesis system. In Strangert & Carlson (2006) an attempt to synthesize hesitation using parametric synthesis was presented. The current work is a continuation of this effort.

2 Experiment Synthetic versions of a Swedish utterance were presented to listeners who had to evaluate if and where they perceived a hesitation. The subjects, regarded as naive users of speech synthesis, were 14 students of linguistics or literature from Umeå University, Sweden. The synthetic stimuli were manipulated with respect to duration features, F0 slope and presence vs. absence of creaky voice to invoke the impression of a hesitation. A previous study (Carlson et al., 2006) showed the total increase in duration at the point of hesitation to be the most important cue rather than each of the factors pause and final lengthening separately. Therefore, pause and final lengthening were now combined in one “total duration increase” feature. The parameter manipulation was done in two different sentence positions as in the previous study. However, in the current experiment the manipulations were similar in the two positions, whereas different parameter settings were used in the previous one. The stimuli were synthesized using the KTH formant based synthesis system (Carlson & Granström, 1997), giving full flexibility for prosodic adjustments. 160 versions of the utterance were created covering all feature combinations in two positions: A hesitation was placed either in the first part (F) or in the middle (M) of the utterance “I sin F trädgård har Bettan M tagetes och rosor.” (English word-by-word translation: “In her F garden has Bettan M tagetes and roses.”) In addition, there were stimuli without inserted hesitations. The two positions were chosen to be either inside a phrase (F) or between two phrases (M). The hesitation points F and M were placed in the unvoiced stop consonant occlusion and were modelled using three parameters: a) total duration increase combining retardation before the hesitation point and pause, b) F0 slope variation and c) presence/absence of creak.

2.1 Retardation and pause The segment durations in our test stimuli were set according to the default duration rules in the TTS system. The retardation adjustment was applied on the VC sequence /in/ in “sin” and /an/ in “Bettan” before the hesitation points F and M, respectively, and the pausing was a simple lengthening of the occlusion in the unvoiced stop. All adjustments were done with an equal retardation and pause contribution following our earlier results in Carlson et al. (2006).

Figure 1. a) F0 shapes for the two possible hesitation positions F and M. D=Retardation + Pause. b) Illustration of intonation contours for the two extreme cases in position F. PROSODIC CUES FOR HESITATION 23

2.2 F0 slope variation The intonation was modelled by the default rules in the TTS system. At the hesitation point the F0 was adjusted to model slope variation in 5 shapes with rising contours (+20, +40 Hz) a flat contour (0) and falling contours (-20, -40 Hz). The pivot point before the hesitation was placed at the beginning of the last vowel before the hesitation, see Figure 1a. Figure 1b shows spectrograms with intonation curves for the two extreme cases in the F position.

2.3 Creak Creaky voice was set to start three quarters into the last vowel before the hesitation and to reach full effect at the end of the vowel. The creak was modelled by changing every other glottal pulse in time and amplitude (Klatt & Klatt, 1990).

100 30

F First F 75 M Mid 20 M

10 50

0 25 1 10 100 1000

Hesitation perception (%) -10

0 1 10 100 1000 Hesitationperception increase (%) -20 Duration increase (ms) Duration increase (ms) Figure 2. a) Distribution of hesitation responses b) Distribution of hesitation perception increase due to addition of creak. Data separated according to position of hesitation.

3 Results and discussion The results of the experiment are summarized in Figure 2. In Figure 2a, hesitation perception is plotted as a function of total duration increase. The strong effect is similar to and confirms the previous result that the combined effect of pause and retardation is a very strong cue to hesitation. In Figure 2b, the increase in hesitation perception due to the addition of creak is plotted against total duration increase. Here, a compensatory pattern is revealed, in particular in the first position; when the duration adjustment is at the categorical border (at a total duration increase of about 100 ms, cf. Figure 2a), creak has a strengthening effect, favouring the perception of a hesitation. In a similar way, falling F0 contours made perception of hesitation easier at the categorical border for duration, compensating for weak duration cues. These results support the conclusion that duration increase, achieved by the combined effects of retardation and pause, is an extremely powerful cue to perceived hesitation. F0 slope variation and creak play a role, too, but both are far less powerful, functioning as supporting rather than as primary cues. Their greatest effects apparently occur at the categorical border, when the decision hesitation/no hesitation is the most difficult. The results further indicate that subjects are less sensitive to modifications in the middle position (M) than in the first position (F). We relate this to the difference in syntactic structure: in the F position the hesitation occurs in the middle of a noun phrase (“I sin F trädgård”), whereas in the M position it occurs between two noun phrases, functioning as subject and object respectively. A reasonable assumption is that the subjects expected some kind of prosodic marking in the latter position and that therefore a greater lengthening was required in order to produce the percept of hesitation.

24 ROLF CARLSON ET AL .

This assumption is strengthened by the subjects’ reaction to the other two features investigated. Both intonation and creaky voice have the capacity to signal an upcoming boundary so that they are more likely to facilitate the detection of a hesitation in a phrase- internal position, where a boundary is unexpected, than between two grammatical phrases. This dependence on syntax is not unexpected: vast numbers of production studies have shown the strength of prosodic signalling to depend on the strength of the syntactic boundary. In conclusion, our results indicate that the perception of hesitation is strongly influenced by deviations from an expected temporal pattern. In addition, different syntactic conditions have an effect on how much changes in prosodic features like the F0 contour and retardation and the presence of creaky voice contribute to the perception of hesitation. In view of this, the modelling of hesitation in speech technology applications should take account of the supporting roles that F0 and creak can play in achieving a realistic impression of hesitation. An important step in the modelling of spontaneous speech would be to include predictions of different degrees of hesitations depending on the utterance structure. To do this, data are required of the distribution of hesitations, see e.g. Strangert (2004). Our long-term goal is to build a synthesis model which is able to produce spontaneous speech on the basis of such data. An even more long-term goal is to include other kinds of disfluencies as well, and to integrate the model in a conversational dialogue system, cf. Callaway (2003).

Acknowledgements We thank Jens Edlund, CTT, for designing the test environment, and Thierry Deschamps, Umeå University, for technical support in performing the experiments. This work was supported by The Swedish Research Council (VR) and The Swedish Agency for Innovation Systems (VINNOVA).

References Callaway, C., 2003. Do we need deep generation of disfluent dialogue? In AAAI Spring Symposium on Natural Language Generation in Spoken and Written Dialogue, Tech. Rep. SS-03-07 . Menlo Park, CA: AAAI Press. Carlson, R. & B. Granström, 1997. Speech synthesis. In W.J. Hardcastle & J. Laver (eds.), The Handbook of Phonetic Science . Oxford: Blackwell Publ., 768-788. Carlson, R., K. Gustafson & E. Strangert, 2006. Modelling Hesitation for Synthesis of Spontaneous Speech. Proc. Speech Prosody 2006 , Dresden. Eklund, R., 2004. Disfluency in Swedish human-human and human-machine travel booking dialogues . Dissertation 882, Linköping Studies in Science and Technology. Horne, M., J. Frid, B. Lastow, G. Bruce & A. Svensson, 2003. Hesitation disfluencies in Swedish: Prosodic and segmental correlates. Proc. 15th ICPhS , Barcelona, 2429-2432. Klatt, D. & L. Klatt, 1990. Analysis, synthesis and perception of voice quality variations among female and male talkers . JASA 87 , 820-857. Lövgren, T. & J. van Doorn, 2005. Influence of manipulation of short silent pause duration on speech fluency. Proc. DISS2005 , 123-126. Strangert, E., 2004. Speech chunks in conversation: Syntactic and prosodic aspects. Proc. Speech Prosody 2004 , Nara, 305-308. Strangert, E. & R. Carlson, 2006. On modelling and synthesis of conversational speech. Proc. Nordic Prosody IX, 2004 , Lund, 255-264. Sundaram, S. & S. Narayanan, 2003. An empirical text transformation method for spontaneous speech synthesizers. Proc. Interspeech 2003 , Geneva. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 25 Working Papers 52 (2006), 25–28

F-pattern Analysis of Professional Imitations of “hallå” in three Swedish Dialects

Frantz Clermont and Elisabeth Zetterholm Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University {frantz.clermont|elisabeth.zetterholm}@ling.lu.se

Abstract We describe preliminary results of an acoustic-phonetic study of voice imitations, which is ultimately aimed towards developing an explanatory approach to similar-sounding voices. Such voices are readily obtained by way of imitations, which were elicited by asking an adult- male, professional imitator to utter two tokens of the Swedish word “hallå” in a telephone- answering situation and three Swedish dialects (, Stockholm, Skania). Formant- frequency (F1, F2, F3, F4) patterns were measured at several landmarks of the main phonetic segments (‘a’, ‘l’, ‘å’), and cross-examined using the imitator’s token-averaged F-pattern and those obtained by imitation. The final ‘å’-segment seems to carry the bulk of differences across imitations, and between the imitator’s patterns and those of his imitations. There is however a notable constancy in F1 and F2 from the ‘a’-segment nearly to the end of the ‘l’- segment, where the imitator seems to have had fewer degrees of articulatory freedom.

1 Introduction It is an interesting fact but all the same a challenging one in forensic voice identification, that certain voices should sound similar (Rose & Duncan, 1995), even though they originate from different persons with differing vocal-tract structures and speaking habits. It is also a familiar observation (Zetterholm, 2003) that human listeners can associate an imitated voice with the imitated person. However, there are no definite explanations for similar-sounding voices, and thus there is still no definite approach for understanding their confusability. Nor are there any systematic insights into the degree of success that is achievable in trying to identify an imitator’s voice from his/her imitations. Some valiant attempts have been made in the past to characterise the effects of disguise on voice identification by human listeners. More recently, there have been some useful efforts to evaluate the robustness of speaker identification systems (Zetterholm et al., 2005). The results are however consistent in that “it is possible to trick both human listeners and a speaker verification system” (Zetterholm et al., 2005: p. 254), and that there are still no clear explanations. Overall, the knowledge landscape around the issue of similarity of voices appears to be quite sparse, yet this issue is at the core of the problem of voice identification, which has grown pressing in dealing with forensic-phonetic evaluation of legal and security cases. Our ultimate objective, therefore, is to use acoustic, articulatory and perceptual manifestations of imitated voices as pathways for developing a more explanatory approach to similar-sounding voices than available to date. The present study describes a preliminary step in the acoustic-phonetic analysis of imitations of the word “hallå” in three dialects of Swedish. The formant-frequency patterns obtained are enlightening from a phenomenological and a methodological point of view. 26 FRANTZ CLERMONT & ELISABETH ZETTERHOLM 2 Imitations of the Swedish word “hallå” – the speech material The material gathered thus far consists of auditorily-validated imitations of the Swedish word “hallå”. An adult-male, professional imitator was asked to first produce the word in his own usual way. The imitator is a long-term resident of an area close to Gothenburg and, therefore, his speaking habits are presumed to carry some characteristics of the . He was asked to also produce imitations of “hallå” in situations such as: (i) answering the telephone, (ii) signalling arrival at home, and (iii) greeting a long-lost friend, all in 5 Swedish dialects (Gothenburg, Stockholm, Skania, Småland, Norrland). The 2 tokens obtained for the first 3 dialects in situation (i) were retained for this preliminary study. The recordings took place in the anechoic chamber recently built at Lund University. The analogue signals were sampled at 44 kHz, and then down-sampled by a factor of 4 for formant-frequency analyses.

3 Formant-frequency parameterisation 3.1 Formant-tracking procedure The voiced region of every waveform was isolated using a spectrographic representation, concurrently with auditory validation. Formants were estimated using Linear-Prediction (LP) analyses through Hanning-windowed frames of 30-msec duration, by steps of 10 msecs, and a pre-emphasis of 0.98. For 25% of the data used for this study, the LP-order had to be increased to 18 from a default value of 14. For each voiced interval, the LP-analyses yielded a set of frame-by-frame poles, among which F1, F2, F3 and F4 were estimated using a method (Clermont, 1992) based on cepstral analysis-by-synthesis and dynamic programming.

3.2 Landmark selection along the time axis The expectedly-varying durations amongst the “hallå” tokens raise the non-trivial problem of mapping their F-patterns onto a common time base. We sought a solution to this problem by looking at the relative durations of the main phonetic segments (‘a’, ‘l’, ‘å’), which were demarcated manually. The token-averaged durations for imitated and imitator’s segments are superimposed in Fig. 1, together with the overall mean per segment.

Figure 1. Segmental durations: Mean ratio of ~3 to 1 for ‘a’, ~5 to 1 for ‘å’, relative to ‘l’.

Interestingly, the durations for the imitator’s ‘a’- and ‘å’-segments are closer to those measured for his Gothenburg imitations, and smaller than those measured for his Skanian and Stockholm imitations. Fig. 1 also indicates that the medial ‘l’-segment has a duration that is tightly clustered around 50 msecs and, therefore, it is a suitable reference to which the other segments can be related. On the average, the duration ratio relative to the ‘l’-segment is about 3 to 1 for ‘a’, and 5 to 1 for ‘å’. A total of 45 landmarks were thus selected such that, if 5 are arbitrarily allocated for the ‘l’-segment, there are 3 times as many for the ‘a’-segment and 5 times as many for the ‘å’-segment. The method of cubic-spline interpolation was employed to generate the 45-landmark, F-patterns that are displayed in Fig. 2 and subsequently examined. F-PATTERN ANALYSIS OF PROFESSIONAL IMITATIONS OF “HALLÅ ” 27 4 F-pattern analysis 4.1 Inter-token consistency It is known that F-patterns exhibit some variability because of the measurement method used, and of one’s inability to replicate sounds in exactly the same way. Consequently, the spread magnitude about a token-averaged F-pattern should be useful for gauging measurement consistency, and intrinsic variability to some degree. Table 1 lists spread values that mostly lie within difference-limens for human perception, and are therefore deemed to be tolerable. The spread in F3 for the imitator’s “hallå” is relatively large, especially by comparison with his other formants. However, the top left-hand panel of Fig. 2 does show that there is simply greater variability in the F3 of his initial ‘a’-segment. Overall, there appear to be no gross measurement errors that prevent a deeper examination of our F-patterns.

Table 1. Inter-token spreads (=standard deviations in Hz) averaged across all 45 landmarks. F1 F2 F3 F4 IMITATOR ( SELF ) 33 68 136 72 STOCKHOLM ( STK ) 42 68 28 79 GOTHENBURG ( GTB ) 23 55 71 75 SKANIA ( SKN ) 34 58 36 50 Mean (spread) with IMITATOR : 32 (8) 62 (7) 68 (49) 69 (13) Mean (spread) without IMITATOR : 33 (10) 60 (7) 45 (23) 68 (16)

4.2 Overview of F-pattern behaviours For both the imitator’s “hallå” and his imitations, there is less curvilinearity in the formant trajectories for the ‘a’- and ‘l’-segments than in those for the final ‘å’-segment, which behaves consistently like a . The concavity of the F2-trajectory for the Skanian-like ‘å’- segment seems to set this dialect apart from the other dialects. Quite noticeably for the ‘a’- and ‘l’-segments, F1- and F2-trajectories are relatively flatter, and numerically closer to one another than the higher formants. Interestingly again, the F-patterns for the Gothenburg-like “hallå” seem to be more aligned with those corresponding to the imitator’s own “hallå”.

Figure 2. Landmark-normalised F-patterns: Imitator & his imitations of 3 Swedish dialects.

28 FRANTZ CLERMONT & ELISABETH ZETTERHOLM

4.3 Imitator versus imitations – a quantitative comparison The ‘a’- and ‘l’-segments examined above seem to retain the strongest signature of the imitator’s F1- and F2-patterns. To obtain a quantitative verification of this behaviour, we calculated landmark-by-landmark spreads (Fig. 3) of the F-patterns with all data pooled together (left panel), and without the Skania-like data (right panel). The left-panel data highlight a large increase of the spread in F1 and F2 for the final ‘å’-segment, thus confirming a major contrast with the other dialectal imitations. The persistently smaller spread in F1 and F2 for the two initial segments raises the hope of being able to detect some invariance in professional imitations of “hallå”. The relatively larger spreads in F3 and F4 cast some doubt on these formants’ potency for de-coupling our imitator’s “hallå” from his imitations.

Figure 3. Landmark-by-landmark spreads: (left) all data pooled; (right) Skania-like excluded.

5 Summary and ways ahead The results of this study are prima facie encouraging, at least for the imitations obtained from our professional imitator. It is not yet known whether the near-constancy observed through F1 and F2 of the initial segments of “hallå” will be manifest in other situational tokens, and whether a similar behaviour should be expected with different imitators and phonetic contexts. We have looked at formant-frequencies one at a time but, as shown by Clermont (2004) for “hello”, there are deeper insights to be gained by re-examining these frequencies systemically. The ways ahead will involve exploring all these possibilities.

Acknowledgements We express our appreciation to Prof. G. Bruce for his auditory evaluation of the imitations. We thank Prof. Bruce and Dr D.J. Broad for their support, and the imitator for his efforts.

References Clermont, F., 1992. Formant-contour parameterisation of vocalic sounds by temporally- constrained spectral matching. Proc. 4 th Australian Int. Conf. Speech Sci. & Tech., 48-53. Clermont, F., 2004. Inter-speaker scaling of poly-segmental ensembles. Proc. 10 th Australian Int. Conf. Speech Sci. & Tech., 522-527. Rose, P. & S. Duncan, 1995. Naïve auditory identification and discrimination of similar sounding voices by familiar listeners. Forensic Linguistics 2, 1-17. Zetterholm, E., 2003. Voice imitation: A phonetic study of perceptual illusions and acoustic successes . Dissertation, Lund University. Zetterholm, E., D. Elenius & M. Blomberg, 2005. A comparison between human perception and a speaker verification system score of a voice imitation. Proc. 10 th Australian Int. Conf. Speech Sci. & Tech., 393-397. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 29 Working Papers 52 (2006), 29–32

Describing Swedish-accented English

Una Cunningham Department of Arts and Languages, Högskolan Dalarna [email protected]

Abstract This paper is a presentation of the project Swedish accents of English which is in its initial stages. The project attempts to make a phonetic and phonological description of some varieties of Swedish English, or English spoken in Sweden, depending on the status attributed to English in Sweden. Here I show some curious results from a study of acoustic correlates of vowel quality in the English and Swedish of young L1 Swedish speakers.

1 Introduction 1.1 Background The aim of the proposed project is to document the phonetic features of an emerging variety of English, i.e. the English spoken by young L1 speakers of Swedish. At a time when the relative positions of Swedish and English in Sweden are the stuff of Government bills (Regeringen, 2005), the developing awareness of the role English has as an international language in Sweden is leading to a rejection of native speaker targets for Swedish speakers of English. Throughout what Kachru (1992) called the expanding circle, learners of English are no longer primarily preparing for communication with native speakers of English but with other non-native speakers. In a recent article, Seidlhofer (2005) called for the systematic study of the features of English as a lingua franca (ELF), that is communication that does not involve any native speakers, in order to free ELF from native-speaker norms imposed upon it. She would prefer to see ELF alongside native speaker varieties rather than constantly being monitored and compared to them. The point is that there are features of the pronunciation of native speaker varieties which impede communication, and features of non-native pronunciation which do not disturb communication, and rather than teaching learners to be as native-like as possible, communication would be optimised by instead concentrating on the non-native listener rather than the native listener. Some young people are British-oriented in their pronunciation, either from RP/BBC English or another accent, others have general American as a clear influence, while another group is not clearly influenced by any native speaker norm. A full phonetic description of these accents of English does not as yet exist, and is of interest as a documentation of an emerging variety of English, at a time when previously upheld targets for the pronunciation of English by Swedish learners have been abandoned and English is growing in importance (Phillipson, 1992; Skolverket, 2006).

1.2 Previous studies The distinction between English as a Foreign Language (EFL) and English as an International Language (EIL) or English as a Lingua Franca (ELF) is important here. The number of non- native speakers of English increasingly exceeds the number of native speakers, and the native speaker norm as the “given and standard measure” (Jenkins, 2004) for English learners must be questioned. There is a clear distinction between those learners who aspire to sound as 30 UNA CUNNINGHAM native-like as possible and those who wish to be as widely understood as possible. McArthur (2003) makes a distinction between English in its own right and English in its global role and argues that the distinction between English as a second language and English as a foreign language is becoming less useful, as people in a range of countries, including those in Scandinavia routinely use the language. Seidlhofer (2005) called for the description of English as a Lingua Franca (a term rejected by some writers because of the associations of lingua franca with pidgins and mixed forms of language). Those who use this term usually want to indicate the same as those who use English as an international language, i.e. a “core” of English stripped of the less useful features of native speaker varieties, such as weak forms of function words, typologically unusual sounds such as the interdental fricatives etc. (e.g. Jenkins, 2004). As corpora of non-native English (both non-native to native and non-native to non native) are being developed, such as the English as a lingua franca in academic settings (ELFA) corpus (Mauranen, 2003) and the general Vienna-Oxford International Corpus of English (VOICE) (Seidlhofer, 2005), this is now possible although few studies have been made of pronunciation.

2 Methodological thoughts The project aim is to make a thorough study of some phonetic and phonological features of Swedish accents of English in two groups of informants. The first group is young adults (those who are currently at upper secondary or have left upper secondary education in the past 5 years, and are thus 16-24 years of age. These speakers have not usually received any pronunciation teaching. The second group of Swedish speakers of English is university teachers who are over 40 years of age and who have not spent long periods in native English- speaking environments or studied English at university level, but who do use English regularly. Although there is a difference in the stability of a learner variety compared to an established variety of a language (Tarone, 1983), there is certainly a set of features characteristic of a Swedish accent of English. It should be possible to make interesting generalisations. A first step will be to establish the phoneme inventories of the English of each informant. The acoustic quality of vowels produced in elicited careful speech (reading words in citation form and texts) as well as in spontaneous speech (dialogues between non-native informants) will be investigated. Within-speaker variation is of interest here to capture variable production, as well as between-speaker variation. The realisations of the vowel phonemes of Swedish-English will be charted and examined. There are hypothesised to be some consonant phonemes of native varieties of English that are missing from Swedish English (as any ELF variety c.f. Seidlhofer, 2005) – voiced alveolar and palatoalveolar fricatives and affricates are candidates. The realisation of the English alveolar consonants will also be closely studied as will various kinds of allophonic variation such as dark and light /l/, rhoticism, phonotactic effects, assimilation, vowel reduction, rapid speech phenomena etc. Flege, Schirru & MacKay (2003) established that the two phonetic systems of Italian- speaking learners of English interact. Their study showed that L2 vowels are either assimilated to L1 vowels or dissimilated from them (i.e. are made more different than the corresponding vowels produced by monolinguals speaking English or Italian), depending on the usage patterns of the individual learners. A similar phenomenon was seen in bilingual speakers of Swedish and English (Cunningham, 2004) where there was a dissimilation in timing. An attempt will be made to detect any similar patterns (i.e. instances of the Swedish accent having greater dissimilation between categories than native varieties in the speech of the Swedish-English speakers being studied). The differences between Swedish and English timing and the way bilingual individuals do not usually maintain two separate systems for organising the temporal relationship of vowels DESCRIBING SWEDISH -ACCENTED ENGLISH 31 and consonants have been the subject of earlier research (Cunningham, 2003). The timing of Swedish-accented English will be studied in the data collected in this project. The way the learners deal with post-vocalic voicing and the relationship between vowel quality and vowel and consonant quantity are particularly interesting as regards their consequences for comprehensibility, as the perceptive weight of quantity appears to be different in Swedish and English (c.f. e.g. McAllister, Flege & Piske, 2002). The consequences of the timing solutions adopted by Swedish speakers of English for their comprehensibility to native speakers of English, Swedish speakers of English and other non-native speakers of English could be investigated at a later stage.

3 Early results Recordings of sixteen young Swedish speakers (12 female, 4 male), with Swedish as their only home and heritage language) have been made. They were all in their first year of upper secondary education at the time of recording (around 16 years old). Figures 1 and 2 show the relationships between the first and second formant frequencies for high front vowels in elicited citation form words for one of these speakers (known as Sara for the purpose of this study). Sara’s English high front vowels appear to be qualitatively dissimilated while her Swedish high front vowels are not clearly qualitatively distinguished using the first two formants. Sara’s English high vowels are apparently generally higher than these Swedish high vowels. Notice the fronting found for Sara’s English in two instances of /u:/ in the word choose. This particular word has been pronounced with fronting for other speakers too. Might this be a case of a feature of Estuary English making its way into the English spoken in Sweden?

Sara English

200

300

400 Ref Eng u 500 Ref Eng i i: 600

F1 Hz I 700 u: U 800

900

1000 2900 2400 1900 1400 900 F2 Hz

Figure 1. A 16 year-old female Swedish speaker’s high vowels from a read list of English words. Reference values from Ladefoged’s material http://hctv.humnet.ucla.edu/departments/ linguistics/VowelsandConsonants/vowels/chapter3/english.aiff .

32 UNA CUNNINGHAM

Sara Swedish

200

300

400 i:

500 Ǻ ș

600 ȅ F1 Hz

700 Ref Sw u Ref Sw i 800

900

1000 2900 2400 1900 1400 900 F2 Hz

Figure 2. Some of the same speaker’s high vowels from a read list of Swedish words. Reference values from Eklund & Traunmüller (1997).

References Cunningham, U., 2003. Temporal indicators of language dominance in bilingual children. Proceedings from Fonetik 2003, Phonum 9 , Umeå University, 77-80. Cunningham, U., 2004. Language Dominance in Early and Late Bilinguals. ASLA, Södertörn . Eklund, I. & H. Traunmüller, 1997. Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetica 54 , 1-21. Flege, J.E., C. Schirru & I.R.A. MacKay, 2003. Interaction between the native and second language phonetic subsystems. Speech Communication 40 , 467-491. Jenkins, J., 2004. Research in teaching pronunciation and intonation. Annual Review of Applied Linguistics 24 , 109-125. Kachru, B. (ed.), 1992. The Other Tongue (2nd edition). Urbana and Chicago: University of Illinois Press. Mauranen, A., 2003. Academic English as lingua franca—a corpus approach. TESOL Q. 37 , 513-27. McAllister, R., J.E. Flege & T. Piske, 2002. The influence of L1 on the acquisition of Swedish quantity by native speakers of Spanish, English and Estonian. Journal of Phonetics 30 (2), 229-258. McArthur, T., 2003. World English, Euro-English, Nordic English? English Today 73 (19), 54-58. Phillipson, R., 1992. Linguistic Imperialism . Oxford: Oxford Univ. Press. Regeringen, 2005. Bästa språket – en samlad svensk språkpolitik. Prop. 2005/06:2. Seidlhofer, B., 2005. English as a lingua franca. ELT Journal 59 (4), 339-341. Skolverket, 2006. Förslag till kursplan. Tarone, E., 1983. On the variability of interlanguage systems. Applied Linguistics 4 , 142-163. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 33 Working Papers 52 (2006), 33–36

Quantification of Speech Rhythm in Norwegian as a Second Language

Wim A. van Dommelen Department of Language and Communication Studies, NTNU [email protected]

Abstract This paper looks into the question of how to quantify rhythm in Norwegian spoken as a second language by speakers from different language backgrounds. The speech material for this study was taken from existing recordings from the Language Encounters project and consisted of sentences read by natives and speakers from six different L1s. Measurements of syllable durations and speech rate were made. Seven different metrics were calculated and used in a discriminant analysis. For the five utterances investigated, statistical classification was to a large degree in congruence with L1 group membership. The results therefore suggest that L2 productions differed rhythmically from Norwegian spoken as L1.

1 Introduction During the last few years a number of attempts have been made to classify languages according to rhythmical categories using various metrics. To investigate rhythm characteris- tics of eight languages, Ramus, Nespor & Mehler (1999) calculated the average proportion of vocalic intervals and standard deviation of vocalic and consonantal intervals over sentences. Though their metrics appeared to reflect aspects of rhythmic structure, also considerable overlap was found. Grabe's Pairwise Variability Index (PVI; see section 2.2) is a measure of differences in vowel duration between successive syllables and has been used by, e.g., Grabe & Low (2002), Ramus (2002) and Stockmal, Markus & Bond (2005). In order to achieve more reliable results Barry, Andreeva, Russo, Dimitrova & Kostadinova (2003) proposed to extend existing PVI measures by taking consonant and vowel intervals together. The present paper takes an exploratory look into the question of how to quantify speech rhythm in Norwegian spoken by second language users. Seven metrics will be used, five of which being based on syllable durations. Two metrics are related to speech rate, and the last one is Grabe's normalized Pairwise Variability Index with syllable duration as a measure.

2 Method 2.1 Speech material The speech material used for this study was chosen from existing recordings made for the Language Encounters project. These recordings were made in the department's sound- insulated studio and stored with a sampling frequency of 44.1 kHz. Five different sentences were selected consisting of 8, 10, 11, 11, and 15 syllables, respectively. There were six second language speaker groups with the following L1s (number of speakers in parentheses): Chinese (7), English (4), French (6), German (4), Persian (6) and Russian (4). Six native speakers of Norwegian served as a control group. The total number of sentences investigated was thus 37 x 5= 185. 34 WIM A. VAN DOMMELEN

2.2 Segmentation and definition of metrics The 185 utterances were segmented into syllables and labeled using Praat (Boersma & Weenink, 2006). Syllabification of an acoustic signal is not a trivial task. It was guided primarily by the consideration to achieve consistent results across speakers and utterances. In words containing a sequence of a long vowel and a short consonant in a context like V:CV (e.g., fine [nice]) the boundary was placed before the consonant (achieving fi-ne ), after a short vowel plus long consonant as in minne (memory) after the consonant ( minn-e). Only when the intervocalic consonant was a voiceless plosive, the boundary was always placed after the consonant (e.g. in mat-et [fed]). To compare temporal structure of the L2 with the L1 utterances, seven different types of metrics were defined. In all cases calculations were related to each of the seven groups of speakers as a whole. The first metric was syllable duration averaged over all syllables of each utterance, yielding one mean syllable duration for each sentence and each speaker group. Second, the standard deviation for the syllable durations pooled over the speakers of each group was calculated for each of the single utterances' syllables. The mean standard deviation was then taken as the second metric, thus expressing mean variation of syllable durations across each utterance. For the definition of the third and fourth metric let us look at Figure 1. In this figure, closed symbols depict mean syllable durations in the sentence To barn matet de tamme dyrene (Two children fed the tame animals) produced by six native speakers. The syllables are ranked according to their increasing durations. Similarly, the open symbols give the durations for the same syllables produced by the group of seven Chinese speakers. Note that the order of the syllables is the same as for the Norwegian natives. Indicated are regression lines fitted to the two groups of data points. The correlation coefficient for the relation between syllable duration and the rank number of the syllables as defined by the Norwegian reference is the third metric in this study (for the Chinese speaker group presented in the figure r= 0.541). Further, the slope of the regression line was taken as the fourth metric (here: 18.7). The vertical bars in Figure 1 indicate ± 1 standard deviation. The mean of the ten standard deviation values represents the second metric defined above (for Norwegian 27 ms; for Chinese 63 ms).

600

500

400

300

200

Syllable duration Syllable[ms] duration 100 e re de et ne dy to tamm mat barn 0

0 2 4 6 8 10 12

Syllable rank Figure 1. Mean duration of syllables in a Norwegian utterance ranked according to increasing duration for six native speakers (closed symbols with regression line). Open symbols indicate mean durations for a group of seven Chinese subjects with syllable rank as for the L1 speakers. Vertical bars indicate ± 1 standard deviation. QUANTIFICATION OF SPEECH RHYTHM IN NORWEGIAN AS A SECOND LANGUAGE 35

As metric number five speech rate was chosen, calculated as the number of (actually produced) phonemes per second. The standard deviation belonging to the mean number of phonemes served as the sixth metric. In both cases, there was one single value per utterance and speaker group. Finally, the seventh metric was the normalized Pairwise Variability Index (nPVI) as used by Grabe & Low (2002):

m−1 d −d  (1) nPVI = 100 × k k+1 /( m − )1 ∑ (d +d 2/)   k=1 k k+1 

In this calculation the difference of the durations (d) of two successive syllables is divided by the mean duration of the two syllables. This is done for all (m-1) successive syllable pairs in an utterance (m= the number of syllables). Finally, by dividing the sum of the (m-1) amounts by (m-1) a mean normalized difference is calculated and expressed as percent.

3 Results 3.1 Mean syllable duration Since the main temporal unit under scrutiny is the syllable, let us first see whether and to what extent the various speaker groups produced different syllable durations. As can be seen from Table 1, mean syllable durations vary substantially. Shortest durations were found for the natives (178 ms), while the subjects with a Chinese L1 produced the longest syllables (285 ms). The other groups have values that are more native-like, in particular the German speakers with a mean of 200 ms. For all speaker groups the standard deviations are quite large, which is due to both inter-speaker variation and the inclusion of all the different types of syllables. (Note that the standard deviation described here is different from the second metric; see 2.2.) According to a one-way analysis of variance, the overall effect of speaker group on syllable duration is statistically significant (F(6, 2029)= 40.322; p< .0001). Calculation of a Games-Howell post-hoc analysis resulted in the following homogeneous subsets (level of significance p= 0.05): (Chinese); (English, French, German, Russian); (French, English, Persian, Russian); (German, Norwegian, English, Russian); (Persian, French); (Russian, English, French, German); (Norwegian, German). It is thus obvious that syllable durations overlap considerably and do not really distinguish the speaker groups.

Table 1. Mean syllable durations and standard deviations in ms for six groups of L2 speakers and a Norwegian control group. Means are across five utterances and all speakers in the respective speaker groups (see 2.2). Chinese English French German Persian Russian Norwegian mean 285 227 238 200 255 224 178 sd 115 107 98 91 102 111 84 n 387 220 330 220 329 220 330

3.2 Discriminant analysis In order to investigate whether rhythmical differences between utterances from the different speaker groups can be captured by the seven metrics defined above, a discriminant analysis was performed. It appears from the results that in the majority of cases the L2-produced utterances were correctly classified (Table 2). The overall correct classification rate amounts to 94.3%. Only one utterance produced by the Chinese speaker group was classified as Persian and one utterance from the French group was confused with the category Russian.

36 WIM A. VAN DOMMELEN

Table 2. Predicted L1 group membership (percent correct) of five utterances according to a discriminant analysis using seven metrics (see section 2.2). Predicted L1 group membership L1 group Chinese English French German Persian Russian Norwegian Chinese 80 0 0 0 20 0 0 English 0 100 0 0 0 0 0 French 0 0 80 0 0 20 0 German 0 0 0 100 0 0 0 Persian 0 0 0 0 100 0 0 Russian 0 0 0 0 0 100 0 Norwegian 0 0 0 0 0 0 100

The results of the discriminant analysis further showed that three of the six discriminant functions reached statistical significance, cumulatively explaining 96.4% of the variance. For the first function, the metrics with most discriminatory power were slope (metric 4), speech rate (metric 5) and mean syllable duration (metric 1). The second discriminant function had also slope and speech rate, but additionally standard deviations for speech rate (metric 6) and for syllable duration (metric 2), and nPVI (metric 7) as important variables. Finally, of highest importance for the third function were metrics 5, 3 (correlation coefficient), 4, and 7, in that order.

4 Conclusion The present results suggest that the utterances spoken by the second language users differed in rhythmical structure from those produced by the native speakers. It was shown that it is possible to quantify rhythm using direct and indirect measures. Though the statistical analysis yielded promising results, it should be kept in mind that the number of utterances investigated was relatively small. Therefore, more research will be needed to confirm the preliminary results and to refine the present approach.

Acknowledgements This research is supported by the Research Council of Norway (NFR) through grant 158458/530 to the project Språkmøter. I would like to thank Rein Ove Sikveland for the segmentation of the speech material.

References Barry, W.J., B. Andreeva, M. Russo, S. Dimitrova & T. Kostadinova, 2003. Do rhythm measures tell us anything about language type? Proceedings 15th ICPhS , Barcelona, 2693- 2696. Boersma, P. & D. Weenink, 2006. Praat: doing phonetics by computer (Version 4.4.11) [Computer program]. Retrieved February 23, 2006, from http://www.praat.org/. Grabe, E. & E.L. Low, 2002. Durational variability in speech and the rhythm class hypothesis. In C. Gussenhoven & N. Warner (eds.), Laboratory Phonology 7 . Berlin: Mouton, 515-546. Ramus, F., 2002. Acoustic correlates of linguistic rhythm: Perspectives. Proceedings Speech Prosody 2002 , Aix-en-Provence, 115-120. Ramus, F., M. Nespor & J. Mehler, 1999. Correlates of linguistic rhythm in the speech signal. Cognition 73, 265-292. Stockmal, V., D. Markus & D. Bond, 2005. Measures of native and non-native rhythm in a quantity language. Language and Speech 48, 55-63. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 37 Working Papers 52 (2006), 37–40

/nailon/ – Online Analysis of Prosody

Jens Edlund and Mattias Heldner Department of Speech, Music and Hearing, KTH, Stockholm {edlund|mattias}@speech.kth.se

Abstract This paper presents /nailon/ – a software package for online real-time prosodic analysis that captures a number of prosodic features relevant for interaction control in spoken dialogue systems. The current implementation captures silence durations; voicing, intensity, and pitch; pseudo-syllable durations; and intonation patterns. The paper provides detailed information on how this is achieved.

1 Introduction All spoken dialogue systems, no matter what flavour they come in, need some kind of interaction control capabilities in order to identify places where it is legitimate to begin to talk to a human interlocutor, as well as to avoid interrupting the user. Most current systems rely exclusively on silence duration thresholds for making such interaction control decisions, with thresholds typically ranging from 500 to 2000 ms (e.g. Ferrer, Shriberg & Stolcke, 2002). Such an approach has obvious drawbacks. Users generally have to wait longer for responses than in human-human interactions, but at the same time they run the risk of being interrupted by the system. This is where /nailon/ – our software for online analysis of prosody and the main focus of this paper – enters the picture.

2 Design criteria for practical applications In order to use prosody in practical applications, the information needs to be available to the system, which places special requirements on the analyses. First of all, in order to be useful in live situations, all processing must be performed automatically, in real-time and deliver its results with minimal latency (cf. Shriberg & Stolcke, 2004). Furthermore, the analyses must be online in the sense of relying on past and present information only, and cannot depend on any right context or look-ahead. There are other technical requirements: the analyses should be sufficiently general to work for many speakers and many domains, and should be predictable and constant in terms of memory use, processor use, and latency. Finally, although not a strict theoretical nor a technical requirement, it is highly desirable to use concepts that are relevant to humans. In the case of prosody, measurements should be made on psychoacoustic or perceptually relevant scales.

3 /nailon/ The prosodic analysis software /nailon/ was built to meet the requirements and to capture silence durations; voicing, intensity, and pitch; pseudo-syllable durations; and intonation patterns. It implements high-level methods accessible through in Tcl/Tk and the low-level audio processing is handled by the Snack sound toolkit, with pitch-tracking based on the ESPS tool get_f0. /nailon/ differs from Snack in that its analyses are incremental with relatively small footprints and can be used for online analyses. The implementation is real- 38 JENS EDLUND & MATTIAS HELDNER time in the sense that it performs in real time, with small and constant latency, on a standard PC. It is a key feature that the processing is online – in fact, /nailon/ is a phonetic anagram of online. On the acoustic level, this goes well with human circumstances as humans rarely need acoustic right context to make decisions about segmentation. The requirements on memory and processor usage are met by using incremental algorithms, resulting in a system with a small and constant footprint and flexible processor usage. The generality requirements are met by using online normalisation and by avoiding algorithms relying on ASR. The analysis is in some ways similar to that used by Ward & Tsukahara (2000), and is performed in several consecutive steps. Each step is described in detail below.

3.1 Audio acquisition The audio signal is acquired through standard Snack object methods from any audio device. Each new frame is pushed onto a fixed length buffer of predetermined size, henceforth the current buffer . The buffer size is a factor of the processing unit size . Note that processing unit size is not the inverse of the sampling frequency, which defaults to 1/16 kHz. Rather, it should be larger by an order of magnitude to ensure smooth processing. The default processing unit size is 10 ms, and the default current buffer size is 40 such units, or 400 ms. The current buffer, then, is in effect a moving window with a length of less than half a second. As far as the processing goes, sound that is pushed out on the left side of the buffer is lost, as the Snack object used for acquisition is continuously truncated. The current buffer is updated with every time a sufficient sound to fill another processing unit has been acquired – 100 times per second given the default settings.

3.2 Preprocessing In many cases, the online requirement makes it impractical or impossible to use filters directly on the Snack sound object used for acquisition. Instead, /nailon/ provides access to the raw current audio buffer, so that filters can be applied to it before any other processing takes place. Filters are applied immediately before each get_f0 extraction (see the next section). Using filters in this manner causes /nailon/ to duplicate the current audio buffer in order to have a raw, unfiltered copy of the buffer available at all times.

3.3 Voicing, pitch, and intensity extraction Voicing, pitch, and intensity are extracted from the current buffer using the Snack/ESPS get_f0 function. This process is made incremental by repeating it over the current buffer as the buffer is updated. The rate at which extraction takes place is managed externally, which facilitates robust handling of varying processor load caused by other processes. In an ideal situation, the update takes place every time a new processing unit has been pushed onto the current buffer, in which case only the get_f0 results for the very last processing unit of the buffer are used. If this is not possible due to processor load, then a variable number N processing units will have been added to the buffer since the last F0 extraction took place, and the last N results from get_f0 will be used, where N is a number smaller then the length of the current buffer processing units. In this case, we introduce a latency of N processing units to the processing at this stage. /nailon/ configuration permits that a maximum update rate is given in order to put a cap on the process requirements of the analysis. The default setting is to process every time a single processing unit has been added, which provides smooth processing on a regular PC at a negligible latency. Each time a get_f0 extraction is performed, /nailon/ raises an event for each of the new get_f0 results produced by the extraction, in sequence. These events, called ticks , trigger each of the following processing steps. /nailon/ – ONLINE ANALYSIS OF PROSODY 39

3.4 Filtering Each tick triggers a series of event driven processing steps. These steps are generally optional and can be disabled to save processing time. The steps described here are the ones used by default. The first step is a filter containing a number of reality checks. Pitch and intensity are checked against preset minimum and maximum thresholds, and set to an undefined value if they fail to meet these. Similarly, if voicing was detected, this is removed if the pitch is out of bounds. Correction for octave errors is planned to go here as well, but not currently implemented. Note that removing values at this stage does not put an end to further processing – consecutive processes may continue by extrapolation or other means. Median filtering can be applied to the pitch and intensity data. If this is done, a delay of half the total time of the number of processing units used in the filtering is introduced at this point. By default, a median filter of seven processing units is used. In effect, this causes all consecutive processes to focus on events that took place 3 processing units back, causing a delay of less than 40 ms. On the other hand, the filter makes the analysis more robust. Finally, the resulting pitch and intensity values are transformed into semitones and dB, respectively.

3.5 Range normalisation of F0 and intensity /nailon/ implements algorithms for calculation of incremental means and standard deviations. Each new processing unit causes an update in mean and standard deviation of both pitch and intensity, provided that it was judged to contain voiced speech by the previous processing stage. The dynamic mean and standard deviation values are used as a model for normalising and categorising new values. The stability of the model is tracked by determining whether the standard deviation is generally decreasing. Informal studies show that the model stabilises after less than 20 seconds of speech has been processed, given a single speaker in a stable sound environment. Currently, /nailon/ may either cease updating means and standard deviation when stability is reached, or continue updating them indefinitely, with ever-decreasing likelihood that they will change. A possibility to reset the model is also available. A decaying algorithm which will permit us to fine-tune how stable or dynamic the normalisation should be has been designed, but has yet to be implemented. The mean and standard deviation are used to normalise the values for pitch and intensity with regard to the preceding speech by expressing them as the distance from the mean expressed in standard deviations.

3.6 Silence detection Many of the analyses we have used /nailon/ for to date are refinements or additions of speech/silence decisions. For this reason a simplistic speech activity detection (SAD) is implemented. Note, however, that /nailon/ would work equally well or better together with an external SAD. /nailon/ uses a simple intensity threshold which is recalculated continuously and is defined as the valley following the first peak in an intensity histogram. /nailon/ signals a change from silence to speech whenever the threshold is exceeded for a configurable number of consecutive processing units, and vice versa. The default number is 30, resulting in a latency of 300 ms for speech/silence decisions. Informal tests show no decrease in performance if this number is lowered to 20, but it should be noted that the system has only been used on sound with a very good signal-to-noise ratio.

3.7 Psyllabification /nailon/ keeps a copy of pitch, intensity and voicing information for the last seen consecutive stretch of speech at all times. Whenever silence is encountered, this intensity values of the record are searched backwards (last processing unit first) for a convex hull 40 JENS EDLUND & MATTIAS HELDNER

(loosely based on Mermelstein, 1975) contained in it. A hull in the intensity values of speech is assumed to correspond roughly to a syllable, thus providing a pseudo-syllabification, or psyllabification . By searching backwards, the hull that occurred last is found first. Currently, processing ceases at this point, since only the hulls directly preceding silence has been of interest to us so far. A convex hull in /nailon/ is defined as a stretch of consecutive value triplets ordered chronologically, where the centre value is always above or on a line drawn between the first and the last value. As this definition is very sensitive to noisy data, it is relaxed by allowing a limited number of values to drop below the line between first and last value as long as the area between that line and the actual values is less than a preset threshold.

3.8 Classification The normalised pitch, intensity, and voicing data extracted by /nailon/ over a psyllable are intended for classification of intonation patterns. Each silence-preceding hull is classified into HIGH , MID , or LOW depending on whether the pitch value is in the upper, mid or lower third of the speaker’s F0 range described by mean and standard deviation, and into RISE , FALL , and or LEVEL depending on the shape of the intonation pattern. Previous work have shown that the prosodic information provided by /nailon/ can be used to improve the interaction control in spoken human-computer dialogue compared to systems relying exclusively on silence duration thresholds (Edlund & Heldner, 2005).

4 Discussion In this paper, we have presented /nailon/ , an online, real-time software package for prosodic analysis capturing a number of prosodic features liable to be relevant for interaction control. Future work will include further development of /nailon/ in terms of improving existing algorithms – in particular the intonation pattern classification – as well as adding new prosodic features. For example, we are considering evaluating the duration of psyllables as an estimate of final lengthening or speaking rate effects, and to use intensity measures to capture the different qualities of silent pauses resulting from different vocal tract configurations (Local & Kelly, 1986).

Acknowledgements This work was carried out within the CHIL project. CHIL is an Integrated Project under the European Commission’s sixth Framework Program (IP-506909).

References Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62 , 215- 226. Ferrer, L., E. Shriberg & A. Stolcke, 2002. Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody in human-computer dialog. Proceedings ICSLP 2002 , Denver, 2061-2064. Local, J.K. & J. Kelly, 1986. Projection and ‘silences’: Notes on phonetic and conversational structure. Human Studies 9 , 185-204. Mermelstein, P., 1975. Automatic segmentation of speech into syllabic units. Journal of the Acoustical Society of America 58 , 880-883. Shriberg, E. & A. Stolcke, 2004. Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing. Proceedings Speech Prosody 2004 , Nara, 575-582. Ward, N. & W. Tsukahara, 2000. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics 32 , 1177-1207. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 41 Working Papers 52 (2006), 41–44

Feedback from Real & Virtual Language Teachers

Olov Engwall Centre for Speech Technology, KTH [email protected]

Abstract Virtual tutors, animated talking heads giving the student computerized training of a foreign language, may be a very important tool in language learning, provided that the feedback given to the student is pedagogically sound and effective. In order to set up criteria for good feedback from a virtual tutor, human language teacher feedback has been explored through interviews with teachers and students, and classroom observations. The criteria are presented together with an implementation of some of them in the articulation tutor ARTUR.

1 Introduction Computer assisted pronunciation training (CAPT) may contribute significantly to second language learning, as it gives the students access to private training sessions, without time constraints or the embarrassment of making errors in front of others. The success of CAPT is nevertheless still limited. One reason is that the detection of mispronunciations is error-prone and that this leads to confusing feedback, but Neri et al. (2002) argue that successful CAPT is already possible, as the main flaw lies in the lack of pedagogy in existing CAPT software rather than in technological shortcomings. They conclude that if only the learners’ needs, rather than technological possibilities, are put into focus during system development, pedagogically sound CAPT could be created with available technology. One attempt to answer this pedagogical need is to create virtual tutors, computer programs where talking head models interact as human language teachers. An example of this is ARTUR – the ARticulation TUtoR (Bälter et al., 2005), who gives detailed audiovisual instructions and articulatory feedback. Refer to www.speech.kth.se/multimodal/ARTUR for a video presenting the project. In such a virtual tutor system it becomes important not only to improve the pedagogy of the given feedback, but to do it in such a way that it resembles human feedback, in order to take benefit of the social process of learning To test the usability of the system at an early stage, we are conducting Wizard of Oz studies, in which a human judge detects the mispronunciations, diagnoses the cause and chooses what feedback ARTUR should give from a set of pre-generated audiovisual instructions (Bälter et al., 2005). The children practicing with ARTUR did indeed like it, but the feedback was sometimes inadequate, e.g. when the child repeated the same error several times; when the error was of the same type as before, but the pronunciation had been improved, or when the student started to loose motivation, because the virtual tutor's feedback was too detailed. One conclusion was hence that a more varied feedback was needed in order to be efficient. The aim of this study is to investigate how the feedback of the virtual tutor could be improved by studying feedback strategies of human language teachers in pronunciation training and assess which of them could be used in ARTUR. Interviews with language teachers and students, and classroom observations were used to explore when feedback should be given, how to indicate an error, which errors should be corrected, and how to promote student motivation . 42 OLOV ENGWALL 2 Feedback in pronunciation training Lyster & Ranta (1997) classified feedback given by language teachers as 1. Explicit correction: the teacher clearly states that what the student said was incorrect and gives the correct form, e.g. as “You should say: ...” 2. Recasts: the teacher reformulates the student's utterance, removing the error. 3. Repetition: the teacher repeats the student utterance with the error using the intonation to indicate the error. Repetitions may also be used as positive feedback on a correct utterance. 4. Clarification requests: urging the student to reformulate the utterance. 5. Metalinguistic feedback: information or questions about an error used to make the students reflect upon and find the error themselves using the provided information. 6. Elicitation: encourage students to provide the correct pronunciation, by open-ended questions or fill-in-the-gap utterances. Recasts was by far the most common type, but learners often perceive recasts as another way to say the same thing, rather than a correction ( Mackey & Philip, 1998). Carroll & Swain (1993) found that all groups receiving feedback, explicit or implicit, improved significantly more than the control group, but the group given explicit feedback outperformed the others. As explicit feedback may be intrusive and affect student self-confidence if given too frequently, it is however not evident that it should always be used.

3 Data collection Six language teachers participated in the study, four in a focus group and two in individual interviews using a semi-structured protocol (Rubin, 1994) with open-ended questions. Five students were interviewed, three of them in a focus group and two individually. The teacher and student groups were intentionally heterogeneous with respect to target language and student level, in order to capture general pedagogical strategies. Classroom observations were made in three beginner level courses, where the languages taught were close to, moderately different from and very different from Swedish, respectively.

4 Results 4.1 When should errors be corrected? There was a large consensus among teachers and students about the importance of never interrupting the students' utterances, reading or discussions with feedback, even if it means that errors are left uncorrected. This strategy was also observed in the classrooms.

4.2 How should errors be corrected? This section summarizes how the teachers (T) or students (S) described how feedback should be given and feedback observed during classes (O). 1. Recasts were the most common feedback in the classroom and were also advocated by the students, as they considered that it was often enough to hear the correct pronunciation. Contrary to the finding by Mackey & Philip (1998) that recasts were not perceived as corrections, the students tried to repair after recasts (T, S, O). 2. Implicit (e.g. “ Sorry? ”) and explicit (e.g. “ Could you repeat that? ”) elicitation for the student to self-correct was used frequently (O). 3. Increasing feedback. One teacher described a strategy going from minimal implicit feedback towards more explicit, when required. In the most minimal form, the teacher indicates that an error was produced by a questioning look or an attention-catching sound, giving the students the opportunity to identify and self-correct the error. If the student is unable to repair, a recast would be used. If needed, the recast would be repeated again FEEDBACK FROM REAL & VIRTUAL LANGUAGE TEACHERS 43

(turning it into an explicit correction). The last step would be an explicit explanation of the difference between the correct and erroneous pronunciation (T). 4. Articulatory instructions. Several teachers thought that formal descriptions and sketches on place of articulation are of little use, since the students are unaccustomed to thinking about how to produce different sounds. Some teachers did, however, use articulatory instructions and one student specifically requested this type of feedback (T, S, O). 5. Sensory feedback, e.g. letting the students place their hands on their neck to feel the vibration of voiced sounds or in front of the mouth to feel aspiration (T, O). 6. Comparisons to Swedish phonemes, as an approximation or reminder (T, S, O). 7. Metalinguistic explanations used to enforce the feedback or to motivate why it is important to get a particular pronunciation right (T). 8. General recommendations rather than feedback on particular errors, e.g., “ You should try reading aloud by yourself at home ”, to encourage additional training (T, O). 9. Contrasting repeat-recast, to illustrate the difference between the student utterance and the correct or between minimal pairs (T, S).

4.3 Which errors should be corrected? The teachers ventured several criteria for which errors should be corrected: 1. Comprehensibility: if the utterance could not be correctly understood. 2. Intelligibility: if the utterance could not be understood without effort. 3. Frequency: if the student repeats the same (type of) error several times. 4. Social impact: if the listener gets a negative impression of a speaker making the error. 5. Proficiency: a student with a better overall pronunciation may get corrective feedback on an error for which a student with a less good pronunciation does not get one. 6. Generality: if the error is one that is often made in the L2 by foreign speakers. 7. Personality: a student who appreciates corrections receives more than one who does not. 8. Commonality: an error that is common among native speakers of the L2 language is regarded as less grave than such errors that a native speaker would never make. 9. Exercise focus: feedback is primarily given on the feature targeted by the exercise.

None of the students thought that all errors should be corrected, only the “worst”. When probed further, the general opinion was that this signified mispronunciations that lead to misunderstandings or deteriorated communication. Other criteria stated were if the error affected the listener’s view of the speaker negatively, or if it was a repeated error. Apart from this, the students thought that it should depend on the student’s ambition. These opinions hence correspond to the first five criteria given by the teachers. In the classes, the amount and type of feedback given depended on the type of exercise (practicing one word, reading texts, speaking freely), the L2 language (for the L2 language that was most different from Swedish, significantly more detailed feedback was given), generality (errors that several students made were given more emphasis) and proficiency.

4.4 Motivation To avoid negative feelings about feedback, the teachers or students suggested: 1. Adapt the feedback to the students’ self-confidence (criteria 5 & 7 in section 3.3). 2. Make explicit corrections impersonal, by expanding to a general error and using “ When one says... ” rather than “When you say... ” 3. Insert non-problematic pronunciations among the more difficult ones. 4. Acknowledge difficulties (e.g. “ Yes, this is a tricky pronunciation”). 5. Never getting stuck on the same pronunciation too long.

44 OLOV ENGWALL

6. Promote the students’ willingness to speak, by making the student feel that the teacher is interested in what the student has to say and not only by how it is said. 7. Provide positive feedback when the student has made an effort or when a progress is made. 8. Adapt to the exercise. Use explicit feedback sparingly if implicit feedback is enough. 9. Give feedback only on the focus of the session. If other pronunciation problems are discovered, these should be left uncorrected, but noted and addressed in another session.

5 Feedback management in ARTUR Some aspects of the feedback strategies proposed above have been implemented in a Wizard- of-Oz version of ARTUR that will be demonstrated at the conference. The focus of the exercise is to teach speakers of English the pronunciation of the Swedish sound “sj”, using the tongue twister “Sju själviska sjuksköterskor stjäl schyst champagne”. The instructions and feedback consisted of instructions and animations on how to position the tongue, showing and explaining the difference between the user’s pronunciation and the correct. The user could further listen to his/her previous attempt to compare it with the target. One new feature is that each user can control individually the amount of feedback given. The first reason for this is the affective, that students should be able to choose a level that they are comfortable with. The second is that this does put the responsibility and initiative with the student, who can decide how much advice he or she requires from the tutor. Secondly, several feedback categories have been added to the standard positive (for a correct pronunciation) and corrective (incorrect): minimal (correct pronunciation, only implicit positive feedback given, in order not to interrupt the flow of the training), satisfactory (the pronunciation is not entirely correct, but it is pedagogically sounder to accept it and move ahead), augmented (for a repeated error, more detailed feedback given), vague (a general hint is given, rather than explicit feedback) and encouragement (encouraging the student and asking for a new try). The two latter categories may be used either when the system is uncertain of the error, when it does not fit the predefined mispronunciation categories or when more explicit feedback is pedagogically unsound.

Acknowledgements This research is carried out within the ARTUR project, funded by the Swedish research council. The Centre for Speech Technology is supported by VINNOVA (The Swedish Agency for Innovation Systems), KTH and participating Swedish companies and organizations. The author would like to thank the participating teachers and students.

References Bälter, O., O. Engwall, A-M. Öster & H. Kjellström, 2005. Wizard-of-oz test of ARTUR – a computerbased speech training system with articulation correction. Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility , 36–43. Carroll, S. & M. Swain, 1993. Explicit and implicit negative feedback: An empirical study of the learning of linguistic generalizations. Studies in Second Lang. Acquisition 15 , 357–386. Lyster, R. & L. Ranta, 1997. Corrective feedback and learner uptake. Studies in Second Lang. Acquisition 20 , 37–66. Mackey, A. & J. Philip, 1998. Conversational interaction and second language development: Recasts, responses, and red herrings? Modern Language Journal 82 , 338–356. Neri, A., C. Cucchiarini & H. Strik, 2002. Feedback in computer assisted pronunciation training: When technology meets pedagogy. Proceedings of CALL professionals and the future of CALL research , 179–188. Rubin, J. (ed.), 1994. Handbook of Usability Testing . New York: John Wiley & Sons Inc. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 45 Working Papers 52 (2006), 45–48

Directional Hearing in a Humanoid Robot Evaluation of Microphones Regarding HRTF and Azimuthal Dependence

Lisa Gustavsson, Ellen Marklund, Eeva Klintfors, and Francisco Lacerda Department of Linguistics/Phonetics, Stockholm University {lisag|ellen|eevak|frasse}@ling.su.se

Abstract As a first step of implementing directional hearing in a humanoid robot two types of microphones were evaluated regarding HRTF (head related transfer function) and azimuthal dependence. The sound level difference between a signal from the right ear and the left ear is one of the cues humans use to localize a sound source. In the same way this process could be applied in robotics where the sound level difference between a signal from the right microphone and the left microphone is calculated for orienting towards a sound source. The microphones were attached as ears on the robot-head and tested regarding frequency response with logarithmic sweep-tones at azimuth angles in 45º increments around the head. The directional type of microphone was more sensitive to azimuth and head shadow and probably more suitable for directional hearing in the robot.

1 Introduction As part of the CONTACT project 1 a microphone evaluation regarding head related transfer function (HRTF), and azimuthal 2 dependence was carried out as a first step in implementing directional hearing in a humanoid robot (see Figure 1). Sound pressure level by the robot ears (microphones) as a function of frequency and azimuth in the horizontal plane was studied. The hearing system in humans has many features that together enable fairly good spatial perception of sound, such as timing differences between left and right ear in the arrival of a signal (interaural time difference), the cavities of the pinnae that enhance certain frequencies depending on direction and the neural processing of these two perceived signals (Pickles, 1988). The shape of the outer ears is indeed of great importance in localization of a sound source, but as a first step of implementing directional hearing in a robot, we want to start up by investigating the effect of a spherical head shape between the two microphones and the angle in relation to the sound source. So this study was done with reference to the interaural level difference (ILD) 3 between two ears (microphones, no outer ears) in the sound signal that is caused by the distance between the ears and HRTF or head shadowing effects (Gelfand, 1998). This means that the ear furthest away from the sound source will to some extent be blocked by the head in such a way that the shorter wavelengths (higher frequencies) are reflected by the head (Feddersen et al., 1957). Such frequency-dependent differences in intensity associated with different sound source locations will be used as an indication to the robot to turn his head in the horizontal plane. The principle here is to make the robot look in the direction that minimizes the ILD 4. Two types of microphones, mounted on the robot head,

1 "Learning and development of Contextual Action" European Union NEST project 5010 2 Azimuth = angles around the head 3 The abbreviation IID can also be found in the literature and stands for Interaural Intensity Difference. 4 This is done using a perturbation technique. The robot’s head orientation is incrementally changed in order to detect the direction associated with a minimum of ILD. 46 LISA GUSTAVSSON ET AL . were tested regarding frequency response at azimuth angles in 45º increments from the sound source (Shaw & Vaillancourt, 1985; Shaw, 1974). The study reported in this paper was carried out by the CONTACT vision group (Computer Vision and Robotics Lab, IST Lisbon) and the CONTACT speech group (Phonetics Lab, Stockholm University) assisted by Hassan Djamshidpey and Peter Branderud. The tests were performed in the anechoic chamber at Stockholm University in December 2005.

2 Method The microphones evaluated in this study were wired Lavalier microphones of the Microflex MX100 model by Shure. These microphones were chosen because they are small electret condenser microphones designed for speech and vocal pickup. The two types tested were omni-directional (360º) and directional (cardoid 130º). The frequency response is 50 to 17000 Hz and its max SPL is 116 dB (omni- directional), 124 dB (directional) with a s/n ratio of 73 dB (omni- directional), 66 dB (directional). The robotic head was developed at Computer Vision and Robotics Lab, IST Lisbon (Beira et al., 2006). Figure 1 . Robot head. 2.1 Setup The experimental setup is illustrated in Figure 2a. The robot-head is attached to a couple of ball bearing arms (imagined to correspond to a neck) on a box (containing the motor for driving head and neck movements). The microphones were attached and tilted by about 30 degrees towards the front, with play-dough in the holes made in the skull for the future ears of the robot. The wires run through the head and out to the external amplifier. The sound source was a Brüel&Kjær 4215, Artificial Voice Loud Speaker, located 90 cm away from the head in the horizontal plane (Figures 2a and 2b). A reference microphone was connected to the loudspeaker for audio level compression (300 dB/sec).

Figure 2a and 2b . a) Wiring diagram of experimental setup (left). b) Azimuth angles in relation to robot head and loudspeaker (right). 2.2 Test Sweep-tones were presented through the loud-speaker at azimuth angles in 45º increments obtained by turning the robot-head (Figure 2b). The frequency range of the tone was between 20 Hz 5 and 20 kHz with a logarithmic sweep control and writing speed of 160mm/sec (approximate averaging time 0.03 sec). The signal response of the microphones was registered and printed in dB/Hz diagrams (using Brüel&Kjær 2307, Printer/Level Recorder) and a back- up recording was made on a hard-drive. The dB values as a function of frequency were also plotted in Excel diagrams for a better overview of superimposed curves of different azimuths (and for presentation in this paper).

5 Because the compression was unstable up until about 200 Hz, the data below 200 Hz will not be reported here. Furthermore, the lower frequencies are not affected that much in terms of ILD. DIRECTIONAL HEARING IN A HUMANOID ROBOT 47 3 Results The best overall frequency response of both microphones was at angles 0º, -45º and -90º that is the (right) microphone is to some extent directed towards the sound source. The sound level decreases as the microphone is turned away from the direction of the sound source. The omni- directional microphones have an overall more robust frequency response than the directional microphones. As expected, the difference in sound level between the azimuth angles are most significant in higher frequencies since the head as a blockage will have a larger impact on shorter wavelengths than on longer wavelengths. An example of ILD for the directional microphones is shown in Figure 3, where sound level as a function of frequency is plotted for the ear near the sound source and ear furthest away from the sound source at azimuth 45º, 90º and 135º. While the difference ideally should be zero 6 at azimuth 0 it is well above 15 dB at many higher frequencies in azimuth 45º, 90º and 135º.

45

40

35

30 near ear 45 near ear 90 25 near ear 135 dB 20 far ear 45 far ear 90 15 far ear 135

10

5

0 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,5 2 3 4 5 6 7 8 9 10 15 20 Frequency (kHz)

Figure 3 . Signal response of the directional microphones. Sound level as a function of frequency and azimuth 45º, 90º and 135º for the ear near the sound source and the ear furthest away from the sound source.

4 Discussion In line with the technical description of the microphones our results show that the directional microphones are more sensitive to azimuth than the omnidirectional microphones and will probably make the implementation of sound source localization easier. Also disturbing sound of motors and fans inside the robot’s head might be picked up easier by an omnidirectional microphone. A directional type of microphone would therefore be our choice of ears for the robot. However, decisions like this are not made without some hesitation since we do not want to manipulate the signal response in the robot hearing mechanism beyond what we find motivated in terms of the human physiology of hearing. Deciding upon what kind of pickup angle the microphones should have forces us to consider what implications a narrow versus a wide pickup angle will have in further implementations of the robotic hearing. At this moment we see no problems with a narrow angle but if problems arise we can of course switch to wide angle cartridges. The reasoning in this study holds for locating a sound source only to a certain extent. By calculating the ILD the robot will be able to orient towards a sound source in the frontal horizontal plane. But if the sound source is located straight behind the robot the ILD would also equal zero and according to the robot’s calculations he is then facing the sound source. Such front-back errors are in fact seen also in humans since there are no physiological

6 A zero difference in sound level at all frequencies between the two ears requires that the physical surroundings at both sides of the head are absolutely equal.

48 LISA GUSTAVSSON ET AL . attributes of the ear that in a straightforward manner differentiate signals from the front and rear. Many animals have the ability to localize a sound source by wiggling their ears, humans can instead move themselves or the head to explore the sound source direction (Wightman & Kistler, 1999). As mentioned earlier the outer ear is however of great importance for locating a sound source, the shape of the pinnae does enhance sound from the front in certain ways but it takes practice to make use of such cues. In the same way the shape of the pinnae can be of importance for locating sound sources in the medial plane (Gardner & Gardner, 1973; Musicant & Butler, 1984). Subtle movements of the head, experience of sound reflections in different acoustic settings and learning how to use pinnae related cues are some solutions to the front-back-up-down ambiguity that could be adopted also by the robot. We should not forget though, that humans always use multiple sources of information for on-line problem solving and this is most probably the case also when locating sound sources. When we hear a sound there is usually an event or an object that caused that sound, a sound source that we easily could spot with our eyes. So the next question we need to ask is how important vision is in localizing sound sources or in the process of learning how to trace sound sources with our ears and how vision can be used in the implementation of directional hearing of the robot.

5 Concluding remarks Directional hearing is only one of the many aspects of human information processing that we have to consider when mimicking human behaviour in an embodied robot system. In this paper we have discussed how the head has an impact on the intensity of signals at different frequencies and how this principle can be used also for sound source localization in robotics. The signal responses of two types of microphones were tested regarding HRTF at different azimuths as a first step of implementing directional hearing in a humanoid robot. The next steps are designing outer ears and formalizing the processes of directional hearing for implementation and on-line evaluations (Hörnstein et al., 2006).

References Beira, R., M. Lopes, C. Miguel, J. Santos-Victor, A. Bernardino, G. Metta et al., in press. Design of the robot-cub (icub) head. IEEE ICRA . Feddersen, W.E., T.T. Sandel, D.C. Teas & L.A. Jeffress, 1957. Localization of High- Frequency Tones. Journal of the Acoustical Society of America 29 , 988-991. Gardner, M.B. & R.S. Gardner, 1973. Problem of localization in the median plane: effect of pinnae cavity occlusion. Journal of the Acoustical Society of America 53 , 400-408. Gelfand, S., 1998. An introduction to psychological and physiological acoustics. New York: Marcel Dekker, Inc. Hörnstein, J., M. Lopes & J. Santos-Victor, 2006. Sound localization for humanoid robots – building audio-motor maps based on the HRTF. CONTACT project report . Musicant, A.D. & R.A. Butler, 1984. The influence of pinnae-based spectral cues on sound localization. Journal of the Acoustical Society of America 75 , 1195-1200. Pickles, J.O., 1988. An Introduction to the Physiology of Hearing. (Second ed.) London: Academic Press. Shaw, E.A.G., 1974. Transformation of sound pressure level from the free field to the eardrum in the horizontal plane . Journal of the Acoustical Society of America 56 . Shaw, E.A.G. & M.M. Vaillancourt, 1985. Transformation of sound-pressure level from the free field to the eardrum presented in numerical form. Journal of the Acoustical Society of America 78 , 1120-1123. Wightman, F.L. & D.J. Kistler, 1999. Resolution of front—back ambiguity in spatial hearing by listener and source movement. Journal of the Acoustical Society of America 105 , 2841- 2853. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 49 Working Papers 52 (2006), 49–52

Microphones and Measurements

Gert Foget Hansen 1 and Nicolai Pharao 2 1Department of Dialectology, University of [email protected] 2Centre for Language Change in Real Time, University of Copenhagen [email protected]

Abstract This paper presents the current status of an ongoing investigation of differences in formant estimates of vowels that may come about solely due to the circumstances of the recording of the speech material. The impact of the interplay between type and placement of microphone and room acoustics are to be examined for adult males and females across a number of vowel qualities. Furthermore, two estimation methods will be compared (LPC vs. manual). We present the pilot experiment that initiated the project along with a brief discussion of some relevant articles. The pilot experiment as well as the available results from other related experiments seem to indicate that different recording circumstances could induce apparent formant differences of a magnitude comparable to differences reported in some investigations of sound change.

1 Introduction 1.1 Purpose The study reported here arose from a request to evaluate different types of recording equipment for the LANCHART Project, a longitudinal study of language change with Danish as an example. One aim of the assignment was to ensure that the LANCHART corpus would be suitable for certain acoustic phonetic investigations.

1.2 Pilot experiments – choosing suitable microphones for on-location recordings Head mounted microphones were compared to the performance of a lapel-worn microphone and a full-size directional microphone placed in a microphone stand in front of the speaker (hereafter referred to as a studio microphone). The following four factors were considered in the evaluation of the suitability of the recordings provided by the microphones: 1) ease of transcription and 2) segmentation of the recordings as well as estimation of 3) fundamental frequency and 4) formants using LPC analysis. Simultaneous recordings of one speaker using all three types of microphones formed the basis for a pilot investigation. Primarily, the results indicated that the lapel-worn microphone was clearly inferior to the other two types with regard to the first 3 criteria, since it is more prone to pick up background noise. The head mounted and studio microphones also showed some differences with regard to these 3 criteria; in particular the recordings made with the head mounted microphone provided clearer spectrograms. Furthermore, apparent differences emerged in the LPC analysis of the vowels in the three recordings. To explore this further we recorded 6 different pairs of microphone and distance combinations using a two channel hard disk recorder. Microphones compared were Sennheiser ME64, Sennheiser MKE2 lavallier and VT600 headset microphone, positioned either as indicated by type, or as typical for ME64 (i.e. at a distance of about 30 cm). 50 GERT FOGET HANSEN & NICOLAI PHARAO

One speaker producing various sustained vowels was recorded, and subsequently we measured the formant values at the same 3 randomly chosen points in each vowel in the two channels and compared the values. Of course we expected some random variation, but our naïve intuition was that if a formant value would for some reason bounce upwards in one recording it should also do so in a synchronous recording made with a different microphone set up. We were wrong. In fact, vowels of all heights and tongue positions seemed to exhibit quite dramatic differences, but the differences appeared to be more prominent for some vowels. Some of these differences are likely to be the result of mistracings of formant values in one or both channels, but some of the large differences were found for high front non- rounded vowels like [i] and [e] where first and second formants are not often confused. Furthermore, when we compared average values of the three points in each vowel, 37 out of 252 values differed between 5 and 10%, while 31 differed more than 10%. Closer inspection of the two channels revealed that a substantial number of the differences could not simply be attributed to spurious values, but was indeed a result of the LPC algorithm producing consistently different results, although the average differences were of a smaller magnitude. Since all other factors were held constant in these pairwise comparisons the apparent differences could only be an effect of the type and placement of the microphone. The question remains which recording to trust.

2 Previous investigations In previous investigations of the usability of portable recording equipment for phonetic investigations and the reliability of LPC-based formant measurements made on such recor- dings the main focus seems to have been on the recording devices, notably the consequences of using digital recorders that employ some sort of psychoacoustic encoding such as MiniDisc and mp3 recorders, rather than on the role of the microphone used. Below is a brief summary of the articles we have found which deal with the influence the microphone exerts. Though the goal for van Son (2002) is also to investigate the consequences of using audio- compression, interestingly, van Son uses the difference in estimation values that results from switching from one particular microphone to another as a yardstick against which the errors introduced by the compression algorithms are compared. Comparing a Sennheiser MKH105 condenser microphone against a Shure SM10A dynamic headset microphone he finds differences between the two recordings larger than 9 semitones (considered “jumps”) in slightly less than 4% of the estimates of F1 and about 2% for F2. When these jumps are excluded the remaining measurements show an increased RMS error of about 1.2 to 1.7 semitones as a result of switching microphones. Unfortunately it is not possible to see the values for the individual vowel qualities. Plichta (2004) also examines formant estimates of vowels from three simultaneous recordings. Comparing three combinations of microphone and recording equipment (thereby not separating characteristics of the microphones and the recording equipment), he shows significant differences in F1 values and bandwidths between all three recording conditions. His material is limited to non-high non-rounded front vowels, plus the diphthong [ai]. Thus there is evidence that recordings made with different microphones (and/or recording equipment differing in other respects) can lead to significantly different formant estimates. Apart from these investigations there are two studies of the spectral consequences of differences in microphone placement by the acoustician Eddy Bøgh Brixen (1996; 1998) which are of particular relevance to our investigation. He provides evidence that the placement of the microphone relative to the speaker in and of itself can lead to substantial differences in the recorded power spectrum, notably when microphones are placed very close to (or on) the body or head of the speaker as is the case with lavallier and headband micro- phones. MICROPHONES AND MEASUREMENTS 51 3 The experiment 3.1 Research questions As we have seen recordings will be affected by a number of factors which interact in complex ways making for a source of error of unknown impact on formant estimates. Now the interesting question is: how big is the problem? Is it large enough to have practical consequences for the use of LPC-based formant estimation as an analysis tool? This overall question led us to these research questions: a) How accurate can LPC-based formant estimates be expected to be? b) How much does the microphone and its placement contribute to the inaccuracy? c) How much does the room contribute to the inaccuracy? d) Is this only a concern for LPC-based formant estimates, or are estimates made by hand also affected?

3.2 Experimental design As an attempt to answer these questions, a more comprehensive experiment was designed. It seems to us that what we need is some sort of neutral reference recording and knowledge about the consequences for formant estimation as we deviate from this ideal. Thus we planned to compare formant estimates of recordings made in four locations with very different acoustic characteristics using four different microphones simultaneously. In total the recorded material covers: 4 microphones (see table below), 4 locations: Anechoic chamber, recording studio, two private rooms, 2 male and 2 female adult speakers. The subjects read short sentences producing 6-18 renditions of 8 vowel qualities at each location. In addition 6 repetitions of sustained vowels with f0-sweeps of 6 vowel qualities have been recorded in the recording studio and in the anechoic chamber by four speakers. These were meant to facilitate a more accurate manual estimation of the formant values. All material was recorded using synchronized Sound Device SD722 hard disk recorders at 24 Bit/48KHz.

Table 1. Microphones compared and their position relative to the subjects Microphone Position relative to speaker Directional sensitivity Brüel & Kjær 4179 80 cm directly in front of speaker’s mouth omnidirectional Sennheiser MKH40 40 cm at a 45 degree angle cardioid DPA 4066 2 cm from corner of mouth, head worn omnidirectional VT 700 2 cm from corner of mouth, head worn omnidirectional

We would suggest using the B&K 4179 with a (certified) flat frequency response in the anechoic chamber at a distance of 80 cm as the reference. The distance is perhaps somewhat arbitrary, but it appears from Brixen (1998) that the effect of changing the distance decreases rapidly as the distance increases. On-axis, the spectrum at 80 cm deviates less than +/- 2dB from the spectrum at 1 m.

4 Current status and preliminary results All planned recordings have been made, and the analysis phase has commenced. We have started with the sustained vowels as they should be the simplest to analyse (since there are no transitions to be aware of) and as they are also the most suitable for manual inspection. Two PRAAT scripts have been produced for the analysis. One is a formant analysis tool that enables simultaneous analysis of the four recordings to ensure that measurements are made at

52 GERT FOGET HANSEN & NICOLAI PHARAO points that – as far as possible – provide trustworthy formant values for all recordings. The other is a script which by tracing the intensity variation in each partial as the f0 changes, can be used to determine when a given partial crosses a formant. By measuring f0 at this point and counting the number of the partial we are able to estimate the formant frequency. We assume that this approach will be more accurate than judging the formant frequencies by visual inspection alone. It is obvious that the “f0-sweep” approach we use to determine formant values manually is not without flaws as we are relying heavily on a number of assumptions: First we expect our speakers to be able to produce the same vowel quality independent of pitch. As vowel quality and pitch are known to be interrelated in real speech it may both be difficult for our speakers to live up to this expectation, and difficult for us to verify auditorily whether they do. Even if the speakers may succeed in ‘freezing’ the oral cavities during the sweep, differences may arise due to movement of the larynx as the pitch is changed, as well as due to changes in voice quality associated with the pitch. Notably the voice often seemed to get more breathy and hypofunctional towards the lower end of the pitch range. The method of determining the time of the maximum energy for the partial may also be affected by overall changes in intensity that have nothing to do with the interaction between the partial and the formant. This would mostly affect estimates of F1 as the transition of partials through higher formants happens much faster, and since there are often more partials crossing through the formant thus giving more estimates. Finally the accuracy of course depends on the accuracy of the f0 tracing, and more so the higher the partial. Despite the potential shortcomings of the method it does seem to provide reliable results, and is particularly helpful in determining the formant frequencies in the lower region of the spectrum. Our ongoing analyses of the data have so far only confirmed the usefulness of carrying out the larger investigation. We hope to be able to ensure that our colleagues at the LANCHART Project need not end up reporting as sound changes what might merely be the results of microphone changes...

References Brixen, E.B., 1996. Spectral Degradation of Speech Captured by Miniature Microphones Caused by the Microphone Placing on Persons’ Head and Chest. Proceedings AES 100th Convention. Brixen, E.B., 1998. Near Field Registration of the Human Voice: Spectral Changes due to Positions. Proceedings AES 104th Convention. Plichta, B., 2004. Data acquisition problems. In B. Plichta, Signal acquisition and acoustic analysis of speech. Available at: http://bartus.org/akustyk/signal_aquisition.pdf. van Son, R.J.J., 2002. Can standard analysis tools be used on decompressed speech? Available at: http://www.fon.hum.uva.nl/Service/IFAcorpus/SLcorpus/ AdditionalDocuments/CoCOSDA2002.pdf. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 53 Working Papers 52 (2006), 53–56

Prosodic Cues for Interaction Control in Spoken Dialogue Systems

Mattias Heldner and Jens Edlund Department of Speech, Music and Hearing, KTH, Stockholm {mattias|edlund}@speech.kth.se

Abstract This paper discusses the feasibility of using prosodic features for interaction control in spoken dialogue systems, and points to experimental evidence that automatically extracted prosodic features can be used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.

1 Introduction All spoken dialogue systems, no matter what flavour they come in, need some kind of interaction control capabilities in order to identify places where it is legitimate to begin to talk to a human interlocutor, as well as to avoid interrupting the user. Most current systems rely exclusively on silence duration thresholds for making such interaction control decisions, with thresholds typically ranging from 500 to 2000 ms (Ferrer, Shriberg & Stolcke, 2002; Shriberg & Stolcke, 2004). Such an approach has several drawbacks, both from the point of view of the user and that of the system. Users generally have to wait longer for responses than in human- human interactions; at the same time they run the risk of being interrupted by the system, since people frequently pause mid-speech , for example when hesitating or before semantically heavy words (Edlund & Heldner, 2005; Shriberg & Stolcke, 2004); and using silent pauses as the sole information for segmentation of user input is likely to impair the system’s speech understanding, as unfinished or badly segmented utterances often are more difficult to interpret (Bell, Boye & Gustafson, 2001). Humans are very good at discriminating the places where their conversational partners have finished talking from those where they have not – accidental interruptions are rare in conversations. Apparently, we use a variety of information to do so, including numerous prosodic and gestural features, as well as higher levels of understanding, for example related to (in)completeness on a structural level (e.g. Duncan, 1972; Ford & Thompson, 1996; Local, Kelly & Wells, 1986). In light of this, the interaction control capabilities of spoken dialogue systems would likely benefit from access to more of this variety of information – more than just the duration of silent pauses. Ultimately, spoken dialogue systems should of course be able to combine all relevant and available sources of information for making interaction control decisions. Attempts have been made at using semantic information (Bell, Boye & Gustafson, 2001; Skantze & Edlund, 2004), prosodic information and in particular intonation patterns (Edlund & Heldner, 2005; Ferrer, Shriberg & Stolcke, 2002; Thórisson, 2002), and visual information (Thórisson, 2002) to deal with (among other things) the problems that occur as a result of interaction control decisions based on silence only. 54 MATTIAS HELDNER & JENS EDLUND 2 Prosodic cues for interaction control Previous work suggests that a number of prosodic or phonetic cues are liable to be relevant for interaction control in human-human dialogue. Ultimately, software for improving interaction control in practical applications should capture all relevant cues. The phenomena associated with turn-yielding include silent pauses, falling and rising intonation patterns, and certain vocal tract configurations such as exhalations (e.g. Duncan, 1972; Ford & Thompson, 1996; Local, Kelly & Wells, 1986). Turn-yielding cues are typically located somewhere towards the end of the contribution, although not necessarily on the final syllable. Granted that human turn-taking involves decisions above a reflex level, evidence suggests that turn-yielding cues must occur at least 200-300 ms before the onset of the next contribution (Ward, 2006; Wesseling & van Son, 2005). The phenomena associated with turn-keeping include level intonation patterns, vocal tract configurations such as glottal or vocal tract stops without audible release, as well as a different quality of silent pauses as a result of these vocal tract closures (e.g. Caspers, 2003; Duncan, 1972; Local & Kelly, 1986). Turn-keeping cues are also located near the end of the contribution. As these cues are not intended to trigger a response, but rather to inhibit one, they may conceivably occur later than turn-yielding cues. There are also a number of cues (in addition to the silent pauses mentioned above) that have been observed to occur with turn-yielding as well as with turn-keeping. Examples of such cues include decreasing speaking rate and other lengthening patterns towards the end of contributions. The mere presence (or absence) of such cues cannot be used for making a turn- yielding vs. turn-keeping distinction, although the amount of final lengthening, for example, might provide valuable guidance for such a task (cf. Heldner & Megyesi, 2003).

3 Prosodic cues applied to interaction control In previous work (Edlund & Heldner, 2005), we explored to what extent the prosodic features extracted with /nailon / (Edlund & Heldner, forthcoming) could be used to mimic the interaction control behaviour in conversations among humans. Specifically, we analysed one of the interlocutors in order to predict the interaction control decisions made by the other person taking part in the conversation. These predictions were evaluated with respect to whether there was a speaker change or not at that point in the conversation, that is, with respect to what the interlocutors actually did. Each unit ending in a silent pause in the speech of the interlocutor being analysed was classified into one out of three categories: turn-keeping, turn-yielding, and don’t know. Units with low patterns were classified as suitable places for turn-taking (i.e. turn-yielding); mid and level patterns were classified as unsuitable places (i.e. turn-keeping); all other patterns, including high or rising, ended up in the garbage category don’t know. This tentative classification scheme was based on observations reported in the literature (e.g. Caspers, 2003; Thórisson, 2002; Ward & Tsukahara, 2000), but it was in no way optimised or adapted to suit the speech material used. This experiment showed that interaction control based on extracted features avoided 84% of the places where a system using silence duration thresholds only would have interrupted its users, while still recognizing 40% of the places where it was suitable to say something (cf. Edlund & Heldner, 2005). Interaction control decisions using prosodic information can furthermore be made considerably faster than in silence only systems. The decisions reported here were made after a 300-ms silence to be compared with silences ranging from 500 to 2000 ms in typical silence only systems (Ferrer, Shriberg & Stolcke, 2002). PROSODIC CUES FOR INTERACTION CONTROL IN SPOKEN DIALOGUE SYSTEMS 55 4 Discussion In this paper, we have discussed a number of prosodic features liable to be relevant for interaction control. We have shown that automatically extracted prosodic information can be used to improve the interaction control in spoken human-computer dialogue compared to systems relying exclusively on silence duration thresholds. Future work will include further development of the automatic extraction in terms of improving existing algorithms as well as adding new prosodic features. In a long-term perspective, we would want to combine prosodic information with other sources of information, such as semantic completeness and visual interaction control cues, as well as to relate interaction control to other conversation phenomena such as grounding, error handling, and initiative.

Acknowledgements This work was carried out within the CHIL project. CHIL (Computers in the Human Interaction Loop) is an Integrated Project under the European Commission’s Sixth Framework Program (IP-506909).

References Bell, L., J. Boye & J. Gustafson, 2001. Real-time handling of fragmented utterances. Proceedings NAACL Workshop on Adaptation in Dialogue Systems , Carnegie Mellon University, Pittsburgh, PA, 2-8. Caspers, J., 2003. Local speech melody as a limiting factor in the turn-taking system in Dutch. Journal of Phonetics 31 , 251-276. Duncan, S., Jr., 1972. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology 23 (2), 283-292. Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62 (2-4), 215-226. Edlund, J. & M. Heldner, forthcoming. /nailon/ – a tool for online analysis of prosody. Proceedings 9th International Conference on Spoken Language Processing (Interspeech 2006) , Pittsburgh, PA. Ferrer, L., E. Shriberg & A. Stolcke, 2002. Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody in human-computer dialog. Proceedings ICSLP 2002 , Denver, Vol. 3, 2061-2064. Ford, C.E. & S.A. Thompson, 1996. Interactional units in conversation: syntactic, intonational, and pragmatic resources for the management of turns. In E. Ochs, E.A. Schegloff & S.A. Thompson (eds.), Interaction and grammar . Cambridge: Cambridge University Press, 134-184. Heldner, M. & B. Megyesi, 2003. Exploring the prosody-syntax interface in conversations. Proceedings ICPhS 2003 , Barcelona, 2501-2504. Local, J.K. & J. Kelly, 1986. Projection and 'silences': Notes on phonetic and conversational structure. Human Studies 9 , 185-204. Local, J.K., Kelly, J. & W.H.G. Wells, 1986. Towards a phonology of conversation: turn- taking in Tyneside English. Journal of Linguistics 22 (2), 411-437. Shriberg, E. & A. Stolcke, 2004. Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing. Proceedings Speech Prosody 2004 , Nara, 575-582. Skantze, G. & J. Edlund, 2004. Robust interpretation in the Higgins spoken dialogue system. COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction , Norwich. Thórisson, K.R., 2002. Natural turn-taking needs no manual: Computational theory and model, from perception to action. In B. Granström, D. House & I. Karlsson (eds.),

56 MATTIAS HELDNER & JENS EDLUND

Multimodality in language and speech systems . Dordrecht: Kluwer Academic Publishers, 173-207. Ward, N., 2006. Methods for discovering prosodic cues to turn-taking. Proceedings Speech Prosody 2006 , Dresden. Ward, N. & W. Tsukahara, 2000. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics 32 , 1177-1207. Wesseling, W. & R.J.J.H. van Son, 2005. Early preparation of experimentally elicited minimal responses. Proceedings Sixth SIGdial Workshop on Discourse and Dialogue , ISCA, Lisbon, 11-18.

Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 57 Working Papers 52 (2006), 57–60

SMTC – A Swedish Map Task Corpus

Pétur Helgason Department of Linguistics and Philology, Uppsala University [email protected]

Abstract A small database of high quality recordings of 4 speakers of Central is being made available to the speech research community under the heading Swedish Map Task Corpus (SMTC). The speech is unscripted and consists mostly of conversations elicited through map tasks. In total, the database contains approximately 50 minutes of word-labelled conversations, comprising nearly 8000 words. The material was recorded at the Stockholm University Phonetics Lab. This paper describes the recording method, the data elicitation procedures and the speakers recruited for the recordings. The data will be made available on- line to researchers who put in a request with the author (cf. section 7 below).

1 Introduction The data being made available under the heading Swedish Map Task Corpus (SMTC) were originally recorded as part of the author’s doctoral dissertation project (Helgason, 2002). The data have already proved useful for several other research projects, e.g. Megyesi (2002), Megyesi & Gustafson-Čapková (2002) and Edlund & Heldner (2005). As it seems likely that future projects shall want to make use of the data, and the data are not described in much detail elsewhere, an account of the recording procedure and elicitation method are called for. At the same time, the data shall be made available for download for researchers.

2 Recording set-up The data were recorded in the anechoic room at the Stockholm University Phonetics Lab. The subjects were placed facing away from one another at opposite corners of the room (see Figure 1). The “head-to-head” distance be- tween the subjects was approximately two meters. The reason for this placement of the subjects was partly to minimize cross- channel interference, and partly to prevent them from consulting one another’s maps (see the following section). The recording set-up was therefore in accordance with the nature of the data elicitation method. The data were recorded using a Technics SV 260 A DAT recorder and two Sennheiser MKE2 microphones. Each microphone was mounted on a headset and placed in such a way that it extended approximately 2.5 cm Figure 1. The placement of subjects and out and to the side of the corner of the experimenter during the recording.

58 PÉTUR HELGASON subject’s mouth. The recording device and an experimenter were placed in between the sub- jects, within the anechoic room. The subjects were recorded on separate channels. This was done in order to avoid an over- lap between the subjects when they were speaking simultaneously. The absorption of sound energy in the anechoic room proved to be quite effective. The difference in average RMS be- tween speakers on a channel was approximately 40 dB. Thus, for example, the average RMS for the intended right channel speaker was, on average, 40 dB higher than for the interfering (left-channel) speaker. This means that at normal listening levels (and provided the intended speaker is silent), the interfering speaker can be detected only as a faint background murmur.

3 Data elicitation – the map tasks Most of the data were elicited by having the subjects perform map tasks. Map tasks have previously been used successfully for eliciting unscripted spoken data, perhaps most notably in the HCRC Map Task Corpus (Anderson et al., 1991). A map task involves two participants, an instruction giver and an instruction follower (henceforth Giver and Follower). For each map task, the experimenter prepares two maps with a set of landmarks (symbols or drawings), and to a large extent the landmarks on the two maps are the same. However, some differences in landmarks are incorporated by design, so that the maps are not quite identical. The Giver’s map has a predetermined route drawn on it, the Follower’s map does not. Their task is to cooperate through dialogue, so that the route on the giver’s map is reproduced on the follower’s map. The Giver and Follower are not allowed to consult one another’s maps. In the SMTC recordings, the subjects were told at the beginning of the task that the maps differed, but it was left up to them to discover the ways in which they differed.

Figure 2. One of the “treasure hunt” map pairs used for data elicitation. On the left is a Giver’s map, and on the right is a Follower’s map. The path on the Giver’s map is repro duced in grey here, but when the subjects performed the tasks it was marked in green. SMTC – A SWEDISH MAP TASK CORPUS 59

For the SMTC recordings, four map tasks were prepared, each consisting of a Giver and Follower map pair. An example of a map task is given in Figure 2. All the maps had a set of basic common features. They all depicted the same basic island shape, the contour of which had several easily recog- nizable bays and peninsulas. The island also had mountains and hills, as well as a lake and a river. Finally, each map had a simple compass rose in the bottom left corner. Two of the map tasks had a “treasure hunt” theme. The landmarks on these maps included an anchor (the starting point), a key (an intermediate goal), and a treasure chest (the final goal). However, most of the symbols depicted animals (some of which were prehistoric) and vegetation. The remaining two map tasks had a “tourist” theme. On these maps the land- marks consisted entirely of various symbols typically used in tourist brochures and maps. In order to familiarize the subjects with these symbols, they were asked to go through a list of such symbols with a view Figure 3. The symbol list used to fami liarize to deciding how to refer to them if they the subjects with the symbols on the “tourist occurred in a map task. This interaction was theme” maps. recorded, and is included in the SMTC database under the heading “Symbol task”. This symbol list is reproduced here in Figure 3. The subjects’ goal in the tourist maps was to trace a predetermined route around the island from an airport and back to the same airport.

4 The subjects The subjects, one male and three females, were recruited from the staff at the Stockholm University Linguistics Department. They are referred to as FK, FT, FS (all female) and MP (male). FK and MP were in their thirties and FT and FS in their forties. All speakers were of normal hearing. As regards dialect, all speakers identified themselves as speakers of Central Standard Swedish and had lived for most or all of their lives in or around Stockholm. They were paid a moderate fee for their participation. The subjects were arranged in pairs of two, FS and MP as one pair and FK and FT as another. Each pair began the session by navigating through a “treasure hunt” map task. This was followed by a discussion of the symbol list. The pair then continued with a “tourist map” followed by another “treasure map”, and finally, if time allowed, one more “tourist map”.

5 The extent of the database The data from both subject pairs comprise a total of approximately 50 minutes of conversa- tion. (There exist additional map-task recordings of these as well as other subjects which await word-labelling, but these are not included in the present database). This represents a total of 35 minutes of uninterrupted speech from the four subjects. (What is referred to here as

60 PÉTUR HELGASON uninterrupted speech is the total speaking time for a subject, excluding any and all pauses.) For FT, approximately 4.3 minutes of uninterrupted speech are available, comprising a total of 870 words; for FK 9.5 minutes comprising 2045 words; for MP 10.8 minutes comprising 2554 words; and for FS 10.3 minutes comprising 2401 words.

6 Some remarks on the transliteration provided with the recordings The data are provided with a word-level transliteration (word labelling). The transliteration was performed by the author, a non-native (albeit competent) speaker of Swedish. Researchers that wish to make use of the data may make use of this transliteration, possibly using it as the basis for searches or subject it to automatic text processing. Therefore, the rationale behind the transliteration conventions will be outlined here. The aim of the transliteration was to facilitate lexical look-ups rather than to indicate or reflect the segmental content. For instance, the function word det is always indicated simply as “det” in the transliteration, without regard for any variability in its production (e.g. [det] , [de] , [d], [e] or [de] ). This approach was also applied in the labelling of minimal responses and lexical fillers. For example, lexical fillers of the “eh” or “er” type are indicated with a semicolon ; in the transliteration, irrespective of their segmental content (schwa-like, [e] -like, [œ] -like, creaky, nasalized, etc.) A prominent feature of the transliteration is that contiguous pieces of speech (i.e. stretches of speech which contain no silence pauses) are demarcated at the onset and offset with a period (full stop) symbol. Thus the transliteration does not attempt to reflect the syntactic structure of an utterance, but instead only the presence of silence pauses. Note, also, that the transliteration provides no evaluation of or amendment to the grammaticality of an utterance.

7 Format and availability The sound files are 16-bit stereo (with one speaker on each channel) and have a sampling rate of 16 kHz. The files are provided in the uncompressed Wave PCM format (i.e. *.wav). The word label files are provided as text files in WaveSurfer format. The data are made available as is, with no guarantee of groundbreaking research results. To obtain the data, please e-mail a request to the author to obtain a web address from which to download the data.

Acknowledgements The author would like to thank Sven Björsten, Peter Branderud and Hassan Djamshidpey for their assistance with the recording of the data.

References Anderson, A.H., M. Bader, E.G. Bard, E. Boyle, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. Thompson & R. Weinert, 1991. The HCRH Map Task Corpus. Language and Speech 34 (4), 351–366. Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62 (2–4), 215–226. Helgason, P., 2002. Preaspiration in the Nordic Languages: Synchronic and Diachronic Aspects . Ph.D. Thesis, Stockholm University. Megyesi, B., 2002. Data-Driven Syntactic Analysis – Methods and Applications for Swedish . Ph.D. Thesis, Department of Speech, Music and Hearing, KTH, Stockholm. Megyesi, B. & S. Gustafson-Čapková, 2002. Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish. Proceedings of the 7th ICSLP , Denver. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 61 Working Papers 52 (2006), 61–64

The Relative Contributions of Intonation and Duration to Degree of Foreign Accent in Norwegian as a Second Language

Snefrid Holm Department of Language and Communication Studies, Norwegian University of Science and Technology (NTNU) [email protected]

Abstract This study investigates the relative contributions of global intonation and global segment durations to degree of foreign accent in Norwegian. Speakers of Norwegian as a second language (N2) from different native languages (L1s) plus one native Norwegian (N1) speaker are recorded reading the same sentence. The N2 utterances’ global intonation and global segment durations are manipulated to match the N1 pronunciation. In this way every N2 speaker provides four utterance versions: the original, a duration corrected version, an intonation corrected version and a version with both features corrected. N1 listeners judge the degree of foreign accent between each speaker’s four utterance versions. The results show that a) the combined correction of both features reduces the degree of foreign accent for all speakers, b) each correction by itself reduces the degree of foreign accent for all but two of the investigated L1 groups and c) some L1 groups benefit more from intonation correction whereas others benefit more from duration correction.

1 Introduction When learning a second language after early childhood the resulting speech will normally be foreign accented (e.g. Flege, Munro & Mackay, 1995). The phenomenon of foreign accent is complex and comprises issues regarding the nature of the foreign accent itself as well as the foreign accent’s various effects on listeners, for instance regarding social acceptance or the ability to make oneself understood. A foreign accent may not in itself hinder communication. Although degree of foreign accent is often confounded with degree of intelligibility, a growing body of evidence supports the view that even heavily accented speech may sometimes be perfectly intelligible (Derwing & Munro, 1997; Munro & Derwing, 1995). The relationship between a deviating pronunciation on the one hand and its effect on listener dimensions like intelligibility or perceived degree of foreign accent on the other hand is not clear. There is however a general belief that prosodic deviations are more important than segmental ones, at least for intelligibility, although there are rather few studies to support this view (Munro & Derwing, 2005). This study aims to establish which of the two pronunciation features global intonation and global segment durations contributes most to perceived degree of foreign accent in Norwegian as spoken by second language learners. The present paper reports on a study which is part of a larger work where the next step will be to investigate the effect of the same two pronunciation features upon intelligibility. In this 62 SNEFRID HOLM way I hope to shed some light upon the relationship between non-native pronunciation, degree of accent, and intelligibility.

2 Experimental procedure 2.1 Recordings The speakers were 14 adult learners of Norwegian as a second language with British English, French, Russian, Chinese, Tamil, Persian and German as their L1s. There were two speakers from each of the L1s. The speakers were of both sexes. In addition one male native Norwegian speaker from the South East region was recorded in order to provide an N1 template. The speech material was recorded in a sound-treated studio using a Milab LSR 1000 microphone and a Fostex D-10 digital recorder. Files were digitized with a sampling rate of 44.1 kHz and later high-pass filtered over 75 Hz. Speech analyses and manipulations were carried out with the Praat program (Boersma & Weenink, 2006). The N2 speakers and the N1 speaker all read the same Norwegian sentence: Bilen kjørte forbi huset vårt (=The car drove past our house) .

2.2 Stimuli Global intonation and global segment durations in the N2 utterances were computer manipulated to match the N1 production of the same sentence. The intonation was manipulated by replacing the intonation contour of each N2 utterance with the stylized intonation contour of the N1 utterance. Because of durational differences between the N1 utterance and the various N2 utterances, the intonation contour had to be manually corrected in the time domain. Because of pitch level differences between the speakers, especially between the sexes, the intonation contour also had to be manually shifted in frequency so as to suit the individual N2 speaker’s voice. Manipulation of segment durations required a phonemic segmentation of both the N1 and the N2 utterances. All segment durations were measured and the N2 phonemes were lengthened or shortened to match the segment durations of the N1 utterance. Three manipulated versions of each speaker’s original utterance were generated: one intonation corrected utterance version, one duration corrected utterance version and one utterance version with both features corrected. The stimuli thus consisted of four utterance versions for each speaker: the original utterance and three manipulated versions. These four versions were put together as pairs. Each pair was put in a separate sound file with a two-second pause in between. These stimulus pairs enabled the direct comparison of each speaker’s four utterance versions. Note that each stimulus pair consists of two utterance versions from the same speaker so that utterance versions are always compared within speaker.

2.3 Experiment 13 native Norwegian listeners evaluated the stimulus pairs. None reported experience with N2 speech out of the ordinary and none reported poor hearing. There were 8 listeners from low- tone dialects and 5 listeners from high-tone dialects. The listeners were paid for their participation. The listener was seated in a sound-treated studio and the sound was presented through loudspeakers. The listener’s task was to judge which of the two utterance versions in each stimulus pair sounded less foreign accented than the other. They also had the option to rate the two utterances as equally foreign accented. All stimulus pairs were presented in random order, and they were presented 10 times each. CONTRIBUTIONS OF INTONATION & DURATION TO FOREIGN ACCENT IN L2 NORWEGIAN 63

The listeners were not told that some of the utterances they would hear were altered through computer manipulation. The participants seemed to find the test design comprehensible.

3 Results The listeners’ responses were subjected to statistical testing. However, no statistics will be presented in this paper. The main findings will be presented and discussed in the following.

3.1 Intonation vs. duration The results show that when both global intonation and global segment durations are manipulated, this correction reduces the amount of perceived foreign accent in the N2 speech. This effect is statistically significant for all N2 speakers across the different L1s. When the listeners are exposed to speech where only one pronunciation feature is corrected, it is shown that each correction separately contributes to the reduction of foreign accent. This effect is statistically significant for all L1 groups with two exceptions. For the N2 speakers with British English as their L1, the native listeners judge the degree of foreign accent as unaltered despite global intonation correction. For the N2 speakers with German as their L1, the correction of global segment durations does not affect the perceived degree of foreign accent. It is thus clear that, in general, both global intonation and segment durations are significant contributors to percieved degree of foreign accent. The interesting question is which of these two corrections reduces the degree of foreign accent most effectively. This is shown to vary between the L1s as presented in Table 1 below. The table also shows the relative size of the accent reduction brought about by the corrections.

Table 1. The middle column shows the correction that contributes the most to degree of foreign accent for each of the L1s. The rightmost column shows the relative size of the accent reductions. T= Tamil, C= Chinese, E= English, F= French, G= German. >= larger effect. L1 Main contribution Effect size Tamil Segment durations T>C>E Chinese English French Global intonation F>G German Russian Equal effect - Persian

The table shows that the N2 speech produced by speakers with the native languages Tamil, Chinese and English is affected more by the correction of global segment durations than by the correction of global intonation correction for the purpose of foreign accent reduction. Conversely, the N2 produced by speakers with the native languages French and German is perceived as having less foreign accent when the global intonation is corrected than when the global duration is corrected. For the Russian and Persian participants there was no difference between the two pronunciation features. This means that correcting global intonation reduces the amount of perceived foreign accent to the same degree as correcting global segment durations. The L1s can thus be categorized according to which of the two investigated pronunciation features reduces the foreign accent more than the other. There are however differences within each of these two categories as the degree to which the foreign accent is reduced by a correction differs between the L1s. Native speakers of Tamil, Chinese and English all benefit

64 SNEFRID HOLM most from intonation correction, but the effect of the correction has greater impact on the foreign accent for some L1 groups than for others. The Tamil speakers’ N2 is more foreign accent reduced by the correction than the Chinese speakers’ N2 and the Chinese N2 speech is more foreign accent reduced than the English speakers’ N2. Likewise, speakers with French and German as their native languages benefit most from duration correction, but the foreign accent reduction effect is larger for the French L1 group than for the German L1 group. The native Norwegian listeners that participated in the experiment represented both low- tone and high-tone dialects. No correlation was found between listener dialect and responses in the perception experiment.

References Boersma, P. & D. Weenink, 2006. Praat: doing phonetics by computer (Version 4.4.17) [Computer program]. Retrieved April 19, 2006, from http://www.praat.org/. Derwing, T. & M.J. Munro, 1997. Accent, intelligibility and comprehensibility: Evidence from four L1s. Studies in second language acquisition 19 , 1-16. Flege, J.E., M.J. Munro & I.R.A. MacKay, 1995. Factors affecting strength of perceived foreign accent in a second language. Journal of the Acoustical Society of America 97 , 3125- 3134. Munro, M.J. & T. Derwing, 1995. Processing time, accent and comprehensibility in the perception of native and foreign accented-speech. Language and Speech 38 , 289-306. Munro, M.J. & T. Derwing, 2005. Second language accent and pronunciation teaching: A research based approach. TESOL Quarterly 39 (3), 379-397. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 65 Working Papers 52 (2006), 65–68

The Filler EH in Swedish

Merle Horne Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University [email protected]

Abstract Findings from a pilot study on the distribution, function and phonetic realization of the filler EH in interviews from SweDia2000 interviews are presented. The results show that EH occurs almost exclusively after function words at the beginning of constituents. The phonetic

realization of EH was seen to be of three basic forms: a middle-high vowel (e.g. [ ], [e], [ ]),

a vowel+nasal (e.g. [ m], [ m], [ n] ), and a vowel with creaky phonation (e.g.[ [ ). The vowel+nasal realization occurs as has been shown for English before other delays and is associated with planning of complex utterances. Since creaky phonation is associated with terminality, the creaky vowel realization of EH could be interpreted as signalling the juncture between the filler and an upcoming disfluency.

1 Introduction The filler, or ‘filled pause’ EH has often been termed a ‘disfluency’ (e.g. Shriberg, 2001), since it constitutes a delay in the flow of speech associated with referential meaning. However, since it can often be assigned pragmatic functions, such as signalling an upcoming focussed word (Bruce, 1998), or need on the part of the speaker to plan or code his/her speech and thus a desire to hold the floor, EH can also be considered to be an integral part of the linguistic system (see e.g. Allwood (1994), and Clark & Fox Tree (2002) who refer to it as a ‘word’). In a study on English, Clark & Fox Tree (2002) found that its realization as Uh signals a minor delay in speaking, whereas Um announces a major delay in speaking. A number of studies on Swedish have reported some characteristics of EH in different speaking styles, but none have focussed on the variation in the phonetic realization of EH as far as I know. Hansson (1998), in a study on the relationship between pausing and syntactic structure in a spontaneous narrative, found that the filled pauses (n=22) in her material occurred at clause boundaries after conjunctions and discourse markers and before focussed words. Lundholm (2000) in a study on pause duration in human-human dialogues found that the filler EH (n=55) in authentic travel-bureau dialogues occurred in turn non-initial position and had a duration similar to silent planning pauses (mean = 340 ms). Eklund (2004) in a number of studies on simulated human-human and human-machine dialogues found that the most common position of EH (n=2601) was utterance-initial before another disfluency and that it was most often followed by jag ‘I’, and det/den ‘it’. The filled pauses were found to have a mean duration of about 500 ms, thus considerably longer than those found by Lundholm (2000) in authentic task dialogues.

2 Current study The present study has been carried out to pursue the investigation of EH in spontaneous data to get some better idea as to its distribution, function, and phonetic realization in authentic interviews where the speech is basically of a monologue style. Spontaneous speech from the 66 MERLE HORNE

SweDia 2000 interview material was used for the study (). The speech of two female speakers from Götaland (a young woman from Orust and an older woman from Torsö) was transcribed and all EH fillers were labeled.

3 Results 3.1 Distribution of EH The spontaneous SweDia data showed that EH occurs principally in non-utterance-initial position. There were only two cases of EH in utterance-initial position in the data studied and their mean duration was 899 ms. EH occurs almost exclusively as a clitic to a preceding function word: 127 of the 137 instances of EH were cliticized to a preceding function word. The most frequent function words preceding EH were the coordinate conjunctions och ‘and’ and men ‘but’ which often have discourse functions, e.g. introducing topic continuations, new topics, etc. 52 cases of EH occurred after these two function words. Och EH ‘and UH ’ was the most common function word+filler construction and was often (in 30 of 38 cases) preceded or followed by an inhalation break, a clear indicator of a speech chunk boundary (see Horne et al., 2006). The second most frequent function word category preceding EH was the subordinate conjunction att ‘that’ which also sometimes functions as a discourse marker introducing a non-subordinate clause. 24 instances EH occurred after att . The other instances of EH were found after the following function words: preposition (n=13), articles (n=9), pronouns (Subject) (n=9), basic verbs or auxiliary verbs (n=9), demonstrative article (n=5), indefinite adjective (n=3), subordinate conjunction (other than att ‘that’) (n=2), negation (n=1). Content words preceded EH in only 7 cases. Finally, there was 1 case where EH was a repetition.

3.2 Phonetic realization of EH in non-initial position Three basic realizations of the filler EH have been observed: (1) a middle-high front or central vowel: e.g. [ ], [e], [ ] (see Fig. 1), (2) a nasalized vowel or vowel+nasal consonant: e.g. [ n], [ m], [ n] (see Fig. 2), (3) a glottalized or creaky vowel: e.g. [ [ (see Fig. 3).

Figure 1. Example of the realization of EH as the middle high vowel [ ].

The vowel realizations of EH were the most frequent (n=61) and had a mean duration of 268 ms and a SD of 136 ms. The nasalized or vowel+nasal realizations were second in frequency (n=43), and had a mean duration of 436 ms and a SD of 185 ms. These showed a distribution like the vowel+nasal fillers in English that Clark & Fox Tree (2002) analysed, i.e. they were THE FILLER EH IN SWEDISH 67 always followed by other kinds of ‘delays’, sometimes several in sequence as in Figure 2 with SWALLOW, SMACK, INHALE following EH .

Figure 2. Example of the realization of EH as a vowel+nasal [ m]. Notice the other delays (SWALLOW, SMACK, INHALE) following [ m].

Figure 3. Example of the realization of EH as the creaky vowel [ .

The creaky vowel realizations of EH were the fewest (n=31) and had a mean duration of 310 ms and a SD of 150 ms. Their duration thus overlaps with the durations of the vowel and vowel+nasal realizations. Unlike the vowel+nasal realizations, the only other delay that was observed to follow the creaky filler was a silent pause. Creaky fillers are in some sense unexpected, since EH is often assumed to be a signal that the speaker wants to hold the floor, whereas creak, on the other hand, is assumed to be a signal of finality (Ladefoged, 1982). Nakatani & Hirschberg (1994), however, have observed that glottalization is not uncommon before a speech repair, and thus the creaky EH could, therefore, be interpreted as a juncture signal for an upcoming disfluency. Observation of the SweDia data shows in fact that the creaky realizations have a tendency to occur before disfluencies, as in the following examples:

a) men den såg ju inte ut [ det var någon ‘but it did not look [ it was somebody’, b) å då var det en [¡ en kar som heter Hans Nilsson som blev ordförande ‘and then there was a [ a guy named Hans Nilsson who became chairman’. Creaky fillers also occur in environments where the speaker seems to be uncertain or have problems in formulating an utterance, e.g. för

68 MERLE HORNE

att då blev det ju så att PAUSE [ PAUSE Johannesberg det skulle ju läggas ner ‘since it happened that PAUSE [ PAUSE Johannesberg it was going to be shut down’.

4 Summary and conclusion This study on the distribution, function and phonetic realization of the filler EH has shown that the occurrence of EH in the SweDia spontaneous speech studied here is mostly restricted to a position following a function word at the beginning of an utterance. This supports and generalizes the findings of Hansson (1998) and Lundholm (2000) who found the filler EH most often in utterance internal position after conjunctions/discourse markers in spontaneous speech, both of a monologue and dialogue type. This differs from the findings for simulated task-related dialogues in Eklund (2004), where the filler EH occurred almost exclusively in utterance-initial position. This difference is most likely due to the simulated nature of the speech situation where the planning and coding of speech is more cognitively demanding. As regards the phonetic realization of the filler EH , the patterning in Swedish is seen to be partially similar to the findings of Clark & Fox Tree (2002) for English: A vocalic realization of EH occurs before shorter delays in speech whereas a vowel+nasal realization correlated with relatively longer delays in speech. The duration of the vocalic realizations in the present data (mean = 268 ms) corresponds rather well with the median duration for EH found by Lundholm (240 ms) in spontaneous dialogues; thus, we would expect that the fillers in her data were realized mainly as a vowel such as ([ ], [e], [ ]). A third realization, with creaky phonation, whose distribution overlaps with the other two realizations would appear to be associated with relatively more difficulty in speech coding; the creaky phonation, associated with termination, perhaps signals that the speaker has problems in formulating or coding his/her speech, and was observed to sometimes occur before repairs and repetitions. More data is needed, however, in order to draw more conclusive results.

Acknowledgements This research was supported by a grant from the Swedish Research Council (VR).

References Allwood, J., 1994. Om dialogreglering. In N. Jörgenson, C. Platzack & J. Svensson (eds.), Språkbruk, grammatik och språkförändring. Dept. of Nordic Lang., Lund U., 3-13. Bruce, G., 1998. Allmän och svensk prosodi . Dept. of Linguistics & Phonetics, Lund U. Clark, H. & J. Fox Tree, 2002. Using uh and um in spontaneous speech. Cognition 84 , 73- 111. Eklund, R., 2004. Disfluency in Swedish: Human-human and human-machine travel booking dialogues . Ph.D. Dissertation, Linköping University. Hansson, P., 1998. Pausering i spontantal . B.A. essay, Dept. of Ling. & Phonetics, Lund U. Horne, M., J. Frid & M. Roll, 2006. Timing restrictions on prosodic phrasing. Proceedings Nordic Prosody IX , Frankfurt am Main: P. Lang, 117-126. Ladefoged, P., 1982. The linguistic use of different phonation types. University of California Working Papers in Phonetics 54 , 28-39. Lundholm, K., 2000. Pausering i spontana dialoger: En undersökning av olika paustypers längd . B.A. essay, Dept. of Ling. & Phonetics, Lund U. Nakatani, C. & J. Hirschberg, 1994. A corpus-based study of repair cues in spontaneous speech. Journal of the Acoustical Society of America 95 , 1603-1616. Shriberg, E., 2001. To ‘errrr’ is human: ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association 31 , 153-169. SweDia 2000 Database: http://www.swedia.nu/. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 69 Working Papers 52 (2006), 69–72

Modelling Pronunciation in Discourse Context

Per-Anders Jande Dept. of Speech, Music and Hearing/CTT, KTH [email protected]

Abstract This paper describes a method for modelling phone-level pronunciation in discourse context. Spoken language is annotated with linguistic and related information in several layers. The annotation serves as a description of the discourse context and is used as training data for decision tree model induction. In a cross validation experiment, the decision tree pronuncia- tion models are shown to produce a phone error rate of 8.1% when trained on all available data. This is an improvement by 60.2% compared to using a phoneme string compiled from lexicon transcriptions for estimating phone-level pronunciation and an improvement by 42.6% compared to using decision tree models trained on phoneme layer attributes only.

1 Introduction and background The pronunciation of a word is dependent on the discourse context in which the word is ut- tered. The dimension of pronunciation variation under study in this paper is the phone dimen- sion and only variation such as the presence or absence of phones and differences in phone identity are considered. The focus is on variation that can be seen as a property of the lan- guage variety rather than as individual variation or variation due to chance. Creating models of phone-level pronunciation in discourse context requires a detailed de- scription of the context of a phoneme. Since the discourse context is the entire linguistic and pragmatic context in which the word occurs, the description must include everything from high-level variables such as speaking style and over-all speech rate to low-level variables such as articulatory feature context. Work on pronunciation variation in Swedish has been reported by several authors, e.g. Gårding (1974), Bruce (1986), Bannert & Czigler (1999), Jande (2003; 2005). There is an ex- tensive corpus of research on the influence of various context variables on the pronunciation of words. Variables that have been found to influence the segmental realisation of words in context are foremost speech rate, word predictability (or word frequency) and speaking style, cf. e.g. Fosler-Lussier & Morgan (1999), Finke & Waibel (1997), Jurafsky et al. (2001) and Van Bael et al. (2004).

2 Method In addition to the variables mentioned above, the influence of various other variables on the pronunciation of words has been studied, but these have mostly been studied in isolation or together with a small number of other variables. A general discourse context description for recorded speech data, including a large variety of linguistic and related variables, will enable data-driven studies of the interplay between various information sources on e.g. phone-level pronunciation. Machine learning methods can be used for such studies. A model of pronuncia- tion variation created through machine learning can be useful in speech technology applica- tions, e.g. for creating more dynamic and natural-sounding speech synthesis. It is possible to 70 PER -ANDERS JANDE create models which can predict the pronunciation of words in context and which are simulta- neously descriptive and to some degree explain the interplay between different types of vari- ables involved in the predictions. The decision tree induction paradigm is a machine learning method that is suitable for training on variables of diverse types, as those that may be included in a general description of discourse context. The paradigm also creates transparent models. This paper describes the creation of pronunciation models using the decision tree paradigm.

2.1 Discourse context description The speech databases annotated comprise ~170 minutes of elicited and scripted speech. Ca- nonical phonemic word representations are collected from a pronunciation lexicon and the phoneme is used as the central unit in the pronunciation models. The annotation is aimed at giving a general description of the discourse context of a phoneme and is organised in six lay- ers: 1) a discourse layer, 2) an utterance layer, 3) a phrase layer, 4) a word layer, 5) a syllable layer and 6) a phoneme layer. Each layer is segmented into a linguistically meaningful type of unit which can be aligned to the speech signal and the information included in the annotation is associated with a particular unit in a particular layer. For example, in the word layer, infor- mation about part of speech, word frequency, word length etc. is included. The information associated with the units in the phoneme layer is instead phoneme identity, articulatory fea- tures etc. For a more detailed description of the annotation, cf. Jande (2006).

2.2 Training data Decision trees are induced from a set of training instances compiled from the annotation. The training instances are phoneme-sized and can be seen as a set of context sensitive phonemes . Each training instance includes a set of 516 attribute values and a phone realisation, which is used as the classification key. The features of the current unit at each layer of annotation are included as attributes in the training examples. Where applicable, information from the neighbouring units at each annotation layer is also included in the attribute sets. The key phone realisations are generated by a hybrid automatic transcription system using statistical decoding and a posteriori correction rules. This means that there is a certain degree of error in the keys. When compared to a small gold standard transcription, the automatic tran- scription system was shown to produce a phone error rate (PER) of 15.5%. Classification is not always obvious at manual transcription, e.g. many cases of choosing between a full vowel symbol and a schwa. Defaulting to the system decision whenever a human transcriber is forced to make ad hoc decisions would increase the speed of manual checking and correction of automatically generated phonetic transcripts without lowering the transcript quality. If this strategy had been used at gold standard compilation, the estimation of the system accuracy would have been somewhat higher. The 15.5% PER is thus a pessimistic estimate of the tran- scription system performance.

2.3 Decision tree model induction Decision tree induction is non-iterative and trees are built level by level, which makes the learning procedure fast. However, the optimal tree is not guaranteed. At each new level cre- ated during the tree induction procedure, the set of training instances is split into subsets ac- cording to the values of one of the attributes. The attribute selected is the attribute that best meets a given criterion, generally based on entropy minimisation. Since training data mostly contain some degree of noise, a decision tree may be biased toward the noise in the training data (over-trained). However, a tree can be pruned to make it more generally applicable. The idea behind pruning is that the most common patterns are kept in the model, while less com- mon patterns, with high probability of being due to noise in the training data, are deleted. MODELLING PRONUNCIATION IN DISCOURSE CONTEXT 71 3 Model performance A tenfold cross validation procedure was used for model evaluation. Under this procedure, the data is divided into ten equally sized partitions using random sampling. Ten different decision trees are induced, each with one of the partitions left out during training. The partition not used for training is then used for evaluation. A pruned and an unpruned version of each tree were created and the version with the highest prediction accuracy on the evaluation data was used for calculating the average prediction accuracy. The annotation contains some prosodic information (variables based on pitch and duration measures calculated from the signal), which cannot be fully exploited in e.g. a speech synthesis context. Thus, it was interesting to investigate the influence of the prosodic information on model performance. For this purpose, a tenfold cross validation experiment where the decision tree inducer did not have access to the prosodic information was performed. As a baseline, an experiment with trees induced from phoneme layer information only was also performed.

3.1 Results Attributes from all layers of annotation were used in the models with the highest prediction accuracy. The topmost node of all trees was phoneme identity and other high ranking attrib- utes were phoneme context, mean phoneme duration measured over the word and over the phrase, and function word , a variable separating between a generic content word representa- tion and the closed set of function words. The trees produced an average phone error rate of 8.1%, which is an improvement by 60.2% compared to using a phoneme string compiled from a pronunciation lexicon for estimating the phone-level realisation. The average PER of models trained on phoneme layer attributes only was 14.2%, which means that the prediction accuracy was improved by 42.6% by adding attributes for units above the phoneme layer to the training instances. A comparison between the models trained on all attributes and the models trained without access to prosodic information showed that the prosodic information gave a decrease in PER from 13.1 to 8.1% and thus increased model performance by 37.8%. The phonetic transcript generated by the models trained on all attributes was also evaluated against actual target transcripts, i.e., the gold standard used to evaluate the automatic tran- scription system. In this evaluation, the models produced a PER of 16.9%, which means that the deterioration in performance when using an average decision tree model instead of the automatic transcription system is only 8.5% and that the improvement using the model instead of a phoneme string is 34.9%.

4 Model transparency Figure 1 shows a pruned decision tree trained on all available data. The tree uses 58 of the 516 available attributes in 423 nodes on 12 levels. The transparency of the decision tree represen- tation becomes apparent from the magnification of the leftmost sub-tree under the top node, shown in the lower part of Figure 1. The top node of the tree is phoneme identity and the magnified branch is the branch repre- senting phoneme identity /v/. It can be seen that there are two possible realisations of the pho- neme /v/, [v] and null (no realisation) and it is easy to understand the conditions under which the respective realisations are used. If the mean phoneme duration over the word is less than 35.1 ms, the /v/ is never realised. If the mean phoneme duration is between 31.5 and 38.2 ms, the current word is decisive. If the word is one of the function words vad , vi , vara , vid , or av , the /v/ is not realised. If the word is any content word or the function word blev , the /v/ is realised as [v]. Finally, if the mean phoneme duration over the word is more than 38.2 ms, the /v/ is realised (as [v]) unless the phoneme to the right is also a /v/.

72 PER -ANDERS JANDE

phoneme_identity

V A: D J G O: RD E0,E:,E I: S Å M R A K B Ö4,Ö3 P Å: T I N RN SJ L Ä: Ä3,Ä4 O U: F U Ö: Y Y: H Ä TJ NG RT Ö RS RL

word_duration_phonemes_absolute word_duration_vowels_absolute D phoneme_identity+1 word_type_with_function word_duration_vowels_absolute RD phoneme_identity word_duration_phonemes_absolute phoneme_identity+1 word_type_with_function word_duration_phonemes_absolute phoneme_feature_py+1 word_duration_vowels_normalised word_type_with_function word_duration_phonemes_absolute syll_accent word_duration_phonemes_absolute word_duration_vowels_absolute phoneme_identity+1 I phoneme_identity+1 word_duration_phonemes_log_normalised SJ phoneme_identity+1 word_duration_vowels_absolute word_duration_phonemes_absolute phoneme_identity−1 word_duration_vowels_absolute word_duration_phonemes_absolute word_duration_vowels_absolute phrase_duration_vowels_absolute Y phoneme_identity+4 syll_position_in_word syll_stress TJ NG RT syll_stress RS utterance_semitone_pitch_dynamic_extremes_median−1

<0.0351488 >0.0351488 <0.0450865 >0.0450865 A:,J,G,I:,A,Å:,SJ,F,H D,O:,E0,S,Å,M,R,K,B,E:,sil,V,P,E,T,junk,N,L,Ä:,Ä3,O,Ö3,U,Ö:,U:,Y,I,Ä,Ä4,Ö,− jag,några,något,enligt,angående,vågar content,någon,herregud,iväg,gå,ihåg,tillväga,tog,igång,lägga <0.054381 >0.054381 E0 E: E <0.0453408 >0.0453408 A:,D,J,G,O:,E0,I:,Å,M,R,A,K,B,Ö4,E:,sil,V,P,Å:,E,T,junk,N,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,TJ,Ö,− S content,dem,de,måste,oss,om,sånt,nån,nånting,igenom,trots,någonting,dom,komma,omkring,honom,bort,utom,bortom,genom,okej,loss,många,oj,kommer,förutom,bakom,vårt,hålla,igång,såsom,kontra,inom,hoppas som,och,liksom eftersom,någon,något <0.0351371 >0.0351371 open,close,mid,open−mid,close−mid,−,glottal dental,palatal,velar,retroflex,bilabial,labiodental,alveolar <−0.109945 >−0.109945 content,skulle,fick,vilket,mycket,ska,kan,kunde,tillbaka,kunna,kvar,skall,komma,omkring,tänkte,country,sjunka,liksom,vilka,verkade,försöka,ikapp,brukar,försökte,okej,kommer,vilken,tack,bakom,inklusive,kontra,diktaturländer,försöker,lyckas,tänker,frakta,kring och <0.0345119 >0.0345119 prim1,prim2,secondComp second2,no <0.0527423 >0.0527423 <0.054784 >0.054784 A:,D,J,G,O:,E0,I:,Å,M,R,A,K,Ö4,E:,sil,V,Å:,E,junk,N,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,TJ,Ö,RS,− S,B,P,T A:,D,J,G,O:,E0,I:,S,Å,M,R,A,Ö4,E:,sil,V,Å:,E,T,junk,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,TJ,Ö,RS,− K B,P N <−0.211851 >−0.211851 A:,D,J,G,O:,E0,I:,Å,M,R,A,K,B,Ö4,E:,sil,V,P,Å:,E,T,junk,N,SJ,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,Ö,− S L <0.0688825 >0.0688825 <0.035647 >0.035647 sil,V,D,J,RD,M,R,A,B,E:,P,N,T,F,H,TJ,RS G,I:,S,K,E,SJ,L I <0.0259981 >0.0259981 <0.0366548 >0.0366548 <0.01375 >0.01375 <0.0354063 >0.0354063 A:,G,O:,RD,E0,I:,S,Å,M,R,A,K,B,E:,D,sil,P,E,T,I,N,L,Ä3,V,U,Ö:,Y,NG J,Å:,U:,H i f,m y n y n <2.04762 >2.04762

null word_duration_phonemes_absolute phoneme_identity−1 A: phrase_duration_phonemes_absolute J null phrase_duration_phonemes_log_absolute phoneme_identity−2 word_type_with_function−1 word_duration_phonemes_absolute word_duration_vowels_absolute word_type_with_function word_duration_vowels_absolute word_duration_vowels_absolute word_duration_phonemes_absolute null Å word_duration_vowels_absolute null null phoneme_identity+1 word_duration_phonemes_absolute phoneme_identity+1 syll_accent A K word_duration_phonemes_absolute null phrase_duration_phonemes_absolute word_duration_vowels_absolute Ö4 phoneme_identity+1 phoneme_identity+1 word_duration_vowels_absolute Å: phoneme_identity−3 null word_duration_phonemes_absolute word_part_of_speech+4 M null null RN word_type_with_function phoneme_identity+2 null word_duration_phonemes_absolute syll_stress word_type_with_function word_duration_vowels_log_absolute phoneme_identity+1 E0 word_probability word_part_of_speech U: word_part_of_speech+1 F E0 U phoneme_identity−1 phoneme_identity−1 phrase_duration_vowels_normalised Y H phrase_duration_phonemes_log_absolute phrase_duration_phonemes_absolute E0 Ö null RL null

<0.038165 >0.038165 V,J D,G,E0,S,M,R,A,K,B,P,Å:,N,T,L,Ö:,H,NG <0.0351698 >0.0351698 <−3.07709 >−3.07709 junk,sil,G,E0,S,M,R,A,T,N,L I:,Å: K,U:,Ä4 E −,jag,content,i,på,med,det,är,som,en,vi,skulle,de,från,och,varit,den,över,upp,man,att,några,om,för,till,men,jo,hade,ja,ut,sånt,alla,då,vilket,han,vara,av,sina,ett,nån,efter,fram,mellan,hur,har,något,våra,både,sin,varför,du,får,varje,bli,iväg,våran,sätta,inget,enligt,all,usch,vilken,dom,vår,gentemot,per mig,än,någon,mina <0.0500093 >0.0500093 <0.0443735 >0.0443735 content,ett,den,men,eller,sen,sig,varenda,nej,bredvid,mig,eftersom,efter,mellan,emellan,hennes,henne,herregud,oavsett,hem,detta,okej,emot,tv−fem,dessa,dig,vem,överens,dess,denna,_re,igen,gentemot,sej en,måste,ville,kunde,ens,_berä,behövde,angående,behöver,densamma,behöva <0.0580797 >0.0580797 <0.0949705 >0.0949705 <0.0363721 >0.0363721 <0.0797275 >0.0797275 A:,D,J,G,O:,E0,I:,S,Å,R,A,K,B,Ö4,E:,sil,V,P,Å:,E,T,junk,N,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,I,H,Ä,Ä4,TJ,Ö,− M <0.0305612 >0.0305612 D,RD,M,R,N,SJ,L,TJ J,G,S,K,B,V,P,T,F,−,RL prim1,second2,no prim2,secondComp <0.135855 >0.135855 <0.031577 >0.031577 <0.0434868 >0.0434868 A:,D,sil,Å:,Ä3 R,A,T A:,D,J,O:,E0,I:,S,Å,M,R,A,K,B,E:,sil,V,Å:,E,T,junk,N,SJ,L,Ä3,O,F,U,Ö:,U:,I,H,Ä,Ä4 P <0.0138383 >0.0138383 sil,junk,V,A:,D,J,G,O:,RD,E0,I:,S,Å,M,R,A,K,B,Ö4,E:,P,E,T,I,N,RN,SJ,L,Ä:,Ä3,Å:,O,U:,Ö3,U,Ö:,Y:,F,H,Ä,Ä4,TJ,NG,RT,Ö,RS,− Y <0.0362163 >0.0362163 NN,VB,PN,PP,−,JJ,KN,HP,PM,RG,DT,AB,HA,SN,IE PS,PLQS content,skulle,allt,till,vill,eller,alla,allting,bland,blev,ville,tillbaka,mellan,emellan,vilja,slutar,bli,skall,allihopa,blivit,lät,liksom,tillväga,enligt,all,slut,loss,allihop,lärde,fel,tala,hålla,inklusive,lägga,lyckas,lämna vilket,vilka,vilken D,E0,I:,A,K,B,E:,L,Y:,Ä J,S,Å,E,Ä:,Ä3,O,TJ M,R,sil,V,P,Å:,T,I,N,F,Y,H <0.0353142 >0.0353142 y n content,där är,när <−3.02067 >−3.02067 A:,D,G,RD,I:,S,M,R,A,K,B,T,N,RN,SJ,L,H,Ä,RT J,sil,P <2.0029e−05 >2.0029e−05 PN,PP,AB,PLQS NN,JJ HA PN,VB,NN,−,DT,JJ,AB,PLQS,HA,SN,IE PM D,S,P,SJ,F,TJ,RS I:,R,A,L T H sil,D,J,E0,I:,S,M,R,A,K,B,E:,P,Å:,junk,I,N,T,SJ,L,U:,U,H,TJ,RT,RS G NG <−0.894746 >−0.894746 <−2.54407 >−2.54407 <0.027235 >0.027235

word_type_with_function phoneme_identity+1 word_type_with_function word_duration_phonemes_absolute null phoneme_identity−1 null G O E0 O: null O: O null phrase_duration_phonemes_absolute word_type_with_function word_duration_phonemes_absolute syll_accent E0 word_duration_phonemes_log_normalised I: word_type_with_function I: phoneme_identity+1 S word_duration_phonemes_log_absolute Å M null phoneme_identity−2 R null word_duration_phonemes_absolute word_duration_vowels_log_absolute word_duration_phonemes_absolute null phrase_duration_vowels_normalised null B phoneme_identity−1 phoneme_feature_len+1 null P P null word_duration_phonemes_absolute Å T phoneme_identity−4 word_type_with_function N NG null L null phoneme_identity+3 L L E0 phoneme_identity+4 syll_duration_vowels_absolute word_type_with_function−1 word_duration_phonemes_absolute word_morphology_tense_aspect+2 phoneme_identity+2 phoneme_feature_len O E0 phoneme_identity+3 null phoneme_identity−4 U: word_duration_phonemes_absolute null F Ö Ö: E0 null syll_duration_vowels_absolute E0 E0 Y syll_duration_vowels_absolute phoneme_identity−2 H E0 Ä

vad,vi,vara,vid,av content blev A:,D,J,G,O:,E0,I:,S,Å,M,R,A,K,B,E:,sil,P,Å:,E,T,junk,N,SJ,L,Ä:,Ä3,O,F,U,U:,I,H,Ä,Ä4,− V vad,jag,content,vara,var varit,ja,kvar <0.0255504 >0.0255504 sil,G,R,E,junk,Ä A:,D,J,O:,E0,S,Å,M,A,K,Å:,I,N,T,L,U:,NG,RT,Ö,RS <0.0182007 >0.0182007 content,blev,sedan,genom med,det,enligt,ned ner <0.0350682 >0.0350682 prim1,prim2,secondComp second2,no <−0.386104 >−0.386104 content,vi,dit,vid,bli,hin i,sina,mina,blivit,ni Å,E:,Å:,E,T,L M,A,K,I <−3.12493 >−3.12493 sil,V,D,G,E0,M,R,A,K,Å:,T,N,U: J,F,H <0.0366438 >0.0366438 <−2.99887 >−2.99887 <0.0653373 >0.0653373 <1.63464 >1.63464 J,G B,T,F H l s <0.0806865 >0.0806865 S,R,B,T,N,L,H M content,en,den,man,men,bland,nån från,när,hon,han,nej,hans några,än,sånt,under,nåt,någon,något,ens,ni ned A:,G,V O:,S,Å,M,R,A,K,E:,D,sil,E,T,N,L,U:,F A:,E0,S,M,R,A,B,E:,D,sil,Å:,T,junk,I,N,L,F J I:,Ö3 <1.1824e−05 >1.1824e−05 −,content,med,från,om,för,till,ur i,mot,dom <0.0251743 >0.0251743 PRT,NO,PRS,INF,− SUP PRF D,J,I:,S,Å,M,K,B,V,T,L,H,− A:,G,sil,junk,N,Ä:,Ä3,U:,Y,Ä,RT O:,P,O RD,E0,R,RS A,E,I,SJ,F,U,Ö: s l G,O:,Å,M,E:,sil,Å:,E,I,L,O,Ä,− R,A,K,D,N sil,V,D,J,S,M,R,A,K,B,E,T,I,N,L,Ä3,O,H,Ä E0 E: U <0.0254167 >0.0254167 <7.64504e−06 >7.64504e−06 <1.05608e−05 >1.05608e−05 V,D,G,O:,E0,S,Å,R,A,K,B,Ö4,Å:,T,I,E,L,U:,Y,Ä4,NG A:,J,P,N,SJ,E:,U,Ä

null V V V null word_part_of_speech phrase_duration_vowels_log_normalised E0 word_duration_vowels_absolute phoneme_identity+2 J null E0 E E0 E: E0 E: word_duration_phonemes_absolute E0 word_part_of_speech phrase_duration_phonemes_normalised word_duration_vowels_absolute I null S E0 phrase_duration_vowels_absolute syll_length_phonemes phoneme_identity+1 phoneme_identity−2 word_type_with_function+1 word_type_with_function A E0 phrase_duration_phonemes_log_normalised K null Ö3 word_morphology_tense_aspect Ö Ö4 Ö3 E0 Å T null word_duration_phonemes_normalised phrase_duration_phonemes_log_normalised null RN L null Ä Ä: E0 Ä word_type_with_function−1 E0 Ä: E0 word_duration_vowels_log_normalised E0 Ä3 null Ä4 word_duration_vowels_normalised phoneme_identity+1 word_duration_vowels_log_absolute−1 phoneme_identity+3 Ä Ä3 phrase_pitch_range+1 O phoneme_identity−3 U null E0 E0 U: Ö Ö: E0 utterance_pitch_range null H

HP,PN,VB,DT,HA NN,AB <−1.33998 >−1.33998 <0.0140655 >0.0140655 D,J,A:,O:,I:,S,Å,M,A,K,B,E:,sil,V,P,Å:,E,T,junk,I,L,U:,F,Ö3,U,Ö:,Y,H,Ä,− G,R,N,Ä3,O <0.035194 >0.035194 PN,PP VB NN <−0.334969 >−0.334969 <0.0449335 >0.0449335 <0.0302093 >0.0302093 <3.5 >3.5 I:,sil,junk,Ä3,U Å,A,Å:,E,U:,I,H sil,V,G,E0,I:,S,M,K,B,P,Å:,T,N,U:,NG D,J,R,L,F,H jag,med,är,sin,för,vad,så,sig,eller,har content,i,−,på,det,som,en,vi,från,och,över,man,att,om,till,då,av,sina,ska,bland,varför,bakom skulle,den,de,vara,ner,sen,vilket,deras,blev,vid,ja,fram,kvar,framför,vilka,försöka,genom,all,vart,_re,försöker,sej,per varit när,men,än,ut,hon,han,mot,tillbaka,mellan,ihop,våra,både,samma,under,samman,honom,blivit,fast,vilken,bör,vårt,fel,vågar content,utanför,allt,man,några,utan,började,inga,bland,varenda,deras,innan,sina,slutar,kunna,varandra,samma,allihopa,detta,försöka,vågar,varandras ha,att,värma,tillbaka,medans,våra,vilja,mina,sedan,komma,era,börja,sjunka,brukar,skära,många,media,dessa,tala,huruvida,behöva,lägga,via,diktaturländer,frakta,säga vara,börjar fast,allting,han,andra,kan,fram,mellan,nära,varför,skall,våran,samman,united,sätta,vilka,verkade,allihop,vart,tvingas,hoppas alla,annat <−1.20726 >−1.20726 NO,PRT,INF PRS,SUP <−0.553373 >−0.553373 <−1.17269 >−1.17269 −,vad,jag,content,i,på,med,det,är,som,en,vi,skulle,och,varit,den,man,att,så,om,började,för,till,fast,ja,eller,min,hon,han,sig,vara,vid,av,ett,ska,hur,nåt,har,vilja,du,får,varje,kunna,samma,komma,era,inget,enligt hade ville,någon,börjar,lärde <−0.156165 >−0.156165 <−0.160485 >−0.160485 RD,RN R,RS <−6.55093 >−6.55093 J,M,A,Ö4,sil,E,I,H,TJ,NG A: G,S,K,D,P,Å:,T,N,L,F,Ö3,U Å,R,V,Ö Ä3,Ö: <9.62448 >9.62448 sil,junk,A:,D,E0,Å,R,A,K,E:,E,T,I,N,L,Ä3,Å:,U: S,Ö3 <98.4923 >98.4923

E0 A: E0 A: E0 A: null J E0 E word_duration_phonemes_absolute phrase_duration_phonemes_absolute I I: I phrase_duration_vowels_log_absolute I: phoneme_identity−1 Å null R null R null word_duration_phonemes_absolute+1 R syll_nucleus R null null discourse_duration_vowels_absolute E0 null A word_duration_phonemes_absolute E0 A Ö4 E0 null N null N word_part_of_speech+3 Ä: Ä phoneme_identity+2 Ä3 word_type_with_function−1 Ä3 Ä3 phrase_mel_pitch_range Ä4 phrase_duration_phonemes_absolute phoneme_identity−1 Ä3 Ä4 Ä4 E0 O E0 phoneme_identity+2 U Y Y:

<0.0158196 >0.0158196 <0.0343429 >0.0343429 <−4.26926 >−4.26926 sil,E0,I:,Å,M,R,junk,NG,RT J,S,P,N <0.0415884 >0.0415884 A:,E O:,E0,I:,Å,A,Ö4,E:,Å:,I,Ä3,O,U:,Ö3,U,Y,Ä4 − <0.0694128 >0.0694128 <0.0864187 >0.0864187 PP,NN,VB,PN,−,DT,JJ,PM,KN,HP,AB,PS,PLQS,SN,IE,IN HA,UO J,S,M,T,L,Ä3 A:,G,R,B,sil,P,N,F K −,jag,content,det,som,de,den,för,sånt,många är,han <56.3845 >56.3845 <0.0536095 >0.0536095 sil D,J,S,B,T,SJ,L,F K P,H A:,G,I:,S,Å,M,R,A,K,Ö4,E:,sil,V,Å:,E,T,N,O,F,Y,H,Ä,TJ,NG I,Ä3

null E0 E0 word_lexeme_repetitions E0 I Å E0 phrase_length_phonemes null phrase_type−1 R R syll_position_in_word word_duration_vowels_normalised null discourse_duration_vowels_absolute Ä: Ä phrase_duration_vowels_normalised+1 Ä4 Ä3 Ä4 E0 Ä4 E0 Ä4 Ä3 Ä phrase_duration_vowels_absolute E0 Ä4 E0 U

<7.5 >7.5 <2.5 >2.5 NP,PP,NOP,ADVP VC APMIN i,m f <−0.665706 >−0.665706 <0.0818528 >0.0818528 <−0.851477 >−0.851477 <0.02875 >0.02875

I: I null R null R null word_type_with_function E0 syll_position_in_word phrase_length_phonemes−1 word_duration_phonemes_log_absolute+1 A Ä4 Ä Ä word_part_of_speech+3

content,utanför,allt,varenda,varandra,allihopa,varandras man,började,bland i,m f <25.5 >25.5 <−2.36703 >−2.36703 PP,NN,VB,PN,−,DT,JJ,KN,HP,AB HA,IN

A E0 A word_type_with_function−1 A E0 E0 A Ä3 Ä4

−,jag,content,i,på,med,det,är,vi,skulle,och,den,upp,att,några,för,till,eller,alla,ville,sina,ska,kan,in,kunde,vilja,du,dom,törs som,en,man,av,mellan,har

phoneme_identity−1 A

V,D,J,G,S,R,K,N,T,RN,L,RT M,I,NG B,P,H

phoneme_identity+4 A A

A:,J,O:,E0,S,Å,M,R,A,K,E:,D,sil,Å:,E,T,I,N,SJ,L,Ä:,Ä3,V,O,F,Ö3,U,Y,H,− G,I:,NG

E0 A

Figure 1. The upper part of the figure shows the pruned version of a decision tree and the lower part of the figure shows a magnification of a part of the tree.

Acknowledgements The research reported in this paper is carried out at the Centre for Speech Technology, a com- petence centre at KTH, supported by VINNOVA (the Swedish Agency for Innovation Sys- tems), KTH and participating companies and organisations. The work was supported by the Swedish Graduate School of Language Technology, GSLT.

References Bannert, R. & P.E. Czigler, 1999. Variations in consonant clusters in standard Swedish. Phonum 7 . Umeå: Umeå University. Bruce, G., 1986. Elliptical phonology. Papers from the Scandinavian Conference on Linguis- tics , 86-95. Finke, M. & A. Waibel, 1997. Speaking mode dependent pronunciation modeling in large vo- cabulary conversational speech recognition. Proceedings of Eurospeech , 2379-2382. Fosler-Lussier, E. & N. Morgan, 1999. Effects of speaking rate and word frequency on pro- nunciations in conversational speech. Speech Communication 29(2-4) , 137-158. Gårding, E., 1974. Sandhiregler för svenska konsonanter (Sandhi rules for Swedish conso- nants). Svenskans beskrivning 8 , 97-106. Jande, P.A., 2003. Phonological reduction in Swedish. Proceedings of ICPhS , 2557-2560. Jande, P.A., 2005. Inducing decision tree pronunciation variation models from annotated speech data. Proceedings of Interspeech , 1945-1948. Jande, P.A., 2006. Integrating linguistic information from multiple sources in lexicon devel- opment and spoken language annotation. Proceedings of LREC workshop on merging and layering linguistic information (accepted). Jurafsky, D., A. Bell, M. Gregory & W. Raymond, 2001. Probabilistic relations between words: Evidence from reduction in lexical production. In J. Bybee & P. Hopper (eds.), Fre- quency and the emergence of linguistic structure . Amsterdam: John Benjamins, 229-254. Van Bael, C.P.J., H. van den Heuvel & H. Strik, 2004. Investigating speech style specific pro- nunciation variation in large spoken language corpora. Proceedings of ICSLP , 586-589. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 73 Working Papers 52 (2006), 73–76

Are Verbs Less Prominent?

Christian Jensen Department of English, Copenhagen Business School [email protected]

Abstract The perceived prominence of three parts of speech (POS), nouns, verbs and adjectives, in three utterance positions, initial, intermediate and final, were examined in a perceptual experiment to see whether previously observed reductions in prominence of intermediate items were the result of positional effects or because words in this position belonged to the same POS, namely verbs. It was found that the perceived prominence of all three POS was reduced in intermediate position, and that the effect of POS membership was marginal, although adjectives tended to be slightly more prominent than nouns and verbs.

1 Introduction In a previous study of the perceived prominence of accented words in Standard Southern British English (SSBE) (Jensen, 2003; 2004) it was found that, in short sentences, accented words in utterance initial and utterance final position are generally perceived as more prominent than accented words in an intermediate position. This is in accordance with traditional descriptions of intonation in SSBE and has also been observed in German (Widera, Portele & Wolters, 1997) and, at least with regard to utterance initial position, Dutch (Streefkerk, 2001). In utterances with multiple intermediate accented lexical items these seemed to form an alternating strong – weak pattern, and the complete pattern of the entire utterance was explained (in part) as reflecting the intermediate accent rule, which states that “any accented syllables between onset and nucleus are liable to lose their accent” (Knowles, 1987: 124). However, it was suggested to me that the observed pattern might not be a general property of the prosodic structure of utterances (or phrases), but rather a reflection of lexical/semantic properties of the sentences employed in the study. Most of these were of the type Bill struck Ann and Sheila examined the patient carefully , i.e. SVO structure with a verb as the second lexical item. Some studies have noted a tendency for verbs to be perceived as less prominent than other lexical items in various languages: Danish (Jensen & Tøndering, 2005), Dutch (Streefkerk, 2001) and German (Widera, Portele & Wolters, 1997), so the reduction in perceived prominence, which was particularly noticeable immediately following the first accent of the utterance, could be the result of an inherent property of verbs. The present study examines whether the tendency towards intermediate accent reduction can be reproduced in utterances with varying lexico-syntactic structure and addresses the following question: does the perceived prominence of a lexical item vary as a function of its part of speech (POS) membership independently of the position of this item in an utterance? Specifically, are verbs, in their function as main verbs in a clause, inherently less prominent than (some) other parts of speech?

74 CHRISTIAN JENSEN 2 Method Since the perceived prominence of words in utterances depends on factors other than the ones studied here, most importantly information structure, it is necessary to find an experimental design which limits the influence of these factors to the smallest possible minimum. This effectively rules out studies of spontaneous speech, since the influences of information structure and the lexical content of the accented words are likely to mask the effects of location within an utterance. A relatively large corpus of spontaneous speech would be required to bring out these effects, which is not practical when measurements of perceived prominence are elicited through the ratings of multiple listeners (see below). Instead, the research question outlined above is addressed through a simple design involving read speech. Verbs are compared with two other POS categories, namely nouns and adjectives. While verbs are often found to be less prominent than other lexical items, nouns and adjectives are consistently found to be among the most prominent words. The inclusion of these word classes should therefore maximise any potential difference between verbs and “other lexical items”. A number of sentences were constructed, each of which contained three lexical items, one verb, one noun and one adjective, which were all expected to be accented. All six possible combinations were used, with two examples of each type, giving a total of 12 different sentences. Some examples of sentences from the material: The children claimed they were innocent (noun – verb – adj); The little girl was crying (adj – noun – verb); He admitted she was a beautiful woman (verb – adj – noun). The decision to include all logical possibilities means that some of the sentence types are more common, or “natural”, than others and also poses certain restrictions on verb forms, for example when they occur in final position. However, this should not have any negative influence on the research question as it is formulated above. Using this design, each POS occurs four times in each of the three positions in the sentence. The 12 sentences were recorded onto a computer by three speakers of Southern British English, giving a total of 36 utterances, which were presented to the raters via a web page, one utterance per page. The raters could hear the utterance as many times as they wanted by pressing a “play” button, and indicated their judgment by selecting the appropriate scale point for each lexical item and then clicking a “submit” button. A four-level scale was used, from 1 to 4, with 1 representing “low degree of emphasis” and 4 representing “high degree of emphasis”. A four-level scale has been demonstrated to be preferable to commonly used alternatives such as a binary scale or a 31-level scale (Jensen & Tøndering, 2005). The lower end of the scale was represented by 1 rather than 0 here to signal that all words were expected to have some degree of emphasis, since function words were excluded. Note also that the word emphasis was used in the written instructions to the untrained, linguistically relatively naive listeners, but refers to the phenomenon which elsewhere I call perceptual prominence and not (just) to higher levels of prominence, such as contrastive emphasis. The notion of “emphasis” (i.e. perceptual prominence) was both explained and exemplified in the online instructions. 23 raters participated in the experiment, all students of English at the Copenhagen Business School.

3 Results The reliability of the data as a whole is good, with a Cronbach's coefficient of 0.922. However, reliability coefficients for any group of three or five raters were relatively low, which indicates some uncertainty on the part of individual raters. The overall ratings averaged over POS membership and position in the utterance are displayed in Figure 1. ARE VERBS LESS PROMINENT ? 75

Figure 1. Prominence ratings based on 36 utterances (12 sentences × three speakers) averaged over POS and position in utterance . Each bar represents average scores of 12 utterances as perceived by all 23 raters on a scale from 1 to 4.

Average ratings for the three utterance positions and three parts of speech are all between 2.42 and 2.97 on the scale from 1 to 4. With regard to the effect of POS membership verbs were not found to be less prominent than nouns, but they were rated slightly lower (by 0.16 on the scale from 1 to 4) than adjectives. The difference is significant (one-tailed t-test, p < 0.05). Adjectives were also in general found to be significantly more prominent than nouns (two- tailed t-test, p < 0.001), which was not predicted (hence the use of two-tailed probability). As expected, words in second position are perceived as less prominent than words in initial position by approximately 0.5 on the scale from 1 to 4. The difference is significant (one- tailed t-test, p < 0.001). Somewhat surprisingly, words in final position are only slightly more prominent (by 0.08) than words in second position, and the difference is only just significant (one-tailed t-test, p < 0.05). The difference between initial and final position is highly significant (two-tailed t-test, p < 0.001). This pattern, and in particular the low prominence ratings of words in final position was not expected, but it is partly caused by the fact that so far only utterance position has been taken into consideration. In some cases the speaker (particularly one) divided these short utterances into two phrases, which may obviously have an effect on the expected prominence relations (as produced by the speakers and perceived by the listeners). Therefore, phrase boundaries were evaluated by three trained phoneticians (including the author) and assigned to the material in those cases where at least two out of three had perceived a boundary. This process divides all accented words up into three categories in accordance with traditional British descriptions of English intonation: nucleus , which is the last accent word of a phrase; onset , which is the first accented word in a phrase with more than one accent; and intermediate (my terminology) which is any accented word between onset and nucleus. Figure 2 displays prominence ratings for these three positions both across all three parts of speech and for each POS separately. The overall pattern of prominence ratings according to phrase position is similar to the patterning according to utterance position in Figure 1 (24 out of 36 cases are identical), but words in intermediate position are more clearly less prominent than words in phrase final (nucleus) position. All differences between onset, nucleus and intermediate position are highly significant ( p < 0.001). If we examine the results for the three parts of speech separately we can see that verbs and adjectives behave similarly: onset and nucleus position are (roughly) equally prominent ( p > 0.1) but intermediate accents are less prominent ( p < 0.01). The difference is larger for verbs than for adjectives. For nouns, however, the onset position is significantly more prominent than both intermediate and nucleus accents ( p < 0.001) while the latter are equally prominent ( p > 0.1).

76 CHRISTIAN JENSEN

Figure 2. Prominence ratings for the three parts of speech in three different positions in the intonation phrase.

4 Conclusion There is a clear effect of phrase position on the perceived prominence of lexical items for all three POS, nouns, verbs and adjectives, such that words in intermediate position are less prominent than words in onset (initial) or nucleus (final) position (nouns in nucleus position excepted). The effect noted in Jensen (2004) – reduction of perceived prominence of intermediate accents – is therefore replicated here and is not likely to have been the result of a certain syntactic structure with verbs in intermediate position. With regard to the effect of POS membership it seems that adjectives are generally slightly more prominent than verbs or nouns. This may be the result of a certain affective content of (some or all of) the adjectives. Although care had been taken to avoid overly affective adjectives, it is difficult to control for minor variations of this parameter. The interpretation of the results is complicated by the fact that nouns were rated as very prominent in onset position but markedly less so in nucleus position. Such a difference was not found in similar sentences in Jensen (2004), and I have no immediate explanation for this observation. The question raised in the title and introduction of this paper must therefore be answered somewhat tentatively: while verbs were found to be slightly less prominent than adjectives, the difference was rather small. And while verbs were found to be as prominent as nouns overall, they were less prominent in onset position but more prominent in nucleus position. The implications of this surprising result awaits further investigation.

References Jensen, C., 2003. Perception of prominence in standard British English. Proceedings of the 15th ICPhS 2003 , Barcelona, 1815-1818. Jensen, C., 2004. Stress and accent. Prominence relations in Southern Standard British English . PhD thesis, University of Copenhagen. Jensen, C. & J. Tøndering, 2005. Choosing a scale for measuring perceived prominence. Proceedings of Interspeech 2005 , Lisbon, 2385-2388. Knowles, G., 1987. Patterns of spoken English . London and New York: Longman. Streefkerk, B., 2001. Acoustical and lexical/syntactic features to predict prominence. Proceedings 24 , Institute of Phonetic Sciences, University of Amsterdam, 155-166. Widera, C., T. Portele & M. Wolters, 1997. Prediction of word prominence. Proceedings of Eurospeech '97 , Rhodes, 999-1002.

Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 77 Working Papers 52 (2006), 77–80

Variation and Finnish Influence in Finland Swedish Dialect Intonation

Yuni Kim Department of Linguistics, University of California at Berkeley [email protected]

Abstract Standard Finland Swedish is often described as having Finnish-like intonation, with characteristic falling pitch accents. In this study, it is found that the falling pitch accent occurs with varying degrees of frequency in different Finland Swedish dialects, being most frequent in the dialects that have had the greatest amount of contact with Finnish, and less frequent (though in many cases still part of the intonational system) elsewhere.

1 Introduction It is generally known that the Swedish dialects of Finland, with the exception of western Nyland (Selenius, 1972; Berg, 2002), have lost the historical word accent contrast between Accent 1 and Accent 2. What is less clear is what kinds of intonational systems the dialects have developed, and how these relate to the previous word-accent system on the one hand, and contact with Finnish (often via Finnish-influenced prestige Swedish varieties) on the other. In their prosodic typology of Swedish dialects, Bruce & Gårding (1978) classified Helsinki Swedish as type 0 (Far East), with falling pitch throughout the word, and western Nyland as type 2A (Central). As for other rural Finland Swedish dialects, subsequent research (Selenius, 1978; Svärd, 2001; Bruce, 2005; Aho, ms.) has suggested that many fit neither category straightforwardly. The purpose of the present study is to gauge how widespread the falling pitch accent is in Finland Swedish. It may be taken as a sign of Finnish influence, since it is the basic pitch accent in Finnish (see e.g. Mixdorff et al., 2002) but generally not attested in Sweden. Since the investigated dialects appeared to have intonational inventories with multiple pitch accents, unlike the lexical-accent dialects of Sweden, a quantitative component was undertaken to assess the frequency of falling pitch accents intradialectally. The results should be seen as preliminary due to the limited size of the corpus, but they point to some interesting questions for future research.

2 Materials and methods The materials used here were archaic dialect recordings, consisting of interviews and spontaneous narratives, from the CD accompanying Harling-Kranck (1998). The southern dialects included in the study were, from east to west, Lappträsk (eastern Nyland; fi. Lapinjärvi), Esbo (central Nyland; fi. Espoo), Kimito and Pargas (eastern Åboland; fi. Kemiö and Parainen, respectively). The northern dialects, south to north, were Lappfjärd (southern Österbotten; fi. Lapväärtti), Vörå (central Österbotten; fi. Vöyri), and Nedervetil (northern Österbotten; fi. Alaveteli). There was one speaker per dialect. The speakers, all female, were born between 1880 and 1914 and were elderly at the time of recording (1960s–1980s). 78 YUNI KIM

Between 1 and 3 minutes of speech from each dialect was analyzed using PRAAT. Accented words of two or more syllables were identified and given descriptive labels according to the shape of F0 in the tonic and post-tonic syllables. Monosyllables were not counted due to difficulties in determining which pitch accents they had. Non-finally, falling pitch accents were defined as those where the tonic syllable had a level or rising-falling F0, followed by a lower F0 in the post-tonic syllable. In phrase-final position, only words with falling F0 throughout were counted as having a falling pitch accent (as opposed to words with rising F0 through the tonic syllable followed by a boundary L).

3 Results 3.1 Eastern and central Nyland Lappträsk (eastern Nyland) and Esbo (central Nyland) overwhelmingly used falling pitch accents: 27 out of 30 total pitch accents and 33 out of 34, respectively. This result agrees with Aho’s (forthcoming) study of the Liljendal dialect of eastern Nyland. Just over a minute of speech was analyzed for each of these two speakers, which attests to the high density of accented words, especially given that monosyllabic accented words were not counted. This density is also characteristic of Finnish, which accents nearly every content word (Mixdorff et al., 2002; see also Kuronen & Leinonen, 2001).

400

300

200

100 ti:e hö:lass ti sta:n men vå:ga int gå opp å å:ka 0 3.02231 Time (s) Figure 1 . Falling pitch accents in the Esbo dialect: ( ja tjö:rd) ti:e hö:lass ti sta:n men vå:ga int gå opp å å:ka ‘(I drove) ten hay bales to town but didn’t dare go up and ride’.

For the Lappträsk speaker, the non-falling pitch accents consisted mainly of a low pitch on the tonic syllable followed by a high post-tonic (hereafter Low-High), which was used on 2 tokens and additionally on 5 or 6 monosyllables that were not included in the main count. The Esbo sample did not contain the Low-High accent, though a longer sample might have revealed some tokens. Interestingly, the Lappträsk and Esbo speech samples each contained two examples of an emphasis intonation that is reminiscent of Central or Western Swedish Accent 2. This pitch accent involves a sharp fall in the stressed syllable, followed by a rise culminating in a peak that may lie several syllables after the tonic (cf. Bruce, 2003).

3.2 Eastern Åboland Eastern Åboland presents a different picture, where the falling pitch accent is infrequent. In the Kimito data (about 1 minute 40 seconds), 6 out of 40 pitch accents were counted as falling, of which five were the last accented word in the phrase. In the Pargas data (about two and a half minutes), none of the 35 pitch accents were classified as falling. Seven of the 8 phrase-final tokens had an F0 rise in the tonic syllable with a L% boundary tone, however, making at least the disyllables auditorily somewhat similar to falling pitch accents. VARIATION & FINNISH INFLUENCE IN FINLAND SWEDISH DIALECT INTONATION 79

In both Kimito and Pargas, the majority of the pitch accents were either Low-High, or had a rise in the tonic syllable plateauing to a level high pitch over the next few syllables (hereafter Rise-Plateau). The Kimito and Pargas samples also contained one instance each of an accent with a sharp fall in the tonic syllable, like in Esbo and Lappträsk, but no subsequent rise.

180

160

140

120 peng ar å allting annat mie: 0 1.39125 Time (s) Figure 2 . Kimito dialect: pengar å allting annat mie : ‘money and everything else too’. Pengar has a Low-High pitch accent (the initial peak is due to consonant perturbation) and annat has a falling pitch accent. The peak in allting is due to background noise.

3.3 Österbotten The results for Lappfjärd (southern Österbotten) were similar to those for eastern Åboland in that only one of the 34 pitch accents (in about 2 minutes of material) was classified as falling. The remaining accents had Low-High and Rise-Plateau shapes, along with 3 instances of sharp falls. In Vörå (central Österbotten), 3 of the 34 pitch accents (in about 2 minutes of material) were falling, while the rest were Low-High or Rise-Plateau (plus one instance of sharp falling). Both Lappfjärd and Vörå had boundary L% tones, as in Pargas. These results are consistent with Aho’s (ms.) findings on intonation in the central Österbotten dialect of Solv. Lastly, Nedervetil (northern Österbotten) had 13 falling pitch accents, in a variety of sentence positions, out of 34 total accents (in about 2 minutes). This was a higher proportion than any of the other Österbotten or Åboland dialects investigated. Of the remaining 21 accents, 20 were labeled as Rise-Plateau and only one was Low-High.

4 Discussion The preliminary result of this study is that falling pitch accents of the Finnish type are very frequent in Swedish dialects of eastern and central Nyland, common in northern Österbotten, and less frequent or marginal elsewhere in Österbotten and in eastern Åboland. A natural explanation for this is that the eastern and northern outposts of Swedish Finland – eastern Nyland and northern Österbotten, respectively – have, as border regions, had the heaviest contact with Finnish. For example, central and eastern Nyland were the first regions to lose the word accent contrast (Vendell, 1897), and anecdotal evidence suggests that northern Österbotten dialects have some Finnish-like phonetic/phonological features that are not found elsewhere in Österbotten. The attestation of dialects where the falling pitch accent exists but has a limited role suggests that intonational variation in Finland Swedish might be fruitfully studied to form a sociolinguistic or diachronic picture of how various dialects have made, or are in the process of making, a gradual transition from Swedish-like to Finnish-like intonational systems. A number of topics would need to be addressed that have been outside the scope of the present study, such as the phonetics, pragmatics, and the positional distributions of the various pitch accents. For instance, the intonational phonologies of eastern Åboland and Österbotten are

80 YUNI KIM probably quite different, despite the fact that their pitch accents have phonetic similarities which in this paper have been subsumed under the same descriptive labels.

Acknowledgements I wish to thank audience members at ICLaVE 3 for helpful discussion on an earlier version of this paper, and Eija Aho for sending copies of her unpublished work. This research was funded by a Fulbright Grant and a Jacob K. Javits Graduate Fellowship.

References Aho, E., forthcoming. Om prosodin i liljendaldialekten. In H. Palmén, C. Sandström & J-O. Östman (eds.), volume in series Dialektforskning . Helsinki: Nordica. Aho, E., ms. Sulvan prosodiasta . Department of Nordic Languages, University of Helsinki. Berg, A-C., 2002. Ordaccenten – finns den? En undersökning av Snappertunamålets ordaccent: produktionen, perceptionen och den akustiska länken. Pro gradu thesis, Åbo Akademi University. Bruce, G., 2003. Late pitch peaks in West Swedish. Proceedings of the 15 th ICPhS , Barcelona, vol. 1, 245-248. Bruce, G., 2005. Word intonation and utterance intonation in varieties of Swedish. Paper presented at the Between Stress and Tone conference , Leiden. Bruce, G. & E. Gårding, 1978. A prosodic typology for Swedish dialects. In E. Gårding, G. Bruce & R. Bannerts (eds.), Nordic Prosody: Papers from a symposium . Lund University, Department of Linguistics, 219-228. Harling-Kranck, G., 1998. Från Pyttis till Nedervetil: tjugonio dialektprov från Nyland, Åboland, Åland och Österbotten . Helsinki: Svenska Litteratursällskapet i Finland. Kuronen, M. & K. Leinonen, 2001. Fonetiska skillnader mellan finlandssvenska och rikssvenska. In L. Jönsson, V. Adelswärd, A. Cederberg, P. Pettersson & C. Kelly (eds.), Förhandlingar vid Tjugofjärde sammankomsten för svenskans beskrivning . Linköping. Mixdorff, H., M. Väiniö, S. Werner & J. Järvikivi, 2002. The manifestation of linguistic information in prosodic features of Finnish. Proceedings of Speech Prosody 2002. Selenius, E., 1972. Västnyländsk ordaccent. Studier i nordisk filologi 59. Helsinki: Svenska Litteratursällskapet i Finland. Selenius, E., 1978. Studies in the development of the 2-accent system in Finland-Swedish. In E. Gårding, G. Bruce & R. Bannerts (eds.), Nordic Prosody: Papers from a symposium . Lund University, Department of Linguistics, 229-236. Svärd, N., 2001. Word accents in the Närpes dialect: Is there really only one accent? Working Papers 49 . Lund University, Department of Linguistics, 160-163. Vendell, H., 1897. Ordaksenten i Raseborgs härads svenska folkmål. Öfversigt av finska vetenskapssocietetens förhandlingar , B. XXXIX. Helsinki. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 81 Working Papers 52 (2006), 81–84

Local Speaking Rate and Perceived Quantity: An Experiment with Italian Listeners

Diana Krull 1, Hartmut Traunmüller 1, and Pier Marco Bertinetto 2 1Department of Linguistics, Stockholm University {diana.krull|hartmut}@ling.su.se 2Scuola Normale Superiore, Pisa [email protected]

Abstract We have shown in earlier studies that the local speaking rate influences the perception of quantity in Estonian, Finnish and Norwegian listeners. In the present study, Italian listeners were presented the same stimuli. The results show that the languages differ not only in the relative position – preceding or following – of the units that have the strongest influence on the perception of the target segment, but seemingly also in the width of the reference frame.

1 Introduction Earlier investigation using Estonian, Finnish and Norwegian listeners has shown that local speaking rate affects listeners’ perception of quantity (Krull, Traunmüller & van Dommelen, 2003; Traunmüller & Krull, 2003). The results were compatible with a model of speech perception where an “inner clock” handles variations in the speaking rate (Traunmüller, 1994). However, there were language dependent differences. The most substantial of these was the narrower reference frame of the Norwegians when compared to the Estonians and Finns. The Estonian quantity system is the most complex one. In a disyllabic word of the form C1V1C2V2 (such as the one used as stimulus) V 1 and C 2 are the carriers of the quantity distinction: V 1 as well as C 2, both singly and as a VC unit can have three degrees of quantity: short, long and overlong. Seven of the nine possible combinations are actually being used in Estonian phonology. C 1 and V 2 act as preceding and following context and is a cue to the local speaking rate. Finnish has two quantity degrees, similar to Estonian short and overlong. In a C 1V1C2V2 word all four possible V 1C2 combinations are used. In Finnish, as in Estonian, the duration of V 2 is inversely dependent on the quantity degree of the preceding units. In Norwegian, on the other hand, only V 1 carries the quantity degree, while C 2 is inversely dependent on the quantity of V 1. There are only two phonologically different possibilities: short or long V 1. In all three languages, it is a following unit of context that exerts the strongest secondary influence on the perception of the quantity degree. The question arises: is this generally valid also for other languages? Are there any other contextual factors that make a segment important for quantity perception, apart from relative position? The answer to these questions can perhaps be found by investigating Italian listeners’ reaction to the same stimuli. In Italian, it is the duration of C 2 that is considered as the most decisive for the distinction between C1V1:C 2V2 and C1V1C2:V 2 – e.g. papa and pappa – while the duration of V 1 is considered to be inversely related to the duration of C 2 when the vowel is stressed (Bertinetto & Vivalda, 1978). 82 DIANA KRULL ET AL .

This paper addresses the question of whether and how the reaction of Italian listeners to the same stimuli differs from that of Estonians, Finns and Norwegians. Where will Italian listeners place the boundary between [t] and [t:]? Which part(s) of the context will influence the perception of C 2?

2 Method The stimuli were obtained by manipulating the duration of selected segments of the Estonian word saate [sa:te] (‘you get’), read by a female Estonian speaker. The word was read both in isolation and preceded by ja (‘and’) or followed by ka (‘also’). The [a:] and the [t] were shortened or prolonged in proportionally equal steps as shown in Figure 1 (for the segment durations of the original utterance and other details, see Traunmüller & Krull, 2003). The durations of the [s] and the [e] were also manipulated up or down together with ja or ka when present. The arrangement of stimuli in series is shown in Figure 1. The selection of stimulus series was made according to which combinations could be possible in Italian. 20 students at the University of Pisa listened to the stimuli.

300

250

200

150 Duration of t(ms)

100 100 150 200 250 300 Duration of a (ms) Figure 1. Duration of the segments [a] and [t] in the stimuli without [ja] and [ka]. There were three series of stimuli that differed in the duration of the [a], while the stimuli within each series differed in the duration of the [t].

3 Results and discussion Figure 2 shows the effect of changes in segment duration on the perception of quantity for Italian listeners. For comparison, results from earlier investigations with Estonian, Finnish and Norwegian listeners have been added. As could be expected, increasing the duration of the [t] had a strong positive effect on the perception of the sate-satte distinction, while increasing the durations of neighboring units practically always had an opposite effect. The strongest negative effect on the perception of the [t] duration for Italian listeners resulted from lengthening the immediately preceding [a]. The role of ja and ka was less obvious. Changing the duration of [jas] had a certain effect on the perception of [t] in ja saate , while the duration of [s] alone in saate and saate ka had no importance. Similarly, change in the duration of [eka] had an effect, but not that of [e] alone in ja saate and saate. This can be explained by the durational variability of an utterance-final vowel in Italian. LOCAL SPEAKING RATE AND PERCEIVED QUANTITY 83

Norwegian [a] Italian [t]

6 6

4 4

2 2

0 0

-2 -2 Regression coefficient Regression -4 Regression coefficient -4 s a t e s a t e

Finnish [a] Finnish [t]

6 6

4 4

2 2

0 0

-2 -2 Regression coefficient Regression Regression coefficient Regression -4 -4 s a t e s a t e

Estonian [a] Estonian [t]

6 6

4 4

2 2

0 0

-2 -2 Regression coefficient Regression -4 Regression coefficient -4 s a t e s a t e

Estonian [a:] Estonian [t:]

6 6

4 4

2 2

0 0

-2 -2 Regression coefficient Regression Regression coefficient -4 -4 s a t e s a t e

Figure 2. Weights of the durations of segments as contributors to the perceived quantity of [a] or [t], expressed in probit regression coefficients. Context: ja saate (black columns), saate (white columns), and saate ka (grey columns).

The relatively strong effect of [jas] as compared with [s] alone may be difficult to explain, but the same tendency appeared among Finnish and Estonian subjects. The [ja] could not be interpreted as a separate word by Italian listeners. Therefore [jasa:te] was more likely to be interpreted as one word, stressed on the second syllable. In this case the first [a] stands out as longer than expected in an Italian word. In spite of that, changes in the duration of [jas] influenced the perception of [t]. That the [s] of saate ka had no negative effect at all may be due to its distance to the end of the utterance in relation to the length of the reference frame. The word-final [e] in ja saate and saate had practically no effect on the perception of [t], probably due to the fact that the duration of an utterance final vowel is highly variable in Italian. However, changes in the duration of [e] had a substantial effect when followed by [ka], in saate ka , which supports this assumption since in this case, there would not be so much free variation in the duration of the [e].

84 DIANA KRULL ET AL .

A comparison with the results of Estonian, Finnish and Norwegian speakers revealed several differences. For Finnish listeners, the negative effect of changes in the duration of [a] on the perceived quantity of [t] was not statistically significant. In Finnish, the [a] is itself a possible carrier of quantity distinction and is therefore not treated as ‘neighboring context’. (A similar effect of [t] can be seen when [a] is the target). This is true also for Estonian. In the case of the distinction between short and long [t], Estonian listeners behaved very much like the Italians. However, when distinguishing between long and overlong, the lengthening of the preceding vowel had a positive effect on the perceived quantity of [t:]. The reason for this is the unacceptability of the combination of long vowel and overlong consonant in Estonian: [t:] can be perceived as overlong only when the preceding vowel is either short or overlong. The same effect is seen in the case where [a:] was the target. Comparing the results of Italian listeners’ perception of [at] with that of the Norwegians’ revealed symmetry in the response patterns. In Italian, the consonant is the target which carries the quantity distinction while the duration of the preceding vowel is inversely related to it. This durational compensation can only be observed under sentence stress (Bertinetto & Loporcaro, 2005). In Norwegian, it is the other way round: the vowel is the target and the duration of the following consonant inversely related to it. In the present case, the negative effect of [a] for the Italians and that of [t] for the Norwegians were of similar size. A comparable effect of an inverse duration relation can be noted in the responses of Estonian and – in a slightly weaker degree – Finnish listeners. Here it is the duration of the vowel in the following syllable that is inversely related to the duration of V 1, C 2 or V 1C2. As a result, changes in the duration of [e] had a strong negative effect on the perceived quantity of [a] and/or [t]. The data clearly show that segments whose duration can vary due to linguistic or paralinguistic factors carry a lower weight (cf. the influence of [a] on [t] or vice versa in the two Fennic languages and utterance final [e] in Italian). To conclude, Italian listeners reacted generally in the same way as did Estonians, Finns and Norwegians: changing the duration of the target segment itself had a strong positive effect while changes in the durations of some neighboring segments had a weaker, negative effect. If segment durations are to be measured by an “inner clock” whose pace depends on the speech listened to, it is necessary to assume language specific reference windows. That of Norwegian listeners must, clearly, be assumed to be shorter than that of the Fennic listeners (Traunmüller & Krull, 2003). The length of the reference frame of Italian listeners is also shorter than that of the Fennic speakers, but the data seem to indicate that it is longer than that of the Norwe- gians. While the Italians’ location of their reference frame is clearly different from that of the Norwegians if considered with respect to the target segment, the center of the reference frame appears to be located close to the [a]/[t]-boundary in representatives of all four languages.

References Bertinetto, P.M. & M. Loporcaro, 2005. The sound pattern of standard Italian, as compared with the varieties spoken in Florence, Milan and Rome. Journal of the IPA 35, 131-151. Bertinetto, P.M. & E. Vivalda, 1978. Recherches sur la perception des oppositions de quantité en italien. Journal of Italian Linguistics 3, 97-116. Krull, D., H. Traunmüller & W.A. van Dommelen, 2003. The effect of local speaking rate on perceived quantity: a comparison between three languages. Proceedings XVth ICPhS, Barcelona , 833-836. Traunmüller, H., 1994. Conventional, biological, and environmental factors in speech communication: A modulation theory. Phonetica 51 , 170-183. Traunmüller, H. & D. Krull, 2003. The Effect of Local Speaking Rate on the Perception of Quantity in Estonian. Phonetica 60, 187-207. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 85 Working Papers 52 (2006), 85–88

A Case Study of /r/ in the Västgöta Dialect

Jonas Lindh Department of Linguistics, Göteborg University [email protected]

Abstract This paper concentrates on the study of five young male speakers of the Swedish Västgöta dialect. First, the classic phonological /r/ distribution between back and front /r/ was tested to see whether the old descriptions of the dialect were valid for this group. Second, the individual variation between the phonetic realizations was studied to see if it was possible to distinguish between the five speakers solely on the basis of their /r/ distribution. This was done by aural and spectrographic comparisons of /r/ in stressed and unstressed positions for each speaker. Three /r/ categories were identified. Two speakers seem to have a classical distribution of uvular /r/, two others use only the front version. The last speaker used the front variant except in one focused instance. These results lead to some speculations on changes occurring in the dialect. The speakers’ individual variation was studied by describing their /r/ realizations with phonological rules. This was done successfully and the five speakers were rather easily distinguishable solely on the basis of their /r/ productions.

1 Background and introduction 1.1 Hypotheses This pilot case study mainly has the goal to investigate two hypotheses: 1. The classic descriptions or rules are not valid for this group of five young male speakers. 2. It is possible to separate five speakers phonologically based solely on their production of /r/ in stressed and unstressed positions. The first hypothesis is simply investigating a possible dialectal change by using diachronic recordings and comparing the use of /r/. The second hypothesis is a pilot case study investigating whether between-speaker variation for /r/, whether it is sociophonetic or dialectal change, is enough to separate or individualize five speakers with the same sex, age and similar dialectal background.

1.2 The phoneme /r/ The phoneme /r/ was chosen because of its reported intra- and interspeaker variance (Vieregge & Broeders, 1993). The phoneme has been subject to several studies for English, both concerning its phonology (Lindau, 1985) and acoustic properties (see Espy-Wilson & Boyce, 1993; 1999). The Swedish studies are mostly concentrated on the dialectal area descriptions, such as Sjösted’s (1936) early dissertation on the /r/-sounds in south Scandinavia and Elert’s (1981) description of the back uvular [ ] geographical frontier. In a recent study by Muminovic & Engstrand (2001), they found that approximant variants outnumbered fricatives and taps while trills were uncommon. Aurally, they identified four place categories and these were also separated acoustically except for back and retroflex /r/.

86 JONAS LINDH

1.3 /r/ in the Västgöta dialect What is the Västgöta (or Göta) dialect? There are several variants. A quite common, but still rough description is that the dialect contains four major variants: the Vadsbo, Skaraborg, Älvsborg (except for the Mark – Marbo and Kind – Kindbo) and Göta-Älv variants. In one major study by Götlind (1918), he suggests around 450 different variants. However, there are several different features that connect them all. One of these dialect features is the distribution of the two /r/ allophones [r] and [ ] which both appear in different positions. The allophones [r] and [ ] are combinatory variants of the phoneme /r/ in the dialect. The general classic rules can be described using SPE notation (Chomsky & Halle, 1968), choosing [r] as the underlying representation:

Rule 1. /r / → [R] /#_

The phoneme /r/ is pronounced uvular in morpheme initial position.

h l

ø ø

r r e

Ex. [ ] and [ ]

Rule 2. / r / → [R] /V_V:() []+stress The phoneme /r/ is pronounced uvular in medial position, i.e. after an unstressed syllable and

preceeding a stressed vowel.

kt

ε

Ex. [ d ]

#  Rule 3. /r / → []R /V _   []+stress  V

The phoneme /r/ is pronounced uvular in final position after a short stressed vowel, or medial

followed by an unstressed vowel.

ø

a a Ex. [ d ] and [ b ]

Teleman (2005) hypothesizes about the development of the allophonic use being related to the geographical border for the use of ‘thick’ (retroflex flap) versus ‘normal’ /l/. However, Malmberg (1974) reports a similar allophonic use of /r/ in Puerto Rican Spanish and it is also used in Brazilian Portuguese which might give other indications (Torp, 2001). Other common features in the dialect, both grammatical and phonological, are not considered in this paper, but there are several (for examples see Norén et al., 1998).

2 Method First, older (between 1950-1970) recordings from the Swedish Institute for Dialectology, Onomastics and Folklore research (the CD Västgötadialekter ) were used as references to confirm the general/classic descriptions of /r/ distribution in the Västgöta variants. Five young male (aged 20-30) speakers from the Swedia dialect database were then analyzed (). The recordings for the Swedia database were done with a portable DAT recorder and small portable microphones carried on subjects collar or similar. The situation was adjusted as well as possible to an informal talk where the subjects told a story or memory. The mean length of each recording was approximately one minute. All instances of /r/ were extracted using the software Praat (Boersma & Weenink, 2005).

A CASE STUDY OF /R/ IN THE VÄSTGÖTA DIALECT 87

Table 1. Number of /r/ and variants per young male speaker and older reference recording used for diachronic comparison.

Speaker N /r/ instances N [r] N [ ] Older Reference Recording Öxabäck 38 32 6 Torestorp Östad 10 10 0 Humla & Lena Torsö 17 15 2 Skara Floby 29 29 0 Floby Korsberga 35 34 1 Korsberga

Young males aged 20-30 were chosen as a group because they exist as such in the Swedia database and because they stand for 62% of the convicted criminals in Sweden last year () and since the second investigation has forensic implications (Rose, 2002) this group was preferred as a pilot case group.

3 Results and discussion 3.1 Diachronic dialectal comparison for /r/ As can be seen in Table 1 above the speakers from Östad and Floby consequently use the alveolar allophone as no instances of uvular [ ] were found. The speakers from Öxabäck and Torsö follow the classical rule using uvular [ ] word (possibly morpheme) initially. No instances of [ ] were found in other positions though. For the speaker from Korsberga, only

one instance of [ ] was found. The instance was observed word initially in the focused word

t¡ t¡

pronounced [ ]. First of all, [ ] does not exist at all after short stressed vowels in the material. Secondly, only two speakers frequently use it word/morpheme initially. That the uvular is disappearing is only a speculation because of the sparse data, and maybe this is an effect of the formal recording situation leading to a sociophonetic variation. However, the distribution of /r/ is as follows using broad phonological categories: Category 1. /r/ [ ] / # _ /r/ is pronounced with uvular variant [ ] morpheme (or word) initially by the Öxabäck and Torsö speakers. Category 2. /r/ [r] /r/ is always pronounced with an alveolar variant [r] for the two speakers from Östad and Floby. Category 3. /r/ [r] or possibly [ ] / # _ (+focus) The Korsberga speaker uses an alveolar variant, but has a uvular variant [ ] word initially when /r/ occurs in a focused syllable.

3.2 The individual variation between the speakers The [ ] instances for the two speakers in category 1 above contain the word , which makes the natural starting point for comparison. The two speakers can then be separated as the speaker from Öxabäck uses a fricative phone articulated as a velar [ ¢] while the speaker from Torsö uses a uvular trill [ ]. Comparing the two speakers in category 2, the alveolar version was naturally compared since there was no use of a uvular variant. By closer aural examination of the two speakers it was obvious that the speaker from Östad in 7 out of 10 cases used an alveolar trill [ r]. In the three other cases the severely reduced sounds, in unstressed positions, were pronounced as

88 JONAS LINDH ¡ approximantic [ ]. The speaker from Floby never produced a trill, but shifted between a tap [ ]

(in stressed position) and an approximant [ ]. As the Korsberga speaker was alone in his use of a uvular in focused position there is no need to separate him further. His uvular variant is pronounced as a trill though, while the alveolar variants are either tapped or approximantic.

4 Conclusions and future work The uvular [ ] is less used in the Västgöta dialect, at least in the sparse data used for this study. This might mean that it has transformed into an already existent alveolar after short stressed vowels and is slowly disappearing as a word (or morpheme) initial as well. By aural and spectrographic examination leading to a narrow transcription and phonological rules, it was easy to separate the speakers. More research on how well a larger group can be separated using this method is recommended. Several aspects of interspeaker variation were left out using a small amount of data. Including more acoustic measurements, such as spectral studies of /r/ for different speakers, should also be investigated in the future.

References Boersma, P. & D. Weenink, 2005. Praat: doing phonetics by computer (Version 4.3.27) [Computer program] Retrieved October 7, 2005, from http://www.praat.org/. Brottsförebyggande Rådet. [www] Retrieved November 26, 2005, from http://www.bra.se/. Chomsky, N. & M. Halle, 1968. The sound pattern of English . New York: Harper & Row. Elert, C-C., 1981. Gränsen för det sydsvenska bakre r. Ljud och ord i svenskan 2. Stockholm: Amquist & Wiksell International. Espy-Wilson, C.Y. & S. Boyce, 1993. Context independence of F3 trajectories in American English /r/’s . JASA 93 , 2296 (A). Espy-Wilson, C.Y. & S. Boyce, 1999. A simple tube model for American English /r/. Proc. XIVth Int. Conf. Phon. Sci. , San Francisco, 2137–2140. Götlind, J., 1918. Studier i västsvensk ordbildning. De produktiva avledningssuffixen och deras funktion hos substantiven i Göteve-målet. Stockholm. Lindau, M., 1985. The story of /r/. In V. Fromkin (ed.), Phonetic linguistics . Orlando: Academic Press. Malmberg, B., 1974. Spansk fonetik . Lund: Liber Läromedel. Muminovic, D. & O. Engstrand, 2001. /r/ in some Swedish dialects: preliminary observations. Working Papers 49 . Dept. of Linguistics, Lund University. Norén, K., R. Gustafsson, B. Nilsson & L. Holgersson, 1998. Ord om orden i Västergötland . Axvall: Aron-förlaget. Rose, P., 2002. Forensic Speaker Identification . New York: Taylor & Francis. Sjöstedt, G., 1936. Studier över r-ljuden i sydskandinaviska mål . Dissertation, Lund University. Swedia Dialect Database. [www] Retrieved during September, 2005, from http://www.swedia.nu/. Swedish Institute for dialectology, Onomastics and Folklore research Västgötadialekter [CD] http://www.sofi.se. Teleman, U., 2005. Om r-fonemets historia i svenskan . Nordlund 25. Småskrifter från Institutionen för Nordiska språk, Lund. Torp, A., 2001. Retroflex consonants and dorsal /r/: mutually excluding innovations? On the diffusion of dorsal /r/ in Scandinavian. In van de Velde & van Hout, 75-90. Vieregge, W.H. & A.P.A. Broeders, 1993. Intra-and interspeaker variation of /r/ in Dutch. Proc. Eurospeech ’93 , vol. 1, 267–270. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 89 Working Papers 52 (2006), 89–92

Preliminary Descriptive F0-statistics for Young Male Speakers

Jonas Lindh Department of Linguistics, Göteborg University [email protected]

Abstract This paper presents preliminary descriptive statistics for 109 young male speakers’ fundamental frequency. The recordings were taken from the Swedia dialect database with speakers from different geographical areas of Sweden. The material consisted of spontaneous speech ranging between seventeen seconds and approximately two minutes. F0 mean, median, baseline and standard deviation distributions in Herz are described using histograms. It is suggested to use median instead of mean when measuring F0 in for example forensic cases since it is more robust and not as affected by octave jumps.

1 Background and introduction 1.1 Why young male speakers? Young males aged 20-30 were chosen as a group because they exist as such in the Swedia database () and because they stand for 62% of the convicted criminals in Sweden last year (), which was important due to the forensic implications of the descriptive statistics.

1.2 F0 and forensic phonetics The within-speaker variation in F0 is affected by an enormous amount of factors. In Braun (1995), she categorizes them as technical, physiological and psychological factors. Tape speed, which surprisingly still is an issue for forensic samples, and sample size are examples of technical factors. Smoking and age are examples of physiological, while emotional state and background noise are examples of psychological factors. However, fundamental frequency has been shown to be a successful forensic phonetic parameter (Nolan, 1983). To be able to study differences it is suggested to use long-term distribution measures such as arithmetical mean and standard deviation (Rose, 2002). The duration of the samples should be more than 60 seconds according to Nolan (1983), but Rose (1991) reports that F0 measurements for seven Chinese speakers stabilised much earlier, implying that the values may be language specific (Rose, 2002). Positive skewing of the F0 distribution is typical (Jassem et al., 1973) and an argument for considering a base value (F b) for F0 (Traunmüller, 1994). This base value is also described here together with mean, median and standard deviation for the whole group. There are no Swedish statistics on F0 found after Kitzing (1979), where he reports a mean of 110.3 Hz and a standard deviation of 3 semitones (in Traunmüller & Eriksson, 1995a) for 51 male speakers ranging between 21-70 years of age.

2 Method The software Praat (Boersma & Weenink, 2005) was used to collect F0 data from 109 young male speakers (20-30 years old). The recordings were taken from the Swedia database 90 JONAS LINDH

() and the durations of the recordings range from 17.4 to 116.8 seconds with a mean duration of 52.3 and standard deviation of 15.2. The parameters extracted from the recordings were F0 mean, median, average baseline value (Fb), standard deviation, maximum and minimum in Hz. The range for the F0 tracker was set to 75 - 350 Hz to be able to cover all possible frequency excursions but at the same time avoid octave jumps.

3 Results and discussion 3.1 F0 means, medians and average baselines This section contains five histograms showing F0 distributions using mean, median, and baseline in Hz.

Mean distribution of F0 for YM

30 28

25 22 21 20

15 14 10

NSpeakers 10 8

5 4 1 1 0 0 0 70 80 90 100 110 120 130 140 150 160 170 Hz

Figure 1. Histogram showing the distribution of F0 means for 109 young male speakers.

Approximately 65% of the speakers have a mean fundamental frequency between 100-130 Hz. The mean of the means is 120.8 Hz. There is a positive skewing (0.6) with five extreme outliers between 150-170 Hz. Since the automatic analysis had a tendency for making positive octave jumps it is suggested to use median as it is more robust (see Figure 2 below).

Median distribution of F0 for YM

35 31 30 25 22 21 20

15 10 10 NSpeakers 10 5 6 5 2 2 0 0 0 70 80 90 100 110 120 130 140 150 160 170

Hz

Figure 2. Histogram showing the distribution of F0 medians for 109 young male speakers.

The median distribution still has a positive skewing (still 0.6), but the mean (of the medians) has moved down to 115.8 Hz. There is now approximately 68% that has a median between 100-130 Hz. PRELIMINARY DESCRIPTIVE F0-STATISTICS FOR YOUNG MALE SPEAKERS 91

For comparison, the average baselines according to Traunmüller (1994) were calculated (see Figure 3 below).

Average Baseline frequencies for YM

35 31 30 27 25

20 15 16 15 13

NSpeakers 10

5 3 0 1 1 1 1 0 30 40 50 60 70 80 90 100 110 120 130 Hz

Figure 3. Histogram showing the average F0 baseline distribution for 109 young male speakers.

The baseline (F b) is seen as a carrier frequency in the modulation theory (Traunmüller, 1994). As there are no major changes in vocal effort, voice register, or emotions involved in this material, F b can be expected to be approximately 1.43 standard deviations below the average (Traunmüller & Eriksson, 1995b). The mean average baseline is 86.3 Hz, which corresponds quite well to Traunmüller & Eriksson’s (1995a) average per balanced speaker of European languages (93.4 Hz for male speakers). The values show a slight negative skewing (-0.35) and approximately 68% of the values range between 70-100 Hz.

3.2 F0 standard deviation Finally, the standard deviation distributions can be studied in Figure 4 and 5 below.

Standard deviations of F0 for YM

30 27

25

20 19 15 15 15 14 11

NSpeakers 10 5 4 2 1 1 0 0 5 10 15 20 25 30 35 40 45 50 55 Hz Figure 4. Histogram showing the F0 standard deviation distribution for 109 young male speakers.

A perceptually motivated measure for liveliness is to use semitones (Traunmüller & Eriksson, 1995b).

92 JONAS LINDH

Standard deviations in semitones for YM

35 32 30

25 22 20 17 15 15 10 NSpeakers 10 5 5 3 4 0 1 0 0 1 1,5 2 2,5 3 3,5 4 4,5 5 5,5 6 Semitones

Figure 5. Histogram showing the F0 standard deviation distribution in semitones for 109 young male speakers.

4 Conclusions and future work The preliminary statistics in this paper gives an overview on the distribution for Swedish young males’ fundamental frequency mean and standard deviation. The results suggest the use of a more robust median instead of mean, since octave jumps influence the arithmetical mean. To be able to study between-speaker differences better, distributions for individual speakers should be compared and studied using different measures.

References Boersma, P. & D. Weenink, 2005. Praat: doing phonetics by computer (Version 4.3.27) [Computer program]. Retrieved October 7, 2005, from http://www.praat.org/. Braun, A., 1995. Fundamental frequency – how speaker-specific is it? In Braun & Köster (eds.), 9-23. Brottsförebyggande Rådet. [www] Retrieved November 26, 2005, from Jassem, W., S. Steffen-Batog & M. Czajka, 1973. Statistical characteristics short-term average F0 distributions as personal voice features. In W. Jassem (ed.), Speech Analysis and Synthesis vol. 3 . Warshaw: Polish Academy of Science, 209-25. Kitzing, P., 1979. Glottografisk frekvensindikering: En undersökningsmetod för mätning av röstläge och röstomfång samt framställning av röstfrekvensdistributionen. Malmö: Lund University. Nolan, F., 1983. The Phonetic Bases of Speaker Recognition . Cambridge: Cambridge University Press. Rose, P., 1991. How effective are long term mean and standard deviation as normalisation parameters for tonal fundamental frequency? Speech Communication 10 , 229-247. Rose, P., 2002. Forensic Speaker Identification . New York: Taylor & Francis. Swedia Dialect Database. [www] Retrieved during September, 2005, from http://www.swedia.nu/. Traunmüller, H., 1994. Conventional, biological, and environmental factors in speech communication: A modulation theory. Phonetica 51 , 170-183. Traunmüller, H. & A. Eriksson, 1995a. The frequency range of the voice fundamental in the speech of male and female adults. Unpublished Manuscript (can be retrieved from http://www.ling.su.se/staff/hartmut/aktupub.htm). Traunmüller, H. & A. Eriksson, 1995b. The perceptual evaluation of F0-excursions in speech as evidenced in liveliness estimations. J. Acoust. Soc. Am. 97 , 1905-1915. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 93 Working Papers 52 (2006), 93–96

L1 Residue in L2 Use: A Preliminary Study of Quantity and Tense-lax

Robert McAllister, Miyoko Inoue, and Sofie Dahl Department of Linguistics, Stockholm University [email protected], [email protected], [email protected]

Abstract The main question addressed in this preliminary study is what traces of L1 have been transferred to L2 use. The focus is on the durational aspects of the tense-lax and quantity contrasts in English and Japanese. The results could be interpreted as support for the hypothesis that an L1 durational pattern rather than a specific phonetic feature is the object of transfer.

1 Introduction As a rule, adults who learn a second language are not completely successful in learning to produce and perceive L2 speech. Much of the recent research that has been done on the acquisition of second language phonology and phonetics has been concerned with the question of the source of foreign accent. A primary issue in both past and current studies of second language (L2) speech acquisition is how and to what extent the first language (L1) influences the learning of L2. The existence of common terms such as “French accent” have supported the importance of that which has become known as “L1 transfer” as a major contribution to foreign accent and numerous studies have been done to support the importance of this phenomenon. The aim of the present study is to contribute to the understanding of the role of native language (L1) phonetic and phonological features in L2 speech acquisition. While considerable research has been done with this aim which has contributed significantly to the understanding of the nature of the phenomenon, there are still some important unanswered questions to be addressed. Central among these concerns what aspects of the perception and production of the L1 are actually transferred. One suggestion has been made by McAllister, Flege & Piske (2003). In the discussion of their results the question was raised as to whether a specific phonetic feature such as duration or an L1 durational pattern typical for the phonology of a particular L1 could be what is actually transferred. If this were the case, a durational pattern similar to that in L1 may be recognized in the use of the L2 contrast.

1.1 The pattern of durational relationships that can be found in Swedish and Japanese quantity and the abstract feature of tense-lax in English Traditionally, the primary phonetic difference underlying phonological quantity distinctions has been attributed to durational differences in the vowels and or consonants, hence the “long- short” or “quantity” terminology. In Swedish there is a relatively complex interplay between temporal dimensions (i.e., the duration of a vowel and that of the following consonant) and spectral dimensions (i.e., formant values in the vowel). English is considered to have no quantity distinction. The tense-lax feature is considered to be a property of English phonology and is phonetically similar to some aspects of Swedish quantity. The phonetic characteristics 94 ROBERT MCALLISTER ET AL . of the Japanese quantity distinction appear to be in some respects similar to the Swedish distinction. The contrast is based on duration and there are stable relationships between the long and short vowels and consonants in Japanese syllables. We are not able, in this short paper, to give even a partial view of the scholarly discussion of tense-lax and its relation to quantity. For an excellent review and discussion, please see Schaeffler (2005). In this preliminary study we have taken the liberty to focus on the obvious if somewhat unclear, relation between quantity and tense-lax. Our intent is to discover if a residue of the Swedish quantity contrast might be found in the use of an L2 by native Swedes. Our hypothesis is that evidence of patterns characteristic of Swedish quantity can be seen in native Swedes’ L2 use of the tense-lax feature in English and the quantity contrast in Japanese.

2 Method 2.1 Experimental subjects For the English part of the study, 20 native speakers of standard Swedish were recruited. These were speech pathology students at Stockholm University who were asked to read a list of English sentences containing a sentence final word with a tense or a lax vowel. As a control group, 8 native speakers of standard American English read sentences with the same tense and lax vowels as the native speakers of Swedish The subjects for the Japanese part of the study consisted of 11 Swedish speakers (3 females and 8 males) ranging from beginner to advanced levels of Japanese language which included 2 speakers one of whose parents was a native Japanese.

2.2 Speech material For the English part of the study the vowels in the tense-lax pairs /i:/ – /i/, /u:/ – /u/, and /e:/ – /e/ each occurred in three different monosyllabic words read by both the native speakers of American English and the native Swedes. All three occurrences of each vowel were placed in an identical or very similar phonetic environment (a voiceless stop). For the Japanese part of the study the speech materials were two-syllable non-words which followed Japanese phonotactics. The stimulus words written in Hiragana were read 5 times. “Kinou _____ o kaimasita” (I bought ______yesterday). The words in carrier sentences were read three times each by the same informants. In this study we present only the results for the Japanese vowels /i:/, /i/, /u:/, /u/, /e:/ and /e/ to compare with the English part of the study.

3 Results and discussion It should be pointed out at the outset of the discussion that the results presented here are a preliminary version of this study. There are a number of additional measures that could be relevant to the question of what aspects of the L1 are transferred in L2 use. Previous research has shown that this V/C ratio is a robust and typical aspect of the Swedish quantity contrast so we have decided to start with a presentation of this data and to present more data at Fonetik 2006 in Lund. L1 RESIDUE IN L2 USE : A PRELIMINARY STUDY OF QUANTITY & TENSE -LAX 95

Swedes speaking English

1,6

1,4

1,2

1

swedL1 0,8 swedL2 V/C RATIO V/C 0,6 engL1

0,4

0,2

0 i: i u: u ei e VOWELS Figure 1 shows the calculated V/C duration ratios for all the tense lax vowel pairs in English. The three bars above the vowel symbols in the graph represent the V/C ratios for native Swedish (unfilled bar), Swedes speaking English (black bar) and the native English speakers.

The native Swedish data in Figure 1 is taken from Elert (1964). Although the group data may mask some of the potentially interesting individual behavior in the L2 users, it could reveal some broad tendencies that are relevant to our question as to whether or not a durational pattern is preserved in the English of the Swedish natives. Figure1 indicates that while the native speakers of Swedish were not able to produce the durational aspects of English authentically, they were not using the patterns familiar from their L1 according to the Swedish norm. In terms of the V/C, the L2 users as a group appear to approach the English pattern but their values are somewhere in between the Swedish and the English norms for all the vowels. This result is reminiscent of a VOT study by Flege and Eefting (1986) where the VOT values of the native Spanish speakers speaking English were between those of native English and native Spanish. Those authors interpreted this result as equivalence classification although this may imply a more strict adherence to the L1 pattern than can be seen in the results. The L2 ratios in Figure 2 are compared to the Swedish native standard and the L2 (Japanese) native standard as in Figure 1. The Swedish L2 users’ realization of the Japanese V:C syllables appears to be similar to the results for the realization of the English tense vowels in a VC sequences seen in Figure 1 although the realization of the /i:/ is better, i.e. closer to the native Japanese values, than the other long vowels /u:/ and /e:/. In these cases the L2 users have not been able to produce authentic Japanese syllables. The long /u:/ shows a result similar to those for English in figure1. The native Swedes produce a syllable with a V/C ratio in between the standard Swedish and the standard Japanese values. The V:C sequence with /e:/, however, was produced in a way similar to the Swedish standard. An interesting aspect of Figure 2 is the realization of the short vowels in VC sequences the native Japanese syllables and the native Swedish syllables are quite similar. The native Swedes’ version of a VC syllable with a short vowel is, with respect to the durational relationships, similar to the

96 ROBERT MCALLISTER ET AL . authentic Japanese syllables. In this case it would seem that the application of the duration rules for Swedish quantity could have yielded a rather good rendering of the Japanese contrast. These results indicate that in the case of the realization of Japanese quantity, the transfer of at least some of the aspects the Swedish quantity contrast pattern is part of the Swedes’ strategy in learning Japanese quantity. The durational aspects of the English tense- lax contrast present a somewhat less clear picture of the transfer phenomenon. It looks like the Swedish natives are attempting to render the contrast but could be unsuccessful because of their tendency to continue to apply the L1 pattern in their L2 use.

Swedes speaking Japanese

1,6

1,4

1,2

1,0 S IO swedL1 T A 0,8 swedL2 R /C jpnL1 V 0,6

0,4

0,2

0,0 i: i u: u e: e

VOWELS

Figure 2 shows the calculated V/C duration ratios for the short-long-vowels in Japanese averaged over both isolated words and words which occurred in a sentence.

Further work on this material can give us a clearer idea of what residue from the L1 there might be in the phonetic realization of an L2 contrast.

References Elert, C-C., 1964. Phonologic Studies of Quantity in Swedish. Uppsala: Monografier utgivna av Stockholms kommunalförvalting 27. Flege, J. & W. Eefting, 1986. The production and perception of English stops by Spanish speakers of English. Journal of Phonetics 15 , 67-83. McAllister, R., J.L. Flege & T. Piske, 2003. The influence of L1 on the acquisition of Swedish quantity by native speakers of Spanish, English and Estonian . Journal of Phonetics 30 , 229- 258. Schaeffler, F., 2005. Phonological Quantity in Swedish Dialects. PHONUM 10 . Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 97 Working Papers 52 (2006), 97–100

Cross-speaker Variations in Producing Attitudinally Varied Utterances in Japanese

Yasuko Nagano-Madsen 1 and Takako Ayusawa 2 1Department of Oriental and African Languages, Göteborg University [email protected] 2Department of Japan Studies, Akita International University [email protected]

Abstract Several acoustic phonetic parameters were analysed for six professional speakers of Japanese who produced attitudinally-varied utterances. The results showed both agreement and discrepancies among the speakers, implying that pragmatic information can be expressed in at least a few alternative ways in Japanese and that this line of research needs more attention.

1 Introduction It is well known that pragmatic information can be combined in a set of tunes (or pitch- accents in more recent terminology) in a language like English which has been traditionally called an intonational language. How such pragmatic information is conveyed in a tone or pitch-accent language in which pitch shape is lexically determined is much less clear. For Japanese, Maekawa & Kitagawa (2002) conducted pioneering research on the production and perception of paralinguistic phenomena. We have earlier reported the F0 shape characteristics to show how speakers choose pitch shapes and phrasing to convey pragmatic meanings in Japanese (Nagano-Madsen & Ayusawa, 2005). In this paper, we will report other phonetic cues used by the same speakers. The attitudes tested are NEU(tral), DIS(appointment), SUS(picious), JOY, and Q(uestion). Three phonologically balanced short utterances were produced as a reply by six speakers – three male and three female speakers. For details on data, speakers, and procedure, see Nagano-Madsen & Ayusawa (2005).

2 F0 characteristics 2.1 Pitch range In order to make the cross-speaker comparison more meaningful, F0 features are calculated on a semitone scale rather than in absolute Hz values. The average pitch ranges for the female and male speakers were 13.9 and 14.3 semitones respectively. Table 1 shows the average pitch range in semitones for the six speakers for the five attitude types, which shows that the overall average pitch range increases in ascending order, DIS

Table 1. Cross-speaker variation in F0 range for attitude (F0 maxima minus (final) F0 minima in semitones). Female speakers Male speakers All the speakers Attitude/ U V W X Y Z speaker Q 11.4 14.7 14.5 16.0 15.8 17.2 15.0 (2.68) SUS 12.0 19.2 17.7 12.0 13.4 14.7 14.9 (3.27) JOY 15.5 16.3 20.1 13.2 17.1 15.5 16.3 (3.73) DIS 10.8 10.8 10.4 11.3 9.5 12.3 10.9 (1.66) NEU 12.5 13.1 12.4 14.5 15.6 16.4 14.1 (1.94)

2.2 Pitch range for initial rise and final fall The pitch range was calculated for the initial F0 rise and final F0 fall (cf. Figure 1 below). A typical manifestation of the pitch range of initial F0 rise is in ascending order DIS

Figure 1 . Pitch range in semitones for the initial F0 rise (left) and for the F0 fall (right).

2.3 Pitch range for final rise (Q and SUS) Question utterances in Japanese are typically accompanied by a terminal rising contour. In the present data, even SUS utterances had a regular terminal rise. However, final F0 rise for Q and SUS were consistently differentiated in the magnitude (cf. Figure 2). The average F0 rise for Q was 9.1 semitones (SD=3.04) while that for SUS was 12.9 semitones (SD=4.72). The magnitude clusters around 2-4 semitones for most speakers, but speakers U and W had more extreme values. CROSS -SPEAKER VARIATIONS IN ATTITUDINALLY VARIED UTTERANCES IN JAPANESE 99

Figure 2 (left). Magnitude of utterance final F0 rise in semitone for Q and SUS. Figure 3 (right). F0 peak values for the six speakers. U, V, W are female speakers.

2.4 F0 peak value Figure 3 shows the F0 peak values for different attitude types. Four out of six speakers had the same order in the F0 peak value, which in ascending order is DIS

2.5 F0 peak delay The relevance of the F0 peak delay, i.e. the F0 peak is not on the expected syllable to which the phonological accent is affiliated to, has been discussed for some time in relation to pragmatics. In the present data, the F0 peak delays were common even for NEU (cf. Figure 4). All except of one case (speaker X for NEU) had a peak delay of varying from one mora-delay to six morae-delay. All the speakers had theleast peak delay for NEU while the peak delay in relation to other attitude types varied considerably across speakers with SUS showing most agreement in delay. Since the diversity among the speakers is great, it seems that the F0 peak delay per se is not a reliable correlate for attitude types.

Figure 4 (left). The timing of F0 peak with the mora. When it is 0, the F0 peak is in the syllable (mora) to which the accent is phonologically affiliated and there is no delay. Figure 5 (right). Intensity peak measurements in dB.

3 Intensity peak There was a good cross-speaker agreement among the speakers in the intensity peak value with attitude type. The highest intensity peak value (average 75dB) was found for JOY while the highest (average 68dB) for DIS (cf. Figure 5 above). Intensity peaks in relation to the attitude types varied less across speakers. In contrast to the difference between JOY and DIS, variations in the intensity peak value for other types of attitudes is small (71-2 dB on average). However, speakers differ considerably in the magnitude of intensity peaks. Some speakers vary the intensity greatly for attitude types (speaker W and Z) while speaker U hardly varied

100 YASUKO NAGANO -MADSEN & TAKAKO AYUSAWA it. Intensity peaks and F0 peaks correlate to some extent, yet it is clear that the two parameters should be treated separately. Note that speakers W and U have very similar F0 peak values but different intensity peak values.

4 Duration (speaking rate) and pause Average total utterance duration for the three utterances for each speaker is presented in Figure 6. Of the three utterances, the utterance /a-soo-desuka/ permits the insertion of a pause after the initial interjection /a/. When pause duration is included, it shows the same durational pattern as the other two utterances without a pause in reflecting the attitude types. Therefore, we interpreted pause as part of durational manifestation and included it in total utterance duration. The smallest cross-speaker variation was found for NEU for which all except one speaker used the shortest duration, clustering around 600-800ms. In the absolute duration value, speakers were also uniform for SUS which falls in the range between 1000 to 1200ms. The greatest cross-speaker variation was found for DIS for which the duration of the utterance varied from 800ms to 1250ms.

Figure 6 (left). Average utterance duration for each attitude type. Figure 7 (right). Plotting of F1 and F2 for the vowel /a/ (speaker Z).

5 Vowel quality Auditory impressions suggested considerable intra- and cross-speaker variation in the use of voice quality in general as well as in the specifically tested attitude types. Since the acoustic cues for voice quality are less straightforward than other acoustic cues, we only present the differences in vowel quality in this paper. Figure 7 above shows the manifestation of vowel quality by speaker Z. This speaker differentiated the vowel quality of /a/ in such a way that SUS and JOY had a more front quality than NEU, Q, and DIS. The figure also shows the formants values of /a/ in nonsense words /mamamama/ spoken neutrally by the same speaker.

6 Summary and discussion Together with our earlier report on F0 shape and phrasing (Nagano-Madsen & Ayusawa 2005), both agreement and discrepancies were observable among the six speakers in their manifestation of attitudes. It seems that pragmatic information can be expressed in at least a few alternative ways in Japanese and that this line of research needs more attention.

References Maekawa, K. & N. Kitagawa, 2002. How does speech transmit paralinguistic information? (in Japanese). Cognitive Studies 9 (1), 46-66. Nagano-Madsen, Y. & T. Ayusawa, 2005. Prosodic correlates of attitudinally-varied back channels in Japanese. Proceedings FONETIK 2005 , Department of Linguistics, Göteborg University, 103-106. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 101 Working Papers 52 (2006), 101–104

Emotion Recognition in Spontaneous Speech

Daniel Neiberg 1, Kjell Elenius 1, Inger Karlsson 1, and Kornel Laskowski 2 1Department of Speech, Music and Hearing, KTH, Stockholm {neiberg|kjell|inger}@speech.kth.se 2 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA [email protected]

Abstract Automatic detection of emotions has been evaluated using standard Mel-frequency Cepstral Coefficients, MFCCs, and a variant, MFCC-low, that is calculated between 20 and 300 Hz in order to model pitch. Plain pitch features have been used as well. These acoustic features have all been modeled by Gaussian mixture models, GMMs, on the frame level. The method has been tested on two different corpora and languages; Swedish voice controlled telephone services and English meetings. The results indicate that using GMMs on the frame level is a feasible technique for emotion classification. The two MFCC methods have similar perform- ance, and MFCC-low outperforms the pitch features. Combining the three classifiers signifi- cantly improves performance.

1 Introduction Recognition of emotions in speech is a complex task that is furthermore complicated by the fact that there is no unambiguous answer to what the “correct” emotion is for a given speech sample (Scherer, 2003; Batliner et al., 2003). Emotion research can roughly be viewed as going from the analysis of acted speech (Dellaert et al., 1996) to more “real”, e.g. from auto- mated telephone services (Blouin & Maffiolo, 2005). The motivation of this latter is often to try to enhance the performance of such systems by identifying frustrated users. A difficulty with spontaneous emotions is in their labeling, since the actual emotion of the speaker is almost impossible to know with certainty. Also, emotions occurring in spontaneous speech seem to be more difficult to recognize compared to acted speech (Batliner et al., 2003). In Oudeyer (2002), a set of 6 features selected from 200 is claimed to achieve good accuracy in a 2-person corpus of acted speech. This approach is adopted by several authors. They ex- periment with large numbers of features, usually at the utterance level, and then rank each feature in order to find a small golden set, optimal for the task at hand (Batliner et al., 1999). Classification results reported on spontaneous data are sparse in the literature. In Blouin & Maffiolo (2005), the corpus consists of recordings of interactions between users and an auto- matic voice service. The performance is reported to flatten out when 10 out of 60 features are used in a linear discriminant analysis (LDA) cross-validation test. In Chul & Narayanan (2005), data from a commercial call centre was used. As is frequently the case, the results for various acoustic features were only slightly better than a system classifying all exemplars as neutral. Often authors use hundreds of features per utterance, meaning that most spectral properties are covered. Thus, to use spectral features, such as MFCCs, possibly with addi- tional pitch measures, may be seen as an alternative. Delta MFCC measures on the utterance level have been used earlier, e.g. in Oudeyer (2002). However, we have chosen to model the distribution of the MFCC parameters on the frame level in order to obtain a more detailed de- scription of the speech signal. 102 DANIEL NEIBERG ET AL.

In spontaneous speech the occurrence of canonical emotions such as happiness and anger is typically low. The distribution of classes is highly unbalanced, making it difficult to measure and compare performance reported by different authors. The difference between knowing and not knowing the class distribution will significantly affect the results. Therefore we will in- clude results from both types of classifiers.

2 Material The first material used was recorded at 8 kHz at the Swedish Table 1. Materials used. company Voice Provider (VP), which runs more then 50 differ- VP development set ent voice-controlled telephone services. Most utterances are Neutral 3865 94 % neutral (non-expressive), but some percent are frustrated, most Emphatic 94 2 % often due to misrecognitions by the speech recognizer, Table 1. Negative 171 4 % The utterances are labeled by an experienced, senior voice re- searcher into neutral, emphasized or negative (frustrated) Total 4130 speech. A subset of the material was labeled by 5 different per- VP evaluation set sons and the pair-wise inter-labeler kappa was 0.75 – 0.80. Neutral 3259 93 % In addition to the VP data, we apply our approach to meeting Emphatic 66 2 % recordings. The ISL Meeting Corpus consists of 18 meetings, Negative 164 5 % with an average number of 5.1 participants per meeting and an Total 3489 average duration of 35 minutes. The audio is of 16 bit, 16 kHz ISL development set quality, recorded with lapel microphones. It is accompanied by Neutral 6312 80 % orthographic transcription and annotation of emotional valence Negative 273 3 % (negative , neutral , positive ) at the speaker contribution level Positive 1229 16 % (Laskowski & Burger, 2006). The emotion labels were con- Total 7813 structed by majority voting (2 of 3) for each segment. Split deci- ISL evaluation set sions (one vote for each class) were removed. Finally, the de- Neutral 3259 70 % velopment set was split into two subsets that were used for Negative 151 3 % cross-wise training and testing. Positive 844 19 % Both corpora were split into a development and an evaluation Total 4666 set, as shown in Table 1.

3 Features Thirteen Standard MFCC parameters were extracted from 24 Mel-scaled logarithmic filters from 300 to 3400 Hz. Then we applied RASTA-processing (Hermansky & Morgan, 1994). Delta and delta-delta features were added, resulting in a 39 dimensional vector. For the ISL material we used 26 filters from 300 to 8000 Hz; otherwise the processing was identical. MFCC-low features were computed similarly to the standard MFCCs but the filters ranged from 20 to 300 Hz. We expected these MFCCs to model F0 variations. Pitch was extracted using the Average Magnitude Difference Function, Ross et al. (1974) as reported by Langlais (1995). We used a logarithmic scale subtracting the utterance mean. Also delta features were added.

4 Classifiers All acoustic features are modeled using Gaussian mixture models (GMMs) with diagonal co- variance matrices measured over all frames of an utterance. First, using all the training data, a root GMM is trained with the Expectation Maximization (EM) algorithm with a maximum likelihood criterion, and then one GMM per class is adapted from the root model using the maximum a posteriori criterion (Gauvin & Lee, 1994). We use 512 Gaussians for MFCCs and 64 Gaussians for pitch features. These numbers were empirically optimized. This way of us- EMOTION RECOGNITION IN SPONTANEOUS SPEECH 103 ing GMMs has proved successful for speaker verification (Reynolds et al., 2000). The outputs from the three classifiers were combined using multiple linear regression, with the final class selected as the argmax over the per-class least square estimators. The transform matrix was estimated from the training data.

5 Experiments We ran our experiments with the features and classifiers described above. An acoustic com- bination was composed by GMMs for MFCC, MFCC-low, and pitch. The combination matrix was estimated by first testing the respective GMM with its training data.

6 Results Performance is measured as abso- Table 2. Results. Accuracy, Average Recall, f1. lute accuracy, average recall (for all classes) and f1, computed VP Neutral vs. Emphasis vs. Negative from the average precision and Classifier Acc. A.Rec. f1 recall for each classifier. The Random with equal priors 0.33 0.33 0.33 results are compared to two naïve MFCC 0.80 0.43 0.40 classifiers: a random classifier MFCC-low 0.78 0.39 0.37 that classifies everything with Pitch 0.56 0.40 0.38 equal class priors, random with Acoustic combination 0.90 0.37 0.39 equal priors , and a random Random using priors 0.88 0.33 0.33 classifier knowing the true prior Acoustic comb. using priors 0.93 0.34 0.38 distribution over classes in the ISL Negative vs. Neutral vs. Positive training data, random using Classifier Acc. A.Rec. f1 priors . The combination matrix Random with equal priors 0.33 0.33 0.33 accounts for the prior distribution MFCC 0.66 0.49 0.47 in the training data, heavily MFCC-low 0.66 0.46 0.44 favoring the neutral class. There- Pitch 0.41 0.38 0.37 fore a weight vector which forces Acoustic combination 0.79 0.50 0.47 the matrix to normalize to equal Random using priors 0.67 0.33 0.33 prior distribution was also used. Acoustic comb. using priors 0.82 0.42 0.48 Thus we report two more results: acoustic combination with equal priors, that is optimized for the accuracy measure and acoustic combination using priors , which optimizes the average recall rate. Thus, classifiers under the random equal priors heading do not know the a priori class distribution and should only be compared to each other. The same holds for the classifiers under random using priors . Note that the performance difference in percentages is higher for a classifier not knowing the prior distribution compared to its random classifier, than for the same classifier knowing the prior distribution compared to its random classifier. This is due to the skewed prior distributions. From Table 2 we note that all classifiers with equal priors perform substantially better than the random classifier. The MFCC-low classifier is almost as good as the standard MFCC and considerably better than the pitch classifier. Regarding the ISL results in Table 2 we again notice that the pitch feature does not perform on the same level as the MFCC features. When the distribution of errors for the individual classes was examined, it revealed that most classifiers were good at recognizing the neutral and positive class, but not the negative one, most probably due to its low frequency resulting in poor training statistics.

104 DANIEL NEIBERG ET AL. 7 Conclusion Automatic detection of emotions has been evaluated using spectral and pitch features, all modeled by GMMs on the frame level. Two corpora were used: telephone services and meet- ings. Results show that frame level GMMs are useful for emotion classification. The two MFCC methods show similar performance, and MFCC-low outperforms pitch features. A reason may be that MFCC-low gives a more stable pitch measure. Also, it may be due to its ability to capture voice source characteristics, see Syrdal (1996), where the level dif- ference between the first and the second harmonic is shown to distinguish between phona- tions, which in turn may vary across emotions. The diverse results of the two corpora are not surprising considering their discrepancies. A possible way to improve performance for the VP corpus would be to perform emotion detection on the dialogue level rather than the utterance level, and also take the lexical content into account. This would mimic the behavior of the human labeler. Above we have indicated the difficulty to compare emotion recognition results. However, it seems that our results are at least on par with those in Blouin & Maffiolo (2005).

Acknowledgements This work was performed within CHIL, Computers in the Human Interaction Loop, an EU 6th Framework IP (506909). We thank Voice Provider for providing speech material.

References Batliner, A., J. Buckow, R. Huber, V. Warnke, E. Nöth & H. Niemann, 1999. Prosodic Feature Evaluation: Brute Force or Well Designed? Proc. 14 th ICPhS , 2315-2318. Batliner, A., K. Fischer, R. Huber, J. Spilkera & E. Nöth, 2003. How to find trouble in communication. Speech Communication 40 , 117-143. Blouin, C. & V. Maffiolo, 2005. A study on the automatic detection and characterization of emotion in a voice service context. Proc. Interspeech , Lisbon, 469-472. Chul, M.L. & S. Narayanan, 2005. Toward Detecting Emotions in Spoken Dialogs. IEEE, Transactions on Speech and Audio Processing 13 (2), 293-303. Dellaert, F., T.S. Polzin & A. Waibel, 1996. Recognizing emotion in speech. Proc. ICSLP , Philadelphia, 3:1970-1973. Gauvin, J-L. & C.H. Lee, 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. SAP 2 , 291-298. Hermansky, H. & N. Morgan, 1994. RASTA processing of speech. IEEE Trans. SAP 4 , 578- 589. Langlais, P., 1995. Traitement de la prosodie en reconnaissance automatique de la parole . PhD-thesis, University of Avignon. Laskowski, K. & S. Burger, 2006. Annotation and Analysis of Emotionally Relevant Behavior in the ISL Meeting Corpus. LREC , Genoa. Oudeyer, P., 2002. Novel Useful Features and Algorithms for the Recognition of Emotions in. Human Speech. Proc. of the 1st Int. Conf. on Speech Prosody . Reynolds, D., T. Quatieri & R. Dunn, 2000. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10 , 19-41. Ross, M., H. Shafer, A. Cohen, R. Freudberg & H. Manley, 1974. Average magnitude difference function pitch extraction. IEEE Trans. ASSP-22 , 353-362. Scherer, K.R., 2003. Vocal communication of emotion: A review of research paradigms. Speech Communication 40 , 227-256. Syrdal, A.K., 1996. Acoustic variability in spontaneous conversational speech of American English talkers. Proc. ICSLP , Philadelphia. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 105 Working Papers 52 (2006), 105–108

Data-driven Formant Synthesis of Speaker Age

Susanne Schötz Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University [email protected]

Abstract This paper briefly describes the development of a research tool for analysis of speaker age using data-driven formant synthesis. A prototype system was developed to automatically ex- tract 23 acoustic parameters from the Swedish word ‘själen’ [ln] (the soul) spoken by four differently aged female speakers of the same dialect and family, and to generate synthetic copies. Functions for parameter adjustment as well as audio-visual comparison of the natural and synthesised words using waveforms and spectrograms were added to improve the synthe- sised words. Age-weighted linear parameter interpolation was then used to synthesise a tar- get age anywhere between the ages of 2 source speakers. After an initial evaluation, the sys- tem was further improved and extended. A second evaluation indicated that speaker age may be successfully synthesised using data-driven formant synthesis and weighted linear interpo- lation.

1 Introduction In speech synthesis applications like spoken dialogue systems and voice prostheses, the need for voice variation in terms of age, emotion and other speaker-specific qualities is growing. To contribute to the research in this area, as part of a larger study aiming at identifying pho- netic age cues, a system for analysis by synthesis of speaker age was developed using data- driven formant synthesis. This paper briefly describes the developing process and results. Research has shown that acoustic cues to speaker age can be found in almost every phonetic dimension, i.e. in F 0, duration, intensity, resonance, and voice quality (Hollien, 1987; Jacques & Rastatter, 1990; Linville, 2001; Xue & Deliyski., 2001). However, the relative importance of the different cues has still not been fully explored. One reason for this may be the lack of an adequate analysis tool in which a large number of potential age parameters can be varied sys- tematically and studied in detail. Formant synthesis generates speech from a set of rules and acoustic parameters, and is con- sidered both robust and flexible. Still, the more natural-sounding concatenation synthesis is generally preferred over formant synthesis (Narayanan & Alwan, 2004). Lately, formant syn- thesis has made a comeback in speech research, e.g. in data-driven and hybrid synthesis with improved naturalness (Carlson et al., 2002; Öhlin & Carlson, 2004).

2 Material Four female non-smoking native Swedish speakers of the same family and dialect were se- lected to represent different ages, and recorded twice over a period of 3 years: Speaker:1 : girl

(aged 6 and 9), Speaker 2 : mother (aged 36 and 39), Speaker 3 : grandmother (aged 66 and

]

n 69), and Speaker 4 : great grandmother (aged 91 and 94). The isolated word ‘själen’ [ l (the soul), was selected as a first test word, and the recordings were segmented into pho- nemes, resampled to 16 kHz, and normalized for intensity. 106 SUSANNE SCHÖTZ 3 Method and procedure The prototype system was developed in several steps (see Figure 1). First, a Praat (Boersma & Weenink, 2005) script extracted 23 acoustic parameters every 10 ms. These were then used as input to the formant synthesiser GLOVE, which is an extension of OVE III (Liljencrants, 1968) with an expanded LF voice source model (Fant et al., 1985). GLOVE was used by kind permission of CTT, KTH. For a more detailed description, see Carlson et al. (1991).

Parameter Audio-visual comparison adjustment with natural speech and previously synthesised version Input Automatic Output Formant natural parameter + synthetic synthesiser speech extraction speech Figure 1. Schematic overview of the prototype system.

Next, the parameters were adjusted to generate more natural-sounding synthesis. To be able to compare the natural speech to the synthetic versions, another Praat script was developed, which first called the parameter extraction script, and then displayed waveforms and spectro- grams of the original word, the resulting synthetic word, as well as the previous synthetic ver- sion. By auditive and visual comparison of the three files, the user could easily determine whether a newly added parameter or adjustment had improved the synthesis. If an adjustment improved the synthesis, it was added to the adjustment rules. Formants, amplitudes and voice source parameters (except F 0) caused the most serious problems, which were first solved us- ing fixed values, then by parameter smoothing.

Output Input synthesis Source speaker target of target parameter files age age Source Source parameter file 1 parameter file 2 Parameter file for target age Calculation of Duration Parameter Formant age weights interpolation interpolation synthesiser for each segment for each parameter Figure 2. Schematic overview of the age interpolation method.

An attempt to synthesise speaker age was carried out using the system. The basic idea was to use the synthetic versions of the words to generate new words of other ages by age-weighted linear interpolation between two source parameter files. A Java program was developed to calculate the weights and to perform the interpolations. For each target age provided as input by the user, the program selects the parameter files of two source speakers (the older and younger speakers closest in age to the target age), and generates a new parameter file from the interpolations between the two source parameter files. For instance, for the target age of 51, i.e. exactly half-way between the ages of Speaker 2 (aged 36) and Speaker 3 (aged 66), the program selects these two speakers as source speakers, and then calculates the age weights to 0.5 for both source speakers. Next, the program calculates the target duration for each pho- neme segment using the age weights and the source speaker durations. If the duration of a particular segment is 100 ms for source Speaker 1, and 200 ms for source Speaker 2, the target duration for the interpolation is 200 x 0.5 + 100 x 0.5 = 150 ms. All parameter values are then interpolated in the same way. Finally, the target parameter file is synthesised using GLOVE, and displayed (waveform and spectrogram) in Praat along with the two input synthetic words for comparison. A schematic overview of the procedure is shown in Figure 2. DATA -DRIVEN FORMANT SYNTHESIS OF SPEAKER AGE 107 4 Results To evaluate the system’s performance, two perception tests were carried out to estimate direct age and naturalness (on a 7-point scale, where 1 is very unnatural and 7 is very natural). Stim- uli in the first evaluation consisted of natural and synthetic versions of the 6, 36, 66 and 91 year old speakers. The second evaluation was carried out at a later stage when the 9, 39, 69 and 94 year olds had been included, and when parameter smoothing and pre-emphasis filter- ing (to avoid muffled quality) had improved the synthesis. 31 students participated in the first evaluation test, also including interpolations for 8 decades (10 to 80 years), while 21 students took part in the second, which also comprised interpolations for 7 decades (10 to 70 years).

4.1 First evaluation In the first evaluation, the correlation curves between chronological age (CA, or simulated “CA” for the synthetic words) and perceived age (PA) displayed some similarity for the natu- ral and synthetic words, though the synthetic ones were judged older in most cases, as seen in Figure 3. The interpolations were mostly judged as much older than both the natural and syn- thetic words. As for naturalness, the natural words were always judged more natural than the synthetic ones. Both the natural and synthetic 6 year old versions were judged least natural.

80

70 7

60 6 50 Nat 5 40 Syn Nat 4 30 Int Syn 3 20

2 10

0 1 0 10 20 30 40 50 60 70 80 90 100 6 36 66 91 "CA" (years) Stimulus age Figure 3. Correlation between PA and CA for natural, synthetic and interpolated words (left), and median perceived naturalness for natural and synthetic words in the first evaluation.

4.2 Second evaluation Figure 4 shows that not only the correlation curves for the natural and synthetic words, but also for the interpolations did improve in similarity in the second evaluation compared to the first one. However, the natural and synthetic versions of the 39, 66 and 69 year olds were quite underestimated. All natural words were judged as more natural than the synthetic ones, and all synthetic words except the 6 and 94 year old achieved a median naturalness value of 6.

80

70 7

60 6 50 Nat 5 40 Syn Nat 4 30 Int Syn 3 20

10 2

0 1 0 10 20 30 40 50 60 70 80 90 100 6 9 36 39 66 69 91 94 "CA" (years) Stimulus age Figure 4. Correlation between PA and CA for natural, synthetic and interpolated words (left), and median perceived naturalness for natural and synthetic words in the second evaluation.

108 SUSANNE SCHÖTZ 5 Discussion and future work The synthetic words obtained a reasonable resemblance with the natural words in most cases, and the similarity in age was improved in the second evaluation. The interpolated versions were often judged as older than the intended age in the first evaluation, but in the second evaluation they had become more similar in age to the natural and synthetic versions, indicat- ing that speaker age may be synthesised using data-driven formant synthesis. Still, some of the age estimations were quite unexpected. For instance, the 39, 66 and 69 year olds were judged as much younger than their CA. This may be explained by that these voices were atypical for their age. One very important point in this study is that synthesis of age by linear interpolation is in- deed a crude simplification of the human aging process, which is far from linear. Moreover, while some parameters may change considerably during a certain period of aging (i.e. F 0 and formant frequencies during puberty), others remain constant. Better interpolation techniques will have to be tested. One should also bear in mind that the system is likely to interpolate not only between two ages, but also between a number of individual characteristics, even when the speakers are closely related. Future work involves (1) improved parameter extraction for formants, (2) better interpola- tion algorithms, and (3) expansion of the system to handle more speakers (of both sexes), as well as a larger and more varied speech material. Further research with a larger material is needed to identify and rank the most important age-related parameters. If further developed, the prototype system may well be used in future studies for analysis, modelling and synthesis of speaker age and other speaker-specific qualities, including dialect and attitude. The pho- netic knowledge gained from such experiments may then be used in future speech synthesis applications to generate more natural-sounding synthetic speech.

References Boersma, P. & D. Weenink, 2005. Praat: doing phonetics by computer (version 4.3.04) [com- puter program]. Retrieved March 8, 2005, from http://www.praat.org/. Carlson, R., B. Granström & I. Karlsson, 1991. Experiments with voice modelling in speech synthesis. Speech Communication 10 , 481–489. Carlson, R., T. Sigvardson & A. Sjölander, 2002. Data-driven formant synthesis. Proceedings of Fonetik 2002, TMH-QPSR , 121–124. Fant, G., J. Liljencrants & Q. Lin, 1985. A four-parameter model of glottal flow. STL-QPSR 4, 1–13. Hollien, H., 1987. Old voices: What do we really know about them? Journal of Voice 1 , 2–13. Jacques, R. & M. Rastatter, 1990. Recognition of speaker age from selected acoustic features as perceived by normal young and older listeners. Folia Phoniatrica (Basel) 42 , 118–124. Liljencrants, J., 1968. The OVE III speech synthesizer. IEEE Trans AU-16 (1), 137–140. Linville, S.E., 2001. Vocal Aging . San Diego: Singular Thomson Learning. Narayanan, S. & A. Alwan (eds.), 2004. Text to Speech Synthesis: New Paradigms and Ad- vances . Prentice Hall PTR, IMSC Press Multimedia Series. Öhlin, D. & R. Carlson, 2004. Data-driven formant synthesis. Proceedings of Fonetik 2004 , Dept. of Linguistics, Stockholm University, 160–163. Xue, S.A. & D. Deliyski, 2001. Effects of aging on selected acoustic voice parameters: Pre- liminary normative data and educational implications. Educational Gerontology 21 , 159- 168. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 109 Working Papers 52 (2006), 109–112

How do we Speak to Foreigners? – Phonetic Analyses of Speech Communication between L1 and L2 Speakers of Norwegian

Rein Ove Sikveland Department of Language and Communication Studies, The Norwegian University of Science and Technology (NTNU), Trondheim [email protected]

Abstract The major goal of this study was to investigate which phonetic strategies we may actually use when speaking to L2 speakers of our mother tongue (L1). The results showed that speech rate in general was slower and that the vowel formants were closer to target values, in L2 directed speech compared to L1 directed speech in Norwegian. These properties of L2 directed speech correspond to previous findings for clear speech (e.g. Picheny et al., 1986; Krause & Braida, 2004). The results also suggest that level of experience may influence L2 directed speech; teachers of Norwegian as a second language slowed down the speech rate more than the non- teachers did, in L2 directed speech compared to L1 directed speech.

1 Introduction When speaking to foreigners in our mother tongue (L1) it might be natural to speak clearer than normal to make ourselves understood, which implies the use of certain phonetic strategies when speaking to these second language learners (L2 speakers). Previous findings by Picheny et al. (1986) and Krause & Braida (2004) have shown that clear speech can be characterized by a decrease in speech rate, more pauses, relatively more energy in the frequency region of 1-3 kHz, less phonological reductions (e.g. less burst eliminations), vowel formants closer to target values, longer VOT and a greater F0 span, compared to conversational speech. What characterizes L2 directed speech has not been subject to any previous investigations, but one might assume that strategies in L2 directed speech correspond to the findings for clear speech. This has been investigated in the present studies, and the results for speech rate and vowel formants will be presented.

2 Method To be able to compare speech in L1 and L2 contexts directly, the experiment was carried out by recording native speakers of Norwegian 1) in dialogue with L2 speakers, and 2) in dialogue with other L1 speakers. The dialogue setting was based on a keyword manuscript, to facilitate natural speech, and at the same time be able to compare phonetic parameters in identical words and phonological contexts.

2.1 Subjects Six native speakers of Norwegian (with eastern Norwegian dialect background) participated as informants. Three of them were teachers in Norwegian as a second language, called P 110 REIN OVE SIKVELAND informants (P for “professional”), and three of them were non-teachers, called NP informants (NP for “non-professional”). Six other L1 speakers and six L2 speakers of Norwegian participated as opponents to match each informant in the L1 and L2 contexts. Thus there were 18 subjects participating in the experiment, distributed across twelve recordings.

2.2 Procedure Recordings were made by placing each informant in a studio while the dialogue opponents were placed in the control room. They communicated through microphones and headphones. The dialogue setting, but not the sound quality, was to represent a phone conversation between two former roommates/partners, and the role of the informants was to suggest to the opponent how to distribute their former possessions, written on a list in the manuscript. There were no lines written in the manuscript, only suggestions of how questions might be asked. The participants were told to carry out the dialogue naturally, but they were not told to speak in any specific manner (e.g. “clearly” or “conversationally”). The speech analyses of the recordings were made using outputs of spectrograms, spectra and waveforms in software “Praat”. Only words from the list of possessions were used for analyses, and the corresponding words/syllables/phonemes were measured for each informant in L1 and L2 contexts.

3 Results 3.1 Speech rate Speech rate was investigated by measuring syllable duration and number of phonemes per second, in ten words for each informant in L1 and L2 contexts (altogether 120 words). The measured words contained four syllables or more. The results showed that syllable duration was longer, and that the number of phonemes per second was lower, in L2 context compared to L1 context. Pooled across informants, the average duration of syllables is 221 ms in L1 context and 239 ms in L2 context. This difference is highly significant (t (298) = - 4.790; p < 0.0001), and gives a strong general impression that the speech rate is slower in L2 context compared to L1 context.

14

12

10

8

6

4

2

0 Number of phonemes per second per phonemes Numberof L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2

P1 P2 P3 NP1 NP2 NP3

Figure 1. Average number of phonemes per second, for all six informants in L1 and L2 contexts. Error bars are standard deviations.

HOW DO WE SPEAK TO FOREIGNERS ? 111

With the purpose of investigating and describing speech rate more directly, the number of phonemes per second was found to be significantly lower in L2 context compared to L1 context, when pooled across informants (t (59) = 3.303; p < 0.002). There was an average difference of 0.7 phonemes per second between contexts. In Figure 1 above the number of phonemes per second is shown for all informants in L1 and L2 context. Figure 1 may also show that the speech rate effect is larger for “professional” (P) informants than for “non- professional” (NP) informants. The interaction between level of experience and L1/L2 context on speech rate is significant (F (1, 58) = 7.337; p < 0.009). Considering these results one has reasons to suggest that speech rate is slower in L2 context compared to L1 context, and that the effect of context on speech rate is dependent on level of experience of the speaker.

3.2 Vowel formants Formants F1, F2 and F3, in addition to F0, were measured for long and short vowels /a/, /i/ and /u/, representing the three most peripheral vowels in articulation. Since male and female speakers have vocal tracts of different sizes and shapes, the results in Table 1 below are presented for both genders separately. Bold type represents significant differences between contexts, and the results suggest that F1 in /a:/ is generally higher in L2 directed speech than in L1 directed speech, for female (t (25) = - 3.686; p < 0.001) and male (t (24) = - 3.806; p < 0.001) speakers. F1 is also significantly higher in L2 context than in L1 context in /a/, for male speakers (t (20) = - 4.668; p < 0.0001), and in /i/ (t (19) = - 2.113; p < 0.048) and /u:/ (t (35) = - 2.831; p < 0.008) for female speakers. A difference in F2 between contexts seems to be evident only for the /i/ vowels, significantly so for male speakers, in /i:/ (t (23) = - 3.079; p < 0.005) and /i/ (t (23) = - 5.520; p < 0.0001). F3 values are quite variable within vowels and informants, but significantly higher values in L2 context than in L1 context were found in /i/ (t (23) = - 2.152; p < 0.042) and /u:/ (t (35) = - 3.313; p < 0.004) for male speakers.

Table 1. Average values for F1, F2 and F3 in Hz for female (F) and male (M) informants in short and long /a/, /i/ and /u/ vowels. Standard deviations are in parentheses. Bold typing represents statistical significance of differences between L1 and L2 context. F1 F2 F3 L1 L2 L1 L2 L1 L2 /a:/ F (n=26) 663 (97) 719 (60) 1165 (89) 1192 (107) 2751 (224) 2724 (165) M (n=25) 578 (66) 632 (46) 1014 (106) 1050 (93) 2562 (209) 2652 (252) /a/ F (n=20) 729 (121) 738 (67) 1257 (168) 1273 (139) 2728 (226) 2685 (174) M (n=21) 552 (77) 626 (59) 1076 (89) 1088 (123) 2408 (282) 2475 (288) /i:/ F (n=24) 403 (77) 391 (76) 2362 (269) 2422 (194) 3044 (311) 3035 (299) M (n=24) 317 (41) 326 (45) 2029 (124) 2093 (120) 2947 (261) 3026 (269) /i/ F (n=20) 391 (56) 418 (51) 2287 (229) 2291 (197) 2933 (191) 2951 (154) M (n=24) 361 (36) 362 (39) 1933 (109) 2036 (121) 2722 (155) 2795 (229) /u:/ F (n=36) 379 (44) 404 (54) 861 (191) 854 (136) 2728 (209) 2790 (262) M (n=36) 351 (32) 354 (38) 738 (135) 735 (138) 2476 (212) 2572 (199) /u/ F (n=18) 394 (56) 418 (58) 1028 (164) 1012 (163) 2690 (216) 2638 (225) M (n=19) 370 (40) 371 (35) 882 (143) 861 (136) 2387 (188) 2388 (146)

If F1 values correlate positively with degree of opening in vowel articulation, the general rise in F1, especially for the /a/ vowels, might be interpreted as a result of a more open mouth/jaw position in L2 context than in L1 context. As suggested by Ferguson & Kewley-Port (2002), a rise in F1 might also be a result of increased vocal effort, which might give an additional explanation to the higher F1 values for /i/ and /u:/. Letting F2 represent the front-back

112 REIN OVE SIKVELAND dimension of the vocal tract (high F2 values for front vowels), one might suggest that /i/ vowels (mostly for male speakers) are produced further front in the mouth in L2 context than in L1 context. The tendencies toward higher F2 and F3 frequencies in L2 context compared to L1 context, might indicate that the informants do not use more lip rounding when producing /u/ vowels in L2 context. Rather this point might support our suggestion above that informants in general use a more open mouth position in L2 context than in L1 context. According to Syrdal & Gopal (1986) one might expect that the relative differences F3-F2 and F1-F0 to describe the front-back and open-closed dimensions (respectively) more precisely than the absolute formant values. In the present investigations, F1-F0 relations led to the same interpretations as for F1 alone, regarding degrees of mouth opening. The F3-F2 relation gave additional information about the vowel /u:/, in that the F3-F2 difference was significantly larger in L2 context than in L1 context (t (40) = - 2.302; p < 0.024). This might be interpreted as /u:/ being produced more back in mouth in L2 context than in L1 context. Effects of level of experience on formant values or formant relations were not found, which indicates that the differences in vowel formants between L1 and L2 contexts are general among speakers.

4 Conclusions The results show that L1 speakers modify their pronunciation when speaking to L2 speakers compared to when speaking to other L1 speakers. We have seen that this was so for speech rate, in that the informants had longer syllable durations and fewer phonemes per second in L2 context than in L1 context. The formant values and formant relations indicated that articulation of the peripheral vowels /a/, /i/ and /u/ was closer to target in L2 context compared to L1 context, in both degree of opening and front-back dimensions. The results for L2 directed speech correspond to those found for clear speech (e.g. Picheny et al., 1986; Krause & Braida, 2004; Bond & Moore, 1994). Level of experience seemed to play a role in speech rate, in that “professional” L1-L2 speakers differentiated more between L1 and L2 context than “non-professional” L1-L2 speakers did.

References Bond, Z.S. & T.J. Moore, 1994. A note on the acoustic-phonetic characteristics of inadvertently clear speech. Speech Communication 14 , 325-337. Ferguson, S.H. & D. Kewley-Port, 2002. Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am. 112 , 259- 271. Krause, J.C. & L.D. Braida, 2004. Acoustic properties of naturally produced clear speech at normal speaking rates. J. Acoust. Soc. Am. 115 , 362-378. Picheny, M.A., N.I. Durlach & L.D. Braida, 1986. Speaking clearly for the hard of hearing 2: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research 29 , 434-446. Syrdal, A.K. & H.S. Gopal, 1986. A perceptual model of vowel recognition based on the auditory representation of American English vowels. J. Acoust. Soc. Am. 79 , 1066-1100. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 113 Working Papers 52 (2006), 113–116

A Switch of Dialect as Disguise

Maria Sjöström 1, Erik J. Eriksson 1, Elisabeth Zetterholm 2, and Kirk P. H. Sullivan 1 1 Department of Philosophy and Linguistics, Umeå University [email protected], [email protected], [email protected] 2 Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University [email protected]

Abstract Criminals may purposely try to hide their identity by using a voice disguise such as imitating another dialect. This paper empirically investigates the power of dialect as an attribute that listeners use when identifying voices and how a switch of dialect affects voice identification. In order to delimit the magnitude of the perceptual significance of dialect and the possible impact of dialect imitation, a native bidialectal speaker was the target speaker in a set of four voice line-up experiments, two of which involved a dialect switch. Regardless of which dialect the bidialectal speaker spoke he was readily recognized. When the familiarization and target voices were of different dialects, it was found that the bidialectal speaker was significantly less well recognized. Dialect is thus a key feature for speaker identification that overrides many other features of the voice. Whether imitated dialect can be used for voice disguise to the same degree as native dialect switching demands further research.

1 Introduction In the process of recognizing a voice, humans attend to particular features of the individual’s speech being heard. Some of the identifiable features that we listen to when recognizing a voice have been listed by, among others, Gibbons (2003) and Hollien (2002). The listed features include fundamental frequency (f0), articulation, voice quality, prosody, vocal intensity, dialect/sociolect, speech impediments and idiosynctratic pronunciation. The listener may use all, more, or only a few, of these features when trying to identify a person, depending on what information is available. Which of these features serve as the most important ones when recognizing a voice is unclear. Of note, however, is that, according to Hollien (2002), one of the first things forensic practitioners look at when trying to establish the speaker’s identity is dialect. During a crime, however, criminals may purposely try to hide their identity by disguising their voices. Künzel (2000) reported that the statistics from the German Federal Police Office show that annually 15-25% of the cases involving speaker identification include at least one type of voice disguise: some of the perpetrators’ ‘favourites’ include: falsetto, pertinent creaky voice, whispering, faking a foreign accent and pinching one’s noise. Markham (1999) investigated another possible method of voice disguise, dialect imitation . He had native Swedish speakers attempt to produce readings in various Swedish dialects that were not their native dialects. Both the speaker’s ability to consistently keep a natural impression and to mask his or her native dialect were investigated. Markham found that some speakers are able to successfully mimic a dialect and hide their own identity. Markham also pointed out that to 114 MARIA SJÖSTRÖM ET AL . avoid suspicion it is as important to create an impression of naturalness, as it is to hide one’s identity when using voice disguise. In order to baseline and delimit the potential impact on speaker identification by voice alone due to dialect imitation a suite of experiments were constructed that used a native bidialectal speaker as the speaker to be identified. The use of a native bidialectal speaker facilitates natural and dialect consistent stimuli. The four perception tests presented here are excerpted from Sjöström (2005). The baselining of the potential problem is of central importance for forensic phonetics since, if listeners can be easily fooled, it undermines earwitness identification of dialect and suggests that forensic practitioners who currently use dialect as a primary feature during analysis would need to reduce their reliance on this feature.

2 Method Four perception tests were constructed. The first two tests investigated whether the bidialectal speaker was equally recognizable in both his dialects. The second two tests addressed whether listeners were distracted by a dialect shift between familiarization and the recognition task.

2.1 Speech material The target bidialectal speaker is a male Swede who reports that he speaks Scanian and a variety of Stockholm dialect on a daily basis. He was born near Stockholm but moved to as a five-year old. An acoustic analysis of the speaker’s dialect voices was performed, which confirmed that his two varieties of Swedish carry the typical characteristics of the two dialects and that he is consistent in his use of them. Two recordings of The Princess and the Pea were made by the bidialectal speaker. In one of them he read the story using the Stockholm dialect, and in the other he read it using his . Four more recordings of The Princess and the Pea were made; two by two male mono- dialectal speakers of the Stockholm dialect (ST) and two by two male mono-dialectal speakers of the Scanian dialect (SC). These speakers (hereafter referred to as foils) were chosen with regard to their similarities with the target voice in dialect, age, and other voice features such as creakiness. For further details, see Sjöström (2005).

2.2 The identification tests Four different earwitness identification tests were constructed for participants to listen to. Each test began with the entire recording of The Princess and the Pea as the familiarization voice, and was followed by a voice line-up of 45 stimuli. The 45 stimuli consisted of three phrases selected from each recording presented three times for each speaker (3 x 3 x 5 = 45). Each voice line-up contained the four foil voices and one of the target’s two dialect voices (see Table 1). For example, the test ‘SC-ST’ uses the target’s Scanian voice as the familiarization voice and the target’s Stockholm dialect voice in the line-up. Test SC-SC and Test ST-ST were created as control tests. They afford investigation of whether the target’s Stockholm and Scanian dialects can be recognized among the voices of the line-up, and to test if the two different dialects are recognized to the same degree. Tests ST-SC and SC-ST investigate if the target can be recognized even when a dialect shift between familiarization and recognition occurs. 80 participants, ten in each listener test, took part in this study. All were native speakers of Swedish and reported no known hearing impairment. Most of the listeners were students at either Lund University or Umeå University, and all spoke a dialect from the southern or northern part of Sweden.

A SWITCH OF DIALECT AS DISGUISE 115

Table 1. The composition of the voice identification tests showing which of the target’s voices was used as familiarization voice and which voices were included in the voice line-up for each of the four tests. Test Familiarization voice Line-up voices SC-SC TargetSC Foil 1-4 + TargetSC ST-ST TargetST Foil 1-4 + TargetST ST-SC TargetST Foil 1-4 + TargetSC SC-ST TargetSC Foil 1-4 + TargetST

2.3 Data analysis In this yes-no experimental design responses can be grouped into four different categories: hit (when the listener correctly responds ‘yes’ to the target stimulus), miss (when the listener responds ‘no’ to a target stimulus), false alarm (when the listener responds ‘yes’ to a non- target stimulus) and correct rejection (when the listener correctly responds ‘no’ to a non- target stimulus). By calculating the hit and false alarms rates as proportions of the maximum possible number of hits and false alarms, the listeners’ discrimination sensitivity can be determined, measured as d’ . This measure is the difference between the hit rate (H) and the false alarm rate (F), after first being transformed into z-values. The d’-equation is: d’ = z(H)- z(F) (see Green & Swets, 1966).

3 Results and discussion Participants of the control tests, SC-SC and ST-ST, show positive mean d’ -values (1.87 and 1.93). It was shown through a two-tailed Student’s t-test that there was no significant difference in identification of the two dialects and they can therefore be considered equally recognizable ( t(38)=-0.28, p>0.05). By conducting a one-sample t-test it was shown that the d’ -values for both tests are highly distinct from 0 ( t(39)=18.45, p<0.001) and therefore high degree of identification of both dialects can be concluded. The responses for the dialect shifting tests, ST-SC (mean d’ = 0.44); SC-ST (mean d’ = -0.07), did not significantly differ (t(38)=1.93, p>0.05). The target voice in these two tests can be considered equally difficult to identify. A one-sample t-test was conducted and showed that the mean d’-value of the two tests was not significantly separated from 0 (t(39)=1.36, p>0.05), indicating random response. Combining the responses for the ‘control tests’ (ST-ST; SC-SC) and the ‘dialect shifting tests’ (ST-SC; SC-ST) and comparing the results to each other revealed a significant difference between the two test groups (t(78)=5.97, p>0.001) (see Fig. 1). Thus, dialect shift has a detrimental effect on speaker identification.

4 Conclusions The results indicate that the attribute dialect is of high importance in the identification process. It is clear that listeners find it much more difficult to identify the target voice when a shift of dialect in the voice takes place. One possible reason for the results is that when making judgments about a person’s identity, dialect as an attribute is strong and has a higher priority than other features. The baselining of the potential problem we have conducted here shows that a switch of dialect can easily fool listeners. This undermines earwitness identification of dialect and suggests that forensic practitioners who currently use dialect as a primary feature during analysis need to reduce their reliance on this feature and be aware that they can easily be misled. 116 MARIA SJÖSTRÖM ET AL .

2,5

2,0

1,5

1,0 discrimination sensitivity (d') 0,5

0,0 Control tests Dialect shifting tests

Figure 1. Mean discrimination sensitivity (d’) and standard error for Control tests (SC-SC and ST-ST combined) and Dialect shifting tests (ST-SC and SC-ST combined).

If used as a method of voice disguise, a perpetrator could use one native dialect at the time of an offence and use the other in the event of being forced to participate in a voice line-up as a suspect. Needless to say this method of voice disguise could have devastating effects on witness accuracy as they would not able to recognize the perpetrators voice when using different dialect, or yet worse, that the witness would make an incorrect identification and choose another person whose dialect is more similar to the voice heard in the crime setting. In order to assess whether voice disguise using imitated dialect can have as drastic an impact upon speaker identification as voice disguise by switching between native dialects, research using imitated dialect as a means of disguise is required.

Acknowledgements Funded by a grant from the Bank of Swedish Tercentenary Foundation Dnr K2002-1121:1-4 to Umeå University for the project ‘Imitated voices: A research project with applications for security and the law’.

References Gibbons, J., 2003. Forensic Linguistics . Oxford: Blackwell Publishing. Green, D.M. & J.A. Swets, 1966. Signal detection theory and psychophysics . New York: John Wiley and sons, Inc. Hollien, H., 2002. Forensic voice identification . San Diego: Academig Press. Künzel, H.J., 2000. Effects of voice disguise on speaking fundamental frequency. Forensic Linguistics 7 , 1350-1771. Markham, D., 1999. Listeners and disguises voices: the imitation and perception of dialectal accent. Forensic Linguistics 6 , 289-299. Sjöström, M., 2005. Earwitness identification – Can a switch of dialect fool us? Masters paper in Cognitive Science. Unpublished. Department of Philosophy and Linguistics, Umeå University.

Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 117 Working Papers 52 (2006), 117–120

Prosody and Grounding in Dialog

Gabriel Skantze, David House, and Jens Edlund Department of Speech, Music and Hearing, KTH, Stockholm {gabriel|davidh|edlund}@speech.kth.se

Abstract In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects’ responses based on pragmatic meaning. Moreover, the subjects’ response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.

1 Introduction Detecting and recovering from errors is an important issue for spoken dialog systems, and a common technique for this is verification. However, verifications are often perceived as tedious and unnatural when they are constructed as full propositions verifying the complete user utterance. In contrast, humans often use fragmentary, elliptical constructions such as in the following example: “Further ahead on the right I see a red building.” “Red?” (see e.g. Clark, 1996). In a previous experiment, the effects of prosodic features on the interpretation of such fragmentary grounding utterances were investigated (Edlund et al., 2005). Using a listener test paradigm, subjects were asked to listen to short dialog fragments in Swedish where the computer replies after a user turn with a one-word verification, and to judge what was actually intended by the computer by choosing between the paraphrases shown in Table 1.

Table 1. Prototype stimuli found in the previous experiment. Position Height Paraphrase Class Early Low Ok, red ACCEPT Mid High Do you really mean red? CLARIFY UNDERSTANDING Late High Did you say red? CLARIFY PERCEIVE

The results showed that an early, low F 0 peak signals acceptance (display of understanding), that a late, high peak is perceived as a request for clarification of what was said, and that a mid, high peak is perceived as a request for clarification of the meaning of what was said. The results are summarized in Table 1 and demonstrate the relationship between prosodic realization and the three different readings. In the present study, we want to test the hypothesis that users of spoken dialog systems not only perceive the differences in prosody of synthesized fragmentary grounding utterances, and their associated pragmatic meaning, but that they also change their behavior accordingly in a human-computer dialog setting. 118 GABRIEL SKANTZE ET AL . 2 Method To test our hypothesis, an experiment was designed in which 10 subjects were given the task of classifying colors in a dialog with a computer. They were told that the computer needed the subject’s assistance to build a coherent model of the subject’s perception of colors, and that this was done by having the subject choose among pairs of the colors green, red, blue and yellow when shown various nuances of colors in-between (e.g. purple, turquoise, orange and chartreuse). They were also told that the computer may sometimes be confused by the chosen color or disagree. The experiment used a Wizard-of-Oz set-up: a person sitting in another room – the Wizard – listened to the audio from a close talking microphone. The Wizard fed the system the colors spoken by the subjects, as well as giving a go-ahead signal to the system whenever a system response was appropriate. The subjects were informed about the Wizard setup immediately after the experiment, but not before. A typical dialog is shown in Table 2.

Table 2. A typical dialog fragment from the experiment (translated from Swedish).

S1-1a [presents purple flanked by red and blue] S1-1b what color is this U1-1 red S1-2 red (ACCEPT /CLARIFY UND /CLARIFY PERC ) or mm (ACKNOWLEDGE ) U1-2 mm S1-3 okay S2-1a [presents orange flanked by red and yellow] S2-1b and this U2-1 yellow perhaps […]

The Wizard had no control over what utterance the system would present next. Instead, this was chosen by the system depending on the context, just as it would be in a system without a Wizard. The grounding fragments (S1-2 in Table 2) came in four flavors: a repetition of the color with one of the three intonations described in Table 1 (ACCEPT , CLARIFY UND or CLARIFY PERC ) or a simple acknowledgement consisting of a synthesized /m/ or /a/ (ACKNOWLEDGE ) (Wallers et al., 2006). The system picked these at random so that for every eight colors, each grounding fragment appeared twice. All system utterances were synthesized using the same voice as the experiment stimuli (Filipsson & Bruce, 1997). Their prosody was hand-tuned before synthesis in order to raise the subjects’ expectations of the computer’s conversational capabilities as much as possible. Each of the non-stimuli responses was available in a number of varieties, and the system picked from these at random. In general, the system was very responsive, with virtually no delays caused by processing.

3 Results The recorded conversations were automatically segmented into utterances based on the logged timings of the system utterances. User utterances were then defined as the gaps in-between these. Out of ten subjects, two did not respond at all to any of the grounding utterances. For the other eight, responses were given in 243 out of 294 possible places. Since the object of our analysis was the subjects’ responses, two subjects in their entirety and 51 silent responses distributed over the remaining eight subjects were automatically excluded from analysis. PROSODY AND GROUNDING IN DIALOG 119

User responses to fragmentary grounding utterances from the system were annotated with one of the labels ACKNOWLEDGE , ACCEPT , CLARIFY UND or CLARIFY PERC , reflecting the preceding utterance type. In almost all cases subjects simply acknowledged the system utterance with a brief “yes” or “mm” as the example U1-2 in Table 2. However, we felt that there were some differences in the way these responses were realized. To find out whether these differences were dependent on the preceding system utterance type, the user responses were cut out and labeled by two annotators. To aid the annotation, three full paraphrases of the preceding system utterance, according to Table 1, were recorded. The annotators could listen to each of the user responses concatenated with the paraphrases, and select the resulting dialog fragment that sounded most plausible, or decide that it was impossible to choose one of them. The result is a categorization showing what system utterance the annotators found to be the most plausible to precede the annotated subject response. The task is inherently difficult – sometimes the necessary information simply is not present in the subjects’ responses – and the annotators only agreed on a most plausible response in about 50% of the cases. The percentage of preceding system utterance types for the classifications on which the annotators agreed is shown in Figure 1.

Percentage of stimuli Table 3. Average of subjects’ mean 100% ClarifyPerc response times after grounding fragments. 90% ClarifyUnd 80% Accept Grounding Response 70% 60% fragment time 50% ACCEPT 591 ms 40% 30% CLARIFY UND 976 ms 20% CLARIFY PERC 634 ms 10% 0% Accept ClarifyUnd ClarifyPerc Annotators' selected paraphrase Figure 1. The percentage of preceding system utterance types for the classifica- tions on which the annotators agreed.

Figure 1 shows that responses to ACCEPT fragments are significantly more common in the group of stimuli for which the annotators had agreed on the ACCEPT paraphrase. In the same way, CLARIFY UND , and CLARIFY PERC responses are significantly overrepresented in their respective classification groups ( 2=19.51; dF=4; p<0.001). This shows that the users’ responses are somehow affected by the prosody of the preceding fragmentary grounding utterance, in line with our hypothesis. The annotators felt that the most important cue for their classifications was the user response time after the paraphrase. For example, a long pause after the question “did you say red?” sounds implausible, but not after “do you really mean red?”. To test whether the response times were in fact affected by the type of preceding fragment, the time between the end of each system grounding fragment and the user response (in the cases where there was a user response) was automatically determined using /nailon/ (Edlund & Heldner, 2005), a software package for extraction of prosodic and other features from speech. Silence/speech detection in /nailon/ is based on a fairly simplistic threshold algorithm, and for our purposes, a preset threshold based on the average background noise in the room where the experiment took place was deemed sufficient. The results are shown in Table 3. The table shows that, just in line with the annotators’ intuitions, ACCEPT fragments are followed by the shortest re-

120 GABRIEL SKANTZE ET AL . sponse times, CLARIFY UND the longest, and CLARIFY PERC between these. The differences are statistically significant (one-way within-subjects ANOVA; F=7.558; dF=2; p<0.05).

4 Conclusions and discussion In the present study, we have shown that users of spoken dialog systems not only perceive the differences in prosody of synthesized fragmentary grounding utterances, and their associated pragmatic meaning, but that they also change their behavior accordingly in a human-computer dialog setting. The results show that two annotators were able to categorize the subjects’ responses based on pragmatic meaning. Moreover, the subjects’ response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system. The response time differences found in the data are consistent with a cognitive load perspective that could be applied to the fragment meanings ACCEPT , CLARIFY PERC and CLARIFY UND . To simply acknowledge an acceptance should be the easiest, and it should be nearly as easy, but not quite, for users to confirm what they have actually said. It should take more time to reevaluate a decision and insist on the truth value of the utterance after CLARIFIY UND . This relationship is nicely reflected in the data. Although we have not quantified other prosodic differences in the users’ responses, the annotators felt that there were subtle differences in e.g. pitch range and intensity which may function as signals of certainty following CLARIFY PERC and signals of insistence or uncertainty following CLARIFY UND . More neutral, unmarked prosody seemed to follow ACCEPT . When listening to the resulting dialogs as a whole, the impression is that of a natural dialog flow with appropriate timing of responses, feedback and turntaking. To be able to create spoken dialog systems capable of this kind of dialog flow, we must be able to both produce and recognize fragmentary grounding utterances and their responses. Further work using more complex fragments and more work on analyzing the prosody of user responses is needed.

Acknowledgements This research was supported by VINNOVA and the EU project CHIL (IP506909).

References Clark, H.H., 1996. Using language . Cambridge: Cambridge University Press. Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62(2-4), 215-226. Edlund, J., D. House & G. Skantze, 2005. The effects of prosodic features on the interpretation of clarification ellipses. Proceedings of Interspeech 2005 , Lisbon, 2389-2392. Filipsson, M. & G. Bruce, 1997. LUKAS – a preliminary report on a new Swedish speech synthesis. Working Papers 46 , Department of Linguistics and Phonetics, Lund University. Wallers, Å., J. Edlund & G. Skantze, 2006. The effect of prosodic features on the interpretation of synthesised backchannels. Proceedings of Perception and Interactive Technologies , Kloster Irsee, Germany. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 121 Working Papers 52 (2006), 121–124

The Prosody of Public Speech – A Description of a Project

Eva Strangert 1 and Thierry Deschamps 2 1Department of Comparative Literature and Scandinavian Languages, Umeå University [email protected] 2Department of Philosophy and Linguistics, Umeå University [email protected]

Abstract The project concerns prosodic aspects of public speech. A specific goal is to characterize skilled speakers. To that end, acoustic analyses will be combined with subjective ratings of speaker characteristics. The project has a bearing on how speech, and prosody in particular, can be adjusted to the communicative situation, especially by speakers in possession of a rich expressive repertoire.

1 Introduction This paper presents a new project, the purpose of which is to identify prosodic features which characterize public speech, both read and spontaneous. The purpose is moreover to reveal how skilled public speakers use prosody to catch and keep the attention of their listeners, whether it be to inform or argue with them. Combined with acoustic analyses of prosody, subjective ratings of speakers will contribute to our knowledge of what characterizes a “good” or “skilled” speaker. Thus, the project, though basically in the area of phonetics, has an interdisciplinary character as it also addresses rhetoric issues. The idea of approaching public speech has grown out of previous work in the field of prosody including the recently completed project “Boundaries and groupings – the structuring of speech in different communicative situations”, see Carlson et al. (2002) as well as studies dealing specifically with the prosody of public speech, see below. Additional motivation for the new project is the growing interest today in public speech, and rhetoric in particular. The project should also be seen in the perspective of the significance given to the areas of speaking style variation and expressive speech during the last decades. This research is theoretically important, as it increases our knowledge of how human speech can be optimally adjusted to the specific situation, and it contributes to learning about the limits of human communicative capacity. Public speech offers a possibility to study speech that can be seen as extreme in this respect. In politics and elsewhere when burning issues are at stake and where often seriously committed individuals are involved, a rich expressive repertoire is made use of. In this domain, prosody has a major role.

2 Background Common to textbooks in rhetoric is their focus on those aspects which do not concern the manner of speaking, although it is included in the concept of “rhetoric”. The emphasis is rather on argumentation and planning of the speech act, the rhetoric process, as well as the linguistic form; correctness, refinement, and clarity are demanded. The descriptions of how to speak are considerably less detailed and very often even vague. The recommendations of 122 EVA STRANGERT & THIERRY DESCHAMPS today are mostly similar to those given two thousand years ago; the voice of a skilled speaker should be “smooth”, “flexible”, “firm”, “soft”, “clear” and “clean” (Johannesson, 1990/1998, citing Quintilianus’ (ca AD 35-96) “Institutes of Oratory”). As far as phonetically based investigations are concerned, Touati (1991) analyzed tonal and temporal characteristics in the speech of French politicians. The analyses were undertaken with a background in earlier studies of political rhetoric and, in addition, other types of speech in public media in Sweden, see Bruce & Touati (1992). Other studies of public speech based on Swedish include Strangert (1991; 1993) both dealing with professional news reading. A study by Horne et al. (1995) concerned pausing and final lengthening in broadcasts on stockmarket reports on Swedish Radio. Analyses of interview speech made within the “Boundaries and groupings” project also have relevance here. The purpose in this case was not to study public speech per se. However, the results, in particular as concerns fluency and pausing (see e.g. Heldner & Megyesi, 2003; Strangert 2004; Strangert & Carlson, 2006), may be assumed to reflect the fact that the speech was produced by a very experienced speaker. A recent study with focus on “the skilled professional speaker” (Strangert 2005) approaches problems sketched for the current project. Braga & Marques (2004) focused on how prosodic features contribute to the listeners’ attention and interpretation of the message in political debate. The conception of a speaker as “convincing”, “powerful” and “dedicated”, is assumed to be reflected in (combinations of) prosodic features, or “maximes”. The study builds on the idea put forward by Gussenhoven (2002) and developed further by Hirschberg (2002) of universal codes for how prosodic information is produced by the speaker and perceived by the listener. Wichmann (2002) and Mozziconacci (2002) belong to those dealing with the relations between prosody (f0 features in particular) and what can be described as “affective functions”; a comprehensive survey of expressive speech research can be found in Mozziconacci (2002). Wichmann (2002) makes a distinction between “ways of saying” (properties or states relating to the speaker) and “ways of behaving” (the speaker’s attitude to the listener). “Ways of saying” includes first, how the speaker uses prosody in itself – stress and emphasis, tonal features, speech rate, pausing etc. – and second, the emotional coloring of speech (e.g. “happy”, “sad”, “angry”) as well as states such as “excited”, “powerful” etc. Examples of “ways of behaving” are attitudes such as “arrogant” and “pleading”. In addition, the speaker may use other argumentative and rhetorical means. All these functions of prosody make it a complex, nuanced and powerful communicative tool. To study the affective functions of prosody, auditive analyses must be combined with acoustic measurements (see e.g. Mozziconacci, 2002). Also, listeners’ impressions have to be categorized appropriately. A standard procedure is to have listeners judge samples of speech. However, human speech very often conveys several states, attitudes and emotions at the same time and this without doubt is true for the often quite elaborated speech produced in the public domain. This complexity is examined in a study by Liscombe et al. (2003) through the use of multiple and continuous scales for rating emotions. In their study, the subjective ratings are also combined with acoustic analyses of prosodic features.

3 Work in progress As a first step, we made a survey asking 22 students of logopedics at Umeå University what kind of qualities they looked upon as important for a person regarded as a “good speaker”. The students wrote down as many characteristics (in Swedish) as they could, guided only by the definition of a good speaker as “A person who easily attracts listeners’ attention through her/his way of speaking.” 7 characteristics were given on average, with a range between 4 and 11. In addition to personality/emotional and attitudinal/interaction features, the labels given also reflected THE PROSODY OF PUBLIC SPEECH – A DESCRIPTION OF A PROJECT 123 opinions about speech per se (articulation, voice characteristics and prosody), cf. Wichman (2002). Thus, even if both the personality and the attitudinal features are transferred to the listeners through speech, the subjects did not refrain from having opinions about the speech itself. Table 1 shows the distribution of labels after grouping into the three categories. With this as a background we will proceed by having subjects judge short passages of speech (spontaneous and prepared) for multiple speaker characteristics. These will include not only positively valued qualities like those listed here; also other qualities, including more negatively colored ones, need to be covered in an effort to characterize speaker behavior. We are currently in the process of developing a test environment for this experiment. In this work we lean on previous efforts (see Liscombe et al., 2003). Combined with acoustic analyses we expect the multiple ratings to give insight into how different acoustic/prosodic features contribute to the impression of skilled – and less skilled – speaking behavior.

Table 1. Characteristics of “a good speaker” grouped into three categories based on 22 subjects’ written responses (see text). Labels in Swedish with English translations.

Number of Speaker characteristics responses Speech features tydlig artikulation clear articulation 7 god röststyrka, röstläge sufficient volume, voice level 6 icke-monoton röst non-monotonous voice 4 variation i röststyrka, röstläge variation of volume, voice level 3 rätt betoning, fokusering adequate prominence and focus 2 väl avvägd pausering well-adjusted pausing 2 bra taltempo, ej för snabbt well-adjusted tempo, not too fast 2 varierat taltempo varied speech tempo 1 talflyt fluency 1 varierad prosodi, uttrycksfullhet varied prosody, expressiveness 3 Personality features inlevelse, entusiasm, engagemang involvement, enthusiasm, commitment 16 humor, lättsamhet sense of humour 12 karisma, utstrålning charisma, appeal 6 lugn, avslappnad stil calm, relaxed style of speaking 5 personlighet personality, individuality 4 positivt inställning positive attitude 3 ödmjukhet, självinsikt sense of humility 2 tydlighet distinctness, authority 2 självförtroende self-confidence 1 övertygelse conviction 1 Interaction features förmåga att knyta an till lyssnarna ability to relate to audience 8 nivåanpassning relativt lyssnarna choosing the right communicative level 6 lyhördhet sensitivity 3 vilja till interaktion ability to interact with audience 3 utan överlägsenhet respectful, non-arrogant style 2

124 EVA STRANGERT & THIERRY DESCHAMPS

As the project in addition aims to characterize also other aspects of public speaking, a variety of representative speech samples will be collected. In analyses of this material, fluency, pausing, prominence, emphasis and voice characteristics will be central. Among the questions we seek answers to are: What types of strategies are used for holding the floor? How does speech perceived as fluent and disfluent respectively differ acoustically? How are prominence and emphasis used in speech in media? What are the prosodic characteristics of agitation? Answers to these questions, we believe, will add to our understanding of human communicative capability and will also be useful in modeling speaking style variation. Knowledge gained within the project may further be expected to be practically applicable.

Acknowledgements This work was supported by The Swedish Research Council (VR).

References Braga, D. & M.A. Marques, 2004. The pragmatics of prosodic features in the political debate. Proc. Speech Prosody 2004 , Nara, 321-324. Bruce, G. & P. Touati, 1992. On the analysis of prosody in spontaneous speech with exemplification from Swedish and French. Speech Communication 11 , 453-458. Carlson, R., B. Granström, M. Heldner, D. House, B. Megyesi, E. Strangert & M. Swerts, 2002. Boundaries and groupings – the structuring of speech in different communicative situations: a description of the GROG project. TMH-QPSR 44 , 65-68. Gussenhoven, C., 2002. Intonation and interpretation: Phonetics and phonology. Proc. Speech Prosody 2002 , Aix-en-Provence, 11-13. Heldner, M. & B. Megyesi, 2003. Exploring the prosody-syntax interface in conversations. Proc. ICPhS 2003 , Barcelona, 2501-2504. Hirschberg, J., 2002. The pragmatics of intonational meaning. Proc. Speech Prosody 2002 , Aix-en-Provence, 65-68. Horne, M., E. Strangert & M. Heldner, 1995. Prosodic boundary strength in Swedish: final lengthening and silent interval duration. Proc. ICPhS 1995 , Stockholm, 170-173. Johannesson, K., 1998/1990. Retorik eller konsten att övertyga . Stockholm: Norstedts Förlag. Liscombe, J., J. Venditti & J. Hirschberg, 2003. Classifying subject ratings of emotional speech using acoustic features. Proc. Eurospeech 2003 , Geneva, 725-728. Mozziconacci, S., 2002. Prosody and emotions. Proc. Speech Prosody 200 2, Aix-en- Provence, 1-9. Strangert, E., 1991. Phonetic characteristics of professional news reading. PERILUS XII . Institute of Linguistics, University of Stockholm, 39-42. Strangert, E., 1993. Speaking style and pausing. PHONUM 2. Reports from the Department of Phonetics, University of Umeå, 121-137. Strangert, E., 2004. Speech chunks in conversation: Syntactic and prosodic aspects. Proc. Speech Prosody 2004 , Nara, 305-308. Strangert, E., 2005. Prosody in public speech: analyses of a news announcement and a political interview. Proc. Interspeech 2005 , Lisboa, 3401-3404. Strangert, E., & R. Carlson. 2006. On modeling and synthesis of conversational speech. Proc. Nordic Prosody IX, 2004, Lund, 255-264. Touati, P., 1991. Temporal profiles and tonal configurations in French political speech. Working Papers 38 . Department of Linguistics, Lund University, 205-219. Wichmann, A., 2002. Attitudinal intonation and the inferential process. Proc. Speech Prosody 200 2, Aix-en-Provence, 11-22. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 125 Working Papers 52 (2006), 125–128

Effects of Age on VOT: Categorical Perception of Swedish Stops by Near-native L2 Speakers

Katrin Stölten Centre for Research on Bilingualism, Stockholm University [email protected]

Abstract This study is concerned with effects of age of onset of L2 acquisition on categorical percep- tion of the voicing contrast in Swedish word initial stops. 41 L1 Spanish early and late learners of L2 Swedish, who had carefully been screened for their ‘nativelike’ L2-proficiency, as well as 15 native speakers of Swedish participated in the study. Three voicing continua were created on the basis of naturally generated word pairs with /p t k b d g/ in initial position. Identification tests revealed an overall age effect on category boundary placement in the nativelike L2 speakers’ perception of the three voicing continua. Only a small minority of the late L2 learners perceived the voicing contrast in a way comparable to native-speaker categorization. Findings concerning the early learners suggest that most, but far from all, early L2 speakers show nativelike behavior when their perception of the L2 is analyzed in detail.

1 Introduction From extensive research on infant perception it has become a well-known fact that children during their first year of life tune in on the first language (L1) phonetic categories, leaving them insensitive to contrasts not existing in their native language (e.g. Werker & Tees, 1984). In a study by Ruben (1997) it was found that children who had suffered from otitis media during their first year of life showed significantly less capacity for phonetic discrimination compared to children with normal hearing during infancy when they were tested at the age of nine years. Such findings do not only demonstrate the importance of early linguistic exposure, they have also been interpreted as an indication for the existence of a critical period for phonetic/phonological acquisition which may be over at the age of one year (Ruben, 1997). In research of age effects on language acquisition one classical issue is concerned whether theories of a critical period can be applied to second language (L2) acquisition. The question is whether the capacity to acquire phonetic detail in L2 learning is weakened or lost due to lack of verbal input during a limited time frame for phonetic sensitivity, or whether a nativelike perception and an accent-free pronunciation is possible for any adult L2 learner. The present study is part of an extensive project on early and late L2 learners of Swedish with Spanish as their L1. The subjects have been selected on the criterion that they are perceived by native listeners as mother-tongue speakers of Swedish in everyday oral communication. Thereafter, the candidates’ nativelike L2 proficiency has been tested for various linguistic skills. The present study focuses on the analysis of the nativelike subjects’ categorical perception of the voicing contrast in Swedish word initial stops. Both Swedish and Spanish recognize a phonological distinction between voiced and voiceless stops in terms of voice onset time (VOT) but they differ as to where on the VOT continuum the stop categories separate. In contrary to languages like Swedish and English, 126 KATRIN STÖLTEN which treat short-lag stops as voiced and long-lag stops as voiceless, in Spanish short-lag stops are recognized as voiceless, while stops with voicing lead are categorized as voiced (e.g. Zampini & Green, 2001). Consequently, Spanish phoneme boundaries are perceptually located at lower VOT values than in, for example, English (Abramson & Lisker, 1973). Since language-specific category boundaries are established at a very early stage in language development a great amount of perceptual sensibility is needed by a second language learner in order to detect the categories present in the target language. The fact that L2 learners generally show difficulties in correctly perceiving and producing these language- specific categories (e.g. Flege & Eefting, 1987) suggests that categorical perception may be considered a good device for the analysis of nativelike subjects’ L2 proficiency. For the present study the following research questions have been formulated: (1) Is there a general age effect on categorical perception among apparently nativelike L2 speakers of Swedish? (2) Are there late L2 learners who show category boundaries within the range of native- speaker categorization? (3) Do all (or most) early L2 learners show category boundaries within the range of native- speaker categorization?

2 Method 2.1 Subjects A total of 41 native speakers of Spanish (age 21-52 years), who had previously been screened for their ‘nativelike’ proficiency of Swedish in three screening experiments (see Abrahamsson & Hyltenstam, 2006), were chosen as subjects for the study. The participants’ age of onset (AO) of L2 acquisition varied between 1 and 19 years and their mean length of residence in Sweden was 24 years. The subjects had an educational level of no less than senior high school and they all had acquired the variety of Swedish spoken in the great Stockholm area. The control subjects consisting of 15 native speakers of Swedish were carefully matched with the experimental group regarding present age, sex, educational level and variety of Swedish. All participants went through a hearing test (OSCILLA SM910 screening audiometer) in order to ensure that none of the subjects suffered from any hearing impairment.

2.2 Stimuli The speech stimuli were prepared on the basis of naturally generated productions of three Swedish minimal word pairs: par /pr/ (pair, couple) vs. bar /br/ ‘bar, bare, carried’, tal /tl/ ‘number, speech’ vs. dal /dl/ ‘valley’, kal /kl/ ‘naked, bald’ vs. gal /gl/ ‘crow(s), call(s)’. A female speaker of Stockholm Swedish with knowledge of phonetics was recorded in an anechoic chamber while reading aloud the words in isolation. The speaker was instructed to articulate the voiceless stops with an extended aspiration interval and the voiced counterparts with a clear period of voicing prior to stop release. All readings were digitized at 22 kHz with a 16-bit resolution. For all stop productions VOT was determined by measuring the time interval between the release burst and the onset of voicing in the following vowel. Thereafter, the release bursts of the voiceless stops were equalized to 5ms. The aspiration phase was then extended to +100ms VOT by generating multiple copies of centre proportions of the voicing lag interval. The stimuli for the perception test were created by shortening the aspiration interval in 5ms-steps. Voicing lead was simulated by copying the prevoicing interval from the original production of the corresponding voiced stop and placing it prior to the burst. The prevoicing maximum was first put at -100ms and then varied in 5ms-steps. Finally, a set of 30 speech stimuli ranging from +90ms to -60ms VOT for each stop continuum was considered appropriate for the study. EFFECTS OF AGE ON VOT: CATEGORICAL PERCEPTION OF SWEDISH STOPS 127

2.3 Testing procedure A forced-choice identification task designed and run in E-Prime v1.0 (Schneider, Eschman & Zuccolotto, 2002) was performed by each subject individually in a sound treated room. The three voicing continua were tested separately. The speech stimuli were preceded by the carrier phrase Nu hör du ‘Now you will here’ and randomly presented through headphones (KOSS KTX/PRO) one at a time. For each stimulus the listeners were told to decide whether they heard the word containing a voiced or voiceless stop and confirm their answer by pressing a corresponding button on the keyboard. The experimenter was a male native speaker of Stockholm Swedish.

3 Results Stop category boundaries (in ms VOT) were calculated for each subject and plotted against their age of onset. Since category boundary locations vary with place of articulation (see, e.g. Abramson & Lisker, 1973) the stop pairs were analyzed separately. Due to extreme VOT values the results from one subject (AO 5) had to be discarded from further analysis.

/p/-/b/ /t/-/d/ /k/-/g/

80 80 80 60 60 60 40 40 40 20 20 20 0 0 0

-20 -20 -20

-40 -40 -40 Category boundary (in ms VOT) -60 -60 -60 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Age of onset (AO) of L2 acquisition Figure 1. Age of onset (AO) in relation to mean category boundaries (in ms VOT) for the bilabial, dental and velar stop continuum; the values at AO 0 represent category boundaries of the 15 native speakers.

As can be seen in Figure 1 correlations between AO on perceived category boundary are existent for both the /p/-/b/ (r = -.468, p < .01) and the /t/-/d/ (r = -.340, p < .05) contrast, whereas the correlation for the /k/-/g/ contrast did not reach significance (r = -.291, p < .069). In order to compare the nativelike candidates in a more systematic way, the subjects were divided into a group of early (AO 1-11) and late learners (AO 13-19). Group comparisons revealed that the control subjects change phoneme categories at the longest mean VOTs (+7.23ms for /p/-/b/; +15.34ms for /t/-/d/; +24.62ms for /k/-/g/), while the late L2 listeners show the shortest category crossover points (-17.57ms for /p/-/b/; +1.28ms for /t/-/d/; +14.86ms for /k/-/g/). The group of early learners changes category boundaries at VOTs somewhere in between the late learners and the controls (-2.93ms for /p/-/b/; +10.74 for /t/- /d/; +20.17ms for /k/-/g/). An ANOVA confirmed that these group differences were highly significant for both the bilabial (F(2,52) = 11.807, p < .001), the dental (F(2,52 = 7.847, p < .001) and the velar stop contrast (F(2,52) = 8.815, p < .001). Post-hoc comparisons (Fisher’s LSD) showed that except for comparisons between the native speakers and the early learners in case of the dental stop contrast all remaining group differences were significant. However, as can be seen in Figure 1 most of the nativelike candidates change categories at estimated VOTs within the range of native speaker categorization. This holds for both the bilabial (30 subjects), the dental (32 subjects) and the velar (29 subjects) voicing contrast.

128 KATRIN STÖLTEN

Whereas most of the early learners (21 out of 30) perceive category boundaries within the range of native-speaker categorization for all three places of articulation this applies to only two of the ten late learners (AO 13 and 16). In the group of early learners nine subjects show category boundaries within the range of native-speaker categorization for either one or two of the Swedish minimal pairs. At the same time no early learner was found who exhibits non- nativelike category crossover points for all three places of stop articulation. Finally, the analysis of the group of late L2 learners shows that seven individuals change phoneme category within the range of native-speaker categorization for either one or two of the three places of articulation. In contrast, only one subject (AO 14) does not exhibit category boundaries within the range of native-speaker categorization for any of the stops.

4 Summary and conclusions The present study has shown that age of onset has an effect on apparently nativelike L2 speakers’ categorical perception of the voicing contrast in Swedish word initial stops. In addition to negative correlations between AO and perceived category boundaries, significant group differences were found. The late L2 learners change phoneme category at the shortest crossover points, thereby deviating the most from the Swedish controls. In short, the data confirm that there is a general age effect on categorical perception even among L2 speakers who seem to have attained a nativelike L2-profiency (Research Question 1). Among the late L2 learners only two subjects (AO 13 and 16) change stop category within the range of native-speaker categorization regarding all three places of articulation. Thus, only a small minority of late, apparently nativelike L2 speakers show actual nativelike behavior concerning the categorical perception of the voicing contrast (Research Question 2). Most of the early L2 learners change category for the three stop continua at VOTs within the range of native-speaker categorization. On the contrary, no subject with an early AO was identified who showed non-nativelike category boundaries for all three stop continua. Thus, most, but far from all, early learners show nativelike behavior when their perception of the L2 is analyzed in detail (Research Question 3).

References Abrahamsson, N. & K. Hyltenstam, 2006. Inlärningsålder och uppfattad inföddhet i andraspråket – lyssnarexperiment med avancerade L2-talare av svenska. Nordisk tidskrift for andrespråksforskning 1:1, 9-36. Abramson, A. & L. Lisker, 1973. Voice-timing perception in Spanish word-initial stops. Journal of Phonetics 1, 1-8. Flege, J.E. & W. Eefting, 1987. Production and perception of English stops by native Spanish speakers. Journal of Phonetics 15 , 67-83. Ruben, R.J., 1997. A time frame of critical/sensitive periods of language development. Acta Otolaryngologica 117 , 202-205. Schneider, W., A. Eschman & A. Zuccolotto, 2002. E-Prime Reference Guide. Pittsburgh: Psychology Software Tools, Inc. Werker, J.F. & R.C. Tees, 1984. Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behaviour and Development 7 , 49-63. Zampini, M.L. & K.P. Green, 2001. The voicing contrast in English and Spanish: the relationship between perception and production. In J.L. Nicol (ed.), One Mind, Two Languages . Oxford: Blackwell. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 129 Working Papers 52 (2006), 129–132

Stress, Accent and Vowel Durations in Finnish

Kari Suomi Department of Finnish, Information Studies and Logopedics, Oulu University [email protected]

Abstract The paper summarises recent research on the interaction of prominence and vowel durations in Finnish, a language with fixed initial stress and a quantity opposition in both vowels and consonants; to be more accurate, the research has been conducted on Northern Finnish. It is shown that, in one-foot words, there are four statistically distinct, non-contrastive duration degrees for phonologically single vowels, and three such degrees for phonologically double vowels. It is shown that the distributions of these duration degrees are crucially determined by moraic structure. Also sentence accent has a moraic alignment, with a tonal rise occurring on the word’s first mora and a fall on the second mora. It is argued that the durational alternations are motivated by the particular way in which accent is realised.

1 Introduction In Finnish word stress is invariably associated with the initial syllable, and there is a binary quantity opposition in both vowels and consonants, independent of stress, effectively signalled by only durational differences. There are very good grounds for interpreting the quantity oppositions syntagmatically, as distinctions between a single phoneme and a double phoneme, i.e. a sequence of two identical phonemes (Karlsson, 1969). This interpretation is also reflected in the orthography, and thus there are written words like taka, taaka, takka, taakka, takaa, taakaa, takkaa, taakkaa. However, the orthography only indicates the contrastive, phonemic quantity distinctions and, beyond this, it does not in any way reflect the actual durations of phonetic segments. Thus, for example, the orthography or a phonemic transcription do not in any way express the fact that, in e.g. the dialect discussed in this paper, the second-syllable single vowel in taka has a duration that is almost twice as long as that in taaka, takka and taakka. This paper summarises recent research on such non-contrastive vowel duration alternations, and suggests their motivations. The paper only looks at vowel durations, and only in words that consist of just one, primary-stressed foot, and thus the effect of secondary stress on vowel durations, which has not been systematically examined, is excluded. As will be seen below, the mora is an important unit in Finnish prosody. The morae of a syllable are counted as follows: the first vowel phoneme – the syllable nucleus – is the first mora, and every phoneme segment following in the same syllable counts as an additional mora. Below, reference will be made to a word’s morae, and e.g. the words taka, taaka and taakka have the moraic structures CM 1.CM 2, CM 1M2.CM 3 and CM 1M2M3.CM 4, respectively th (where M n refers to the word’s n mora, and C is a non-moraic consonant).

2 Vowel duration patterns Suomi & Ylitalo (2004) investigated segment durations in unaccented, trisyllabic nonsense words that consist of one foot each. The segmental composition of the nonsense words was 130 KARI SUOMI fully counterbalanced. The word structures investigated were CV.CV.CV, CV.CVC.CV, CV.CVV.CV, CV.CVV.CVV, CVC.CV.CV, CVC.CVC.CV, CVV.CV.CVV and CVV.CVV.CVV, each represented by 18 different words. The words were spoken in the frame sentence xyz, MINUN mielestäni xyz kirjoitetaan NÄIN (xyz, in MY opinion xyz is written like THIS), where xyz represents the target word, the second occurrence of which was measured. The five speakers were instructed to emphasise the capitalised words. Suomi & Ylitalo only compared segment durations within the domain of the word’s first two morae with those outside the domain, but the data have now been reanalysed in more detail. It turned out that there are four statistically distinct, non-contrastive and complementary duration degrees for single vowels, denoted as V (1) – V (4) in Table 1. The Table also shows the results for three classes of double vowels (VV) with different moraic affiliations. The duration labels given to the duration degrees are ad hoc.

Table 1. The mean durations (in ms) of the four duration degrees (DD) of phonologically single vowels (V) and of three types of double vowels (VV) as observed in Suomi & Ylitalo (2004) and in Suomi (in preparation); columns S & Y and S, respectively. In the column Moraic status “M3+ ” means that the V is the word’s third or later mora, “M1.” that the V is M 1 that is followed by a syllable boundary, “M1C” that the V is M 1 that is followed by a consonant in the same syllable, “CM2” that the V is M 2 preceded by a consonant in the same syllable, “M1M2” that the VV constitutes the sequence M1M2, “M2M3” that the VV constitutes the sequence M2M3, and “M3+ M” that the first segment in the VV sequence is M3+ or a later mora. For further explanations see the text. DD Duration label S & Y S Moraic status Example structures V(1) “extra short” 48 75 M3+ CV.CV.CV, CVC.CV V(2) “short” 58 104 M1. CV.CV(X) V(3) “longish” 73 126 M1C CVC.CV(X) V(4) “long” 84 158 CM2 CV.CV(C) VV (1) “longish” + “longish” 149 - M1M2 CVV (X) VV (2) “long” + “extra short” 142 - M2M3 CV.CVV VV (3) “very long” 135 - M3+ M CVC.CVV , CVV.CVV

Suomi (in preparation) measured durations in segmentally fully controlled, accented CV.CV and CVC.CV nonsense words embedded in the frame sentence Sanonko ___ uudelleen? (Shall I say ___ again?) and spoken by seven speakers. Suomi found the same four statistically distinct duration degrees for phonologically single vowels, as reported in Table 1. Three of the four single vowel duration degrees have been well documented earlier, e.g. by Lehtonen (1970), but the existence of degree V(3) (“longish”) has not been previously reported. Below are the distributional rules of the observed duration degrees. The rules are to be applied in the following manner: if a word contains a VV sequence, then an attempt to apply the rule for VV duration should be made first. If this rule is not applicable, then the rule for V should be applied to both members of the VV sequence (and of course to singleton V’s).

VV [very long] if the first V in the sequence constitutes M3+ V [extra short] if it constitutes M3+ [short] if it constitutes M 1 that is not next to M 2 [longish] if it occurs in the sequence M1M2 [long] if it constitutes M2 that is not next to M1 STRESS , ACCENT AND VOWEL DURATIONS IN FINNISH 131

As the rule for VV duration is formulated, it is only applicable to VV (3) but not to VV (1) nor to VV (2) . In these latter two cases, then, the rule for V duration has to be separately applied to both segments in the sequence, and the correct durations are assigned. Thus VV is “very long” in e.g. CVV.CVV .CVV and CVC.CVV , V is “extra short” in e.g. CVV.CV, CVC.CV and CV.CV.CV, “short” in CV.CV(X), “longish” in CVC.CV and CVV .CV (both segments in VV are “longish”), and “long” in CV.CV. In the structure CV.CVV , the first segment in the second-syllable VV sequence (M2) is analysable as “long” and the second one (M3) as “extra short”; the sum of these duration degrees is (84 ms + 48 ms =) 132 ms which is 10 ms less than the observed duration for VV (2) (142 ms), but the difference was not significant. The durational alternations under discussion of course entail complications to the realisation of the phonemic quantity opposition, and in particular the durational difference in the second- syllable vocalic segments in CV.CV and CV.CVV word structures is less than optimal. Notice that the above rules explicitly refer to moraic structure only, and not e.g. to the syllable. Notice further that M3+ is only referred to when the vowel is either “very long” or “extra short”. These degrees represent the durations of double and single vowels in those unstressed syllables in which nothing interferes with the realisation of the quantity opposition; in these positions, the mean duration of double vowels is (135/48 =) 2.8 times that of single vowels. But when a vowel constitutes M 1, it can be either “short” or “longish”, and when it constitutes M 2, it can be either “longish” or “long”. This is because the durations of these segments also signal prominence.

3 On the phonetic realisation of prominence The distinction drawn by Ladd (1996, and elsewhere) between the association and the alignment of prominence is very useful in Finnish. Primary word stress is unquestionably phonologically associated with the word’s initial syllable, but its phonetic alignment with the segmental material is more variable. Stress is signalled by greater segment duration, but not necessarily on the stressed syllable only. Broadly speaking, stress is manifested as greater duration of the segments that constitute M1 and M2, but exactly how the greater duration is distributed depends on the structure of the initial syllable. If the initial syllable is light, i.e. in (C)V.CV(C) words, the first-syllable vowel is “short” and the second-syllable vowel (M2) is “long” (but both are longer than the third-syllable “extra short” vowel in (C)V.CV.CV(C) words). But if the initial syllable is heavy, i.e. contains both M1 and M2, then both of these segments are “longish” as in CVV.CV(C) words (and the second-syllable V is “extra short”). As concerns sentence accent, it is normally realised as a tonal rise-fall that is also moraically aligned: the rise is realised during the first mora, and (most of) the fall during the second mora. Thus in (C)V.CV(C) words, the rise is realised during the first syllable and the fall during the second one, whereas in words with a heavy initial syllable both the rise and the fall are realised during the initial syllable. Strong (e.g. contrastive) accent involves a wider f 0 movement than moderate accent, and it is also realised durationally, as an increase in the durations of especially M1 and M2. But moderate accent is not realised durationally, i.e. the unaccented and moderately accented versions of a word have equal durations. In many languages, details of the tonal realisation of accent depend on the structure of the accented syllable. Thus e.g. Arvaniti, Ladd & Mennen (1998) report that, in Greek, the slope and duration of the (prenuclear) accentual tonal movement vary as a function of the structure of the accented syllable. This is not so in Finnish. Instead, what has been observed repeatedly is that, given a constant speech tempo and a given degree of accentuation, the rise-fall tune is temporally and tonally uniform across different word and syllable structures (Suomi, Toivanen & Ylitalo, 2003; Suomi, 2005; in press).

132 KARI SUOMI 4 Motivating the durational alternations Why are there so many non-contrastive vowel duration degrees in Finnish, alternations that partly interfere with the optimal realisation of the quantity opposition? The answer seems to be provided by the particular combination of prosodic properties in the language. Given the uniformity of the accentual tune across different word structures, and given the moraic alignment of the accentual tune, the durational alternations discussed above are necessary. If the durational alternations did not exist but accent nevertheless had the moraic alignment that it has, the uniformity of the accentual tune would not be possible. Why the tonal uniformity exists is not clear, but there it is. It is somewhat paradoxical that, in a full-fledged quantity language in which segment durations signal phonemic distinctions, segment durations nevertheless also vary extensively to serve tonal purposes, while in non-quantity languages like Greek the segmental composition of the accented syllable determines the tonal realisation. The durational alternations are also observable in unaccented words. But this does not undermine the motivation just suggested, because unaccented and moderately accented words do not differ from each other durationally, and the alternations are directly motivated in moderately accented words. Thus unaccented words are as if prepared for being accented. A conceivable alternative would be that unaccented words would lack the alternations present in accented words, but this state of affairs would further complicate the durational system. To summarise, beyond the loci in which stress and accent are realised, i.e. when vowels do not constitute M1 or M2, single vowels are “extra short” and double vowels “very long”, which results in their clear separation. In (C)V.CV(X) words, the tonal rise is realised during the initial syllable and it is sufficient that the vowel is “short”. The long fall is realised during the second syllable, and therefore the vowel must be “long”. In (C)VV.CV(X) words, both the rise and most of the fall is realised during the initial syllable, and therefore both segments in the VV sequence must be “longish”. This paper is not about consonant durations but in (C)VC.CV(X) words, in which M2 is a consonant, it too has to be “longish”; if the consonant has relatively short intrinsic duration elsewhere, it is lengthened in this position. As a consequence of these alternations, the accentual rise-fall can be uniform across different word structures, and at the same time, the quantity oppositions are not jeopardised.

References Arvaniti, A., D.R. Ladd & I. Mennen, 1998. Stability of tonal alignment: the case of Greek prenuclear accents. Journal of Phonetics 26, 3-25. Karlsson, F., 1969. Suomen yleiskielen segmentaalifoneemien paradigma. Virittäjä 73, 351- 362. Ladd, D.R., 1996. Intonational phonology . Cambridge: Cambridge University Press. Lehtonen, J. 1970. Aspects of quantity in standard Finnish . Jyväskylä: Jyväskylä University Press. Suomi, K., 2005. Temporal conspiracies for a tonal end: segmental durations and accentual f 0 movement in a quantity language. Journal of Phonetics 33, 291-309. Suomi, K., in press. On the tonal and temporal domains of accent in Finnish. Journal of Phonetics . Suomi, K., in preparation. Durational elasticity for accentual purposes (working title). Suomi, K., J. Toivanen & R. Ylitalo, 2003. Durational and tonal correlates of accent in Finnish. Journal of Phonetics 31, 113-138. Suomi, K. & R. Ylitalo, 2004. On durational correlates of word stress in Finnish. Journal of Phonetics 32, 35-63.

Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 133 Working Papers 52 (2006), 133–136

Phonological Demands vs. System Constraints in an L2 Setting

Bosse Thorén Dept. of Linguistics, Stockholm University [email protected]

Abstract How can system constraints and phonological output demands influence articulation in a L2- speaker? When measuring durations and articulator movements for some Swedish /V C/ and /VC / words, pronounced by a Swedish and a Polish speaker, it appeared that phonological vowel length was realized very similarly by both speakers, while complementary consonant length was applied only by the native Swedish speaker. Furthermore, the tendency for increased openness in short (lax) vowel allophones was manifested in analogous jaw and lip movements in the Swedish speaker, but followed a different pattern in the Polish speaker.

1 Introduction How is articulation influenced by system-based constraints and output-based constraints, when a person uses a second language? According to the Hyper & Hypo speech theory (Linblom 1990) the degree of articulatory effort in human speech is determined by mainly two factors: 1) The limitations that inertia in the articulators poses upon speech, including the tendency for economy in effort. 2) The demands of the listener, e.g. sufficient phonological contrast. The former is assumed to result in unclear speech, or “under shoot”, and the latter to “over shoot” or “perfect shoot” (clear speech). According to the H&H-theory, the output demands vary depending on e.g. contextual predictability and the acoustic channel being used, the presence of noise etc. From a cross-linguistic point of view, the demands of a listener are to a high degree determined by the phonologic system of the language in question. These demands are supposed to be intuitively inherent in the native speaker of the language, i.e. the speaker has a clear but probably unconscious picture of the articulatory goal. What happens to a L2-speaker in this perspective? We can assume that the L2-speaker is influenced both by L1 and L2 demands on the output, as well as by system-based constraints.

Swedish has a quantity distinction in stressed syllables, manifested in most varieties as ¡ either /V ¡(C)/, or /VC /. Elert (1964) has shown that the Swedish long-short phonological distinction is accompanied by analogous differences in duration for the segments involved. His study also shows that the differences in duration between long and short Swedish vowel allophones are significantly greater (mean 35%) than durational differences between closed and open vowels (5-15%). This predicts that output constraints for Swedish segment durations would override the system constraints, i.e. the inherent differences in duration between open and closed vowels. Polish on the other hand, is a language without phonological quantity, and is not expected to involve any output constraints on the duration of segments. Duration differences in Polish are assumed to result mainly from vowel openness, in accordance with the “Extent of Movement Hypothesis” (Fischer-Jörgensen, 1964). A native polish speaker, who speaks 134 BOSSE THORÉN

Swedish as a second language, is therefore expected to show more influence from the system constraints in his/her Swedish production than a native Swedish speaker. In addition to the longer inherent duration in open vowels, there is a clear connection between long/tense and short/lax vowels, resulting in Swedish short vowel allophones being pronounced more open than their long counterparts (cf. Fant 1959). The present study examines what happens when a native speaker of Swedish and a native

speaker of Polish pronounce test words containing the following combinations in trisyllabic

a

¡ p p p ¡

nonsense words: long open vowel / p /, short open vowel / /, long closed vowel

¡p pp ¡ / p i / and short closed vowel / /. In this study, the movements of mandible and lips are measured in addition to segment durations, in order to compare the two speakers with regard to patterns of articulatory gestures

as results from output demands and system constraints respectively. ¡ The question is: Will the duration of segments, produced in Swedish /V ¡C/ and /VC /- contexts differ significantly when pronounced by a native Swedish speaker, and a native Polish speaker? And will the timing and magnitude of lip- and jaw-movements differ in a significant way between the two speakers, indicating more influence from output demands or system constraints?

2 Method

Two adult male subjects, one native Swede and one native Pole, who had lived in Sweden for

a

p ¡ p p p¡ p i¡p

22 years, were recorded when pronouncing the nonsense words [ ], [ ], [ ]and

pp ¡ [ ], all of which are possible Swedish words according to Swedish phonotactics and prosody. The Swedish speaker read the sequence of test words five times, and the Polish speaker read it three times. Measurements of lip and mandible movement as well as speech signal were carried out by means of Move track: a magnetometer system with sender coils attached to the speaker’s lips and lower incisors, and receiver coils placed in a light helmet on the speaker’s head. The device measures variation in magnetic field that can be directly related to distance between coils. The system produces data files with articulator movements synchronized with the speech signal.

3 Results 3.1 Segment durations The two speakers realized phonological vowel length in a similar way, making clear temporal differences between long and short vowel allophones, as shown in Figure 1. The complementary long consonant after short vowel in stressed syllables in Swedish is very clear in the native Swedish speaker, but non-existent and even shorter in the native Polish speaker. Differences in vowel duration ratios, are illustrated in Figure 1b, where the differences in vowel length are seen as functions of phonological demands and system constraints respectively. Both speakers realize phonologically long vowels with more than the double duration of short vowels, the Polish speaker having even more difference than the Swedish speaker. The Polish speaker made a greater duration difference between /a/ and /i/ than the Swedish speaker did. This latter difference in duration ratios between speakers is significant

(p< 0.05 ANOVA) whereas the inter-speaker difference in V ¡/V ratios is not. PHONOLOGICAL DEMANDS VS . SYSTEM CONSTRAINTS IN AN L2 SETTING 135

Figure 1a and 1b . a) Durations of long and short allophones produced by the Swedish (black columns) and the Polish speaker (gray columns). Mean values from 10 realizations by the Swedish speaker and 6 realizations by the Polish speaker. b) Inter-speaker differences for long/short vowel ratios and open/closed vowel ratios (mean values).

3.2 Vowel durations and articulator movements Two principal measures of articulator movements were taken to show possible differences between the speakers; 1) vertical mandible displacement in relation to vowel openness and phonological length, 2) vertical lower lip depression in relation to vowel openness and phonological length. The pattern of jaw opening in the two speakers is shown in Figure 2a. The Swedish speaker follows an expected pattern, where the jaw movement seems to reflect vowel openness, with greater openness for /a/ than for /i/, but also more open articulation for short allophones than for long allophones. The Polish speaker also shows greater jaw lowering for /a/ than for /i/, but the smaller opening for short allophones compared to long

allophones, does not reflect the spectral vowel quality, i.e. the fact that at least for /a/, the

]

Polish speaker produces higher F1 for [a] than for [ ¡ . Inspection via listening and spectral analysis, shows that both speakers produce very similar F1 and F2 values. The pattern of lip aperture, as shown in Figure 2b, follows roughly the pattern of jaw lowering gestures, except for the Swedish speaker’s smaller lip aperture for short /i/ compared

to long /i ¡/.

Figure 2a and 2b . Mandible (2a) and lower lip (2b) depression for long-short and open- closed vowels, produced by the Swedish and the Polish speaker.

The timing pattern in terms of lip aperture duration related to vowel duration, and time laps from vowel end to maximal lip closure, did not show any systematic differences between speakers or vowel types. 136 BOSSE THORÉN

4 Discussion The segment duration patterns produced by the two speakers are not surprising. Starting with vowel duration, the phonological vowel length is a well established and well known property of Swedish, both as an important feature of Swedish pronunciation, and a way of accounting for the double consonant spelling. As seen in Figure 1, both speakers realize long and short vowel allophones quite similarly. The Swedish speaker, as shown in Figure 1, demonstrates in addition a substantial prolonging of the /p/ segment after short vowel, which the Polish speaker does not. The Polish speaker reports having encountered rules for vowel length as well as consonant length while studying Swedish, implying that mere ignorance does not account for his lack of complementary long consonant. Literature in phonetics, e.g. Ladefoged & Maddieson (1996), gives the impression that phonological vowel length is utilized by a greater number of the world’s languages than is consonant length. This suggests that phonological consonant length is a universally more marked feature than is vowel length, and hence more difficult to acquire. The somewhat greater difference between long and short vowel allophone, demonstrated by the Polish speaker, can be interpreted as a compensation for the lack of complementary

consonant length, which is demonstrated to serve as a complementary cue for the listener, ¡ when segment durations are in the borderland between /V ¡C/ and /VC / (Thorén 2005). The between-speaker difference is not surprising, since the phonological quantity in Swedish is a predominant phonetic feature, and can be expected to influence the temporal organization of the native Swede’s speech from early age. The Polish speaker came to Sweden as an adult and has acquired one important temporal feature, but his overall temporal organization may still bear strong traces of the system constraints, concerning the duration of segments.

The differences in lip and mandible movements between the speakers could be interpreted as follows: Both speakers produce a higher F1 for short [a] than for long [ ¡] (e.g. Fant 1959),

which typically correlates with lower tongue and mandible. The Polish speaker however,

shows a clearly greater jaw and lip opening for long [ ¡] than for short [a], which suggests

that the Polish speaker has a compensatory tongue height in [ ¡], to maintain correct spectral quality. The greater mandible excursion in [ ¡] can not be the result of an articulatory goal for this vowel, but could possibly be interpreted as an inverse “Extent of Movement Hypothesis”

(Fischer-Jörgensen 1964), letting the mandible make a greater excursion owing to the opportunity offered by the long duration of the [ ¡].

References Elert, C-C., 1964. Phonologic studies of Swedish Quantity . Uppsala: Almqvist & Wiksell. Fant, G., 1959. Acoustic analysis and synthesis of speech with application to Swedish. Ericsson Technics. No. 15 , 3-108. Fischer-Jörgensen, E., 1964. Sound Duration and Place of articulation. Zeitschrift für Sprachwissenschaft und Kommunikationsforschung 17 , 175-207. Ladefoged, P. & I. Maddieson, 1996. The sounds of the World’s Languages . Oxford: Blackwell publishers. Lindblom, B., 1990. Explaining phonetic variation: a sketch of the H&H theory. In Hardcastle & Marchal (eds.), Speech production and speech modeling . Dordrecht: Kluwer, 403-439. Thorén, B., 2005. The postvocalic consonant as a complementary cue to the quantity distinction in Swedish – a revisit. Proceedings from FONETIK 2005, Göteborg University, 115-118. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 137 Working Papers 52 (2006), 137–140

Cross-modal Interactions in Visual as Opposed to Auditory Perception of Vowels

Hartmut Traunmüller Department of Linguistics, Stockholm University [email protected]

Abstract This paper describes two perception experiments with vowels in monosyllabic utterances presented auditorily, visually and bimodally with incongruent cues to openness and/or roundedness. In the first, the subjects had to tell what they heard; in the second what they saw. The results show that the same stimuli evoke a visual percept that may be influenced by audition and may be different from the auditory percept that may be influenced by vision. In both cases, the strength of the influence of the unattended modality showed between-feature variation reflecting the reliability of the information.

1 Introduction Nearly all research on cross-modal interactions in speech perception has been focused on the influence an optic signal may have on auditory perception. In modeling audiovisual integration, it is common to assume three functional components: (1) auditory analysis, (2) visual analysis and (3) audiovisual integration that is assumed to produce an ‘amodal’ phonetic output. Although details differ (Massaro, 1996; Robert-Ribes et al., 1996; Massaro & Stork, 1998), the output was commonly identified with what the subjects heard, not having been asked about what they saw . This experimenter behavior suggests the amodal representations of phonetic units (concepts), which can be assumed to exist in the minds of people, to be closely associated with auditory perception. The seen remains outside the scope of these models unless it agrees with the heard . The present experiments were done in order to answer the question of whether a visual percept that may be influenced by audition can be distinguished from the auditory percept that may be influenced by vision and whether the strength of such an influence is feature-specific. Previous investigations (Robert-Ribes et al., 1998; Traunmüller & Öhrström, in press) demonstrated such feature-specificity in the influence of optic information on the auditory perception of vowels: the influence was strongest for roundedness, for which the non-attended visual modality offered more reliable cues than the attended auditory modality. In analogy, we could expect a much stronger influence of non-attended acoustic information on the visual perception of vowel height or “openness” as compared with roundedness.

2 Method 2.1 Speakers and speech material For the two experiments performed, a subset of the video recordings made for a previous experiment (Traunmüller & Öhrström, in press) was used. It consisted of the 6 incongruent auditory-visual combinations of the nonsense syllables /i/,/y/ and /e/produced by each one of 2 male and 2 female speakers of Swedish. Synchronization had been based on the release burst of the first consonant. In Exp. 1, each auditory stimulus was also presented alone and in Exp. 2 each visual stimulus instead. 138 HARTMUT TRAUNMÜLLER

2.2 Perceivers 14 subjects believed to have normal hearing and vision (6 male, aged 20 to 60 years, and 8 female, aged 20 to 59 years) served as perceivers. All were native speakers of Swedish pursuing studies or research at the Department of Linguistics.

2.3 Procedure The subjects wore headphones AKG K25 and were seated with their faces at an arm’s length from a computer screen. Each one of the 9 stimuli per speaker was presented twice in random order, using Windows Media Player. The height of the faces, shown in the left half of the screen, was roughly 12 cm. In Exp. 1, the subjects were instructed to look at the speaker when shown, but to tell which vowel they heard . In Exp. 2, they were instructed to keep the headphones on, but to tell which vowel they saw . Stimulus presentation was controlled individually by the subjects, who were allowed to repeat each stimulus as often as they wished. They gave their responses by clicking on orthographic symbols of the 9 Swedish vowels arranged in the right half of the screen in manner of an IPA-chart. They were told to expect a rather skewed vowel distribution. Prior to each experiment proper, three stimuli were presented for familiarization. The two experiments lasted for grossly 30 minutes together.

3 Results The pooled results of each one of the two experiments are shown in Tables 1 and 2. It can be noticed that in Exp. 1, where subjects had to report the vowels they heard, openness was almost always perceived in agreement with the speaker’s intention (99.2%) even when conflicting optic information was presented. Roundedness was perceived much less reliably: 14.7% errors when no face was presented. Many of these errors were evoked by one particular speaker. In cases of incongruent information, roundedness was predominantly perceived in agreement with the optic rather than the acoustic stimulus. The picture that emerged from Exp. 2, where subjects had to report the vowels they saw by lipreading, was the reverse: Presence vs. absence of roundedness was perceived correctly to 98.4% and the rounded vowels were only in 5.4% of the cases mistaken for inrounded (labialized), while openness was perceived quite unreliably, with 28.3% errors when no acoustic signal was presented. (One of the speakers elicited substantially fewer errors.) In cases of incongruent information, openness was often perceived in agreement with the acoustic rather than the optic stimulus, but the cross-modal influence was not as strong in lipreading (Exp. 2) as in listening (Exp. 1). This can be seen more immediately in Table 3, which shows the result of linear regression analyses in which the numerical mean of the perceived openness and roundedness of each stimulus were taken as the dependent variables. While the overall close to perfect performance in auditory perception of openness and in visual perception of roundedness by all speakers does not allow any possible between- perceiver differences to show up, such differences emerged, not unexpectedly, in auditory roundedness perception and in visual openness perception (see Table 4). In auditory roundedness perception, the case-wise reliance on vision varied between 31% and 100%. In visual openness perception, the case-wise reliance on audition varied between 28% and 97%. Despite the similarity in range, there was no significant correlation between these two variables (r 2=0.04, p=0.5), nor was there any significant gender difference in visual perception (p>0.4). This means that the between-subject variation cannot be explained as due to a subject specific (or gender specific) general disposition towards cross-modal integration. In auditory perception, women showed a greater influence of vision, but the gender difference failed to attain significance (two-tailed t-test, equal variance not assumed: p=0.15), and age was never a significant factor. CROSS -MODAL INTERACTIONS IN PERCEPTION OF VOWELS 139

Table 1. Confusion matrix for auditory perception. Stimuli: intended vowels presented acoustically and optically. Responses: perceived vowels (letters). ; incorrect openness ; roundedness incorrect but agrees with optic stimulus. Boldface: majority response. Stimuli Responses Sound Face i y u o e ö å ä a * i 9 1 y 17 1 e 23 1 i e 3 i y 85 y i 85 1 y e 84 6 e i 8 e y 1 73

Table 2. Confusion matrix for visual perception. As in Table 1, but incorrect roundedness ; openness incorrect but agrees with acoustic stimulus. Stimuli Responses Sound Face i y u o e ö å ä a * i 54 2 y 1 7 3 28 e 9 2 e i 91 4 1 y i 2 31 1 1 i y 1 5 11 e y 3 1 46 i e 65 1 y e 52 2 1

Table 3. Weights of auditory and visual cues in the perception of openness and roundedness. Auditory cues Visual cues Heard openness 1.00 0.03 Heard roundedness 0.28 0.68 Seen openness 0.45 0.52 Seen roundedness 0.00 0.97

Table 4. Auditory roundedness and visual openness perception by subject (age in years, sex). Percentage of responses in agreement with the acoustic (Aud) or optic (Vis) stimulus in cases of incongruent information (n=32). Incorrect but agrees with the unattended modality. Subject Age 34 27 30 20 41 34 21 23 60 27 20 23 25 59 Sex m m f m m f f f m f m f f f Roundedness Aud of vowels heard Vis 31 41 50 59 63 66 75 78 91 94 94 97 97 100 Openness Aud 59 47 28 34 72 69 97 69 63 72 50 75 28 59 of vowels seen Vis

140 HARTMUT TRAUNMÜLLER

The patterns of confusions can be modelled by weighted summation of the response probabilities for each vowel in the attended modality [listening (A), lipreading (V)] and a Bayesian auditory-visual integration (AV). For the pooled data, linear regression on this basis gives response probabilities P and determination coefficients r 2 as follows: 2 2 Pheard = 0.01 +0.26 A +0.71 AV (r = 0.98) and P seen = −0.00 +0.57 V +0.45 AV (r = 0.94).

4 Discussion As for auditory perception with and without conflicting visual cues and for visual perception alone (lipreading), the patterns of confusion observed here agree closely with those obtained previously (Traunmüller & Öhrström). Now, the novel results obtained in visual perception with conflicting auditory cues demonstrate that a visual percept that may be influenced by audition has to be distinguished from the auditory percept that may be influenced by vision and that the strength of the cross-modal influence is feature-specific in each case. Based on confusion patterns in consonant perception, it has been claimed that humans behave in accordance with Bayes’ theorem (Massaro & Stork, 1998), which allows predicting bimodal response probabilities by multiplicative integration of the unimodal probabilities. Although some of our subjects behaved in agreement with this hypothesis in reporting what they heard , the behaviour of most subjects refutes the general validity of this claim, since it shows a substantial additive influence of the auditory sensation. When reporting what they saw , all subjects except one showed a substantial additive influence of the visual sensation. Given the unimodal data included in Tables 1 and 2, Bayesian integration lends prominence to audition in the perception of openness and to vision in roundedness. The data make it clear that an ideal perceiver should rely on audition in the perception of openness, as all subjects did in their auditory judgments, and combine this with the roundedness sensed by vision, since this is more reliable when the speaker’s face is clearly visible. Four female and two male subjects behaved in this way to more than 90% in reporting what they heard but only one other female subject in reporting what she saw. The results can be understood as reflecting a weighted summation of sensory cues for features such as openness and roundedness, whereby the weight attached reflects the feature- specific reliability of the information received by each sensory modality (cf. Table 3). The between-perceiver variation then reflects differences in the estimation of this reliability.

Acknowledgements This investigation has been supported by grant 2004-2345 from the Swedish Research Council. I am grateful to Niklas Öhrström for the recordings and for discussion of the text.

References Massaro, D., 1996. Bimodal speech perception: a progress report. In D.G. Stork & M.E.Hennecke (eds.), Speechreading by Humans and Machines. Berlin: Springer, 80-101. Massaro, D.W. & D.G. Stork, 1998. Speech recognition and sensory integration. American Scientist 86 , 236-244. Robert-Ribes, J., M. Piquemal, J-L. Schwartz & P. Escudier, 1996. Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition. In D.G. Stork & M.E.Hennecke (eds.), Speechreading by Humans and Machines. Berlin: Springer, 193-210. Robert-Ribes, J., J-L. Schwartz, T. Lallouache & P. Escudier, 1998. Complementarity and synergy in bimodal speech: Auditory, visual and audio-visual identification of French oral vowels in noise. Journal of the Acoustical Society of America 103 , 3677-3689. Traunmüller, H. & N. Öhrström, in press. Audiovisual perception of openness and lip rounding in front vowels. Journal of Phonetics . Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 141 Working Papers 52 (2006), 141–144

Knowledge-light Letter-to-Sound Conversion for Swedish with FST and TBL

Marcus Uneson Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University [email protected]

Abstract This paper describes some exploratory attempts to apply a combination of finite state transducers (FST) and transformation-based learning (TBL, Brill 1992) to the problem of letter-to-sound (LTS) conversion for Swedish. Following Bouma (2000) for Dutch, we employ FST for segmentation of the textual input into groups of letters and a first transcription stage; we feed the output of this step into a TBL system. With this setup, we reach 96.2% correctly transcribed segments with rather restricted means (a small set of hand-crafted rules for the FST stage; a set of 12 templates and a training set of 30kw for the TBL stage). Observing that quantity is the major error source and that compound morpheme boundaries can be useful for inferring quantity, we exploratively add good precision-low recall compound splitting based on graphotactic constraints. With this simple-minded method, targeting only a subset of the compounds, performance improves to 96.9%.

1 Introduction A text-to-speech (TTS) system which takes unrestricted text as input will need some strategy for assigning pronunciations to unknown words, typically achieved by a set of letter-to-sound (LTS) rules. Such rules may also help in reducing lexicon size, permitting the deletion of entries whose pronunciation can be correctly predicted from rules alone. Outside the TTS domain, LTS rules may be employed for instance in spelling correction, and automatically induced rules may be interesting for reading research. Building LTS rules by hand from scratch is easy for some languages (e.g., Finnish, Turkish), but turns out prohibitively laborious in most cases. Data-driven methods include artificial neural networks, decision trees, finite-state methods, hidden Markov models, transformation-based learning and analogy-based reasoning (sometimes in combination). Attempts at fully automatic, data-driven LTS for Swedish include Frid (2003), who reaches 96.9 % correct transcriptions on segment level with a 42000-node decision tree.

2 The present study The present study tries a knowledge-light approach to LTS conversion, first applied by Bouma (2000) on Dutch, which combines a manually specified segmentation step (by finite-state transducers, FST) and an error-driven machine learning technique (transformation-based learning, TBL). One might think of the first step as redefining the alphabet size, by introducing new, combined letters, and the second as automatic induction of reading rules on that (redefined) alphabet, ordered in sequence of relevance. For training and evaluation, we used disjoint subsets of a fully morphologically expanded form of Hedelin et al. (1987). The expanded lexicon holds about 770k words (including 142 MARCUS UNESON proper nouns; these and other words containing characters outside the in lowercase were discarded).

2.1 Finite-state transduction (FST) Many NLP tasks can be cast as string transformation problems, often conveniently attacked with context-sensitive rewrite rules (which can be compiled directly into FST). Here, we first use an FST to segment input into segments or letter groups, rather than individual letters. A segment typically corresponds to a single sound (and may have one member only). Treating a sequence of letters as a group is in principle meaningful whenever doing so leads to more predictable behaviour. Clearly, however, there is an upper limit on the number of groups, if the method should justifiably be called ‘knowledge-light’. For Swedish, some segments close at hand are {[s,c,h], [s,s], [s,j], [s,h], [c,k], [k], [k,j]…}; the set used in the experiments described here has about 75 members. Segmentation is performed on a leftmost, longest basis, i.e., that rule is chosen which results in as early a match as possible, the longest possible one if there are several candidates. All following processing now deals with segments rather than individual letters. After segmentation, markers for begin- and end-of-word are added, and the (currently around 30) hand-written replace rules are applied, again expressed as transducers or compositions of transducers. These context-sensitive replace rules may encode well-known reading rules (in the case of Swedish, for instance ‘ is pronounced // in front of morpheme-initially’), or try to capture other partial regularities (Olsson 1998). Most rules deal with vowel quantity and/or the grapheme, reflecting typical difficulties in . The replacement transducer is implemented such that each segment can be transduced at most once. A set (currently around 60) of context-less, catch-all rules provide default mappings. To illustrate the FST steps, consider the word skärning ‘cut’ after each transduction: input: skärning segment: sk-ä-r-n-i-ng marker: #-sk-ä-r-n-i-ng-# transduce: #-S+<:+r-n-I-N+# remove marker: S<:rnIN

2.2 Transformation-based learning (TBL) TBL was first proposed for part-of-speech tagging by Eric Brill (1992). TBL is, generally speaking, a technique for automatic learning of human-readable classification rules. It is especially suited for tasks where the classification of one element depends on properties or features of a small number of other elements in the data, typically the few closest neighbours in a sequence. In contrast to the opaque problem representation in stochastic approaches, such as HMMs, the result of TBL training is a human-readable, ordered list of rules. Application of the rules to new material can again be implemented as FSTs and thus be very fast. For the present task, we employed the -TBL system (Lager 1999). It provides an interface for scripting as well as an interactive environment, and Brill’s original algorithm is supplemented by much faster Monte Carlo rule sampling. The templates were taken from Brill (1992), omitting disjunctive contexts (e.g., “A goes to B when C is either 1 or 2 before”), which are less relevant to LTS conversion than to POS tagging.

2.3 Compound segmentation (CS) The most important error source by far is incorrectly inferred quantity. In contrast to Dutch, for which Bouma reports 99% with the two steps above (and a generally larger setup, with LETTER -TO -SOUND CONVERSION FOR SWEDISH WITH FST & TBL 143

500 TBL templates), quantity is not explicitly marked in Swedish orthography. One might suspect that this kind of errors might be remedied if compounds and their morpheme boundaries could be identified in a preprocessing step. Many rules are applicable in the beginning or end of morphemes rather than words; we could provide context for more rules if only we knew where the morpheme boundaries are. Compound segmentation (CS) could also help in many difficult cases where the suffix of one component happens to form a letter group when combined with the prefix of the following, as in , , . Ideally, segments should not span morpheme boundaries: should be treated as a segment in but not in . In order to explore this idea while still minimizing dependencies on lexical properties, we implemented a simple compound splitter based on graphotactic constraints. An elaborate variant of such a non-lexicalized method for Swedish was suggested by Brodda (1979). He describes a six-level hierarchy for consonant clusters according to how much information they provide about a possible segmentation point, from certainty (as -rkkl- in ‘church bell’) to none at all (as -gr- in ‘verge (road)’). For the purposes of this study, we targeted the safe cases only (on the order of 30-40% of all compounds). Thus, recall is poor but precision good, which at least should be enough to test the hypothesis.

3 Results 3.1 Evaluation measure The most common LTS evaluation measure is Levenshtein distance between output string and target. For the practical reason of convenient error analysis and comparability with Frid (2003) we follow this, but we note that the measure has severe deficiencies. Thus, all errors are equally important – exchanging [e] for [ ] is considered just as bad as exchanging [t] for [a]. Furthermore, different lexica have different levels of granularity in their transcriptions, leading to rather arbitrary ideas about what ‘right’ is supposed to mean. For future work, some phonetically motivated distance measure, such as the one suggested by Kondrak (2000), seems a necessary supplement.

Table 1. Results and number of rules for combinations of CS, FST, and TBL. 5-fold cross- validation. Monte Carlo rule sampling. Score threshold (stopping criterion) = 2. The baselines (omitting TBL) are 80.1% (default mappings); 86.6% (FST step only); 88.3% (CS + FST). Training data TBL FST + TBL CS + FST + TBL segments words results % #rules results % #rules results % #rules 49k 5k 93.8 820 94.9 503 95.5 513 98k 10k 94.1 1131 95.0 761 95.7 809 198k 20k 95.2 1690 95.7 1275 96.5 1250 300k 30k 95.7 2225 96.2 1862 96.9 1756

3.2 Discussion Some results are given in Table 1. In short, both with and without the TBL steps, adding handwritten rules to the baseline improves system performance (and TBL training time) significantly, as does adding the crude CS algorithm. The number of learnt rules is sometimes high. However, although space constraints do not allow the inclusion of a graph here, rule efficiency declines quickly (as is typical for TBL), and the first few hundred rules are by far the most important. We note that the major error source still is incorrectly inferred quantity. We have stayed at the segmental level of lexical transcription, with no aim of modelling contextual processes. Although this approach would need (at the very least) postprocessing for many applications, it might be enough for others, such as spelling correction. Result-wise, it

144 MARCUS UNESON seems that the current approach can challenge Frid’s (2003) results (96.9% on a much larger (70kw) training corpus), while still retaining the advantage of the more interpretable rule representation. Frid goes on to predict lexical prosody; we hope to get back to this topic.

4 Future directions Outside incorporating more sophisticated compound splitting, there are several interesting directions. The template set is currently small. Likewise, the feature set for each corpus position may be extended in other ways, for instance by providing classes of graphemes – C and V is a good place to start, but place or manner of articulation for C and frontness for vowels might also be considered. Such classes might help finding generalizing rules over, say, front vowels or nasals, and might help where data is sparse; the extracted rules are also likely to be more linguistically relevant. If so, segments should preferably be chosen such that they fall clear into classes. Another, orthogonal approach is “multidimensional” TBL (Florian & Ngai 2001), i.e., TBL with more than one variable. For instance, the establishment of stress pattern may determine phoneme transcription, or the other way round. For most TBL systems, rules can change one, prespecified attribute only (although many attributes may provide context). This is true for - TBL as well; however, we are currently considering an extension. Interesting is also the idea to try to predict quantity and stress reductively, with Constraint Grammar-style reduction rules (i.e., “if Y, remove tag X from the set of possible tags”). Each syllable is assigned an initial set of all possible stress levels, a set which is reduced by positive rules (‘ending -<ör># has main stress; thus its predecessor does not’) as well as negative (‘ending -# never takes stress’). -TBL conveniently supports reduction rules.

References Bouma, G., 2000. A finite state and data oriented method for grapheme to phoneme conversion. Proceedings of the first conference on North American chapter of the Association for Computational Linguistic , Seattle, WA. Brill, E., 1992. A simple rule-based part of speech tagger. Third Conference on Applied Natural Language Processing, ACL. Brodda, B., 1979. Något om de svenska ordens fonotax och morfotax: Iakttagelse med utgångspunkt från automatisk morfologisk analys. PILUS 38 . Institutionen för lingvistik, Stockholms universitet. Florian, R. & G. Ngai, 2001. Multidimensional Transformation-Based Learning. Proceedings of the Fifth Workshop on Computational Language Learning (CoNLL-2001) , Toulouse. Frid, J., 2003. Lexical and Acoustic Modelling of Swedish Prosody . PhD Thesis. Travaux de l'institut de linguistique de Lund 45. Dept. of Linguistics, Lund University. Hedelin, P., A. Jonsson & P. Lindblad, 1987. Svenskt uttalslexikon (3rd ed.). Technical report, Chalmers University of Technology. Kondrak, G., 2000. A new algorithm for the alignment of phonetic sequences. Proceedings of the first conference on North American chapter of the ACL , Morgan Kaufmann Publishers Inc, 288-295. Lager, T., 1999. The µ-TBL System: Logic Programming Tools for Transformation-Based Learning. Third International Workshop on Computational Natural Language Learning (CoNLL-1999) , Bergen. Olsson, L-J., 1998. Specification of phonemic repesentation, Swedish. DEL 4.1.3 of EC project “SCARRIE Scandinavian proof-reading tools” (LE3-4239) . Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 145 Working Papers 52 (2006), 145–148

The Articulation of Uvular Consonants: Swedish

Sidney Wood Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University [email protected]

Abstract The articulation of uvular consonants is studied with particular reference to quantal aspects of speech production. Data from X-ray motion films are presented. Two speakers of Southern Swedish give examples of [ ]. The traditional view, that uvular consonants are produced by articulating the tongue dorsum towards the uvula, is questioned, and theoretical considerations point instead to the same upper pharyngeal place of articulation as for [o]- like vowels. The X-ray films disclose that these subjects did indeed constrict the upper pharynx for [ ].

1 Introduction 1.1 The theory of uvular articulations This study begins by questioning the classical account of uvular consonant production (e.g. Jones, 1964), that the tongue dorsum is raised towards the uvula, and that the uvula vibrates

for a rolled [ ]. Firstly, it is not clear how a vibrating uvula would produce the acoustic energy

of a typical rolled [ ]. A likely process exploits a Bernoulli force in the constricted passage to chop the voiced sound into pulses when air pressure and tissue elasticity are suitably balanced, which requires that intermittent occlusion is possible between pulses. Unfortunately, there are free air passages either side of the uvula that should prevent this from happening. Secondly, these same passages should likewise prevent complete occlusion for a uvular stop, and they should also prevent a Reynolds number becoming sufficiently small for the turbulence of uvular fricatives. If the uvula is not a good place for producing consonants known as “uvular”, how else might they be produced? Wood (1974) observed that the spectra of vowel-to-consonant tran-

sitions immediately adjacent to uvular consonants were very similar to the spectra of [o ]-like vowels, or to their respective counterparts [ ], and concluded that they shared the same

place of location, i.e. the upper pharynx, confirmed for [o ]-like vowels by Wood (1979). Mrayati et al. (1988) studied the spectral consequences of systematic deformations along an acoustic tube, and also concluded that the upper pharynx was a suitable location for these same consonants and vowels. Observations like this are obviously relevant for discussions of the quantal nature of speech (Stevens 1972, 1989). Clarifying the production of uvular consonants is not just a matter of correcting a possible misconception about a place of articulation. It concerns fundamental issues of phonetic theory.

1.2 This investigation The uvular articulations were analysed from cinefluorographic films, a method that enables simultaneous articulatory activity to be observed in the entire vocal tract, and is therefore

suitable for studying the tongue manoeuvres associated with uvular consonants. Two

χ undisputed sources of uvular consonants are [ ] in southern Swedish, and [ q ] in West Greenlandic Inuit. The subjects of the films are native speakers of these languages. Examples 146 SIDNEY WOOD from the Greenlandic subject have been published in e.g. Wood (1996a-b, 1997). Examples from one Swedish subject are reported in this paper. Examples from a second Swedish subject will be presented at the conference.

2 Procedures Wood (1979) gives details of how the films were made. One reel of 35mm film was exposed per subject at an image rate of 75 frames/second (i.e. 1 frame every 13.3ms), allowing about 40 seconds per subject. Each frame received a 3ms exposure.

Figure 1 . (a) Example of profile tracing and identification of prominent features. Note the difference between the tongue midline and the edge contours. (b) Examples of tongue body and tongue blade maneuvers (five successive film frames in this instance).

In the film by SweA, Swedish sibilants were commuted through different vowel environ- ments. The uvular variant of Swedish /r/ occurs in the present indicative verb ending {/ar/}

followed by the proposition {/i:/}, yielding several tokens of the sequence [a i]. In the film by SweB, the long vowels of Swedish (diphthongized in this dialect) were placed in a /bVd/ environment. The uvular variant of Swedish /r/ occurs where the subject

recited the date and location of the film session. The word “four”, fyra (/fy:ra/), is reported here, yielding the sequence [ y a].

3 Examples from the subject SweB

The frame by frame tongue body movement by SweB in [y a] is summarised in Figure 2. The

sequence of profiles from [y] through [ ] to [a] is shown in Figures 3-5 (every other film frame, i.e. about 27ms between each illustration).

Figure 2 . Subject SweB. Frame by frame tongue body movement (13.3ms for each step) in the transition from [y] to [ ] (left) and [ ] to [a] (right). The numbers refer to frames on the film. THE ARTICULATION OF UVULAR CONSONANTS : SWEDISH 147

Figure 3 . Every other profile from the sequence [y a], starting from the most complete /y:/ profile (left, frame 2298): tongue body raised towards the hard palate and lips rounded. The

tongue body was then retracted for the transition to [ ], and the lip rounding withdrawn (2300 centre, 2302 right). Continued in Figure 4.

Figure 4 . Profiles from the sequence [y a], continued from Figure 3. The transition to [ ] continued to frame 2304 (left), concluding with a narrow pharyngeal constriction (circled). This retraction was accompanied by slight depression, so that the tongue dorsum passed below the uvula and was directed into the upper pharynx. The lip rounding of /y:/ is still being withdrawn. Activity for /a/ was then commenced, continuing through frames 2306, centre, and 2308, right. The tongue body gesture of /a/ is directed towards the lower pharynx, accompanied by mandibular depression. The velar port opened slightly in frame 2308 (right) (this sequence is phrase final and was followed by a breathing pause). Continued in Figure 5.

Figure 5 . Profiles from the sequence [y a], continued from Figure 4. The transition from [ ] to [a] continued through frame 2310 (left) to frame 2312 (right), concluding with a narrow low pharyngeal constriction (circled), as expected from Wood (1979).

148 SIDNEY WOOD 4 Discussion and conclusions

The retracting tongue body manoeuvre from [y] to [ ], seen in Figure 1 (left) and in profiles 2300 to 2304 in Figures 2 and 3, was depressed slightly. Consequently it passed below the

uvula and continued into the pharynx. For this instance of [ ], the subject did not elevate the

tongue dorsum towards the uvula. Similar behaviour was exhibited by the West Greenlandic χ subject for [ q ], and by the second Swedish subject whose results will be presented at the conference.

The target of the tongue body gesture of [ ] was the upper pharynx, as hypothesized. This was also the case in the other data to be reported at the conference. The upper pharynx is also

the region that is constricted for [o] and [ ]-like vowels, which means that this one place of articulation is shared by all these consonants and vowels. The upper pharynx is a more suitable place than the uvula for producing “uvular” stops, fricatives and trills. The soft smooth elastic surfaces of the posterior part of the tongue and the opposing posterior pharyngeal wall allow perfect occlusion, or the creation of apertures narrow enough for the generation of turbulence.

References Jones, D., 1964. An Outline of English Phonetics . Cambridge: W. Heffer & Sons Ltd. (9 th amended edition). Mrayati, M., R. Carré & B. Guérin, 1988. Distinctive regions and modes: a new theory of speech production. Speech Communication 7 , 257-286. Stevens, K.N., 1972. The quantal nature of speech: Evidence from articulatory-acoustic data. In P.B. Denes & E.E. David, Jr. (eds.), Human Communication: A Unified View. New York: McGraw Hill, 243-255. Stevens, K.N., 1989. On the quantal nature of speech. In J. Ohala (ed.), On the Quantal Nature of Speech . Journal of Phonetics 17 , 3-45. Wood, S.A.J., 1974. A spectrographic study of allophonic variation and vowel reduction in West Greenlandic Eskimo. Working Papers 4, Dept. of Linguistics and Phonetics, University of Lund, 58-94. Wood, S.A.J., 1979. A radiographic analysis of constriction locations for vowels. Journal of Phonetics 7 , 25-43. Wood, S.A.J., 1996a. Temporal coordination of articulator gestures: an example from Greenlandic. Journal of the Acoustical Society of America (A). Poster presented at 131st meeting of the Acoustical Society of America , Indianapolis. Wood, S.A.J., 1996b. The gestural organization of vowels: a cinefluorographic study of articulator gestures in Greenlandic. Journal of the Acoustical Society of America 100 , 2689 (A). Poster presented at the Third Joint Meeting of the Acoustical Societies of America and Japan , Honolulu. Wood, S.A.J., 1997. A cinefluorographic study of the temporal organization of articulator gestures: Examples from Greenlandic. In P. Perrier, R. Laboissière, C. Abry & S. Maeda (eds.), Speech Production: Models and Data (Papers from the First ESCA Workshop on Speech Modeling and Fourth Speech Production Seminar, Grenoble 1996 ). Speech Communication 22 , 207-225. Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 149 Working Papers 52 (2006), 149–152

Acoustical Prerequisites for Visual Hearing

Niklas Öhrström and Hartmut Traunmüller Department of Phonetics, Stockholm University {niklas.ohrstrom|hartmut}@ling.su.se

Abstract The McGurk effect shows in an obvious manner that visual information from a speaker’s articulatory movements influences the auditory perception. The present study concerns the robustness of such speech specific audiovisual integration. What are the acoustical prerequisites for audiovisual integration to occur in speech perception? Auditory, visual and audiovisual syllables (phonated and whispered) were presented to 23 perceivers. In some of the stimuli, the auditory signal was exchanged for a schwa syllable, a dynamic source signal and a constant source signal. The results show that dynamic spectral information from a source signal suffice as auditory input for speech specific audiovisual integration to occur. The results also confirm that type (and absence) of lip rounding are strong visual cues.

1 Introduction Visual contribution to speech comprehension was for a long time ignored by theorists and only accounted for when the auditory speech signal was degraded (Sumby & Pollack, 1954). However, McGurk and MacDonald (1976) showed that auditory speech perception could be altered by vision even when the auditory stimulus lacked ambiguity. They used [baba] and [aa] syllables dubbed on visual stimuli with different consonants. A visual [aa] dubbed on an auditory [baba] evoked the percept of /dada/ . A visual [baba] dubbed on an auditory [aa] was often perceived as /aba/ or /baba/ . This demonstrated ordinary speech perception to be a bimodal process in which optic information about a speaker’s articulatory movements is integrated into auditory perception. Traunmüller and Öhrström (in press) have demonstrated that this also holds for vowels. It has been shown experimentally that perception of features such as labiality and lip rounding is dominated by the visual signal. In addition, it is worth mentioning that crossmodal illusions are not necessarily restricted to speech perception: Shams et al. (2000) demonstrated that the visual perception of the numerosity of flashes can be altered by simultaneous auditory presentation of clicks. Bimodal speech perception normally involves synchrony between the auditory and the visual information from a speaker’s articulatory movements. Visual information can, therefore, be expected to have a substantial influence on auditory speech perception but visual hearing might require presence of a more or less authentic acoustic speech signal. This study aims at exploring acoustical prerequisites for visually influenced auditory perception to occur. How much information from the original acoustic signal can we remove and still evoke visual hearing? In this study the four long Swedish vowels / i/, / u/, / / and / / will be tested (appearing both phonated and whispered in a [ b_d ] frame). In the first condition the formant frequencies of the vowel will be changed (in this case a [] will be used). In the second condition the formant peaks will be flattened out, whereby an approximate source signal will be obtained. In the third condition, the formant peaks will be flattened out and the acoustic signal will be kept in a steady state. It can be expected that at least the visible type of lip rounding will have an influence on auditory perception. 150 NIKLAS ÖHRSTRÖM & HARTMUT TRAUNMÜLLER 2 Method 2.1 Speakers and speech material One male and one female lecturer from the Department of Linguistics served as speakers. The recordings took place in an anechoic chamber using a video camera Panasonic NVDS11 and a microphone Brüel&Kjær 4215. The speakers produced the more or less meaningful Swedish syllables /bid/ , /bud/ , /bd/ and /bd/ in both phonated and whispered fashion . They were also asked to produce [bd] . The video recordings were captured in DV format and the acoustic signal was recorded separately (sf = 44.1 kHz, 16 bit/sample, mono). The acoustic recordings were subsequently manipulated in different ways using Praat (Boersma & Weenink, 2006): Firstly all acoustic syllables were resynthesized (sf = 11 kHz). The resynthesis was carried out using the Praat algorithm “LPC-burg”. The [bd] syllable was also resynthesized with formant bandwidths expanded to B n = 2 F n. The spectrally flattened schwa in this syllable is most similar to a source signal. Finally, to obtain a constant spectrally flattened signal, one glottal period of this schwa was selected and iterated. To obtain a constant whispered spectrally flattened signal, a window of 25 ms of the spectrally flattened whispered schwa was subjected to LPC analysis and resynthesized with increased duration. The final audiovisual stimuli were obtained by synchronizing the video signals with the manipulated audio signals. The synchronization was based on the release burst of the first consonant and performed in Premiere 6.5. The constant spectrally flattened signals were made equal in duration with the whole visual stimuli (approximately 2s). Each optic stimulus (except [bd]) was presented together with its corresponding auditory one, the acoustic schwa vowel [] , the spectrally flattened signal (SF) and the constant spectrally flattened signal (CSF). Each visual syllable (except [bd] ) and each auditory stimulus was also presented alone. In this way, 54 stimuli were obtained for each speaker. The total perception task consisted of two blocks. Block one consisted of 92 audio (A) and audiovisual (AV) stimuli in which each stimulus was presented once in random order. Block two consisted of 16 visual (V) stimuli, each presented twice in random order.

2.2 Perceivers 23 subjects who reported normal hearing and normal or corrected-to-normal vision (11 male, aged 17 to 52 years, and 12 female, aged 20 to 50 years) served as perceivers. All were phonetically naïve native listeners of Swedish.

2.3 Perception task The perceivers wore headphones AKG 135 and were seated with their faces at approximately 50 cm from a computer screen. All the stimuli were presented using Windows Media Player. During block 1 (which contained AV and A stimuli), the perceivers were instructed to report what they had heard. Nevertheless, they were instructed to always look at the speaker when shown. The perceivers were allowed to repeat each stimulus as many times as they wished. If they had heard a [ bVd] syllable (which could appear in a very distinct or vague manner) they were asked to report which one of the nine long Swedish vowels it resembled the most. They gave their responses by clicking on orthographic symbols of the Swedish vowels (a //, e /e/, i /i/, o /u/, u //, y /y/, å /o/, ä //, ö /ø/) arranged in the right half of the screen in manner of an IPA-chart. There was a response alternative “hör ingen vokal” right under the chart. This was to be used when no syllable was heard or when the sound was not heard as a human vowel. During block 2 (which contained optic stimuli only) the perceivers were instructed to report the vowel perceived through lipreading. As before, they were allowed to repeat each stimulus as many times as they wished. The whole experiment lasted for approximately 30 minutes. ACOUSTICAL PREREQUISITES FOR VISUAL HEARING 151 3 Results The responses of all subjects to each stimulus combination, whispered and phonated versions pooled, are shown in Table 1. It can be seen that the responses to congruent AV stimuli and auditorily presented vowels are in accord with the speaker’s intention. With vowels presented visually only, there were many confusions. The unrounded /i/ and // were mostly confused with other unrounded vowels. The in-rounded /u/ was predominantly confused with other in- rounded vowels ( // and /o/)). The out-rounded // was mostly confused with other out- rounded vowels (in this case with /ø/) and, to some extent, with in-rounded vowels. The auditory [] was almost exclusively categorized as an out-rounded vowel ( // or /ø/) and incongruent visual cues, such as absence of lip rounding, contributed only marginally to the auditory perception.

Table 1. Responses from all subjects (in %) to all stimuli, whispered and phonated versions pooled. Boldface: most frequent response. A: acoustic cues, V: optic cues, SF: spectrally flattened [bd] , CSF: constant spectrally flattened [] . “*”: no audible vowel. Stimuli Responses A V /i/ /y/ // /u/ /e/ /ø/ /o/ // // * /i/ /i/ 99 1 /u/ /u/ 89 11 // // 11 1 86 2 // // 100 /i/ - 99 1 /u/ - 86 14 // - 10 2 88 // - 100 - /i/ 70 24 1 4 2 - /u/ 1 1 7 87 4 - // 6 18 1 66 10 - // 1 1 19 5 74

[] - 48 52 [] /i/ 5 38 3 53 [] /u/ 49 2 49 [] // 1 35 3 60 1 [] // 50 50 SF - 7 1 1 9 1 22 1 2 48 9 SF /i/ 29 4 2 18 3 39 3 SF /u/ 2 48 9 4 1 27 9 SF // 10 4 4 13 13 48 8 SF // 2 9 21 2 1 61 4 CSF - 2 9 1 1 7 80 CSF /i/ 5 2 5 1 3 83 CSF /u/ 1 7 4 2 4 82 CSF // 2 5 4 2 86 CSF // 1 1 9 10 79

152 NIKLAS ÖHRSTRÖM & HARTMUT TRAUNMÜLLER

When presented in auditory mode alone, the spectrally flattened vowel (SF) was mostly categorized as out-rounded (in 70% of the cases as // or /ø/) but also, to some extent, as an unrounded or in-rounded vowel. When the auditory source signal was presented with different visual cues, type of rounding was very often perceived in accord with the visual stimulus. The constant source signal (CSF) was not very often identified as a human vowel or syllable, but there were traces of influence from the visual signal.

4 Discussion These experiments have shown that auditory perception of an approximate acoustic source signal (SF) is sensitive to visual input. In this case, the type of rounding was often perceived in accord with the visual signal. Interestingly, there was a perceptual bias towards /ø/ and // concerning the stimuli containing acoustical [bd] syllables, (SF) signals and (CSF) signals, while [] is undefined with respect to its roundedness. It is obvious that the (SF) and (CSF) still contain some acoustic traces from the [bd] . In this study the []s produced by the two speakers were categorized as rounded vowels. A possible explanation is that the Swedish phonological system does not offer any unrounded vowels, except possibly // , in this region of the vowel space. Therefore, it cannot be excluded that subjects actually heard an unrounded vowel for which they lacked a response alternative, but coarticulation effects from the initial labial consonant might also have caused a bias in favor of rounded vowels. The auditory perception of the acoustic schwa vowel was not much influenced by the visual signal. This could be due to the fact that a naturally articulated schwa has a definite place in auditory vowel space since the formant peaks are distinct. This makes the acoustic cues quite salient and leaves just a small space for visual influence. On the other hand, the approximate acoustic source signal with preserved dynamics (SF) evoked the perception of a human vocalization, although the place of the vowel in auditory vowel space was only vaguely suggested by its acoustic cues, which were much less salient. This gives opportunity for the visual information about a speaker’s articulatory movements to be integrated into the auditory percept. The constant source signal (CSF) lacked both dynamic properties and a distinct place in vowel space. It also lacked the temporal alignment with the visual signals that was present in the other stimuli. It was, therefore, perceived as a sound that was separate from the human utterance that the visible articulatory movements suggested. Thus, it appears that visual hearing of speech requires the presence of an acoustic signal that can easily be interpreted as belonging together with the visual signal.

Acknowledgements This investigation has been supported by grant 2004-2345 from the Swedish Research Council.

References Boersma, P. & D. Weenink, 2006. Praat – a system for doing phonetics by computer. http://www.fon.hum.uva.nl/praat/ . McGurk, H. & J. MacDonald, 1976. Hearing lips and seeing voices. Nature 264 , 746-748. Shams, L., Y. Kamitani & S. Shimojo, 2000. What you see is what you hear. Nature 408 , 788. Sumby, W.H. & I. Pollack, 1954. Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America 26 , 212-215. Traunmüller, H. & N. Öhrström, in press. Audiovisual perception of openness and lip rounding in front vowels. Journal of Phonetics .