<<

Proceedings from FONETIK 2014 Stockholm, June 9-11, 2014

PERILUS XXIV, June 2014

Edited by Mattias Heldner

© The Authors and the Department of Linguistics, Stockholm University, 2014 ISSN 0282-6690 ISBN 978-91-637-5662-7 printed version ISBN 978-91-637-5663-4 web version Printed by US-AB, Stockholm 2014 Distributor: Department of Linguistics, Stockholm University

Dedication

This conference is jointly dedicated to Professor Björn Lindblom and Professor Anders Eriksson on the occasion of their 150th birthday June 19, 2014. The dedication is weighted with 53.3% to the former, and 46.7% to the latter of these. You phoneticians do the math.

Previous Swedish Phonetic Conferences

I 1986 Uppsala University II 1988 Lund University III 1989 KTH Stockholm IV 1990 Umeå University (Lövånger) V 1991 Stockholm University VI 1992 Chalmers and Göteborg University VII 1993 Uppsala University VIII 1994 Lund University (Höör) - 1995 (XIIIth ICPhS in Stockholm) IX 1996 KTH Stockholm (Nässlingen) X 1997 Umeå University XI 1998 Stockholm University XII 1999 Göteborg University XIII 2000 Skövde University College XIV 2001 Lund University XV 2002 KTH Stockholm XVI 2003 Umeå University (Lövånger) XVII 2004 Stockholm University XVIII 2005 Göteborg University XIX 2006 Lund University XX 2007 KTH Stockholm XXI 2008 Göteborg University XXII 2009 Stockholm University XXIII 2010 Lund University XXIV 2011 KTH Stockholm XXV 2012 University of Gothenburg XXVI 2013 Linköping University

Preface

This volume contains the contributions to FONETIK 2014, the XXVIIth Swedish Phonetics Conference, organized by the Department of Linguistics at Stockholm University, June 9-11 2014. The papers appear in the order in which they were given at the conference.

Only a limited number of copies of this publication have been printed for distribution among the authors and those attending the conference. For ac- cess to electronic versions of the contributions, please look under http://www.ling.su.se/fonetik2014.

We would like to thank all contributors to the Proceedings. We would also like to thank Språkstudion at Stockholm University for hosting the social event on the first evening of the conference. Finally, we would like to thank Fonetikstiftelsen for financial support.

Stockholm in May 2014

On behalf of the Department of Linguistics

Mattias Heldner, Francisco Lacerda, Marcin Włodarczak, Iris-Corinna Schwarz, Elisabet Eir Cortes, Hatice Zora, Lena Renner, Gláucia Laís Salomão, Ulla Bjursäter

v

Contents

Session 1: Voice quality and emotion Voices after midnight – How a night out affects voice quality 1 Alexandra Berger, Rosanna Hedström Lindenhäll, Mattias Heldner, Sofia Karlsson, Sarah Nyberg Pergament, Ivan Vojnovic Emotional Finnish Speech: Evidence from Automatic Classification 5 Experiments Juhani Toivanen The intonation’s effects on speech intelligibility and attitudes 9 Sara Marklund, Jesper Zackariasson

Session 2: Segmental features Instability in simple speech motor sequences - an overview of 15 measures and what they really quantify Fredrik Karlsson Tongue articulation dynamics of /iː, yː, ʉ̟ ː/ in Stockholm, Gothen- 17 burg and Malmöhus Swedish Susanne Schötz, Johan Frid, Lars Gustafsson, Anders Löfqvist An acoustic study of the Estonian Swedish lateral [ɬ] 23 Susanne Schötz, Francis Nolan, Eva Liina Asu

Posters A data-driven approach to detection of interruptions in human– 29 human conversations Raveesh Meena, Saeed Dabbaghchian, Kalin Stefanov The WaveSurfer Automatic Speech Recognition Plugin 33 Giampiero Salvi, Niklas Vanhainen

vi

Towards a contingent anticipatory infant hearing test using eye- 35 tracking Iris-Corinna Schwarz, Atena Nazem, Sofia Olsson, Ellen Marklund, Inger Uhlén

Session 3: Dialogue Duration and pitch in perception of turn transition by Swedish and 41 English listeners Margaret Zellers Backchannels and breathing 47 Kätlin Aare, Marcin Włodarczak, Mattias Heldner Pauses and resumptions in human and in computer speech 53 Jens Edlund, Fredrik Edelstam, Joakim Gustafson

Session 4: Tone and accent Initiality accent deaccenting 59 Sara Myrberg Syllable structure and tonal representation: revisiting focal Accent II 65 in Swedish Antonis Botinis, Gilbert Ambrazaitis, Johan Frid Prosodic boundaries and discourse structure in Kammu 71 Anastasia M Karlsson, Jan-Olof Svantesson, David House Tonal production and syllabification in Greek 77 Antonis Botinis, Elina Nirgianaki

Session 5: Animal sounds Sound initiation and source types in human imitations of sounds 83 Pétur Helgason Human perception of intonation in domestic cat meows 89 Susanne Schötz, Joost van de Weijer A pilot study of human perception of emotions from domestic cat 95 vocalizations Susanne Schötz

vii

Session 6: L2 Aspects of second language speech prosody: data from research in 101 progress Juhani Toivanen Perception and production of Swedish word accents by Somali L1 105 speakers Anna Hed The confusing final stops in L2 acquisition 111 Elisabeth Zetterholm Observed pronunciation features in Swedish L2 produced by two L1- 117 speakers of Vietnamese Mechtild Tronnier, Elisabeth Zetterholm

Session 7: Child speech Consonant inventory of Swedish speaking 24-month-olds: A cross- 123 sectional study Emilie Gardin, Maria Henriksson, Emilia Wikstedt, Marie Mar- kelius, Lena Renner Real-time registration of listener reactions to unintelligibility in 127 misarticulated child speech Ivonne Contardo, Anita McAllister, Sofia Strömbergsson

Session 8: Brain imaging and phonetics SUBIC - Stockholm University Brain Imaging Center and its signifi- 133 cance for humanistic and interdisciplinary research Francisco Lacerda, Björn Lindblom

viii Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Voices after midnight – How a night out affects voice quality Alexandra Berger1,2, Rosanna Hedström Lindenhäll1,2, Mattias Heldner2, Sofia Karlsson1,2, Sarah Nyberg Pergament1,2, Ivan Vojnovic1,2, 1 Department of Clinical Science, Intervention and Technology, Division of Speech and Language Pathology, Karolinska Institutet, 2 Department of Linguistics, Stockholm University, Sweden [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract This study aimed at examining how This study aimed to investigate how different acoustic voice quality parame- different parameters of the voice (jitter, ters were affected by the voice strain shimmer, LTAS and mean pitch) are induced by a night out talking in a noisy affected by a late night out. Three re- environment, and what effects can be cordings were made: one early evening observed the following day. The pa- before the night out, one after midnight, rameters examined in this study were and one on the next day. Each recording jitter (cycle-to-cycle variations in fre- consisted of a one minute reading and quency), shimmer (cycle-to-cycle varia- prolonged vowels. Five students took tions in amplitude) (Titze, 1995), LTAS part in the experiment. Results varied (long time average spectrum) and mean among the participants, but some pat- pitch. terns were noticeable in all parameters. Following Södersten, Ternström, & A trend towards increased mean pitch Bohman (2005) we expected the mean during the second recording was ob- pitch to increase and that LTAS would served among four of the subjects. indicate a decrease in vocal fry in the Somewhat unexpectedly, jitter and second recording. Furthermore, as pre- shimmer decreased between the first vious results imply that female speakers and second recordings and increased in tend to increase glottal closure after the third one. Due to the lack of ethical speaking in loud conditions (Linville, testing, only a small number of partici- 1995), we hypothesized that jitter and pants were included. A larger sample is shimmer would decrease continuously suggested for future research in order to from the first to the third recording. generalize results. Method Introduction To test our hypotheses we made three It is well known that the general vol- recordings and compared a number of ume at pubs, discotheques and similar voice quality measures in these. The venues is very loud and that as a guest first recording (R1) occurred at 7 pm on you have to raise your voice significant- a Friday evening. The subjects each ly in order to make yourself heard. read a text of approximately one minute Speakers tend to raise their voices in and then pronounced a prolonged [a]. loud conditions. This is known as the The second recording (R2) took place at Lombard effect (Lane & Tranel, 1971). half past midnight, after four hours in a This type of voice behavior can result bar, where background noise level was in vocal fatigue, temporary hoarseness measured. The third recording (R3) was and may in the long run cause vocal done at noon the next day. The subjects disorders (Vilkman, 2000). reported differences in sleep duration

1 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

(from 2 hours of sleep to 7 hours) as (R1, R2, R3) on three different voice well as differences in alcohol intake. quality measures: mean pitch, jitter, and shimmer. We used repeated contrasts to Equipment compare R1 vs. R2, and R2 vs. R3, The recordings were done in 16-bit, respectively. Mauchly’s test indicated 44.1 kHz with the application Røde rec that the assumption of spherizity was LE (version 2.8.1) for iPhone 4 (version met in all three ANOVAS, therefore we 7.0.3) and a Røde smartLav, tie clip, will report the tests assuming spherizity with a mouth-to-mic distance of 20 cm. below. Data was later analyzed in Praat (Boersma & Weenink, 2014). Noise Results level was measured with the application Environmental noise Buller (version 1.5) running on an iPh- one 4S. Measurements of the environmental noise were done repeatedly during one Subjects hour. These measurements showed that The five participants consisted of four the background noise level varied be- women and one man. Mean age was 26 tween 80 and 92 dB(A), which is a years with standard deviation of 2.9 normal noise level at these types of years. All of the subjects were speech venues, but is indeed a strenuous envi- and language pathology students from ronment for dialogue. Karolinska Institutet. None of the five Mean pitch reported any voice problems. One of them, henceforth referred to as S1, Figure 1 shows the mean pitch in the smokes on a daily basis. All were in- different recording sessions for the in- formed of the potential health risks and dividual subjects. participated voluntarily in the experi- ment. Analysis All voice quality analyses were per- formed in Praat (Boersma & Weenink, 2014). Mean pitch was measured for each one-minute text reading using the To Pitch… and Get Mean… functions in Praat. LTAS was calculated from the complete audio recordings (text reading plus vowels) in each session. The LTAS analyses were based on down- sampled (10 or 11 kHz) and inverse Figure 1. Mean pitch (in semitones relative to 100 Hz) in the three recording sessions filtered versions of the original audio for the individual subjects. recordings. The To LPC (burg) function in Praat was used for the inverse filter- Evidently, four out of the five subjects ing. Perturbation measures of local jitter had about 0.5 to 1 semitones higher and shimmer were taken in the pro- pitch after midnight, and all subjects longed vowels, using the voice report had a lower pitch on the day after alt- function in Praat. hough the amounts differed. Differences in voice quality across A one-way repeated-measures the three recordings from each partici- ANOVA showed that there was a sig- pant were tested using repeated nificant effect of recording session on measures ANOVAS. We used one-way mean pitch (averaged across subjects), repeated-measures ANOVAS to com- F(2,8) = 10.41, p = .006. Contrasts pare the effect of the recording session revealed that mean pitch was signifi-

2 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University cantly lower in R3 than in R2, F(1,4) = 8.29, p = .045, and further- F(1,4) = 13.36, p = .022, and further- more that R2 also was significantly more that R2 and R1 were not signifi- lower than R3, F(1,4) = 15.30, cantly different, F(1,4) = 2.09, p = .017. p = .222. LTAS There was a lot of individual variation in the LTAS results for the five partici- pants. Figure 2 shows an example from one subject (S5). For some of the sub- jects, there were clear differences be- tween recordings while two of the par- ticipants showed little variation. Some subjects showed a more rapid decline within the first 1000 Hertz on R3 com- pared to the previous recordings indi- cating a steeper spectral slope. Figure 3. Jitter (in %) in the three recording sessions for the individual subjects. The 80 grey horizontal line indicates the MDVP threshold of pathology for Jitter. ) z H / B d (

Shimmer l e v e l

e r u

s Figure 4 shows the average shimmer s e r p

d

n values in the different recording ses- u o S sions for the individual subjects. All individual values but two were below

-20 the MDVP shimmer threshold of pa- 0 5500 Frequency (Hz) thology of 3.810% (Kay Elemetrics, Figure 2. Example of LTAS curves from 2008). Again, unexpectedly, four out of participant S5. The x-axis shows frequency five participants had the highest shim- (Hz) whilst the y-axis is showing sound mer values in R1 and lower values in pressure level (dB/Hz). The different lines R2. represent the recordings: R1=middle line R2=upper line, R3=bottom line. Jitter Figure 3 shows the average jitter values in the different recording sessions for the individual subjects. All individual values were clearly below the Multi- Dimensional Voice Program (MDVP) jitter threshold of pathology of ≤1.040% (Kay Elemetrics, 2008). Somewhat unexpectedly, four out five participants had the highest jitter values in R1 and lower jitter value in R2 than Figure 4. Shimmer (in %) in the three re- in R1 and R3. cordings sessions for the individual subjects. A one-way repeated-measures The grey horizontal line shows the MDVP ANOVA showed that there was a sig- threshold of pathology for Shimmer. nificant effect of recording session also on jitter, F(2,8) = 5.33, p = .034. Con- A one-way repeated-measures ANOVA trasts revealed that jitter was signifi- showed that recording session did not cantly lower in R2 than in R1, have a significant effect on jitter,

3 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

F(2,8) = 0.78, p = .49. However, if the motivate ethically due to the possible participant which behaved qualitatively health effects of this study. different from the others was excluded, We also suggest monitoring how there was a significant effect, the individual voices behave in loud F(2,6) = 5.60, p = .042. environments in order to identify possi- ble differences in voice behavior. Such Discussion and conclusions differences might have an effect on the This study investigated the effects of voice quality of the voices after a given speaking in a loud and noisy environ- occasion. ment. Although, the results varied across subjects, certain recurring pat- References terns were observed. As expected the Boersma, P., & Weenink, D. (2014). mean pitch increased from R1 to R2 Praat: doing phonetics by computer and decreased to R3. Surprisingly, all [Computer program] (Version subjects except S1 decreased in both 5.3.75). Retrieved from jitter and shimmer from R1 to R2 and http://www.praat.org/ increased to R3, although not to the Kay Elemetrics. (2008). Multi- same level as R1. Our theory is that the Dimensional Voice Program, subjects were more vocally warmed up Model 5105 [Computer program]. at R2, which might explain these re- Lincoln Park, NJ, USA: Kay sults. Elemetrics Corporation. S1 differed from the others and in- Lane, H., & Tranel, B. (1971). The creased in both jitter and shimmer dur- Lombard sign and the role of ing R2. This participant had results, hearing in speech. Journal of which did not correlate with the others, Speech, Language and Hearing even in pitch measures. We speculate Research, 14, 677–709. that the individual differences can be Linville, S. E. (1995). Changes in explained by external factors such as glottal configuration in women alcohol consumption, cigarette smoking after loud talking. Journal of Voice, and amount of sleep. S1 had the largest 9(1), 57–65. intake of alcohol and cigarettes as well Södersten, M., Ternström, S., & as only three hours of sleep. Bohman, M. (2005). Loud speech Concerning LTAS there were not in realistic enviromental noise: any strong differences between record- phonetogram data, perceptual voice ings, which may be due to environmen- quality, subjective ratings and tal conditions during the recordings... gender differences in healthy Since this is a pilot study with only speakers. Journal of Voice, 19(1), a small number of participants it is dif- 29–46. ficult to get significant results. Also, Titze, I. R. (1995). Workshop on because this study is explorative, it acoustic voice analysis: Summary might be more interesting to look at the statement. Retrieved from main effects of the experiment, rather http://www.ncvs.org/freebooks/sum than focusing on significance. mary-statement.pdf Because of the obvious problems in Vilkman, E. (2000). Voice problems at generalizing our results to a larger pop- work: A challenge for occupational ulation we suggest a larger sample for safety and health arrangement. future research. However, using a larger Folia Phoniatrica Et Logopaedica, randomized sample might be hard to 52(1-3), 120-125.

4 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Emotional Finnish Speech: Evidence from Automatic Classification Experiments Juhani Toivanen Diaconia University of Applied Sciences, Finland [email protected]

Abstract material was produced by fourteen pro- Emotional Finnish speech: a contradic- fessional actors (eight men, six women) tion in terms? Stereotypical views from Oulu City Theatre in Finland. The aside, emotional expression does exist subjects were aged between 26 and 50, in spoken Finnish, as in all languages. and were all speakers of the same The vocal repertoire may be somewhat northern variety of Finnish. The speak- more limited than in some other lan- ers simulated the following basic emo- guages, Finnish hardly being an intona- tions while reading out a phonetically tion language par excellence, but recent rich text of 120 words adapted from a evidence shows that, at least at the non- newspaper article: neutral, sadness, intonational level, there are systematic anger, and happiness. vocal features of emotion in spoken The audio recordings were made in Finnish. In this paper, research on this an anechoic chamber using high quality area is reviewed. equipment, and the acoustic analysis was carried out with f0Tool (developed Introduction by MediaTeam Language and Audio There is the persistent stereotype that Technology Group). Currently, f0Tool Finns do not use prosodic signals in is capable of analyzing over 40 acous- speech as freely and intensively as tic/prosodic parameters fully automati- speakers of some other languages (e.g. cally from a speech sample of any dura- Italians). Stereotypically, it has been tion (Toivanen et al., 2004). The pa- assumed that Finns tolerate long silenc- rameters are f0-related, intensity relat- es in conversation and are reluctant to ed, temporal and spectral features (Su- engage in spontaneous small talk with omi et al., 2008). strangers in communicative situations The general f0-based parameters (consider the portrait of a Finn in a were: mean f0, median f0, maximum f0, minimum f0, f0 range, 5th fractile of Kaurismäki movie). While there is th some literature on the emotional aspects f0, and 95 fractile of f0. The parame- of some intonation contours in spoken ters describing the dynamics of f0 were: Finnish, the vocal expression of emo- average f0 fall/rise during a continuous tion in continuous Finnish speech is voiced segment, average steepness of f0 understood very poorly. For example, fall/rise, maximum f0 fall/rise during a Laukkanen et al. (1996) present rele- continuous voiced segment, and maxi- vant data on the vocal expression of mum steepness of f0 fall/rise. The in- emotion in Finnish but the speech seg- tensity-related were e.g. the following: ments (syllables) are very limited dura- mean RMS intensity, median RMS in- tensity, intensity range, 5th fractile of tionally and communicatively. th In this paper, results on the correla- intensity, 95 fractile of intensity, and tion between vocal parameters and the average range between the fractiles. emotion in spoken Finnish are present- The temporal parameters were e.g. the ed. The research on this subject was following: average duration of voiced carried out utilizing the MediaTeam segments, average duration of unvoiced Emotional Speech Corpus. The speech segments shorter than 300 ms, maxi- mum duration of voiced segments, and

5 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University maximum duration of silence segments. 5%-95% f0 range 65.0% Ratio parameters were e.g. the follow- Shimmer 66.1% ing: ratio of speech to long unvoiced Jitter 68.6% segments and ratio of silence/speech Intensity variation 69.6% segments. The spectral features con- cerned the proportion of low-frequency The highest f0 value and the lowest f0 energy (below 500/1000 Hz). Addition- value are absolute values, and are not al parameters were jitter and shimmer. often very useful parameters as they Jitter is defined as the amount of ran- may actually be “accidental” values, dom cycle-to-cycle variation between representing shifts into the falsetto reg- adjacent pitch periods in vocal fold vi- ister and the creak register, respective- bration; it is thus a measure of f0 per- ly. turbation. Shimmer is the amount of cycle-to-cycle variation in amplitude Discussion between adjacent pitch periods. The existing literature suggests that the computer achieved quite a good dis- Evidence from classification ex- crimination rate. It has been argued periments that, in a speaker-independent task, as Speaker-independent classification was in this experiment, the performance performed using the K-Nearest- level can reach 60-70% for three basic Neighbour (kNN) classifier, which is emotions (ten Bosch, 2003). Looking at applied as a standard non-parametric the best feature vector in the classifica- method in statistical pattern recogni- tion task, it was observed that, to ex- tion, leave-one-out was used for evalu- press emotion vocally, the speakers ating classifier performance. The level used cues largely similar to those re- of automatic classification of emotions ported for other languages, i.e. varia- reached a level of just below 70 % with tions in energy, speech rate and pitch. the prosodic patterns given in Table 1 The optimal set of parameters in the that represent seven dimensions in the classification procedure consisted of classification procedure, intensity range intensity range, maximum f0 range dur- being the single most important cue. ing a continuous voiced segment, ratio Note that the intensity range alone pro- of silence-to-speech, 5%-95% f0 range, duced a classification capacity of over shimmer, jitter and intensity variation. 50%, and that intensity range and max- This set clearly reflects the “liveli- imum f0 rise during a voiced segment ness” of the speech: intensity range, f0 together yielded a classification rate range, the dynamics of f0 change as exceeding 54%, and so on. Note also well as the amount of speech within a that with three parameters, the classifi- speaking turn obviously correlate with cation accuracy already exceeds 60%. the activity level of the speech situation and the speaker. It can thus be argued that Finns use prosody to express affect Table 1. Emotional cues in spoken Finnish in speech in a way that must be essen- for the computer. tially similar to the vocal expression of Acoustic feature Cumulative emotion reported for major languages classification such as English and French. Showing accuracy that the same prosodic parameters are utilized in the emotion portrayals Intensity range 51.1% through voice, and demonstrating that Maximum f0 rise 54.6% emotional spoken Finnish is not qualita- during a voiced tively different from other languages, segment these research findings hopefully serve Ratio of silence-to- 63.2% to dispel some myths about the charac- speech ratio teristics of Finnish speech.

6 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

An interesting product of the exper- 2008), f0 rises in all likelihood also iment is the 7% difference between the occur in utterances which are “globally performance levels for the computer emotional”: they do not just mark off and the human listeners, demonstrating single accentuated words but they rep- that the human listeners utilized acous- resent speaking turns which are emo- tic/prosodic parameters unavailable to tional throughout. As is well-known, a the computer. The computer can utilize rising intonation is relatively rare in only automatically computable prosodic standard spoken Finnish – unless an primitives, while the human listener emotional (or in some other way strong) also pays attention to the linguistically dimension is intended. Utterances with relevant prosodic phenomena. What high rising tones can be assumed to might these phenomena be? convey strong emotional meanings (an- In spoken Finnish, the basic non- noyance, incredulity, etc.) in spoken affective utterance contains a descend- Finnish. Again, it must be noted that the ing f0 curve with rising-falling peaks in current classifier does not “hear” these the syllables of the accentuated words. syntactic features of rising f0 move- The point is that accents, which are ments (in final position) in an utterance. signaled tonally, probably tend to co- By contrast, the human listeners can be occur with special emotional content in expected to be fully aware of this kind speech. The human listener will hear of “marked” prosody in a speaking turn. these accents as discrete phonological It should also be noted that these pho- phenomena but the classifier (i.e. the nological (emotion-related) f0 features computer) is not, as yet, capable of this. certainly exist in spoken Finnish re- Thus the human listener has access to gardless of the possibility that Finnish more information than the computer in is not, phonologically, as tonal as some evaluating the affective dimensions of other languages. The degree of tonality speech – as the observed performance may be “small” only in comparison level in the data indeed suggests. with other languages: the language- To elaborate, in spoken Finnish, a specific tonal features are quite distinct thematic accent is realized as a gentle in Finnish to separate contrastive ac- rise-fall, typically occurring on lexical cents from thematic ones, and emotion- items, a rhematic accent is a more al speech from non-emotional speech. prominent rise-fall on the accented The results of this classification ex- word (Suomi et al., 2003; Suomi et al., periments offer (indirect) support for 2008). These two accents are not real- the hypothesis that discrete non- ized durationally. A contrastive accent, gradable phonological features – ac- on the other hand, is realized as an even cents and utterance-level intonation more prominent rise-fall with increased contours – also convey affective con- segmental duration. Finally, the em- tent in Finnish. This has implications phatic accent is not a phonological phe- for the development of classification nomenon as such as it reflects the de- methods. It will not be enough to con- gree of emotion rather than the degree centrate on the automatically measura- of contrast in a speech situation. With ble phonetic variables; at some point, the emphatic accent, all prosodic fea- the classifier must tackle the more ab- tures (f0, intensity, duration) can in- stract prosodic patterns if the aim is to crease “unlimitedly” (relatively speak- ultimately improve the emotion dis- ing) in unison with the speaker’s affec- crimination performance level. An im- tive state. In spoken emotional Finnish, portant future direction in the develop- the dynamic aspects of f0 variation (e.g. ment of classification methods would maximum f0 rise) probably have an be to model the abstract f0 phenomena important role from the perceptual in a computable way. There is no rea- viewpoint. In addition to signaling the son to assume that this would be an beginning of accent (Suomi et al., impossible task in the long run. Essen-

7 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University tially, what is needed is the gradual and as more abstract (“phonological” or development of language-specific mod- “linguistic”) phenomena, and there is els of legitimate phonological f0 con- no reason to assume that some of these tours, which the classifier must be levels would be irrelevant from the trained to recognize. Eventually, in the viewpoint of the vocal communication classification procedure, the constantly of emotion. Finally, the results suggest varying prosodic features and the more that contrastive research on human vs. abstract features must be combined. computer categorization of emotions is promising, and that, in the near future, Conclusions computer recognition of human vocal Some conclusions about the cues for emotions may approach a natural state. emotion in spoken Finnish seem possi- ble. Firstly, features of f0 and intensity References have been found to accompany emo- Laukkanen, A.M., Vilkman, E., Alku, tional Finnish speech – this is probably P. & Oksanen, H. (1996). Physical a universal phenomenon in the expres- variations related to stress and sion of emotion. Secondly, the perfor- emotional state: a preliminary mance level of the human classification study. Journal of Phonetics, 24, of emotion exceeds that of the automat- 313-335. ic classification. Although this is not ten Bosch, L. (2003). Emotions, speech surprising in itself, it can be argued that and the ASR framework. Speech phonological features f0 variation, es- Communication, 40, 213-225. pecially rising f0, are emotion-carrying Suomi, K., Toivanen, J. & Ylitalo, R. features in spoken Finnish, in addition (2003). Durational and tonal to the global constantly varying average correlates of accent in Finnish. features of f0, intensity, duration, etc. Journal of Phonetics, 31, 113-138. Also in this respect, it can be argued Suomi, K., Toivanen, J. & Ylitalo, R. that Finnish, a small language in a small (2008). Finnish sound structure: language group, is not qualitatively phonetics, phonology, phonotactics different from major Indo-European and prosody. Oulu Unversity Press. languages. This finding contradicts ste- Toivanen, J., Seppänen, T. & reotypical notions of (the lack of) emo- Väyrynen, E. (2004). Automatic tionality in the Finnish language. In discrimination of emotion from languages, in general, prosodic parame- spoken Finnish. Language and ters are hierarchically organized as con- Speech, 47, 383-412. crete (“phonetic” or “paralinguistic”)

8 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The intonation’s effects on speech intelligibility and attitudes Sara Marklund1, 2, Jesper Zackariasson1, 2 1 Department of Philosophy, Linguistics and Theory of Science 2Department of Clinical Neuroscience and Rehabilitation Division of Speech and Language Pathology University of Gothenburg, Sweden [email protected], [email protected]

Abstract portant parts as communicative markers Intonation is a phenomenon that consti- (Vassière, 2006). Amongst other topics tutes a large part of human communica- not much research has been done ex- tion. Despite this fact it is an area that is ploring any eventual connection be- relatively unexplored, particularly in tween intonation and intelligibility. Swedish. This study examines whether However, a study that was made by the pattern of intonation affects the in- Francuz (2010) touches on the subject. telligibility of spoken language and Francuz conclusion was that intonation how the listener perceives the speaker affects how well listeners comprehend with five different attitudes to choose newsreaders messages. Furthermore, a from. This is a quantitative cross sec- study made by Braun, Dainora & Ern- tion study where mail based question- estus (2011) showed that an unfamiliar naires were used. 46 participants were intonation pattern slows down the pro- included in the study, half of them lis- cessing of speech and thereby reduces tened to a recitation of a short text with intelligibility, which indicates the im- lively intonation and the remaining half portance of intonation. Since intonation listened to the same text but recited constitutes a large part of human com- with a monotone intonation. This study munication this was an area that awoke shows that there is a significant differ- our interest. Intonation gives the listen- ence of speech intelligibility between er information about an utterance, such the two different intonation patterns. as if it is a question or a statement Speech intelligibility is larger in the (Vassière, 2006). Other functions of group who listened to the recording intonation are emphasis and the display with lively intonation. There is also a of attitudes, for example if the speaker significant difference between the par- is ironic. It also functions as a marker ticipant groups regarding the evaluation for turn taking in conversations. Rebec- of the attitudes sympathetic, dedicated ca Hincks has done several studies and irritated. The results of this study where she puts emphasis on the im- are applicable in areas such as teaching, portance of a varied intonation when news casts and lectures. giving lectures and doing presentations (Hincks, 2005; Hincks & Edlund, Background 2009). She has done research on how a The idea of this study emerged when varied intonation can be obtained in reading a chapter by Vaissière (2006) oral presentations held in one’s second which was focused around the subject language through feedback. However, of intonation. It became clear that into- we wanted to explore whether there in nation is an area that has not been fully fact is a significant difference in speech explored. Particularly few studies have intelligibility depending on which type been done on the subject in Swedish, of intonation pattern one uses, a mono- even though it is a language with rich tone or a lively. We also wanted to ex- patterns of intonation which play im- plore how a listener perceives a speaker

9 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University based on the pattern of intonation that is the one with monotone intonation. The used. study was randomized with a restriction Our hypothesis for this study was in the distribution in terms of gender. In that speech with a monotone intonation the group who listened to lively intona- would be less intelligible than speech tion 9 participants were men and 14 with lively intonation. This belief arose were women. In the group that listened from personal experiences where to monotone intonation 7 participants speech with a monotone intonation has were men and 16 were women. An been considered strenuous to listen to equal amount of participants in each resulting in a decrease of interest of the group (8 people) knew the person who speech. Furthermore, without varied read the text on the recording since be- intonation important communication fore. The age of the participants varied markers such as emphasis and focusing between 20-35 years. However, two of becomes lost which makes it more dif- the participants were older than 40 ficult to obtain the message. We also years, one in each group. hypothesized that speech with mono- The participants of both groups tone intonation would be perceived as were asked to listen to the recording irritated, unsympathetic and uncommit- just once and then answer five ques- ted, whereas speech with a lively into- tions concerning the content of the rec- nation would be perceived as dedicated, orded text, three multiple choice an- sympathetic and possibly naïve. Speech swers were given to each question. The with a very lively intonation was also participants were also asked to fill in a believed to be perceived as unnatural or form where they could rate from 0 (not artificial (Vaissière, 2006). at all) to 5 (very much) how they per- ceived the speaker. The attitudes the Method participants were asked to rate were: To be able to answer these questions a sympathetic, dedicated, irritated, artifi- male speaker reading a short text about cial and naïve. We e-mailed the instruc- chameleons was recorded in two set- tions to the participants and attached the tings, one with a lively intonation and sound file and questionnaires separate- the other one with a monotone intona- ly. Their answers were e-mailed back to tion. The following data was collected the authors. from the program Praat. The mean val- Two very different types of intona- ue of the pitch, amplitude and duration tion patterns were used, a monotone was kept on approximately the same and a lively, so that a clear result could level in both of the recordings. The du- be extracted. Two separate listening ration of the recordings was 64 seconds groups were chosen so that the partici- each. The pitch in the recording with pants would not hear the text read two lively intonation varied between 76 Hz times and thereby learn it. Furthermore, and 210.6 Hz, the mean pitch was 114.7 since the participants did not know Hz. The pitch in the recording with there was a second recording, the pur- monotone intonation varied between 75 pose of the study remained hidden so it Hz and 143.4 Hz, the mean pitch was would not affect the results. 104.7 Hz. The mean value of the ampli- The text that was used in this study tude in the recording with lively intona- was chosen because it seemed to keep tion was 72.5 dB and in the one with an appropriate level of difficulty, not monotone intonation 74 dB. too easy nor too difficult. The text was The number of participants in this about chameleons which was consid- study was 46. Without knowing there ered a subject not completely unfamil- was any other type of recording than iar to people nor was it too easy. Multi- the one they listened to, half of the par- ple choice answers to the questions ticipants listened to the recording with were chosen so that it would not be too lively intonation and the other half to difficult for the participants to answer

10 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University the questions since they only were al- The extract was retrieved from Fak- lowed to hear the recording once. Fur- toider by Peter Olausson (2010). thermore, multiple-choice answers The questionnaire the participants made it easier to compare the results were asked to fill out after listening to whereas open questions would have the recitation contained the following been more difficult to analyze. When questions, they are here translated from deciding the rating scale for the atti- Swedish: tudes, even steps were chosen so that What are the chameleons primary col- the participants had to take a stand and ours? not just choose the middle alternative. a) The answer was not given in the text Attitudes were chosen that we associate b) Black and brown with a monotone respectively lively c) Gray, brown and green intonation (Vaissière, 2006). The results Why does the chameleon change its of the test were analyzed in the program colour? SPSS using the t-test for independent a) Camouflage variables. The alfa-level was set at 0.05. b) Depending on mood c) To scare off threatening enemies Materials Which is the most common colour that The recited text used in the study was in chameleons assume? Swedish, the following text is a transla- a) A sand coloured nuance tion of that text: b) The answer was not given in the text c) Dark green The Chameleon What does every chameleon owner usu- Surely chameleons can change their ally know? colour, but they do not do it depending a) That the chameleon assume a dark on their environment. From the begin- nuance when its aggressive ning they have a primary colour which b) That the chameleon can assume makes them difficult to discover such as many different colours other gray, brown and green lizards. c) They are very sociable Moreover, they can change their colour In what context does the chameleon get and pattern depending on the tempera- stressed? ture and the light conditions, which a) When they are surprised gives it a further camouflaging effect. b) The answer was not given in the text But above all they change their colour c) When their nest is threatened. depending on their mood – the main point is therefore communicative. For Results example, dark colours signals that the Results of intelligibility for lively in- chameleon is stressed or/and aggres- tonation sive, which every chameleon owner Questionnaire (maximal score 5): knows. Many of the changes are inci- Max: 5 dentally very colourful and therefore Min: 2 everything else but camouflaging. Fur- Mean: 4.04 thermore, they cannot assume any col- Median: 4 our, although verymany. There are other lizards that can Results of intelligibility for monotone change their colour, but far from as intonation spectacularly as the chameleons. Some Questionnaire (maximal score 5): octopuses can camouflage themselves Max: 5 effectively by assuming the appearance Min: 1 of their environment, and they do it very Mean: 3.17 fast, just like many people think chame- Median: 3 leons do.

11 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Table 1. Results of the attitude form (scale 0-5) for lively intonation

Sympathetic Dedicated Irritated Artificial Naïve Max 5 5 2 5 4 Min 2 1 0 0 0 Mean 3.61 3.35 0.13 3 1.74 Median 4 3 0 3 1 Table 2. Results of the attitude form (scale 0-5) for monotone intonation

Sympathetic Dedicated Irritated Artificial Naïve Max 5 4 3 5 4 Min 0 0 0 0 0 Mean 1.87 1.22 0.87 2.43 1.22 Median 4 3 0 3 1

There is a significant difference be- who recited the text. This could of tween the two listening groups. The course have affected their judgments of group that listened to the recitation with the attitudes. However, this was com- lively intonation scored a significantly pensated for by having an equal amount higher mean sum on the questionnaire of familiar listeners in each group so than the group that listened to the reci- that it would not influence the final tation with monotone intonation. More results. exactly 3.17 respectively 4.04, p=0.010. This study investigated recited text In terms of the attitudes sympathetic, material and therefore it is difficult to dedicated and irritated the difference tell how generalizable the results are to between the groups is also significant. other modalities such as spontaneous The group that listened to the lively speech. intonation estimated the attitudes sym- The sound quality in the material pathetic (p=0.000) and dedicated was not optimal since there was some (p=0.000) higher and the attitude irri- disturbing noise. On the other hand this tated (p=0.008) lower than the group noise was equally loud for each listener that listened to the monotone intona- in both of the groups and should there- tion. There was no significant differ- fore not bias the results. ence regarding the ratings of the atti- Since the test material was sent by tudes naive (p=0.252) and artificial email we have not been able to ensure (p=0.205) between the groups. that the participants followed the in- structions and listened to the recording Discussion and conclusion only once. We have also not been able The results in this study support our to control the listening environment nor hypothesis. The participants that lis- that the participants used equivalent tened to the monotone intonation scored equipment. a significantly lower mean sum on the The results show that a significant questionnaire than the participants that difference in comprehension can be listened to the lively intonation. This seen after only 64 seconds of speech. It study shows that there is a connection is not difficult to imagine that listening between intonation and intelligibility. to a long speech segment with mono- We have seen that a monotone intona- tone intonation would affect the listen- tion reduces intelligibility. ers ability to maintain his or hers con- A few of the participants (8 in each centration and attention, which in turn group) were familiar with the person would affect the intelligibility. But as

12 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University we have seen in this study, after listen- groups but it was estimated relatively ing to only 64 seconds of speech with a high in both groups. One explanation monotone intonation, a time frame dur- for this could be that an exaggerated ing which a healthy young adult should intonation pattern, whether it is mono- be able to keep his or her concentration, tone or lively, is deviant in Swedish and the intelligibility is affected and de- therefore sounds artificial (Vaissière, creases. 2006). A study made by Braun et al It would be desirable to do further (2011) shows that unfamiliar intonation research on this area and investigate patterns slow down the processing of how the intelligibility is affected in speech and thus reduces intelligibility. other modalities. The study made by These findings correspond with our Francuz (2010) showed that intonation results. affects how well listeners comprehend newsreaders messages. Therefore, re- Acknowledgements sults of future studies could possibly be We would like to give special thanks to applied in many areas, for example lec- our supervisor Åsa Abelin, Associate tures, education and for newsreaders to professor in General Linguistics at the maximize intelligibility. University of Gothenburg, who guided Our ambition was to choose ques- us and proofread this text. We would tions that were not too easy nor too also like to thank all of the participants hard, but of varied difficulty level. It is who contributed to this study. difficult to tell if the questions were well balanced since we did not use a References neutral intonation pattern to compare Vaissière, J. (2006) Perception of Into- with. If the questions were too unbal- nation. B. Pisoni, E. Remez (red.), anced in difficulty level then we assume The Handbook of Speech Percep- that the intonation pattern is of less im- tion (pp. 236-263). Blackwell Pub- portance for intelligibility. lishing Ltd The results of the attitude form Boersma, Paul & Weenink, David support our initial hypothesis, the voice (2013). Praat: doing phonetics by with lively intonation was perceived as computer [Computer program]. more sympathetic and dedicated. How- Version 5.3.43, retrieved April ever there was no significant difference 2013 from http://www.praat.org in the attitude naïve between the two Braun, B., Dainora, A., & Ernestus, M. groups. One explanation for this result (2011). An unfamiliar intonation could be that naïve is hard to estimate contour slows down online speech because it is a word that is not as fre- comprehension. Language and quently used as the other attitudes in the Cognitive Processes, 26(3), 350- study and people might have different 375. Retrieved from connotations of the word. http://search.proquest.com/docview The attitude irritated was estimated /919961697?acco untid=11162 low in both groups but significantly Francuz, P. (2010). The impact of audio higher in the group who listened to the information intonation on under- monotone intonation compared to the standing television news content. other group. Before the study was per- Psychology of Language and formed we expected the attitude irritat- Communication, 14(1), 71-86. Re- ed to be estimated generally higher. An trieved from explanation for the current results could http://search.proquest.com/docview be that a monotone intonation is not /868374007?acco untid=1116 associated with any form of engage- Hincks, R. (2005). Measuring Liveli- ment and therefore nor irritation. ness in Presentation Speech. Pro- There was no significant difference ceedings of Interspeech 2005, Lis- in the attitude artificial between the bon, (pp. 765-768).

13 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Hincks, R. & Edlund, J. (2009). Transi- Olausson, P. (2010). Faktoider: ent visual feedback on pitch varia- försanthållna osanningar, halvsan- tion for Chinese speakers of Eng- ningar och missuppfattningar (pp. lish . In Proceedings of Fonetik 152-153). Falun: Månpocket. 2009 (pp. 102-107) Stockholm: Department of Linguistics, Stock- holm University

14 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Instability in simple speech motor sequences – an overview of measures and what they really quantify Fredrik Karlsson Department of Clinical Sciences, Umeå University [email protected]

Abstract The sequencing of speech motor gestures may be impaired in patients with condi- tions that affect either the functioning of active articulators or regions in the brain involved in speech motor control. Oral diadochokinesis is an established tool for the assessment of speech motor function, and has primarily been studied in terms of rate and stability in the syllable productions. Syllable rate has achieved a coherent quan- tification across reports due to the simple nature of what is being quantified. At- tempted quantifications of the concept of syllable production instability, however, are much more diverse, with most measures being incomparable to the others due to different underlying definitions of the concept of instability. In this talk, I will pre- sent and overview of recently used quantifications of instability or regularity of speech production, and illustrate which aspect of instability they could claim to quantify. Specific cases of measures that may either lead to erroneous conclusions to be drawn or be of reduced scientific value due to high levels of uncertainty in the interpretation will be highlighted.

15 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

16 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Tongue articulation dynamics of /iː, yː, ʉ̟ ː/ in Stockholm, Gothenburg and Malmöhus Swedish Susanne Schötz1, Johan Frid1, Lars Gustafsson1, Anders Löfqvist2 1 Lund University Humanities Lab, Centre for Languages & Literature, Sweden 2 Dept. of Logopedics, Phoniatrics and Audiology, Lund University, Sweden [email protected], [email protected], [email protected], [email protected]

Abstract the end. In many dialects, including Articulatory data were collected for the Stockholm and Gothenburg, the gesture for /iː/ and /yː/ is achieved by the Swedish vowels /iː, yˌ, ʉ̟ ː/ from nine speakers each of Stockholm, Gothen- tongue dorsum as [ij] and [yj], while the burg, and Malmöhus Swedish, and the lips are used for /ʉ̟ ː/ as [ʉ̟ β] (McAllister tongue positions and their dynamics et al., 1996; Hadding et al., 1974). A analysed using Functional Data Analy- different tongue gesture is used in Malmöhus Swedish. Here /iː, yː, ʉ̟ ː/ are sis (FDA). Results showed that the gen- ͡ ͡ ͡ eral tongue positions for /iː/ and /yː/ are realised as [ei, øy, øʉ̟ ] (Bruce, 2010). Another fairly common realisation similar and clearly different from /ʉ̟ ː/ in all three dialects. Variation within the of /iː/ and /yː/ in Swedish is as [ɨː] and Stockholm and Gothenburg groups led [ʉ̟ ː]. i.e. with a “damped” quality often to a subdivision into two types, where referred to as Viby-colouring (Bruce, the tongue positions of type 1 resem- 2010; Ladefoged & Maddiesson, 1996). bled Malmöhus Swedish more. Several There is disagreement in the Swedish differences in tongue articulation be- phonetics literature if the major con- tween types 1 and 2 were observed, striction for the damped /iː/ and /yː/ is possibly explained by the presence of further front compared to their regular Viby-coloured /iː/ and / yː/ in type 2. counterparts, and basically alveolar, or instead further back and rather central Introduction (Björsten et al, 1999; Engstrand et al., In the Swedish vowel system, there are 2000). However, as adequate articulato- three contrastive long front, close vow- ry data seem to be lacking, these views els /iː, yː, ʉ̟ ː/, characterised by a rela- are at best intelligent speculations. tively small acoustic and perceptual In Schötz et al. (2013) we investi- distance. The magnitude of the lip gated the articulatory dynamics of /iː, opening is regarded as the major dis- yː, ʉ̟ ː/ in Gothenburg Swedish (GS) and tinctive feature: unrounded /iː/, out- Malmöhus Swedish (MS), spoken in rounded /yː/, and inrounded / ʉ̟ ː/ (Fant, and near Gothenburg and Malmö, re- 1959; Ladefoged & Maddieson, 1996). spectively, using Functional Data Anal- Specifically the contrast between /yː/ ysis. In MS, we found that the position and /ʉ̟ ː/ is considered highly unusual of the tongue body was significantly among the world’s languages. The lower for /ʉ̟ ː/ than for /iː/ and /yː/. In tongue articulation is assumed to be GS, the speakers could be subdivided basically identical, but the documenta- into two different types according to tion of this is incomplete, especially for their articulation patterns; type GS1 the articulatory dynamics (Ladefoged & resembled MS, while type GS2 had Maddieson: 295–6). To maintain the higher tongue body for /ʉ̟ ː/. distinctions between these vowels, they The purpose of this study was to are often characterised by a slight diph- extend our findings by including Stock- thongisation or consonantal off-glide at holm Swedish (SS), spoken in and near Stockholm, and compare the tongue

17 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University articulation of /iː, yː, ʉ̟ ː/ of this dialect speakers were instructed to read each to those of GS and MS. Our aim was to sentence in their own dialect at a com- find out how SS relates to our findings fortable speech rate. A contour of the for MS and GS. Based on the results of palate was obtained as the speakers Schötz et al. (2013), we expected the moved their tongue tips back and forth tongue positions in the dimensions along the midline of their palate. open–close and front–back to be differ- ent for /ʉ̟ ː/ than for /iː/, /yː/ in all three Error detection and speaker normali- dialects. Furthermore, we expected to sation find regional differences in the articula- Noise and measurement errors in articu- tion of /iː/ and /yː/, as Viby-colouring is latory data are fairly common due to more common in SS and GS than in MS quick head movements, sensors moving (Bruce, 2010). We also expected to find too close to each other, sensors break- a subdivision into two types in both GS ing or falling off, or calculation errors. and SS. In order to detect and exclude such er- rors, we used the same two-step pro- Material and method cess, described in Schötz et al. (2013). Nine speakers each of SS (3 females, 6 The vowels were segmented manually males, age: 21 – 63, mean = 42, sd = in Praat (Boersma & Weenink, 2013) 15.2), GS (5 females, 4 males, age: 20 – and used as acoustic landmarks to trim 47, mean = 29, sd = 10.0), and MS (4 the data set. Plots for sensors traces 1–3 females, 5 males, age: 23 – 62, mean = were used to visually identify and ex- 43, sd = 11.7) were recorded by means clude vowels with errors. The remain- of electromagnetic articulography along ing errors and outliers were removed with a microphone signal using an with the package ‘robustbase’ AG 500 (Carstens Medizinelektronik). (Rousseeuw et al., 2012) in the R statis- Twelve sensors were placed on the lips, tical environment (R Development Core jaw and tongue, and also on the nose Team, 2013). In order to compensate ridge and behind the ear to correct for for differences in oral anatomy between head movements. Figure 1 shows the speakers, data was normalized using z- sensor positions and one subject with score transformation. sensors attached. FDA smoothing and aligning Functional Data Analysis (FDA) is a technique for timewarping and aligning a set of signals to examine differences between them. FDA techniques and applications to speech analysis were first introduced by Ramsay et al. (1996), and further developed by Figure 1. The twelve sensor positions and a speaker with the sensors attached. Lucero et al. (1997), Lucero and Löfqvist (2005) and Gubian et al. In this study we focused on the tongue (2011). In FDA, a function or function tip (sensor 1) and body (sensor 2). The system is fitted to the data, and the fit- speech material consisted of 15–20 rep- ting coefficients are examined instead etitions by each speaker of /iː, yː, ʉ̟ ː/ in of the original data. A commonly used carrier sentences of the type “De va inte function form are B-spline functions hVt utan hVt ja sa” (It was not hVt, but (Ramsey et al. 2009), which are flexible hVt I said), where the target words con- building blocks for fitting curves to taining the vowels were stressed and approximate a large number of different produced with contrastive focus. The shapes. By selecting weights for each sentences were displayed in random spline, the overall shape becomes simi- order on a computer screen, and the lar to the actual sensor trace. The details

18 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University are described in Schötz et al. (2013). In gesting a higher degree of diphthongi- this study, FDA was used to smooth the sation or coarticulation. sensor traces, and to standardise the time to facilitate comparisons between Tongue tip height repetitions. All FDA processing was Figure 3 shows that the tongue tip done using the R package ‘fda’ (see height for /yː/ is higher than for /iː/ and Schötz et al., 2013 for details). /ʉ̟ ː/ in all varieties except MS, where /ʉ̟ ː/ has the highest contour. Between Analysis of tongue articulation varieties there are significant differences Sensors 1 and 2 were selected to repre- (pairwise functional t-tests, p<0.05) in sent the tongue tip and body (see Figure the central part of /ʉ̟ ː/ between MS and 1). FDA processed contours were plot- all others varieties. For GS1-GS2 and ted for the tongue body and tip dynam- SS1-SS2 the difference is not signifi- ics in height and frontness, and the po- cant. The dynamics for all the vowels in sitions and dynamics compared within all the varieties is represented by slight- as well as across the regional varieties. ly rising contours, suggesting closing Statistical analysis was done with func- diphthongisations, although some indi- tional t-tests (see Ramsey et al. 2009 for vidual variation can be observed. details), where the t-statistic is a func- tion of time, using the function tperm.fd Tongue body frontness in the 'fda' package. As shown in Figure 4, the tongue body is more protruded in /iː/ and /yː/ than in Results /ʉ̟ ː/ in all varieties except GS2, which Generally, the vowel /ʉ̟ ː/ displays dis- displays an opposite pattern except in tinct patterns from /iː/ and /yː/, and /ʉ̟ ː/ the final part of the vowel. /iː/ and /yː/ also varies the most between regions. have similar contours in all varieties, Among the SS and GS speakers we with the clearest overlap in SS2. The found a subdivision between speakers vowel contours are either rising slightly (5 type SS1, 4 type GS1) who articulate (GS1, MS), arch-shaped (e.g. SS1, SS2) the three vowels with similar tongue or slightly falling (GS2), suggesting positions as the MS speakers, and different diphthongisation strategies. speakers (4 type SS2, 5 type GS2) who generally have different tongue posi- Tongue tip frontness tions compared to the MS speakers. Tongue tip frontness is shown in Figure Tongue body height 5. In MS the tongue tip is further back in /iː/ and /yː/ compared to /ʉ̟ ː/, while Tongue body height is shown in Figure the opposite pattern is found for all the 2. In MS, GS1 and SS1 the position of other varieties. Between varieties, we the tongue body is lower for /ʉː/ than ̟ found significant differences (pairwise for /iː/ and /yː/, while in GS2 and SS2 functional t-tests, p<0.05) in the middle the position is higher for /ʉː/. We found ̟ of /ʉ̟ ː/ for MS vs. all the others. We also significant differences between varie- note somewhat different vowel dynam- ties (pairwise functional t-tests, p<0.05) ics in the different vowels and varieties, throughout the vowel in /ʉ̟ ː/ for MS- suggesting different types of diphthong- GS2, MS-SS2, GS1-GS2 and SS1-SS2. isation gestures. In SS1 all vowels show For MS-GS1 and MS-SS1 the differ- slight forward-backward movements, ence is not significant throughout the but with an earlier timing for /ʉ̟ ː/ than whole vowel. The main difference be- for /iː/ and /yː/. All vowels in GS1 tween SS2 and GS2 is that /iː/ has the move lightly forward, while they move lowest tongue body in SS2 while /yː/ is backward in GS2. In MS /iː/ and /yː/ lower in GS2. SS1 displays slightly show a forward motion, while the arch- more arched contours for all vowels shaped contour for /ʉ̟ ː/ suggests a for- compared to the other varieties, sug- ward-backward-movement.

19 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

2 2 2 MS GS1 SS1 1 1 1

0 0 0

-1 -1 -1

-2 Normalised time (sec.) -2 -2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 GS2 SS2 high (close) 1 1 Tongue body height (z-score) 0 0 -1 -1 ••••• /iː/ low –– /yː/ (open) --- /ʉ̟ ː/ -2 Normalised time (sec.) -2 Normalised time (sec.) 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 Figure 2. Mean tongue body height (z-score) as a function of normalised time for /iː, yː, ʉ̟ :/ in Malmö (MS) and two types of Gothenburg (GS1, GS2) and Stockholm (SS1, SS2) Swedish.

2 2 2 MS GS1 SS1 1 1 1

0 0 0

-1 -1 -1

-2 Normalised time (sec.) -2 -2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 GS2 SS2 high (close) 1 1 Tongue tip height (z-score) 0 0 -1 -1 ••••• /iː/ low –– /yː/ (open) --- /ʉ̟ ː/ -2 Normalised time (sec.) -2 Normalised time (sec.) 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 Figure 3. Mean tongue tip height (z-score) as a function of normalised time for /iː, yː, ʉ̟ :/ in Malmö (MS) and two types of Gothenburg (GS1, GS2) and Stockholm (SS1, SS2) Swedish. Discussion The intra-regional variation found in SS and GS led to a subdivision into The results of this study indicate that the four types SS1, SS2, GS1 and GS2. the tongue articulation for /ʉ̟ ː/ is signif- A closer look showed that the SS1 and icantly different from /iː/ and /yː/ in GS1 speakers were more often from the both Stockholm, Gothenburg and outskirts of the Stockholm and Gothen- Malmöhus Swedish. Our hypothesis of burg areas than the SS2, GS2 speakers. different tongue articulation for /ʉ̟ ː/ Furthermore, most SS2 and GS2 speak- than for /iː/ and /yː/ was thus confirmed. ers had clear Viby-coloured /iː/ and /yː/, Considerable regional variation was which was not the case for most of the observed in this study, not only for each SS1 and GS1 speakers. No MS speak- vowel in the front–back and open–close ers used Viby-colouring. The Viby- dimensions, but also in the vowel dy- colouring may offer one explanation for namics (diphthongisation). MS often the differences in tongue articulation. In displayed different patterns than SS and future studies, we will investigate this GS, supporting our hypothesis of dif- further by comparing articulatory data ferent articulation strategies in different and acoustic data, e.g. formant frequen- regional varieties, at least in part. cies.

20 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

2 2 2 MS GS1 SS1 1 1 1

0 0 0

-1 -1 -1

-2 Normalised time (sec.) -2 -2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 GS2 SS2 front 1 1 Tongue body frontness (z-score) 0 0

••••• /iː/ -1 -1 –– /yː/ back --- /ʉ̟ ː/ -2 Normalised time (sec.) -2 Normalised time (sec.) 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 Figure 4. Mean tongue body frontness (z-score) as a function of normalised time for /iː, yː, ʉ̟ :/ in Malmö (MS) and two types of Gothenburg (GS1, GS2) and Stockholm (SS1, SS2) Swedish.

2 2 2 MS GS1 SS1 1 1 1

0 0 0

-1 -1 -1

-2 Normalised time (sec.) -2 -2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 2 0.00 0.05 0.10 0.15 GS2 SS2 front 1 1 Tongue tip frontness (z-score) 0 0

••••• /iː/ -1 -1 –– /yː/ back --- /ʉ̟ ː/ -2 Normalised time (sec.) -2 Normalised time (sec.) 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 Figure 5. Mean tongue tip frontness (z-score) as a function of normalised time for /iː, yː, ʉ̟ :/ in Malmö (MS) and two types of Gothenburg (GS1, GS2) and Stockholm (SS1, SS2) Swedish. In this study we analysed only two [VOKART] (Swedish Research Coun- discrete points and two dimensions of cil, grant no. 2010-1599). the tongue: tongue tip and body height References and frontness, and used a standard z- score transformation for speaker nor- Björsten, S., Bruce, G. Elert, C-C, malisation. Although we did not look at Engstrand, O., Eriksson, A. lip rounding, traditionally regarded as Strangert E. & Wretling, P. (1999). the main difference between /iː/, /yː/ Svensk dialektologi och fonetik – and /ʉ̟ ː/, our results clearly show differ- tjänster och gentjänster. Svenska ences between these vowels in tongue landsmål och svenskt folkliv: 7–23. body height as well. In future studies, Boersma, P. & Weenink, D. (2014). we will compare tongue articulation to Praat: doing phonetics by lip rounding and we also include a larg- computer [Computer program] Ver. er number of vowels, e.g. /eː/ and /øː/. 5.3.76, retrieved from http://www.praat.org/. Acknowledgements Bruce, G. (2010). Vår fonetiska This study was carried out within the geografi. Lund: Studentlitteratur. project Exotic Vowels in Swedish: an Fant, G. (1959). Acoustic description articulographic study of palatal vowels and classification of phonetic units.

21 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Ericsson Technics 1. (Reprinted McAllister, R., Lubker, J. & Lindblom, 1983 in Fant, G., Speech Sounds B. (1974). An EMG study of some and Features. 32–83. Cambridge, characteristics of the Swedish MA: MIT Press). rounded vowels. J. of Phon. 2: Engstrand, O., Björsten, S., Lindblom, 267–278. B., Bruce, G. & Eriksson, A. R Development Core Team. (2013). R: (2000). Hur udda är Viby-i? A Language and Environment for Experimentella och typologiska Statistical Computing, URL: observationer. Folkmålsstudier 39: http://www.R-project.org. 83–95. Helsingfors. Ramsay, J., Hooker, G. & Graves, S. Gubian, G., Cangemi, F., & Boves, L. (2009). Functional Data Analysis (2011). Joint analysis of F0 and with R and MATLAB. Springer. speech rate with functional data Ramsay, J. O., Munhall, K. G., Gracco, analysis. ICASSP, 4972–4975, V. L. & Ostry, D. J. (1996). Prague. Functional data analysis of lip Hadding, K., Hirose, H. & Harris, K. S. motion, J. Acoust. Soc. Am., 99, (1976). Facial muscle activity in 3718– 3727. the production of Swedish vowels: Rousseeuw, P., Croux, C., Todorov, An electromyographic study. J. of V., Ruckstuhl, A., Salibian-Barrera, Phonetics 4: 233–245. M., Verbeke, T, Koller, M. & Ladefoged, P. & Maddieson, I. (1996). Maechler, M. (2012). robustbase: The Sounds of the World’s Basic Robust Statistics. R package Languages. Oxford: Blackwells. version 0.9-7. URL: Lucero, J., Munhall, K., Gracco, V. & http://CRAN.R- Ramsay, J. 1997. On the regis- project.org/package=robustbase. tration of time and the patterning of Schötz, S., Frid, J., Gustafsson, L, speech movements, J. Speech Lang. Löfqvist, A. (2013). Functional Hear Res. 4ː1111–1117. Data Analysis of Tongue Lucero, J. & Löfqvist, A. (2005). Articulation in Palatal Vowels: Measures of articulatory variability Gothenburg and Malmöhus in VCV sequence, Acoust. Res. Swedish /i:, y:, ʉ̟ :/. In Proceedings Lett. Online 6, 80–84. of Interspeech 2013, Lyon.

22 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

An acoustic study of the Estonian Swedish lateral [ɬ] Susanne Schötz1, Francis Nolan2, Eva Liina Asu3 1Centre for Languages and Literature, Lund University, Sweden 2 Department of Theoretical and Applied Linguistics, University of Cambridge, UK 3 Institute of Estonian and General Linguistics, University of Tartu, [email protected], [email protected], [email protected]

Abstract Schötz and Kügler (2009), and a study This pilot study investigates the Estoni- of the prosody of compounds by Schötz an Swedish (ES) voiceless lateral [ɬ], and Asu (2013). Yet, the ES sound sys- which is rare among the Scandinavian tem offers plenty of interesting and dialects and is a development mainly of unique material for acoustic phonetic historic /sl/ clusters. Six elderly ES study. One of such is the lateral system speakers were recorded, and the phono- that will be tackled here. logical and phonetic (duration, relative According to H. Lagman (1971: intensity) properties of [ɬ] studied and 175–188), ES has a fairly large number compared to other ES consonants and of laterals consisting of the following: [ɬ] in Icelandic. The results suggest that (1) [l]: a voiced dental or alveolar ES [ɬ] is a single consonant rather than common to all Scandinavian lan- a consonant cluster. It behaves much guages, in most contexts, e.g. lag like initial [s] in duration, although a (team) and vall (ley, mound), tendency to anticipatory voicing in its (2) [ɭ]: a voiced supradental (post al- latter part may point to its ‘approxi- veolar) in the context /r/ + /l/, like mant’ status. Furthermore, ES [ɬ] is many Swedish and Norwegian dia- similar in intensity to the Icelandic [ɬ]. lects (except most with [ʁ]; and It has a phonemic status and it can be ), e.g. farligheter both short and long. Laterals as well as [fɑːɭɪhaɪtɵ(ɹ)] (hazards), other phonetic aspects of ES are in ur- (3) [ɽ]: a voiced retroflex flap in some gent need of further research. contexts, as in some Swedish and Norwegian dialects, e.g. blå [bɽoː] Introduction (blue) and nagel [nɑːɽ] (nail), Until the Soviet Russian occupation at (4) [ɬ]: a voiceless alveolar fricative, the end of WWII there was a substantial used sometimes in the (historic) Swedish speaking population in Estonia contexts /sl-/ and /tl-/, which is which had settled there in medieval more rare among the Scandinavian times from the 13th century onwards. dialects, e.g. slag [ɬoː] (stroke, Swedish was concentrated mainly on blow) and vassle [vaɬː] (whey). islands off the west coast of Estonia and The latter two constitute major devel- on the north-west corner of the Estonian opments in the lateral system, one mainland. Nowadays, Estonian Swedish common to many dialects of Swedish (ES) survives only in a small communi- and Norwegian, and the other idiosyn- ty of elderly emigrants to Sweden and a cratic, or at least very rare. The retro- tiny handful of equally elderly speakers flex flap [ɽ], often referred to as ‘cacu- in Estonia. minal’ or ‘thick’ /l/, is found in initial The sound system of ES has so far clusters (e.g. flytta [fɽɛte] or [fɽɪtː] ‘to mainly been studied in the descriptive move’, and blöt [bɽɑʊt] ‘wet’), medially framework of dialect research (e.g. E. in some words (e.g. mala [mɔɽɐ] ‘to Lagman, 1979). The only two existing grind’), and finally after certain conso- acoustic phonetic studies are an investi- nants (e.g. fågel [føːɽ] ‘bird’, and gation of ES close vowels by Asu, körhjul [keːɾjøːɽ] ‘bicycle’). Swedish

23 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University and Norwegian [ɽ] is thought to have Ormsö ES were recorded in Stockholm first arisen as a realisation of [rð] se- in 2012. All speakers have been resi- quences in in words such as dent in Stockholm since the 1940s. The garðr ‘court(yard)’ and borð ‘board’, main speech material was compiled and to have spread to /l/ in certain con- with the help of the members of the texts. Tiberg (1962), though, claims that Estonian Swedish community Svenska [ɽ] in Estonian Swedish did not origi- Odlingens Vänner (SOV) in Stockholm nate from [rð] sequences. and contained 90 words, most of which The second development, the lat- included liquid consonants in various eral voiceless fricative [ɬ] from historic positions. The words were read in the sequences of /s/+/l/, occurs in words carrier sentence ‘jag sa ___ igen’ (I said such as slipp [ɬɪpː] ‘cloth/rag’ and ___ again). The speakers were asked to vassle [vaɬː] ‘whey’. This sound is used produce three repetitions of each sen- consistently across our speakers, though tence. In this study we focussed on five with realisational variation. It is very words with initial [ɬ(l)]: slag, slita, slipp rare in Swedish varieties, being (to our (=‘rag’), slå, and slak, eleven words knowledge) found elsewhere only in the with initial [f]: fågel, falla, fel, föl, föll, Älvdalska and Orsamål dialects of Da- fall, fil, fall, fylla, fåll, fittla (=‘tickle’), larna in the west of Sweden. It may also and six words with initial [fɽ]: flytta, have been present in the Gam- flaska, flat, fläta, flämta, flicka. malsvenskby dialect found in Ukraine before coming under pressure from more standard forms. The purpose of the present paper is to quantify the phonetic properties of ES [ɬ]. Specifically, we wanted to an- swer the following questions: (a) Is [ɬ] still a complex onset, i.e. a nostalgic [sl–] consonant cluster? An initial audi- tory analysis revealed that many tokens, especially in Rickul ES, have [ɬl], and some tokens showed evidence of a voiceless vocalic portion before the lateral friction, i.e. [hɬl]. Does this sug- gest a phonological sequence, assimi- lated for manner but not voicing? (b) if [ɬ(l)] is not a cluster, has the original [sl-] become a lateralised [s], or has it become a devoiced [l]?, and (c) Is it an Figure 1. The main dialect areas of Estonian Swedish in the 1930s (from E. Lagman approximant [l̥ ] or a fricative [ɬ]? An ̥ 1979: 2). additional aim ̥ was to hypothesise the historical origin̥ of [ɬ]. As additional materials an elicited word list adapted from the word list used in Material and method the Swedish dialect project SweDia Speakers for the present study represent 2000 (Bruce et al., 1999) was used. the Nuckö-Rickul-Ormsö variety of These materials were recorded in Estonian Swedish which forms the Stockholm in 2009 with four Rickul largest dialect area of Estonian Swe- speakers (of whom three were the same dish. As can be seen from Figure 1 as the ones recorded in 2012). From this Nuckö and Rickul lie in the north-west corpus, we selected the following of the Estonian mainland and Ormsö is words: [l]: läs, lus, lös, lat, lott, lass, a small island close to Nuckö. Three ludd, lett, and [s]: särk, såll, saker, sur elderly speakers of each of Rickul and ([søːɹ], used instead of söt for /øː/).

24 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The target words were manually voiceless [ɬ], while one speaker has segmented in Praat (Boersma & Ween- shorter, and less frequent voicing at the ink, 2014), and durations obtained of end of the lateral. There is some ten- the initial singleton consonant or con- dency towards an [h] onset to [ɬ] in both sonant cluster. In the case of [ɬ] we also dialects, which may be interpreted as measured the duration of its potential pre-aspiration or anticipatory devoicing components: [(h) ɬ (l)]. Figure 2 shows of the vowel. an example of an [ɬl] by one Rickul speaker. Phonological and phonetic status of [ɬ] Is [ɬ] a singleton or consonant cluster? Several of the [ɬ] tokens are realised as [(h)ɬl], which is unlike [ɬ] in other lan- guages, like e.g. Zulu and Welsh (Ladefoged & Maddieson 1996; Mark Jones, pers. communication). The dura- tion is also rather long for a singleton (150-250 ms, see Figure 3). This would suggest that [ɬl] is phonologically a consonant cluster. However, two of the Figure 2. Waveform and spectrogram of [ɬl] Ormsö speakers lack voicing, and only for one Rickul speaker. one Rickul speaker has no fully voice- less tokens, which would indicate that Results [ɬl] is phonologically a singleton. Figure Phonetic properties of [ɬ] 4 shows mean durations of the conso- nant cluster [fɽ], the two singleton con- Mean durations of [(h) ɬ (l)] for Rickul sonants [s] and [f], and the possible and Ormsö ES are shown in Figure 3. consonant cluster [(h)ɬl].

Figure 3. Mean duration of the components of [(h) ɬ (l)] for three Rickul (top) and three Figure 4. Mean duration of initial [ɬ(l)], [fɽ], Ormsö (bottom) speakers. [s] and [f] for three Rickul (top) and three Ormsö (bottom) speakers. The Rickul speakers on average pro- duce a largely voiceless lateral, which The results for [s] should, however, be is often voiced towards the end. Two of seen as tentative since they are based on the Ormsö speakers produce completely only three tokens of each speaker, and

25 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University as two of the Ormsö speakers did not voicing is characteristic of [l̥ ] (and produce any words with initial [s]. In some voiceless nasals) rather than [ɬ]. both Rickul and Ormsö ES, [s] can be On this basis the ES voiceless lat- as long as [ɬ(l)], suggesting that [ɬ(l)] eral shows signs of anticipatory voic- need not be a cluster. However, dura- ing, i.e. of being an approximant, at tional compensation means that clusters least in Rickul ES. However, impres- need not be longer than singletons, so sionistically the strength of the friction the question is still open, although the suggests ‘fricative’. evidence tends against the consonant We compared intensity ratios be- cluster structure of /sl-/. tween the lateral and the following vowel in ES and Icelandic, which has Is [ɬ(l)] as a lateral [s] or voiceless [l̥ ]? an approximant /l̥ /, often realised as [ɬ], If [ɬ(l)] is not a nostalgic reflex of the as well as a fricative [s]. Figure 6 shows [sl-] consonant cluster but a singleton, that the mean intensity of the following should we regard it as a lateral [s], or vowel relative to [ɬ] is very similar in rather as a voiceless [l]? We compared Ormsö ES and Icelandic, suggesting the duration of [ɬ(l)] to that of [l], and that [ɬ] may have approximant status. [s] in initial positions of the words tak- en from the word list materials read by the same speakers of Rickul ES, taken from the word list material. The dura- tion for four Rickul speakers are shown in Figure 5.

Figure 6. Mean intensity (dB) of the follow- ing vowel relative to [ɬ] for three Ormsö speakers and one speaker of Icelandic. Is [ɬ(l)] a phoneme? Figure 7 shows the waveform and spec- trogram of an example minimal pair in Figure 5. Mean duration of initial [ɬ(l)], [l] ES: /valː/ – /vaɬː/. and [s] for four Rickul speakers. The duration of [ɬ(l)] is clearly more similar to the duration of [s] than that of [l], indicating that durationally, [ɬ(l)] may be some kind of lateralised [s]. Is it a [ɬ] or a [l̥ ]? The difference between [ɬ] and [l̥ ] is arguably that devoicing an approximant (e.g. [l̥ ]) needs only enough glottal opening to stop voicing, and gives min- imal local friction, whereas a voiceless fricative (e.g. [ɬ]) needs wide opening, and narrower oral constriction, to pro- duce significant local friction. Figure 7. Waveform and spectrogram of the Ladefoged and Maddieson SotWL Estonian Swedish minimal pair /valː/ – /vaɬː/. (1996: 199) imply that anticipatory

26 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The existence of minimal pairs in our consonant rather than a consonant clus- material, e.g. /lɑː/ ‘lag’ (team) – /ɬɑː/ ter [sl-], which behaves like an [s] in ‘slå’ (hit, blow) and /valː/ ‘vall’ duration, but like a [l̥ ] in intensity, sug- (mound) – /vaɬː/ ‘vassle’ (whey) sug- gesting that it has approximant status. gests that [ɬ] has a phonemic status. However, three speakers of each sub- Moreover, it can be short [ɬ] as in e.g. variety are too few for any broader gen- [ɬɑː] as well as long [ɬː] as in e.g. [vaɬː]. eralisations. Remaining questions may be answered by including a larger num- Historical origin of [ɬ] ber of speakers and other sub-varieties ES has undoubtedly historically been of ES. In the project Estonian Swedish influenced by its closest contact lan- Language Structure we intend to record guage Estonian (e.g. Danell, 1905: 34, additional speakers of different ES dia- ref. in H. Lagman, 1971: 13). As Esto- lects and also use previously recorded nian prefers simple onsets, it is tempt- materials to further investigate the pho- ing to hypothesize that somehow the netic and phonological properties of ES. emergence of [ɬ] for [sl-] might have to do with Estonian influence. On the oth- Conclusions er hand, the sound is not present in the Estonian Swedish has a rich and unusu- Estonian phonemic inventory, and the al set of liquids, as shown in Figure 8. hypothesis is less persuasive than it might be given that we are not aware of any other processes of cluster simplifi- cation in ES like those evident in bor- rowings into Estonian from . Figure 8. Estonian Swedish liquids. Also for this solution to generalize, dialects like Älvsdalska, – one of the The most unusual liquid among Scan- few other with [ɬ] – dinavian languages is the voiceless lat- would have had to be in contact with eral [ɬ]. Although some questions re- e.g. Sami, which is not self-evidently main, in ES [ɬ] appears to be likely in the relevant timescale. a) a singleton consonant rather than a It is therefore more likely that the nostalgic reflex of a cluster, seeds of [ɬ] may be inherent in variation b) a voiceless fricative in duration, in [sl-] more generally. To test this, a behaving much like initial [s], alt- search was made of the DyViS database hough the partial final voicing may (see Nolan et al., 2009). Perhaps sur- point to ‘approximant’ status, simi- prisingly, in approximately 25 hours of lar to Icelandic voiceless laterals, interviews the interviewees provided c) optionally anticipatorily voiced, only 16 examples of word-initial /sl-/. d) a phoneme, as it appears both short Of these, all had identifiable voiceless and long in minimal pairs like /lɑː/ lateral events following the [s], and – /ɬɑː/ and /valː/ – /vaɬː/. nearly half (7/16) lacked a subsequent Acknowledgements identifiable voiced lateral episode. This suggests that a voiceless lateral is latent The authors are very grateful to Júlía in any [sl-] sequence. Baldursdóttir for her help with data analysis and for recording the Icelandic Discussion and future work words. We would also like to thank our The results of this pilot study suggest Estonian Swedish informants and Sven- that [ɬ] is a phoneme in ES, which can ska Odlingens Vänner in Stockholm. be realised as [ɬ], [ɬl], or [hɬl] in the The work on this paper was supported Rickul and Ormsö sub-varieties. More- by the project Estonian Swedish Lan- over, the duration and intensity meas- guage Structure (ESST) (Swedish Re- urements suggest that [ɬ] is a singleton search Council, grant no. 2012-907).

27 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

References Kulturföreningen Svenska Asu, E.L., Schötz, S., & Kügler, F. Odlingens Vänner. (2009). The acoustics of Estonian Lagman H. (1971). Svensk-estnisk Swedish long close vowels as com- språkkontakt. Studier över estnis- pared to Central Swedish and Fin- kans inflytande på de estlandssven- land Swedish. Proceedings of ska dialekterna. Stockholm. Fonetik 2009, Dept. of Linguistics, Nolan, F., McDougall, K., de Jong, G. Stockholm University. & Hudson, T. (2009). The DyViS Boersma, P. & Weenink, D. (2014). database: style-controlled record- Praat: doing phonetics by computer ings of 100 homogeneous speakers [Computer program] Ver. 5.3.76, for forensic phonetic research. In- retrieved from www.praat.org/. ternational Journal of Speech, Bruce G., Elert C.-C., Engstrand O., & Language and the Law 16(1), 31- Eriksson A. (1999). Phonetics and 57. phonology of the Swedish dialects Schötz, S. & Asu, E.L. (2013). An – a project presentation and a data- acoustic study of accentuation in base demonstrator. Proceedings of Estonian Swedish compounds. In ICPhS 99 (San Francisco), 321– Asu and Lippus (Eds.), Nordic 324. Prosody: Proceedings of the XIth Ladefoged P. & Maddieson I. (1996). Conference, Tartu 2012. 343–352. The Sounds of the World’s Lan- Peter Lang, Frankfurt am Main. guages. Oxford: Blackwell. Tiberg N. (1962). Estlandssvenska Lagman E. (1979). En bok om Estlands språkdrag. Lund: Carl Bloms Bok- svenskar. Estlandssvenskarnas tryckeri A.-B. språkförhållanden. 3A. Stockholm:

28 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

A data-driven approach to detection of interruptions in human–human conversations Raveesh Meena, Saeed Dabbaghchian, Kalin Stefanov Department of Speech, Music and Hearing (TMH) KTH, Stockholm [email protected], [email protected], [email protected]

Abstract ruptions, backchannels and turn- We report the results of our initial ef- changes. Yang (2001) analyzed the forts towards automatic detection of maximum pitch and intensity in speaker user’s interruptions in a spoken human– turns, and described the function of machine dialogue. In a first step, we interruptions in managing local and explored the use of automatically ex- global coherence in conversation that is tractable acoustic features, frequency brought about through the systematic and intensity, in discriminating listen- phrase-to-phrase prosodic patterns of er’s interruptions in human–human discourse. For example, a speaker at- conversations. A preliminary analysis tempts at taking the conversational of interaction snippets from the HCRC floor while the main speaker is speak- Map Task corpus suggests that for the ing (referred to as competitive interrup- task at hand, intensity is a stronger fea- tions) are characterized by high pitch ture than frequency, and using intensity and amplitude. In contrast, speaker in combination with feature loudness statements supporting the main speak- offers the best results for a k-means er’s contentions, with no intention to clustering algorithm. take the conversational floor (referred to as cooperative interruptions) often Introduction occur at low or medium pitch levels. Interruptions are important elements of In a related work, Gravano & conversations. They contribute in medi- Hirschberg (2012) examine interrup- ation of the content and redirection of a tions in a corpus of spontaneous task- conversational exchange. Human-like oriented dialogue and report a number conversational dialogue systems should of significant differences between inter- not only be able to 1) use interruptions rupting and non-interrupting turns, as a means to regulate the direction of a based on features such as speaking rate, conversation, but also 2) discriminate mean intensity, mean pitch, and dura- user’s interruptions from backchannels, tion of speaker speech. Lee & Nara- and turn-taking attempts, and select an yanan (2010) analyzed the differences appropriate response. A system’s insen- between competitive and cooperative sitivity to user’s interruptions could interruption with features, change and possibly render a dialogue inefficient activeness, employing audio, visual, and have adverse effect on user experi- and dis-fluency data. They have shown ence. In this work, we aim at building a that the using these features in combi- computation model for automatic detec- nation offers better results in discrimi- tion of interruptions and in human– nating between the two types than using human conversations. any single feature modality. Our work is motivated from the ob- Background servation made in this literature regard- Various works have analyzed the ing the distinct acoustic characteristic acoustic and prosodic characteristics of of backchannels, interruptions, and conversational elements such as inter- turn-changes. However, unlike the su- pervised methods for classification used

29 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University in Lee et al. (2008) and Lee & Nara- and speech regions with low energy. yanan (2010) we 1) take an unsuper- This results in some user speech units vised approach to automatically cluster getting lost in the next processing stage. speaker utterances into interruptions, This indicates the issues and limitations backchannels, and turn-change catego- of a fully automatic system for the task ries; and 2) use a fully automatic at hand. Next, for each speaker IPU we scheme for extraction of frequency and extracted the maximum f0 and intensity intensity features for training a model values using Wavesurfer toolkit for online use. (Sjölander & Beskow, 2000). The fea- ture values were z-normalized Method (z = (x − µ) /σ)) for building a speak- Corpus er invariant model. In addition to this, we used the perception level features: To get a feeling for the task at hand, we maximum pitch and loudness (the log started with a relatively small subset of semitone equivalents of frequency and the HCRC Map Task corpus (Anderson intensity, respectively). et al., 1991). In the Map Task interac- 10 tion one of the dialogue participants TurnChange Intrp (giver) provide instructions to the other 8 BckFb human participant (follower) about 6 finding her way to a destination on a map. In one setting, participants have 4 no visual contact with each other, and 2 zNormMaxFrequency as the respective maps are not com- pletely identical (absence or presence of 0

-2 landmarks or difference in landmark 0 0.5 1 1.5 2 2.5 3 names) the conversations inevitably zNormMaxIntensity involves: clarifications, acknowledge- Figure 1. Spread of training instances with ments, backchannels, interruptions, z-normalized max frequency and intensity turn-changes etc. Three Map Task in- teractions (average duration 15 min) The spread of training instances in our dataset using the features were randomly picked and only the first zNormMaxFrequency and zNormMax- 3 min of the interaction was analyzed Intensity is illustrated in Figure 1. In for this initial work. The resulting da- order to obtain the ground truth of the taset contains 5 unique participants (3 category of speaker IPUs in our data, male and 2 female). the authors of this paper labeled the Feature extraction IPUs with one of the three categories: Intrp (the IPU is an interruption), We explored the use of two acoustic BckFb (the IPU is a backchannel), and features: frequency and intensity, for TurnChange (the IPU marks a turn- the task at hand. We used inter-pausal change). Since this is still an explorato- units (IPUs): speech units separated by ry work, the judges sat together and 200 milliseconds of silence as the basic labeled the data unanimously (of unit of processing, i.e. deciding whether course, in a larger study this would a speaker IPU is an interruption or a have been done more formally along backchannel or a turn-change. Towards with Kappa scores for inter-annotator this, we first used a voice activity detec- agreement). In our training set we have tor to automatically segment speakers’ 30 instances of interruption, 25 instanc- speech into IPUs. The automatic seg- es of backchannel, and 58 instances of mentation method is not perfect and turn-change. doesn’t always produce segmentation A cursory look at Figure 1 suggests around turns with simultaneous speech that interruptions (Intrp) indeed tend to

30 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University have higher maximum intensity and troid (1.99, 4.14) as the cluster repre- frequency. Backchannels (BckFb), in senting interruptions, and the other contrast are in the lower end of the cluster as representing backchannels. If spectrum. The instances of turn-change we treat the cluster label of instances in lie somewhere in the middle, but are these clusters as their learned category, spread largely over interruptions, sug- we would correctly label 78.1% of the gesting that it would be hard to discrim- training dataset. The training instances inate interruptions and turn-changes. A with erroneous classification are indi- univariate analysis of variances of the cated with red color in Figure 3. means of zNormMaxFrequency and 10 Annotated-Clustered Intrp-Intrp zNormMaxIntensity suggest that only Intrp-BckFb 8 BckFb-BckFb the backchannel category differ signifi- BckFb-Intrp Clusters Center cantly from interruption and turn- 6 change categories. This suggested that using these two features only we would 4 not have much success in discriminat- 2 zNormMaxFrequency ing between the three categories. There- fore, for the remaining part of this paper 0 we focus only on the task of discrimi- -2 0 0.5 1 1.5 2 2.5 3 nating interruptions from backchannels. zNormMaxIntensity Figure 2 illustrates the distribution of Figure 3. K-means clustering and classifica- interruptions and backchannels in our tion results. Misclassified instances are indi- training dataset. cated in red color. 10 Intrp BckFb Since the majority class in our dataset is 8 interruptions (54%), the accuracy of 78.1% is a huge improvement over the 6 majority class baseline. Table 1 summa- 4 rizes the performances of the various (additive) features explored in this 2 zNormMaxFrequency work. Both z-normalized max intensity

0 and loudness are stronger features in comparison to frequency and pitch. -2 0 0.5 1 1.5 2 2.5 3 Using the combination of z-normalized zNormMaxIntensity maximum intensity and loudness we Figure 2. Spread of interruptions and back- obtained the best clustering perfor- channels in the training set (55 instances). mance of 80.0%. Table 2 presents the recall and performance corresponding Result to this feature combination. The model Clustering and classification achieves a high F-measure for interrup- We used the centroid based k-means tions. clustering algorithm (with k=2) to au- Table 1. Feature performances (where + tomatically cluster our training dataset indicates additive feature combinations) consisting of only backchannels and interruptions. In Figure 3, the two black Feature (s) Accuracy circles indicate the two cluster centroids zNormMaxFrequency 70.9% (1.42, 1.19) and (1.99, 4.14), with 39 + zNormMaxIntensity 78.1% and 16 instances in the respective clus- zNormMaxLoudness 76.3% ters. Based on the observations made in + zNormMaxPitch 61.8% the literature that interruptions are char- zNormMaxIntensity 80.0% acterized by higher frequency and in- +zNormMaxLoudness tensity, we label the cluster with cen-

31 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Table 2. Precision, Recall and F-measure of analysis on the data with a clear separa- clustering using z-normalized max intensity tion of the two cases would be an inter- and loudness esting investigation. A major limitation of this work is Precision Recall F-measure that we have excluded turn-changes Intrp BckFb Intrp BckFb Intrp BckFb from the current dataset. As observed, 0.77 0.85 0.90 0.68 0.83 0.76 with regard to their acoustic property (max frequency and intensity) turn- change overlap largely with interrup- Discussion tions. This suggests that one may want We have presented the preliminary re- to explore contextual (dialogue act) and sults from our efforts towards automatic lexico-syntactic features for telling detection of user interruptions in a spo- them apart from interruptions and back- ken human–human conversation. We channels. formulated the task as that of clustering References speaker utterances in three categories: interruption, backchannel, and turn- Anderson, A., Bader, M., Bard, E., change. We explored the two acoustic Boyle, E., Doherty, G., Garrod, S., features: z-normalized maximum fre- Isard, S., Kowtko, J., McAllister, J., quency (and pitch) and intensity (as Miller, J., Sotillo, C., Thompson, well as loudness) in speaker utterances. H., & Weinert, R. (1991). The A preliminary analysis of interaction HCRC Map Task corpus. Lan- snippets from the HCRC Map Task guage and Speech, 34(4), 351-366. corpus suggested that the task of dis- Gravano, A., & Hirschberg, J. (2012). criminating between backchannel and A Corpus-Based Study of Interrup- interruption is instead more feasible on tions in Spoken Dialogue. In the dataset at hand. Using our fully au- INTERSPEECH. ISCA. tomated approach to extract feature Lee, C-C., Lee, S., & Narayanan, S. S. values, we have observed that intensity (2008). An analysis of multimodal is a stronger feature in comparison to cues of interruption in dyadic spo- frequency, and using intensity in com- ken interactions. In bination with loudness offers the best INTERSPEECH (pp. 1678-1681). performance results for discriminating ISCA. interruptions and backchannels. Lee, C-C., & Narayanan, S. (2010). The results obtained in this work are Predicting interruptions in dyadic encouraging. In a next step it would be spoken interactions. In ICASSP (pp. interesting to see whether scaling up the 5250-5253). IEEE. training set would provide similar or Sjölander, K., & Beskow, J. (2000). better results. More data should help us WaveSurfer - an open source return to our original task of automatic speech tool. In Yuan, B., Huang, discrimination between the three cate- T., & Tang, X. (Eds.), Proceedings gories. of ICSLP 2000, 6th Intl Conf on It would be interesting to see if the Spoken Language Processing (pp. performance of models presented here 464-467). Beijing. could be improved with using addition- Yang, L-C. (2001). Visualizing Spoken al features, such as duration as suggest- Discourse: Prosodic Form and Dis- ed in Gravano & Hirschberg (2012), or course Functions of Interruptions. measures of activity (how the values In Proceedings of the Second SIG- fluctuate in the overlapping speech re- dial Workshop on Discourse and gions) and change (shift in peak values) Dialogue - Volume 16 (pp. 1-10). in Lee & Narayanan (2010). The da- Stroudsburg, PA, USA: Associa- taset contains both overlap and non- tion for Computational Linguistics. overlap speech segments. A similar

32 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The WaveSurfer Automatic Speech Recognition Plugin Giampiero Salvi, Niklas Vanhainen KTH, School of Computer Science and Communication Department for Speech Music and Hearing, Stockholm, Sweden [email protected], [email protected]

Abstract This paper presents a plugin for automatic speech recognition (ASR) in the WaveSurfer sound manipulation and visualization program. The plugin allows the user to run continuous speech recognition on spoken utterances, or to align an already available orthographic transcription to the spoken material. The plugin is distributed as free software and is based on free resources, namely the Julius speech recognition engine and a number of freely available ASR resources for different languages.

More information is available in Salvi and Vanhainen (2014). References Salvi, G and Vanhainen, N (2014). The WaveSurfer Automatic Speech Recognition Plugin, Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland.

33 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

34 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Towards a contingent anticipatory infant hearing test using eye-tracking Iris-Corinna Schwarz1, Atena Nazem1, Sofia Olsson1, Ellen Marklund1, Inger Uhlén2 1 Department of Linguistics, Stockholm University, Sweden 2 Department of Hearing and Balance, Karolinska Universitetssjukhuset, Sweden [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract (OAE), and second, the auditory brain- Early identification of infant hearing stem response (ABR). The otoacoustic impairment is imperative to prevent emissions are the response of the outer developmental language difficulties. hair cells of the cochlea to acoustic The current diagnostic method is Visual stimuli (Vohr & Maxon, 1996). It tests Reinforcement Audiometry (VRA) in the physical cochlear components of which infant response to sound is perceiving an auditory event and gives observed to establish hearing a first indication on the working state of thresholds. Together with the the hearing system (Kemp, 1978). If the Karolinska Institute, we are developing hearing thresholds lie above 30 dB, no an observer-independent contingent response will be indicated by OAE. anticipatory infant hearing test using Infants who fail the newborn hear- eye-tracking to increase reliability and ing screening are sometimes referred to significance levels of the current the ABR. This procedure tests whether clinical practice. auditory information arrives at the cor- The present pilot study addresses in tex by measuring electroence- particular the first phase of the test in phalography (EEG) at specific skull which the eye response is conditioned locations and requires sedation. A short to occur at sound detection. The aim is latency is expected in the EEG- to establish how well 6.5-month-olds waveforms at the skull electrodes in associate the presence of sound to a response to a click stimulus (Davis, certain location via a visual reward. 1976). OAE and ABR are only testing the functionality of the hearing system. Introduction If all infants were tested with the 2- Worldwide, more than 665 000 infants stage method, about 23 % of the infants are estimated to be born with a with permanent hearing loss at 9 significant hearing loss, and during the months would still have passed ABR. first year, the figures increase due to the When considering mild hearing loss, occurrence of acquired hearing over 70% of infants would pass the impairment (Olusanya, 2005). The screening procedure, yet in need of help earlier in infancy hearing impairment is (Johnson et al., 2005). This demon- discovered, the earlier measures can be strates the necessity of a hearing test undertaken to prevent communication that monitors behavioural responses to difficulties and language delay (Lidén auditory stimulation and can establish & Kankkunen, 1969). hearing thresholds. In current clinical In Sweden, detection of hearing practice, the Visual Reinforcement Au- difficulties has been nationwide includ- diometry (VRA) fulfils this purpose. ed in the general screening procedures Visual Reinforcement Audiometry for newborns. Newborn hearing screen- Originating in the Conditioned Orienta- ing today includes commonly two tion Reflex Audiometry (Suzuki & steps: first, the otoacoustic emissions Ogiba, 1961), the VRA paradigm re-

35 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University sembles the Conditioned Head-Turn 2009). A completed hearing test outputs Procedure commonly used in infant a standard audiogram with hearing speech perception studies (Werker, thresholds for both ears at the frequen- Polka, & Pegg, 1997), but does not in- cies 500 Hz, 1000 Hz, 2000 Hz and corporate the same level of observer 4000 Hz, providing a significance level objectivity. The child is seated in front for each threshold. This prototype re- of a panel with two loudspeakers on quired further refinement. each side, combined with two screens Palmgren & Sundberg developed a to display reward pictures. The audio- revised version of the program called logist presents a sound stimulus on one eye-tracker-based VRA (2012) in which of the loudspeakers and a correct turn to they managed to achieve anticipatory the side from the child is reinforced eye movement, however, unfortunately with picture presentation on the corre- contingent to a constant time interval sponding screen. The method is based between fixation and sound presenta- on the assumption that, after a number tion. Once this time interval was ran- of trials, the child becomes conditioned domised, only few infants became con- to associate sound presentation with ditioned. picture display. Therefore, the head- The current complete version has turn towards the loudspeaker and/or been programmed in C# with all the screen at sound presentation can be features that the first prototype con- interpreted as an indication of sound tained, but it proved to be too tedious to perception. re-program just for the sake of testing The problems associated with VRA different set-ups of the conditioning are obvious: The audiologist is not phase. Therefore we decided to imple- blind to when a sound is presented. The ment parts of the complete hearing parent also present in the test room may threshold test in E-Prime and Tobii give subconscious indication of sound Studio to log and record infant eye re- presence. The infant reaction is not al- sponse to sound. For the current study, ways the defined head-turn an older we wanted to improve the conditioning child would present. Instead, it often phase only as too few infant partici- leaves room for interpretation, ranging pants managed to associate sound de- from slight body shivers to eye move- tection with looking at a particular ment or even eye widening. There is no place on the screen. defined criterion as to when condition- As more minimum response levels ing has been passed. Significance levels are obtained with sound field testing are low, since the same threshold level compared to insert-earphone stimulus per frequency is tested at maximum two presentation, especially in the youngest times in total. And yet this is the current infants (Day, Bamford, Parry, Sheperd, clinical diagnostic tool to determine & Quigley, 2000), we decided on sound infant hearing thresholds. field testing using loudspeakers rather Inspired by the improved experi- than headphones. Additionally, since mental control of the Conditioned the compliance to wear headphones or Head-Turn Procedure and yet incorpo- insert-earphones is low in infants, we rating the need for a behavioural meas- reduce infant fussiness during testing. urement of infant hearing, we are de- veloping a contingent anticipatory in- Method fant hearing test using eye-tracking. Participants The contingent anticipatory infant The participating families, randomly hearing test using eye-tracking selected from the Swedish tax registry, The first prototype was developed and lived in Greater Stockholm and had an coded in Tobii’s Software Development infant at 6 months. They received an Kit, based on Visual Basics (Mattson, information letter and returned a con-

36 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University sent form upon which their lab visit was Bit) with a left and a right channel only scheduled. Response rate was 9 %. In to be displayed at only one of the two total, 12 infants (age 6 months +/- 2 loudspeakers at a time. Sound pressure week; 7 girls, 5 boys) participated in the level was measured as 83 DB at the study. All participants reported normal source and 55 dB at infant position with hearing screening results at birth and no about 70 cm distance to the loudspeak- known history of hearing difficulties. ers. The picture rewards were created in Apparatus Adobe Flash Professional CC (version The study was conducted in a test booth 13.1.0.226) and consists of smiling with an adjacent control room. The flowers in two different contrastive and study set-up contained three directly bright colours (see Figure 1). The pic- linked stations, that is one eye-tracker, and two PCs. Sound and picture stimuli were presented via a PC (Windows XP) using the software E-Prime 2.0 (Psychology Software Tools, 2010), programmed in E-Studio version 2.0.8.90 using extensions for Tobii (Clearview). Eye movement was calibrated, recorded and analysed via tures are animated to move up and another PC, using Tobii Studio down on screen in phase with the war- Software version 3.2.0. The infants ble tone within their target box (360 x were seated in their parent’s lap in front 500 pixels). of a Tobii Eye-tracker T120 with a 17 Figure 1. The reward pictures. inch-screen (1280 x 1024 pixels). The Tobii eye-tracking system does The fixation picture consisted of not require head-mounted equipment, the same round window picture of a as near-infrared light is projected into baby that was used for calibration. Dur- both eyes and the position of the head is ing Demonstration phase 1 and 2, the calculated by an algorithm. This makes round frame was growing and shrinking Tobii eye-trackers particularly suitable to draw infant attention to the screen for infant studies as eyes can be tracked centre, whereas during test phase, it was while the head is free to move. reduced to a still picture. The sound stimulus was presented Procedure via two loudspeakers (NuForce S-1), The infants sat in their parent’s lap in located left and right of the eye-tracking the sound-attenuated test room at the monitor about 70 cm away from the Phonetic Lab at the Department of Lin- infant head. guistics at Stockholm University. Dis- Stimuli tance between infant head and eye- The sound stimulus that was used as a tracker was about 55-60 cm. placeholder for the pool of frequency- To calibrate the eye-tracker to the and SPL-exemplars was a warble tone individual infant, 5-point manual infant with a 1000 Hz base frequency, a mod- calibration was used to capture the in- ulation depth of 100 Hz and a rate of 1 fant eye just when the infant was look- phase per second. The warble tone was ing at the screen. generated for 6 s (stereo, 44100 Hz, 16

37 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Figure 2. Overview over the test procedure: If the infant passes criterion phase, it will count towards establishing the hearing thresholds, i.e., adding four additional trials to the test phase.

Figure 3. A typical trial with fixation phase, one-sided sound display during the decision window, and reward picture presentation if the correct side was fixated.

Instead of a growing and shrinking dot stration 1, the infant is introduced to the on the screen, a growing and shrinking task in 2 trials. After the fixation picture round window with a baby face was is displayed in the centre, the sound is used for calibration. Parents were not presented while the reward is shown for wearing headphones since no proper 6 s first on one side, then on the other hearing test was conducted, the infant (see Figure 3). Sides are counterbal- sound detection was not time-sensitive anced between participants. During and parents were debriefed about the demonstration 2, the infant follows 6 infant’s actual task only after the test. typical trials, 3 for each side, in ran- The test session contained several domised order. Criterion phase consists phases (see Figure 2). During demon- of 4 trials of which the infant has to

38 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University pass 3 trials. It is the first phase in When considering fixation behaviour, which contingent feedback to infant there is indication of the common right fixation is active. A sound is only pre- side bias. No matter whether target or sented, if the fixation image is watched non-target was situated on the right for 200 ms. The reward is only activat- hand side, the duration of the initial ed, if the correct box is fixated for at fixation was much greater than for the least 200 ms. The time window for this left hand side (Table 2). is 5.5 s; for the remaining 500 ms of the sound presentation, the reward will be Discussion shown in any case to remind and rein- As none of the participants who passed force the formed association between criterion phase completed the test phase sound and picture. The test phase con- with hits considerably above chance tains 20 trials, 10 for each side. If the level, it can be assumed that the current infant does not pass criterion, the test set-up of the conditioning phase is not loops back to demonstration 2 and then satisfactory. The criterion 3 out of 4 to criterion phase again. correct could be set too low or the Each test session took about 10 infants could need reminder min, depending on criterion phase out- demonstration trials later during test come and potential demonstration phase phase as they may forget. Keeping repetitions. infant attention was also an issue, especially after a repeated Results demonstration phase 2 and therefore a Of the 12 infants who participated, data prolonged test time. A variety of sound of 6 had to be excluded from further stimuli as a real hearing threshold test processing due to inadequate calibra- would continue and a several different tion. Of the remaining 6, only 2 partici- reward pictures could offer a solution to pants passed the criterion phase to that. It was decided to be as rigorous as move on to the test phase (see Table 1). possible with the current set-up and not to vary either sounds or pictures in Table 1: Test performance of the 2 order to build and strengthen the participants who passed criterion. Number of trials performed, number of correct side association as much as possible. Of identification and percentage correct is course there is a fine line between presented per participant. strengthened relationship and boredom, and this set-up’s internal consistency No. Trials Hits Correct may have crossed it. 3 17 9 53 % Conclusions 6 24 5 21 % The current study aimed to further

develop the conditioning phase of a None of the participants performed sig- contingent anticipatory eye-tracker- nificantly above chance level in identi- based infant hearing test. While it is fying the correct side, although they undisputed that such an automated and passed criterion for conditioning. objective hearing threshold test for Table 2: First fixation duration in seconds infants is bitterly needed, we cannot yet towards target (T) and non-target (NT) on claim to have developed a fully- either side during criterion and test trials functioning test. The present results on (R=right; L=left). the improvement of the conditioning phase look promising and give No. TR NTR TL NTL directions for further development, but 3 1,63 3,43 0,56 1,84 more steps need to be taken until we 6 0,64 1,4 1,44 0 have arrived at a successful contingent anticipatory eye-tracker-based hearing test for infants.

39 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Acknowledgements Lidén, G., & Kankkunen, A. (1969). We would like to thank all participating Visual reinforcement audiometry. families for their contribution to the Acta Oto-Laryngologica, 67(2-6), project. No research is possible without 281-292. you! We also gratefully acknowledge Mattson, L. (2009). Prototype of infant the funding that kept this project alive hearing test using eye tracking. over the years (VINNOVA 2011- Masters Thesis, KTH Royal 03329, VR 421.2007-6400, and several Institute of Technology Stockholm, internal grants from Karolinska Sweden. University Hospital and Stockholm Olusanya, O. B. (2005). Can the world's University). infants with hearing loss wait? International Journal of Pediatric References Otorhinolaryngology, 69, 735-738. Davis, H. (1976). Brainstem and other Palmgren, S., & Sundberg, J. (2012). responses in electric response Att se men inte höra - ett eye- audiometry. Annals of Otology, trackerbaserat hörseltest för Rhinology, and Laryngology, spädbarn. Masters thesis, KTH 85(1/1), 3-14. Royal Institute of Technology, Day, J., Bamford, J., Parry, G., Stockholm, Sweden. Sheperd, M., & Quigley, A. (2000). Suzuki, T., & Ogiba, Y. (1961). Evidence on the efficacy of insert Conditioned orientation reflex earphone and sound field VRA audiometry: A new technique for with young infants. British Journal pure tone audiometry for children of Audiology, 34(6), 329-334. doi: under 3 years of age. Archives of 10.3109/03005364000000148 Otolaryngology, 74(2), 192-198. Johnson, J. L., White, K. R., Widen, J. doi: E., Gravel, J. S., James, M., 10.1001/archotol.1961.0074003019 Kennalley, T., . . . Holstrum, J. 7013 (2005). A Multicenter Evaluation Werker, J. F., Polka, L., & Pegg, J. E. of How Many Infants With (1997). The conditioned head turn Permanent Hearing Loss Pass a procedure as a method for testing Two-Stage Otoacoustic infant speech perception. Early Emissions/Automated Auditory Development and Parenting, 6(3- Brainstem Response Newborn 4), 171-178. Hearing Screening Protocol. Vohr, B. R., & Maxon, A. B. (1996). Pediatrics, 116(3), 663-672. doi: Screeening infants for hearing 10.1542/peds.2004-1688 impairment. The Journal of Kemp, D. T. (1978). Stimulated Pediatrics, 128(5), 710-713. acoustic emissions from within the human auditory system. Journal of the Acoustical Society of America, 64(5), 1386-1391.

40 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Duration and pitch in perception of turn transition by Swedish and English listeners Margaret Zellers Department of Speech, Music & Hearing, KTH, Stockholm, Sweden [email protected]

Abstract Prosodic turn transition cues Turn transition is often predictable, as In English, intonational patterns at pro- evidenced by the relative ease with sodic boundaries have been associated which speakers follow one another with cuing turn hold or transition. Lo- without large silences between turns. cal, Kelly & Wells (1986) note that The current study investigates prosodic high pitch rises or low falls generally turn-taking cues in perception in Swe- occur at speaker transition in Tyneside dish and English. While Swedish listen- English, alongside other phonetic fea- ers prefer duration as a cue for turn tures such as slowing and vowel cen- transition, English listeners prefer pitch tralization. This intonational finding has cues. This difference between lan- recently been replicated by Gravano & guages may be due to the relatively Hirschberg (2009, 2011) in a corpus of heavier meaning-bearing role for pitch American English. Ford, Fox & in Swedish. Thompson (1996) and Schegloff (1998) Introduction also point out that intonation can influ- ence whether complete syntactic struc- The apparent ease with which transition tures are interpreted as turn ends in from one speaker to another occurs in English conversation, countering argu- conversation has been widely acknowl- ments from e.g. de Ruiter et al. (2006) edged ever since Sacks, Schegloff & that intonation does not play a role in Jefferson’s (1974) seminal paper, alt- the predictability of turn ends. hough some more recent work has indi- The case of prosodic cues at turn cated that these transitions are not as boundaries in Swedish appears to be smooth as those authors originally pro- somewhat more complicated. Hjalmars- son (2011) found that Swedish listeners posed. Heldner & Edlund (2010) found associated “flat” intonation with turn- a large degree of variation in timing of holding, while “falling” intonation was the onsets of new speakers’ turns, with associated with turn-yielding. This is the greatest number of turn transitions consistent with findings by Edlund & across three corpora having a “just- Heldner (2005), who also found that noticeable” silence with a duration of rising intonation patterns were not con- around 200ms. Heldner (2011) showed sistently associated with either turn- further that overlaps and gaps at turn holding or turn-yielding. Although transitions must be at least 120ms long Hjalmarsson (2011) did not find that in order to be perceptible to listeners as listeners made consistent predictions of overlaps or gaps. Heldner & Edlund turn-transition on the basis of final (2010) indicate that in their data, at lengthening, Hjalmarsson & Laskowski least 41% of cases had a long enough (2011) showed that including final gap that the next speaker could poten- lengthening in an automatic model im- tially use phonetic information from proved prediction of speaker change at even the very end of the prior turn as a pauses, with increased final lengthening cue to turn transition. before the pause being associated with turn hold.

41 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Perception experiments semitones (st) above the speaker’s base- A set of perception experiments were line pitch. These modifications will be conducted in which listeners in either referred to hereafter as the Peak modifi- Swedish or English made judgments cations. about prosodic features related to turn For each peak height, a set of modi- transition. fications of the final pitch fall were made, with the fall ending at the speak- Methodology er’s baseline, or else 2, 4, or 6 st above Spoken sentences were taken from cor- the speaker’s baseline. In all cases the pora of conversational Swedish (DEAL basic shape of a pitch fall was retained, corpus, Hjalmarsson et al., 2007) and so items with Peak height 3 st only had British English (unpublished corpus, falls to baseline and 2 st above the base- Cambridge University). The conversa- line, while items with Peak height 8st tional turns used for the experiment had falls to 0, 2, 4, and 6 st. The modi- were chosen on the basis of several fications of the ends of the pitch con- criteria. First, they were syntactically tours will be referred to hereafter as the complete, but with a declarative sen- Truncation modifications. A schematic tence form (i.e. not interrogative or im- of the pitch manipulations is given in perative). This meant that the turns Figure 1. were syntactically/semantically ambig- After the resynthesis of the turns uous as to turn transition. The final was completed, five native speakers of word of each turn was a content word, Swedish and four native speakers of with stress on the penultimate or ante- English gave naturalness ratings for penultimate syllable. In the Swedish each of the stimuli. Stimuli which did turns, the final word was focally ac- not attain a majority rating as reasona- cented (i.e. an additional high pitch bly natural were not included in the peak followed the word-accent LH final experiment. 50 of the 56 original tones; cf. Bruce, 1998; Gårding 1989). Swedish stimuli remained, while 55 of In the English turns, the final word was the 56 English stimuli were rated as pitch-accented, always with an H* tone. acceptable. This allowed for 101 usable The pitch contours in all turns in both pairs in the Swedish stimuli. The Eng- languages ended with a final fall to low lish stimuli used were matched to the (i.e. L%). There were four base turns Swedish stimuli, with sentences paired each for the Swedish and the English on the basis of highest average accepta- experiments. bility. Resynthesis of turn prosody Modifications of duration and pitch characteristics of the turns were carried out using PSOLA resynthesis in Praat (Boersma & Weenink, 2013). The final unstressed syllable(s) of each turn had their duration modified, so that seg- ments in the final rhyme had a duration Figure 1. Schematic of pitch manipulations of either 0.1 sec (Short condition) or made in the experiment stimuli. Left: Peak 0.15 sec (Long condition). variations. Right: Trunc variations (for a The pitch contours of the turns stimulus with Peak height 5st). were then modified in two ways, illus- Experiment procedure trated in Figure 1. First, the height of In each trial, participants heard two the final pitch peak (i.e. the focus tone versions of one of the base sentences. peak in Swedish and the pitch accent The two versions differed in one, two, peak in English) was modified to a lev- or all three of the prosodic manipula- el of either 3, 5, or (in Long stimuli) 8

42 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University tions described earlier. Half of the par- ers in the Change condition preferred ticipants were asked to choose which the Short stimuli, as shown in Figure 2. version of the sentence the speaker When there was no duration differ- would say if s/he had more to say and ence between the two stimuli presented, was going to continue speaking (Hold the Swedish listeners tended to prefer condition), and the other half to choose stimuli with high Peak values in the which version the speaker would say if Hold condition and low Peak values in s/he was done speaking and ready for the Change condition. Although the someone else to talk (Change condi- most commonly chosen stimuli in the tion). Participants had the option to re- Hold condition also had high Trunca- listen to the pairs, but most did not do tion values, and the most commonly so after the first few trials. Participants chosen stimuli in the Change condition also reported for each response whether had low Truncation values, these differ- they were relatively sure or relatively ences did not attain statistical signifi- unsure about their selection. cance when Peak height was included Thirty-two native speakers of Swe- in the statistical model. It is also im- dish, all resident in the Stockholm area, portant to note that Peak height did not participated in the Swedish version of override Length in listeners’ judgments the experiment. Twenty-four native about appropriateness for turn transi- speakers of British English, all resident tion. in Cambridge, UK, participated in the English version of the experiment. English The English listeners, in contrast to the Results Swedish listeners, responded very Swedish strongly to both Peak and Truncation variations, with high Peaks and high Truncation levels being associated with Hold, and low Peaks and Truncation with Change. Length variations did not appear to play a role in the English lis- teners’ judgments about which turn versions were more appropriate for Hold or Change. The Peak and Trunca- tion levels preferred by the English lis- teners are shown in Figure 3 overleaf. Discussion The evidence from these two perception experiments demonstrates that listeners can judge whether a turn is ending or not at least in part on the basis of the Figure 2. Length of chosen stimuli in prosodic form of that turn. Since not all Change (C) and Hold (H) conditions for syntactically/semantically complete Swedish listeners. (χ2(1, N=1978)= productions necessarily lead to turn 63.6423, p<0.001). transition, the prosodic form of these turns can be a valuable tool for the con- The Swedish listeners preferred to use versational participant who is determin- duration cues when they were available. ing whether and when to begin a new In turns where there was a difference in turn. the Length condition between the two stimuli, listeners in the Hold condition preferred the Long stimuli, while listen-

43 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Figure 3. Height of pitch Peaks (left) and Truncation (right) for English listeners in Change (C) and Hold (H) conditions. Length of stimuli did not have a statistically significant effect on listeners’ judgments. Peaks: F(1, 2422)=33.16, p<0.001. Trunc: F(1,2422)=36, p<0.001. The prosodic cues reported here turns which begin in overlap do not should be seen as probabilistic rather take into account the previous speaker's than definitive, for several reasons. intention, whether or not that speaker First, although listeners were sensitive then chooses to produce a prosodic cue to the cues, they were not forced into indicating their intention about what particular interpretations; that is, a may follow their turn (i.e. as a result of Short turn (in Swedish) or a turn with a earlier planning that is not modified on low pitch Peak (in English) did not the basis of the overlapping speech). guarantee that the listener would con- sider that turn as over; instead, the re- Cross-linguistic differences sults represent preferences over many While Swedish listeners preferred dura- repetitions of similar stimuli. Second, tion variation as a cue to turn transition, the cues occurred only at the very ends English listeners preferred pitch varia- of turns – within the last 500ms or so. tion. Pitch has regularly been reported In the current experiment, the turns as being strongly linked with turn tran- were always followed by silence, so sition in English (Local et al., 1986; there was no question that the partici- Gravano & Hirschberg 2009, 2001 inter pants had time to hear the prosodic cues alia), while its association with turn and respond to them. However, turns transition in Swedish has been more beginning less than about 150ms after ambiguous (Edlund & Heldner, 2005; the offset of the previous turn are likely Hjalmarsson, 2011). Similarly, duration not to reflect responses to turn-final variations with longer duration being prosodic detail. Average reaction time associated with turn hold, have been to auditory stimuli in a variety of exper- reported for Swedish production data iments tends to be around 140-160ms (Hjalmarsson & Laskowski, 2011; (see review in Kosinski 2013), and

44 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Zellers, submitted) as well as for Eng- used in the perception study above (i.e. lish (Gravano & Hirschberg, 2009, syntactically complete with indicative 2011). If these kinds of variations are structure, and with the final word being available in both languages’' produc- a focused content word), duration was tions, why are listeners with different found to vary consistently with the lis- native languages so different in their teners' preferences in the perception preference for turn transition cues? experiment: that is, potential boundaries followed by a short silence then contin- One possible explanation for this uing talk by the same speaker had rela- phenomenon depends on the intona- tively long final syllables, while poten- tional phonology of the two languages tial boundaries followed by a short si- in question. Central Swedish has a lence and then talk from another speak- complex word-accent system (Bruce, er had relatively short final syllables. 1998; Gårding, 1989) in which most or However, the pitch characteristics of all content words bear a potentially con- these turns (height of final pitch peak, trastive word accent. Focus is addition- height of final L%, or height of low ally marked with an additional high (H) pitch preceding final pitch peak) did not tone following the focused content vary significantly in relation to the turn word. English, in comparison, has a transition structure. The only source of relatively sparse tonal specification. statistically significant variation in the- Content words may bear a pitch accent se pitch data was whether the content but need not, especially if they refer to word bore Accent 1 (in which case the already-given information, and focus is first L tone in the final LHL sequence is typically marked by the placement of considered to be associated with the the nuclear pitch accent, and possibly a stressed syllable) or Accent 2 (in which following phrase boundary, as well as case the first L tone is considered to be phonetic modification of that accent to a trailing tone after the starred H tone). increase its prominence (cf. e.g. In other words, the only statistically Gussenhoven, 2004). significant pitch variation is directly If we assume a cross-linguistics tied to the intonational phonology. preference for pitch as a signaling tool – perhaps because it may be more sali- Conclusions ent than other types of prosodic varia- The current perception study has tion – then we can argue that English, demonstrated that prosodic cues are with its relatively sparse use of pitch relevant to listeners as part of their de- variation in the intonational structure, cision about the turn-taking process in has “space” left for pitch variation to be conversation. The study gives evidence used as a turn-transition cue. Swedish, for language-specific variation, poten- on the other hand, may already be satu- tially influenced by the phonological rated when it comes to pitch as a mean- system of the language in question. ing-bearing tool. Thus, the default turn Further research in other, more varied transition cue must be something else – languages, will give further insight into e.g. duration, as found by the current the degree of overlapping influence of perception experiment. the phonological system on the turn- A short further investigation of taking system or vice versa. pitch variation in Swedish production data lends further support for this pro- Acknowledgements posal. Turns or turn segments in 10 I am very grateful to David House, An- minutes of spontaneous Swedish con- na Hjalmarsson, Jana Götze, Mattias versation from the DEAL corpus were Heldner, Myra Öberg, Niklas investigated with regards to their pro- Vanhainen, Brechtje Post, Francis No- sodic characteristics. In turns with simi- lan, Calbert Graham, and Elaine lar form to the Hold and Change turns Schmidt for assistance with various

45 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University aspects, and in particular to Sarah of Linguistics Working Papers 35: Hawkins and Richard Ogden for use of 63-88. their unpublished corpus of English Heldner, M. (2011) Detection thresh- conversational data. This research was olds for gaps, overlaps, and no-gap- supported by the postdoctoral grant no-overlaps. Journal of the Acous- “Perception of prosody in linguistic tical Society of America 130(1): contexts” (VR-435-2011-6871) from 508-513. the Swedish Research Council (Veten- Heldner, M. & Edlund, J. (2010) Paus- skapsrådet). es, gaps and overlaps in conversa- tion. Journal of Phonetics 38: 555- References 568. Boersma, P. & Weenink, D. (2013). Hjalmarsson, A. (2011) The additive Praat: doing phonetics by computer effect of turn-taking cues in human [Computer program]. Available and synthetic voice. Speech Com- http://www.praat.org/ munication 53: 23-35. Bruce, G. (1977) Swedish word accents Hjalmarsson, A. & Laskowski, K. in sentence perspective. Lund: (2011) Measuring final lengthening Gleerup. for speaker-change prediction. Pro- De Ruiter, J.P., Mitterer, H. & Enfield, ceedings of 12th Interspeech, Flor- N.J. (2006) Projecting the end of a ence, Italy. speaker’s turn: a cognitive corner- Hjalmarsson, A., Wik, P. & Brusk, J. stone of conversation. Language (2007) Dealing with DEAL: a dia- 82(3): 515-535. logue system for conversation Edlund, J. & Heldner, M. (2005) Ex- training. Proceedings of SIGDIAL, ploring prosody in interaction con- Antwerp, Belgium, 132–135. trol. Phonetica 62(2-4): 215-226. Kosinski, R.J. (2013) A literature re- Ford, C.E., Fox, B.A., & Thompson, view on reaction time. [online] S.A. (1996) Practices in the con- Available http://biae.clemson.edu/ struction of turns: the "TCU" revis- bpc/bp/ Lab/110/reaction.htm. ited. Pragmatics 6(3): 427-454. Local, J.K., Kelly, J. & Wells, W.H.G. Gravano, A. & Hirschberg, J. (2009) (1986) Towards a phonology for Turn-yielding cues in task-oriented conversation: turn-taking in Tyne- dialogue. Proceedings of SIGDIAL side English. Journal of Linguistics 2009, Queen Mary University of 22: 411-437. London, UK, 253-261. Sacks, H., Schegloff, E.A. & Jefferson, Gravano, A. & Hirschberg, J. (2011) G. (1974) A simplest systematics Turn-taking cues in task-oriented for the organisation of turn-taking dialogue. Computer Speech and for conversation. Language 50(4): Language 25: 601-634. 696-735. Gussenhoven, C. (2004) The phonology Zellers, M. (2013) Pitch and lengthen- of tone and intonation. Cambridge, ing as cues to turn transition in UK: Cambridge University Press. Swedish. Proceedings of 14th In- Gårding, E. (1989) Intonation in Swe- terspeech, Lyon, France. dish. Lund University Department Zellers, M. (submitted) Prosodic varia- tion for turn transition in Swedish.

46 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Backchannels and breathing Kätlin Aare, Marcin Włodarczak, Mattias Heldner Department of Linguistics, Stockholm University, Sweden [email protected], [email protected], [email protected]

Abstract both verbal and non-verbal. Verbal The present study investigated the tim- backchannels can be more generic like ing of backchannel onsets within uh-huh or m-hm, or more specific to speaker’s own and dialogue partner’s signal what the addressee has under- breathing cycle in two spontaneous stood, like oh or other markers for sur- conversations in Estonian. Results indi- prise, for example. Research has shown cate that backchannels are mainly pro- that listeners show a great variety of duced near the beginning, but also in behaviors to contribute specific re- the second half of the speaker’s exhala- sponses (Bavelas & Gerwing, 2011). tion phase. A similar tendency was ob- Respiration during speech can be served in short non-backchannel utter- both audible and visible, and breathing ances, indicating that timing of back- patterns in speech have been claimed to channels might be determined by their be relevant for conversational organiza- duration rather than their pragmatic tion. For instance, an audible inhalation function. By contrast, longer non- before an utterance has been suggested backchannel utterances were initiated to be a “pre-beginning” element in turn- almost exclusively right at the begin- taking mechanisms (Schegloff, 1996). ning of the exhalation. As expected, The respiratory pattern changes during backchannels in the conversation part- spontaneous conversations. It has been ner’s breathing cycle occurred predom- noted that the quiet breathing cycle is inantly towards the end of the exhala- repeated about 12 times per minute, and tion or at the beginning of the inhala- exhalation is slightly longer than inha- tion. lation. The frequency of breathing changes for speech breathing, with the Introduction inhalation phase being considerably Conversational turn-taking involves shorter than the exhalation phase to coordination between participants ex- minimize interruption to the flow of changing the roles of speakers and lis- speech (Hixon, 1987). It has also been teners, and backchannel communication shown that most speakers take a deeper is part of this system. Backchannels breath before longer or more compli- (Yngve, 1970) are short, typically cated sentences (Fuchs et al., 2008; mono- or disyllabic (Gardner, 2001) Winkworth, Davis, Adams, & Ellis, listener responses in dialogues or con- 1995). Prephonatory movements of the versation. The term backchannel has rib cage and abdomen have been re- been coined to refer to the background ported to be adaptive to different speech channel through which the listener can tasks, indicating that there may indeed give feedback to the speaker without be preparatory respiratory processes claiming the conversational floor. occurring during listening and prepara- Backchannels indicate that the listener tion of turn onset (McFarland, 2001). is following and understanding the To summarize, next speakers pre- speaker (e.g. Heldner, Hjalmarsson, & pare turn onset among other things by Edlund, 2013). In face-to-face dia- inhaling, and this is potentially an im- logues participants make use of visible portant turn-taking signal. By contrast, as well as audible means of communi- it remains unclear if and how listeners cation, backchannels can therefore be prepare the onset of backchannels.

47 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Backchannels are typically short, brief each other at a bar table keeping their and quiet, and these characteristics do hands on the table. not require as much exhaled air and Respiratory activity was measured effort as longer utterances. Further- using Respiratory Inductance Plethys- more, backchannels carry relatively mography (Watson, 1980), which quan- little propositional content and they are tifies changes in rib cage and abdominal not supposed to claim the conversation- cross sectional area by means of two al floor. All of this taken into account, it elastic transducer belts (Ambu RIP- is conceivable that backchannels are not mate) placed at the level of the armpits planned the same way longer utterances and the navel, respectively. The belts are, and furthermore that they do not were connected to dedicated respiratory necessarily have to be initiated at the belt processors (RespTrack) designed beginning of the (listener’s) exhalation and built in the Phonetics Laboratory at phase. In this study, we will explore our Stockholm University. The RespTrack intuition that backchannels may occur processor was designed for ease of use, more freely in the respiratory cycle than and optimized for low noise and low longer utterances. We will also explore interference recordings of respiratory whether this is related to their non-floor movements in speech and singing. In claiming properties, or just to their rela- particular, DC offset can be corrected tive shortness. Finally, we will explore simultaneously for the rib cage and ab- how backchannels are timed relative to domen belts using a ”zero” button. Un- the other speaker’s breathing cycle. like the processors supplied with the belt, there is no high-pass filter, thus the Method amplitude will not decay during periods For the purpose of this exploratory of breath-holding. A potentiometer al- study, we recorded respiratory activity lows the signals from the rib cage and synchronized with audio in two sponta- abdomen belts to be weighted so that neous two-party dialogues of approxi- they give the same output for a given mately 20 minutes each. The subjects volume of air, as well as for a sum sig- were two females and two males, aged nal allowing a direct estimation of lung 18-25, all native speakers of Estonian. volume change. The calibration of the The subjects all knew each other. The belts for the estimated volume change first dialogue was between two sisters, between the two chest walls was and the other one between two young achieved by performing the isovolume men who had known each other for one maneuvre (Konno & Mead, 1967). and a half years. They had no Audio was captured using head- knowledge of the aim of the experiment worn microphones with a cardioid polar before the recording. They were free to pattern (Sennheiser HSP 4). The audio talk about any topic throughout the re- and belt processor signals were record- cording session. None of the subjects ed synchronously using an integrated reported any speech or hearing disor- physiological data acquisition system ders. One speaker had suffered from a consisting of LabChart software and breathing disorder caused by low blood PowerLab hardware (ADInstruments, pressure, and two were smokers. All 2014), which also allows connecting subjects were of slim body type and other measuring instruments, such as wore tight-fitting clothes. air-flow masks or electroglottographs. The recordings took place in a qui- Figure 1 shows an example of synchro- et, sound-treated room in the Phonetics nized audio and respiratory measure- Laboratory at Stockholm University. To ments from one speaker. The setup is minimize noise in the respiratory sig- described in greater detail in Edlund, nals caused by body movement, the Heldner, & Włodarczak (2014). subjects were recorded standing facing

48 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Figure 1. An example of synchronized audio and respiratory measurements from one speaker. The channels (from top to bottom) show the audio signal, the rib-cage signal, the abdomen signal, and the weighted sum of the two belts. The audio and breathing signals As expected, there was a strong were subsequently manually annotated tendency for non-backchannel utteranc- using Praat (Boersma & Weenink, es to start early in the exhalation phase. 2014). The rib cage and abdomen About 44% of all utterances started movements were used to segment the within the first tenth of the exhalation breathing signals into periods of inhala- (i.e. the first two bins). This tendency tions and exhalations. The speech signal was considerably weaker in the back- was segmented into intervals of pauses, channels, where only about 27% started utterances or backchannels, the latter within the first tenth of the exhalation, delimited by pauses of at least 500 ms. and where another mode in the distribu- A Praat script was used to extract tim- tion was discernable in the second half ing of speech and breathing events. of the exhalatory phase. Thus, the back- Speech onsets were normalized channels were more evenly distributed with respect to their relative position across the exhalation phase than the within the breathing phase they coin- non-backchannel utterances. This is in cided with: exhalation within speaker’s line with previous findings on German own breathing cycle, inhalation or ex- (Fuchs, personal communication), halation within interlocutor’s breathing where the tendency was even more cycle. marked and backchannels were equally likely throughout the breathing cycle. Results Longer vs. shorter utterances Backchannels vs. utterances To explore whether the observed differ- A total of 277 backchannels and 732 ence between backchannels and utter- (non-backchannel) utterances were in- ances was related to the relative short- cluded in the analyses. A small number ness of the backchannels rather than of backchannels was excluded from their non-floor claiming properties, the analysis, either because they were pro- utterance data was split into two groups duced in the inhalation phase (N=1), or based on duration. As 99% of the back- because they erroneously spanned more channels were shorter than 0.8 s, this than one breathing cycle (N=4). The duration was used as the criterion for remaining backchannels were mostly separating short utterances (<0.8 s) short markers of agreement (m-hm, from longer utterances (>0.8 s). Manual ahah, jajah ‘yes-yes’, okei), but also of inspection of the former revealed that surprise (tegelt ‘really’). Figures 2 and these utterances consisted mainly in 3 show the distribution of normalized short answers, pause-delimited dis- onset times for utterances and back- course markers, and stretches of disflu- channels, respectively. ent or otherwise incomplete turns.

49 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Figure 2. Distribution of normalized onset Figure 4. Distribution of normalized onset time for non-backchannel utterances. time for longer utterances (>0.8 s).

Figure 3. Distribution of normalized onset Figure 5. Distribution of normalized onset time for backchannels. time for shorter utterances (>0.8 s). A total of 216 shorter utterances Thus, shorter utterances were more and 516 longer utterances were identi- evenly distributed in the exhalation fied. Figures 4 and 5 show the distribu- phase, and behaved similarly to back- tion of normalized onset times for long- channels (cf. Figure 3). er and shorter utterances, respectively. The longer utterances displayed a Backchannels in the other speaker’s pattern similar to that observed for all breathing pattern utterances (cf. Figure 2), although the Finally, we wanted to explore if there is tendency was stronger. About 50% of a pattern in how backchannels are timed all longer utterances started in the first relative to the other speaker’s breathing tenth of the exhalation. The shorter ut- cycle. Therefore, we calculated normal- terances showed a pattern markedly ized onset times relative to the other different from the longer ones. Here, speaker’s inhalations and exhalations. only about 32% of the shorter utteranc- All backchannel occurrences (N=282) es started in the first tenth of the exhala- were included in this analysis. Figures 6 tion and there was a second mode in the and 7 show the distribution of onset distribution around 0.7. time for backchannels normalized rela- tive to exhalations and inhalations in the other speaker’s speech, respectively.

50 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

their temporal organization with respect to speaker’s own the respiratory cycle: non-backchannels are initiated predom- inantly towards the beginning of the exhalation, a tendency which is less pronounced in backchannels where an- other, somewhat smaller, peak is pre- sent towards the end of the exhalatory phase. While this observation suggest a functionally motivated difference, re- sults in Figures 4 and 5, in which non- backchannel utterances where further split depending on their duration, con-

tradict this hypothesis. Specifically, Figure 6. Distribution of normalized onset backchannels and comparably short time for backchannels in the other speaker’s non-backchannels behave very similar- exhalations. ly. They are distributed more uniformly than longer utterances with two local maxima: one near the beginning of the exhalation and another between 70 and 80% of its duration. Consequently, it suggests that duration rather than prag- matic function is the decisive factor determining turn initiation patterns. Simply put, if an upcoming turn is short enough, it is produced immediately, without the need for a deep inhalation characteristic of longer stretches of speech. Not surprisingly, backchannel onsets did not always coincide with exhalation in the interlocutor’s breath- Figure 7. Distribution of normalized onset ing cycle. Instead, they were most time for backchannels in the other speaker’s common around the transition between inhalations. exhalation and inhalation. Insofar as this location corresponds to partner’s The majority of the backchannels turn or phrase boundaries, the observed (67.5%) were produced during the other pattern is most likely brought about by speaker’s exhalations. The shape of the the underlying grounding mechanism, distribution for exhalations (Figure 6) whereby feedback acknowledges the shows that backchannels were increas- new piece of information produced in ingly more frequent towards the end of the previous turn constituent. the other speaker’s exhalation. For the remaining backchannels Conclusions produced during the other speaker’s The present study revealed that back- inhalations, the pattern was the reverse channels and non-backchannel utter- with decreasingly less backchannels ances of corresponding length are timed towards the end of the other speaker’s in a similar way within the speaker’s inhalation (Figure 7). breathing cycle. They are most likely to Discussion be initiated towards the beginning of the exhalation or roughly around 70% The comparison of backchannels and of its duration. By contrast, longer non- non-backchannel utterances (Figures 2 backchannel utterances are extremely and 3) indicates a clear distinction in rare anywhere but at the very onset of

51 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University the exhalatory phase. The observed Listener Stance. Amsterdam: J. similarity indicates that timing of Benjamins Publishing. speech with respect to the respiratory Heldner, M., Hjalmarsson, A., & phase is motivated by turn length, and Edlund, J. (2013). Backchannel not its pragmatic function. Consequent- Relevance Spaces. In E. L. Asu & ly, backchannels cannot be distin- P. Lippus (Eds.), Nordic Prosody: guished from non-backchannels on the Proceedings of the XIth basis of position within the respiratory Conference, Tartu 2012 (pp. 137- cycle alone. At the same time, back- 146). Frankfurt am Main, channels were found to occur most fre- Germany: Peter Lang. quently in the vicinity of interlocutor’s Hixon, T. J. (1987). Respiratory exhalation offset, which is likely to Function in Speech. In T. J. Hixon reflect processes related to grounding of (Ed.), Respiratory Function in new information. Speech and Song (pp. 1-54). Boston, MA, USA: Little Brown. Acknowledgements Konno, K., & Mead, J. (1967). The work was funded in part by the Measurement of the Separate Swedish Research Council (VR) project Volume Changes in the Rib Cage Samtalets rytm (2009-1766). and Abdomen During Breathing. Journal of Applied Physiology, References 22(3), 407-422. ADInstruments. (2014). LabChart McFarland, D. H. (2001). Respiratory software and PowerLab hardware markers of conversational (Version 8). New South Wales, interaction. Journal of Speech, Australia: ADInstruments. Language and Hearing Research, Bavelas, J. B., & Gerwing, J. (2011). 44(1), 128–143. The Listener as Addressee in Face- Schegloff, E. A. (1996). Turn to-Face Dialogue. International organization: One intersection of Journal of Listening, 25(3), 178- grammar and interaction. In E. 198. Ochs, E. A. Schegloff & S. A. Boersma, P., & Weenink, D. (2014). Thompson (Eds.), Interaction and Praat: doing phonetics by computer Grammar (pp. 52-133). [Computer program] (Version Cambridge: Cambridge University 5.3.75). Retrieved from Press. http://www.praat.org/ Watson, H. (1980). The technology of Edlund, J., Heldner, M., & Włodarczak, respiratory inductive M. (2014). Catching wind of plethysmography. In F. D. Stott, E. multiparty conversation. In J. B. Raftery & L. Goulding (Eds.), Edlund, D. Heylen & P. Paggio Proceeding of the Second (Eds.), Proceedings of MMC 2014. International Syposium on Reykjavik, Iceland. Ambulatory Monitoring (ISAM Fuchs, S., Hoole, P., Vornwald, D., 1979). London: Academic Press. Gwinner, A., Velkov, H., & Winkworth, A. L., Davis, P. J., Adams, Krivokapić, J. (2008). The Control R. D., & Ellis, E. (1995). Breathing of Speech Breathing in Relation to patterns during spontaneous the Upcoming Sentence. In speech. Journal of Speech and Proceedings of the 8th Hearing Research, 38(1), 124-144. International Seminar on Speech Yngve, V. H. (1970). On getting a word Production (ISSP 2008) (pp. 77- in edgewise. In Papers from the 80). Strasbourg, France. sixth regional meeting of the Gardner, R. (2001). When Listeners Chicago Linguistic Society (pp. Talk: Response Tokens and 567-578). Chicago, IL, USA: Chicago Linguistic Society.

52 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Pauses and resumptions in human and in computer speech Jens Edlund, Fredrik Edelstam, Joakim Gustafson KTH Speech, Music and Hearing, Sweden [email protected], [email protected], [email protected]

Abstract listen (keep the microphone on) and, if We present a study in which 16 subjects needed, are trained to disregard their were recorded while interacting with a own voice while they are speaking. The human narrator acting the part of a spo- reaction when user speech is detected is ken dialogue system (SDS). Interrup- generally the most basic imaginable: tions to the narrator’s speech were add- simply cancel the current speech output. ed systematically. The recordings were There are exceptions. A particularly analysed to find pause and resume be- interesting approach by Ström & Seneff haviours that may be suitable for im- (2000) takes inspiration from human- plementation in SDSs. The first results human dialogue to design a system show that resumptions are initiated, on which increases its voice intensity when average, with a higher pitch than other barge-ins occur at dialogue states where utterances. interruptions are undesirable, signalling that barge-ins are disallowed at this Introduction stage. When a barge-in occurs at a less We present a study in which 16 subjects critical point in the dialogue, they pro- were recorded while interacting with a pose that the system reduce its intensi- human narrator acting the part of a spo- ty, but continue to speak, which allows ken dialogue system (SDS). Interrup- the system to verify that the detected tions to the narrator’s speech were add- barge-in was indeed speech from the ed systematically. The recordings were user before cutting itself short. analysed to find pause and resume be- A less frequently implemented, but haviours that may be suitable for im- equally important and practical, reason plementation in SDSs. is to handle temporary fluctuations in the ambient noise environment. If a Background lorry drives by, or someone suddenly starts drilling in a near-by wall, people Pauses and resumptions in spoken respond by either raising their voices or dialogue systems by simply pausing until the noise re- There are a number of good reasons to cedes before finishing. An SDS that equip SDSs – be they robots, machines copies this behaviour will be easier to or computers that communicate using understand in adversary conditions. To speech – with the ability to cut them- our knowledge, this simple functionali- selves short – to stop speaking before ty has not been implemented in any they have finished saying what the published system, but there is a surge of planned to say. research into a related area: the use of The reason most frequently dis- Lombard speech from SDSs to over- cussed in SDS design is to handle so- come adverse noise conditions. called user barge-ins – user appearing A third example concerns in partic- in the middle of the system's speech. In ular humanlike SDSs that strive to current SDSs, barge-ins are often disal- achieve spoken communication in a lowed (e.g. ignored) by simply turning manner that is similar to how humans the microphone off when the system is use speech to communicate (Cassell, speaking. SDSs that do handle barge-in 2007; Edlund et al., 2008). Human con-

53 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University versations are emergent, and humans that current machines are at a great dis- often reconsider their plan while speak- advantage here. In the most primary ing, and may pause briefly to consider and common form of spoken communi- before finishing. A humanlike SDS that cation amongst humans –face-to-face treats the conversation as an emergent conversation – a speaker has access to a phenomenon and senses the environ- whole slew of indications when an in- ment incrementally and continuously terlocutor intends to take the floor. should also be able to halt and to Commonly discussed speech preparato- change its mind mid-utterance. Alt- ry events include in-breath, smacks and hough this type of pause behaviour can mouth opening, posture and head pose be used in a pre-planned, simulated shifts; and gaze patterns. A barge-in manner to achieve focus, a more inter- supporting SDS generally senses noth- esting challenge is for the system to ing but sound. pause thoughtfully when it actually There is evidence that users can be needs processing time. Skantze & dissatisfied with pausing systems even Hjalmarsson (2013) successfully use if these are objectively better. The utterance-initial filled pauses to this adaptive (and objectively safer and end. more efficient) system presented by Finally, situated SDSs – systems Kousidis et al. (2014) received poorer designed to monitor and model the con- subjective judgements from its users text and the environment as well as the than a non-adaptive (non-pausing) emergent dialogue – may pause if they counterpart, with comments suggesting notice an unreceptive listener. This may that users thought the system might happen if the listener becomes dis- have paused because of programme turbed by another person, or if some errors. external task of greater importance gets In order to allow our SDSs to pause in the way. Such systems may also im- when need be without being second- plement pausing as a proactive behav- guessed by their users, it seems im- iour and pause when they anticipate that portant to clearly signal that the pause the user will become otherwise occu- is intentional and planned. This way, pied presently. Kousidis et al. (2014) users will be confident that they can tell show that an in-car SDS that pauses intended pauses – features – from bugs. when complicated driving situations We think it likely that if the subjects in occur leads to improvements both in the the Kousidis et al. (2014) study had felt driving and in the driver’s recall of confident that the adaptive system knew what the SDS said. what it was doing, and that is was all for their benefit, they would have grad- Where it fails ed the system’s performance higher. Although it can be shown that well- The choice of signal, however, is positioned pauses can improve the usa- important and not trivial. A signal that bility of an SDS as it affords user is clear and easy to perceive may not be barge-ins or the efficiency and safety as sufficient. Edlund & Nordstrand (2002) for example an in-car SDS can adapt to compare an SDS which signals that it is changes in for example the driving situ- thinking with a spinning hourglass (as ation, there are drawbacks. A system in a popular computer operating sys- implementing user barge-in is likely to tem) with one where the SDS’s avatar halt in the wrong places, as it will mis- (an animated talking head) simply looks interpret non-speech sound such as away. The system with the more obvi- coughs or external noises for barge-ins. ous hour-glass results in slightly more If these mistakes occur frequently, user efficient dialogues, but is strongly dis- trust in the system will diminish, and liked by the users, who suspected the the pausing behaviour is likely to do computer was having problems running more harm than good. It is worth noting the SDS. The system that looked away

54 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University while thinking was almost as efficient tains behaviours that could all plausibly (and outperformed a system with no be used by the same speaker. To indicators at all) as well as liked by the achieve this, we consistently use the users. The baseline system without in- same single male speaker in the role as dicators resulted in quite poor dialogue the system (“speaker”, hereafter) for all efficiency, but was slightly better liked recordings. For the user role (“listener”, by users than the hour-glass version. hereafter), a balanced variety of speak- ers were used: two sets of 8 listeners, The way forward both balanced for gender, were used. We believe that we would benefit from None of the listeners had any previous finding out how people behave when knowledge of this research. All listeners they pause and when they resume were rewarded with one cinema ticket. speaking, and attempt to implement They were told that those who per- these behaviours in humanlike SDS. In formed the task best would earn a se- other types of SDSs, mimicking human cond ticket, and the top performers behaviours may not be a good option from each setup received a second tick- (Edlund et al., 2008). In this paper, we et after the recordings were completed. present a first step towards this goal. Task Method The data collection was designed as a The target of our experiment can be dual task experiment. The main task for formulated in three questions: How the speaker was to read three short in- does a human speaker stop speaking formative texts about each of three cit- when faced with an (possible) interrup- ies (Paris, Stockholm, and Tokyo), ar- tion? How does a human speaker re- ranged so that the first is quite general, sume speaking after such an event? the second more specific, and the third Which of these behaviours are plausible deals with a quite narrow detail with candidates for inclusion in a spoken some connection to the city. This task is dialogue system? equivalent to what one might expect from a tourist information system. For Data Collection the listener, the main task is to listen to Setting the city information. The listener is We recorded the dialogues in our dia- motivated by the knowledge that the logue recording studio – a recording reading of each segment – that is each environment consisting of several phys- of the nine informative texts –is fol- ically distinct locations that are inter- lowed by three questions on the content connected with low and constant laten- of the text. Their performance in an- cy audio and video. Pairs of interlocu- swering these questions and in complet- tors were placed in different rooms, and ing the secondary task counted towards communicated through pairs of wireless the extra movie ticket. The secondary close-range microphones and loud- task was designed as follows. At irregu- speakers. Video was not used here, lar, random intervals, a clearly visible since we are interested in behaviours coloured circle would appear, either in that are triggered by the acoustic cues front of the speaker or the listener. that are available to most SDSs. When this happened, the speaker was under obligation to stop the narration Subjects and instead read a sequence of eight The end-goal of this data collection is digits from a list. The listener must then not to train a recognizer or a recogni- to repeat the digit sequence back to the tion or categorisation device, but the speaker, after which the speaker could generation of a consistent set of candi- resume the narration. date behaviours for implementation in a spoken dialogue system – one that con-

55 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Conditions We considered two characteristics of interruptions that we assumed would have an effect on how humans react to the interruption and to how they resume speaking after it: the source of an inter- ruption can be either internal or external in a dialogue; and the duration and con- tent of an interruption varies: they can be brief or even the result of a mistake, or they can be long and contentful. The condition mapping to the first of these characteristics was designed such that the coloured circle signalling an inter- Figure 1. The average pitch (in semitones, ruption was presented randomly to ei- Y-axis) for the first through tenth (X-axis) ther the speaker, mapping to en external voiced 100 ms segment in the original read- event visible to the system but not the ings, and in the resumptions following inter- driver, or to the listener, mapping to an ruptions into these readings, over 59 pairs. interruption from the driver to the sys- We take the resumption to be the point tem (the listener had to speak up to in- where the reader returns to reading the form the speaker that the circle was script. Any material after the interrup- present). The second condition was tion, and in applicable cases the com- designed such that in one set of eight pletion of the embedded, secondary dialogues, the coloured circle would task, and the resumption was coded as start out yellow, and as soon as the pre-resumption material. speaker became silent, it would ran- In this initial analysis, we looked domly either disappear (causing only a the pitch of resumptions. For each re- short interruption with light or no con- sumption in the material, we also took tent, corresponding to e.g. a false the first sentence from the script pre- alarm) or turn red, in which case the ceding the interruption to get a pair. sequence of digits would be read and These pairs are matched in time, at least repeated (a contentful interruption). In within a minute, so the voice character- the other set of eight recordings, the istics of the reader should be similar. circle always went straight to red, and We extracted the first 10 100 ms frames always caused digits to be read and containing voiced speech from each of repeated. these pairs and analysed them for pitch. Analysis 59 pairs were found were there was at least 10 frames of voiced speech. Only Each channel of each recording was these were used in the analysis. segmented into silence delimited speech segments automatically, and these were Results transcribed using Nuance Dragon Dic- We had anticipated results showing that tate. The transcriptions were then cor- the end of speech following an interrup- rected by a human annotator, and la- tion is different, but so far our analysis belled for interruptions – either from have come up empty. Furthermore, the the listener (who was prompted by a pauses follow near-instantly in most light indicator) or from the reader being cases, with a delay that is just slightly interrupted by a similar light indicator. above the minimum reaction time. Resumptions from the pauses caused by Prosodically, we see a clear differ- these interruptions were coded as well. ence between the beginning of the re- sumptions and the non-resumption ut- terances. The pairwise difference be-

56 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University tween the average pitch over the first 10 References voiced 100ms segments of each part of Cassell, J. (2007). Body language: les- the original-resumption pairs is plotted sons from the near-human. In in Figure 1. Riskin, J. (Ed.), Genesis Redux: Es- Figure 1 shows that the resump- says on the history and philosophy tions start on average about 1.5 semi- of artificial life (pp. 346-374). Uni- tones higher. They then drop, and after versity of Chicago Press. about 0.5 seconds, approximately 2 Edlund, J., & Nordstrand, M. (2002). syllables, they are on a level compara- Turn-taking gestures and hour- ble to original readings. glasses in a multi-modal dialogue Discussion system. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile We think that the pitch difference we Environments. Kloster Irsee, Ger- found is a good candidate for imple- many. mentation in current systems. The find- Edlund, J., Gustafson, J., Heldner, M., ing is consistent with the impression- & Hjalmarsson, A. (2008). To- istic finding that the resumptions are wards human-like spoken dialogue often characterized by a stronger initial systems. Speech Communication, stress, and suggests that increasing ini- 50(8-9), 630-645. tial stress in resumptions is a candidate Kousidis, S., Kennington, C., Baumann, behavior for humanlike resumption. T., Buschmeier, H., Kopp, S., & Future work Schlangen, D. (2014). Situationally aware in-car information presenta- The pitch finding is not straightforward tion using incremental speech gen- to implement in current systems, as eration: safer, and more effective. they normally do not grant control over In Procs. of the EACL 2014 Work- pitch or initial stress. The finding, how- shop Dialogue In Motion (pp. 68- ever, can be implemented and tested in 72). Gothenburg, Sweden. research systems. Skantze, G., & Hjalmarsson, A. (2013). The data recorded will be annotated Towards Incremental Speech Gen- and analysed further. In particular, the eration in Conversational Systems. short interruptions that originated on Computer Speech & Language, the reader side are interesting, as it 27(1), 243-262. seems that the listener in many cases Ström, N., & Seneff, S. (2000). Intelli- never even noticed the interruption, as gent barge-in in conversational sys- the reader masked it using several strat- tems. In Procedings of ICSLP-00. egies such as coughing or drawing for breath slowly.

Acknowledgments This work was funded by the GetHomeSafe (EU 7th Framework STREP project 288667).

57 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

58 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Initiality accent deaccenting Sara Myrberg Dept. of and Multilingualism, Stockholm University, Sweden [email protected]

Abstract b. Den bruna haren bor [i parken]Focus. The paper presents the results of a pro- the brown hare lives in park-the duction experiment designed to study Initiality accent deaccenting when initiality accents appear in a sen- We know that initiality accents are sen- tence. Based on a previous observation sitive to the presence of focal accents in (Myrberg, 2010) that focal accents in the clause initial constituent. A focal the clause initial constituent can deac- accent in the clause initial constituent cent initiality accents, the present ex- can deaccent the initiality accent, pre- periment examines whether the length venting it from appearing in the clause and information structural status (fo- (Myrberg, 2010). This happens e.g. cused, given) of a clause initial subject when there is a narrow focus in the affects the rate of initiality accent deac- clause initial constituent, as in (2). centing. Results show that initiality accents are more often deaccented in (2) a. Vilket brunt djur bor i parken? focused and long subjects than in given what brown animal lives in park-the and short ones. b. Den bruna [haren]focus bor i parken. Introduction: initiality accents the brown hare lives in park-the Initiality accents are tonal markers of In (2b) haren must obligatorily carry a the left edges of Intonation Phrases in focal accent, as it is information struc- (at least) Stockholm Swedish (Roll, et turally focused. The presence of this al., 2009; Myrberg, 2010, 2013). They focal accent prevents the appearance of share their shape and much of their an initiality accent on bruna. phonological behavior with focal ac- Instead of an initiality accent, bru- cents (Bruce, 1977, 1998). Thus, like na carries a word accent. The tonal rep- focal accents in Stockholm Swedish, resentation for word accents is H*L for initiality accents have the tonal repre- lexical accent 2, and HL* for lexical sentation H*LH (accent 2) or L*H (ac- accent 1. The word accent is the lowest cent 1). This makes initiality accents tonal prominence level in Swedish, and look much like prominences. appears on most words that have a lexi- Functionally, however, initiality cal stress (Bruce, 1977; Myrberg & accents are similar to boundary tones. Riad, 2013). They do not serve as markers of any This paper presents the results of a information structural category. They production experiment designed to rather have the function of marking the study the interaction of the focal accent beginning of a new Intonation Phrase. and the initiality accent in clause initial Initiality accents appear on the first subjects. lexically stressed word in a sentence, i.e. the first Prosodic Word (PWd) Two types of focal accents (Myrberg & Riad 2013, Riad 2014). In the present paper, two different func- Thus in (1b), bruna ‘brown’ carries an tions of focal accents will be distin- initiality accent. guished. These will be shown to inter- act in different ways with initiality ac- (1) a. Var bor den bruna haren? cents when appearing in the clause ini- Where lives the brown hare? tial constituent.

59 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

First, there are focusing focal ac- (4) Target sentences cents. These are the focal accents that a. subject length: 2 PWd (underlined) obligatorily appear on information [Den bruna haren]S structurally focused constituents (and it the brown hare is with this function in mind that the [bor i parken]VP focal accents have been named). A fo- lives in park-the cusing focal accent appears on parken in (1b), to mark the whole PP i parken b. subject length: 3 PWd (underlined) as information structurally focused. The [Den bruna haren med ungar]S focal accent on haren in (2b) is also a the brown hare with kids focusing focal accent. [bor i parken]VP Second, there are phrasing focal lives in park-the accents. These focal accents (like the c. subject length: 4 PWd (underlined) focusing ones) appear on the last word [Den bruna haren of a constituent. Their function is to the brown hare group the words of a constituent into med många ungar]S one prosodic phrase. The phrasing focal with many kids accents do not signal information struc- [bor i parken]VP tural focus, and can even appear on lives in park-the given material, as we will see below. A d. subject length: 5 PWd (underlined) phrasing focal accent can, optionally, [Den bruna haren appear on haren in (1b) (note that even the brown hare when a phrasing focal accent appears med många söta ungar] on haren, an additional focusing focal S with many cute kids accent must appear on parken). [bor i parken]VP The experiment presented here was lives in park-the designed to answer the questions in (3). In what follows, the experiment design (5) Context questions is presented, followed by the results. a. subject is given, part of VP is focused (3) Questions: Var bor [subject of 4a–d]? where lives […] a. Does the focal accent have a stronger deaccenting effect the closer it appears to b. part of subject is focused, VP is given an initiality accent? If yes, initiality will Vilken brun hare/ vilket brunt djur be more deaccented in short subjects than Which brown hare/ what brown animal in long ones. bor i parken? lives in park-the b. Do phrasing focal accents and focusing focal accents deaccent initiality accents There were ten sets of items as in (4a– equally much? d). Five sets had accent 1 words in the subject, and the other five sets had ac- Material cent 2 words. Each sentence was read in The production experiment was de- the context of the question in (5a) as signed to study deaccenting of initiality well as the one in (5b), with three repe- accents in information structurally fo- titions. cused and given clause initial subjects In total, this resulted in a corpus of of four different lengths. Five native 1200 read sentences (10 items * 4 female Stockholm Swedish speakers length conditions * 2 focus structures * were asked to read target sentences as 3 repetitions * 5 speakers = 1200 sen- in (4a–d). tences). The question-answer pairs were presented to the speakers on a laptop screen.

60 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Annotation procedure with respect to whether the contour on The sentences were annotated semi- that word was a focal/initiality accent or manually using Praat (Boersma & a word accent. Together, the tonal an- Weenink, 2013). Word boundaries were notation procedure and the subjective manually annotated, and tonal targets rating of tonal contours form the base were automatically extracted in the first of the analysis presented here. and last word of each subject (4a: bruna Results and discussion + haren, 4b–d: bruna + ungar). An annotation procedure was de- The effect of constituent length on signed that automatically placed three initiality accent deaccenting (question tonal points (points A, B, C) in each 3a) target word, as illustrated in Figure 1. The length of the focused constituent Measurement errors and microprosodic does affect the shape and distribution of effects were manually corrected. initiality accents. The length of the subject has a sta- Accent 1: Focal/ Accent 1: Word tistically significant effect on the height initiality accent accent of the Hfocus tone (point B in the accent 1 annotation and point C in the accent 2 annotation). This is shown in Figure 2. The difference between the subject L Hfocus (L%) (L) H(focus) L* lengths 2 PWd and 5 PWd is highly pointA pointB pointC pointA pointB pointC significant for all speakers (accent 1 and 2 data taken together, 2-sided t- Accent 2: Focal/ Accent 2: Word tests, p<0.001). initiality accent accent

H* L Hfocus H* L (Hfocus) pointA pointB pointC pointA pointB pointC St re 1 Hz Figure 1. The annotation procedure assigned three measure points in each word (points 90 95 100 A, B, C). The upper panels show the distri- bution of the three tonal measure points in accent 1 words. The lower panels show the 2 PWd 3 PWd 4 PWd 5 PWd distribution of the three tonal points in ac- Figure 2. Height of H tone in the four cent 2 words. focus different length conditions. The accent 1 The three points A, B, C were used to data corresponds to point B in the annota- identify and distinguish between fo- tion (L*H). The accent 2 data corresponds cal/initiality accents (lexical accent 1: to point C in the annotation (H*LH). Each L*H, lexical accent 2: H*LH) and word box contains data points from all 5 speakers, accents (lexical accent 1: HL*, lexical and accent 1 as well as accent 2. accent 2: H*LH) (Bruce, 1998; Between 3 PWd subjects and 5 PWd Myrberg, 2010, 2013). A high value for subjects, speaker 5 has a significant point B indicates the presence of a fo- effect in both the accent 1 and 2 condi- cal/initiality accent in accent 1. A high tions, and speaker 1 has a significant value for point C indicates the presence effect in the accent 2 condition of a focal/initiality accent in accent 2. (p<0,01). Speakers 2, 3, and 4, howev- Independently of the annotation of er, have no significant difference be- the tonal points, the author made a sub- tween the 3 PWd and the 5 PWd sub- jective judgment for each target word, jects.

61 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

In addition, subject length has a the answer to (5b) (χ2=159.5346, df=2, significant effect on the subjective p<0.001). judgments of whether the initial word in each subject carries an initiality accent or a word accent. This is shown in Fig- WA 2 IA? ure 3 (χ =54.8168, df=6, p<0.001). IA

WA IA? IA Frequency Frequency 0 100 200 300 400 500 600 0 50 150 250 Focused subject Given subject 2 PWd 3 PWd 4 PWd 5 PWd Subejct length Figure 5. Number of initiality accents (IA), Figure 3. Number of initiality accents (IA), word accents (WA) and unclear cases (IA?) word accents (WA) and unclear cases (IA?) on the first word of information structurally in the four length conditions. focused versus given subjects. It is worth noting that words rated as The effect in Figure 5 is unsurprising, initiality accented have higher values given that deaccenting happens only in for the Hfocus targets (annotation point B subjects that have a focal accent on for accent 1, point C for accent 2) than their last word. When there is no focal words rated as word accented. This is accent in the subject, the initiality ac- illustrated in Figure 4. cent is obligatory (this seems safe to conclude, based on the fact that among the 1198 sentences in the dataset, deac- centing happens only when there is a focal accent in the subject. Subjects

St re 1 Hz without a focal accent on their last word

90 95 100 carry initiality accents on their first word.) 1.ia wa 2.ia wa 3.ia wa 4.ia wa 5.ia wa In the subject focus condition, all subjects carry a focal accent on the last Figure 4. The height of the Hfocus tone (point word. In the given subject condition, B in the accent 1 annotation, point C in the however, the focal accent is not obliga- accent 2 annotation), for the five speakers tory on the last word of the subject. The (1–5), in words rated as initiality accented result in Figure 4, then, could merely be (ia) and word accented (wa) respectively. Accent 1 and 2 data is plotted together. due to the higher frequency of focal accents on the last words of focused Initiality accent deaccenting with subjects than given ones. focusing vs. phrasal big accents Interestingly, however, this does (question 3b) not seem to be the case. The effect re- In addition to the effect of subject mains when all subjects that do not length, the information structural status have focal accents are excluded. of the subject has an effect on the dea- Figure 5 shows the distribution of centing of the initiality accent. initiality accents in subjects that have Figure 5 illustrates the difference in focal accents. We see that approximate- frequency of initiality accents in the ly half of the given subjects carry focal subject when the subject is given as in accents. As expected, almost all fo- the answer to (5a), versus focused as in cused subjects carry focal accents. Among the given subjects with focal

62 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University accents on their last word, the vast ma- discussion in Myrberg & Riad, in jority also contain an initiality accent on press). their first word. Among the focused In the Swedish intonation research, subjects, however, less than half have the relation between the notions focal an initiality accent in addition to the accent and nuclear accent has not been focal accent. much discussed. The results of the pre- sent experiment however indicate that focal accents are of two types, the fo- WA IA? cusing ones, with a “strong” deaccent- IA ing effect and the phrasing ones with a “weaker” deaccenting effect. It makes sense to analyze the focus- ing accents in this dataset as nuclear

Frequency accents, and the phrasing accents as prenuclear accents. When the subject is focused in this dataset, the VP that follows it is given. 0 100 200 300 400 500 600 A given constituent that follows a focus Focused subject Given subject generally does not contain any focal Figure 6. Number of initiality accents (IA), accents (Bruce, 1977; Myrberg, 2010). word accents (WA) and unclear cases (IA?) The accents that appear on information on the first word of subjects that also have a structurally focused subjects are thus focal accent on their last word. Among giv- the rightmost focal accents in their sen- en subjects, approximately 50% have a tences, and we can therefore refer to (phrasing) focal accent on their last word. them as nuclear. Among focused subjects, (almost) all have a When the subject is given in this (focusing) focal accent on their last word. dataset, the VP is always focused and Figure 5 shows that, proportionally, must, therefore, carry a focusing focal focal accents that appear in information accent, independent of whether or not structurally focused subjects are less the subject has a phrasing focal accent. likely to coocurr with an initiality ac- The accent on a given subject, there- cent in the subject, compared to focal fore, is not rightmost in its sentence, accents in given subjects. thus prenuclear. Put differently, the (obligatory and Accepting that one intonation focusing) focal accents that appear on phrase can contain multiple focal ac- information structurally focused sub- cents, and that the rightmost of these is jects have a stronger deaccenting effect the nuclear accent (in accordance with than the (optional and phrasing) focal common analyses of other Germanic accents that appear on given subjects. languages), we arrive at the generaliza- We may extend this observation to tion that nuclear accents have a more a claim that in the present dataset, the powerful deaccenting effect than prenu- focusing focal accents are nuclear ac- clear accents. cents, whereas the phrasing focal ac- Conclusion cents are prenuclear accents. The term nuclear accent has often The results of a production experiment been used to refer to the rightmost ac- were presented, which show how focal cent of an Intonation Phrase in the liter- accents and initiality accents interact in ature on Germanic intonation clause initial subjects. The results show (Pierrehumbert, 1980; Ladd, 2008). In that the closer a focal accent appears to sentences with a single focus, the nu- the clause initial word, the less likely it clear accent must correlate with the is that an initiality accent is realized. focus (e.g. Truckenbrodt, 1995, see In addition, focal accents that ap- pear on focused subjects have a strong-

63 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University er deaccenting effect than focal accents ACTA Universitatis Stockholmi- on given subjects. It was argued that the ensis 53. Stockholm Studies in former can be analyzed as nuclear ac- Scandinavian Philology New Se- cents as they are the rightmost focal ries. Stockholm: Department of accents in their sentences, whereas the Scandinavian Languages, Stock- latter are prenuclear accents. The fact holm University. that these accents behave differently in Myrberg, S. (2013). ‘Sisterhood in Pro- terms of how they interact with initiali- sodic Branching’, Phonology. 30.1: ty accents provides additional support 73–124. for their different status in the intona- Myrberg, S., & Riad, T. (2013). ‘The tional phonology. Prosodic Word in Swedish’, in E. L. Asu & P. Lippus (eds.) Nordic Acknowledgements Prosody. Proceedings of the XIth For financial support, I gratefully Conference, Tartu 2012. Frankfurt acknowledge Anna Ahlströms and El- am Main: Peter Lang. 255–264. len Terserus stiftelse and The Swedish Myrberg, S. & Riad, T. In press. On the Research Council. expression of focus in the metrical grid and in the prosodic hierarchy. References In: Caroline Féry & Shinichiro Boersma, P. & Weenink, D. (2013). Ishihara (eds.) Handbook on Infor- Praat: doing phonetics by comput- mation Structure. Oxford Universi- er. http://www.praat.org. ty Press. Bruce, G. (1977). Swedish word accents Riad, T. (2014). The Phonology of in sentence perspective. Lund: Li- Swedish. Oxford University Press. ber Läromedel. Roll, M., Horne, M., and Lindgren, M. Bruce, G. (1982). Developing the Swe- (2009). Left-edge boundary tone dish intonation model. Working and main clause verb effects on papers 22. Lund: Department of syntactic processing in embedded Linguistics and Phonetics, Lund clauses – An ERP study. Journal of University. 51–116. Neurolinguistics 22. 55–73. Ladd, D. R. (2008). Intonational pho- Truckenbrodt, H. (1999). On the rela- nology (2nd ed.) Cambridge: Cam- tion between syntactic phrases and bridge University Press. phonological phrases. LI 30. 219– Myrberg, S. (2010). The Intonational 255. Phonology of Stockholm Swedish.

64 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Syllable structure and tonal representation: revisiting focal Accent II in Swedish Antonis Botinis1, Gilbert Ambrazaitis2, Johan Frid2 1 Lab of Phonetics & Computational Linguistics, University of Athens, Greece 2 Centre for Languages and Literature, Lund University, Sweden [email protected], [email protected], [email protected]

Abstract HL fall for accent I (acute) and a late This is a study of tonal representation HL fall for accent II (grave). On the as a function of syllable structure con- other hand, sentence (or focal) accent is stituency in Swedish. The results of a associated with a tonal rise (represented production experiment indicate that the by a H), following the accent II fall, but onset of the focal accent rise – which no direct association with a specific we suggest to be best represented by a syllable constituent has been suggested bitonal LH command – is associated for simplex Accent II words. Thus, to- with the consonant onset of the post- nal analysis assumes some type of asso- accented syllable. Furthermore, a vowel ciation between stressed syllables and insertion is favored in certain intervo- specific tonal commands one way or calic consonant clusters. In light of the- another, albeit with a variety of differ- se findings, as well a parallel study on ent functions among languages. Greek, we claim: (1) syllabification is a Despite the general appeal to the basic prerequisite condition in tonal syllable in tonal analysis, the notion of analysis and intonation studies, (2) to- the syllable itself and related syllabifi- nal targets may define syllable bounda- cations remain a controversial issue. ries and hence syllabification and (3) Theoretical approaches, such as the different tonal targets may be associat- Maximum Onset Principle (MOP) and ed with different syllable structure con- the Sonority Sequence Principle (SSP) stituents in different languages. may predict diverse syllabifications. Thus, the fairly internationalized word Introduction “pasta” is syllabified as /pa.sta/ accord- This presentation is part of a large study ing to MOP, as the consonant cluster on syllable structure and crosslinguistic /st/ is canonical at the onset of words prosody. Our general hypothesis is that (cf. “studio”), but as /pas.ta/ according different types of tonal commands and to SSP, as there is no sonority rising related tonal targets are associated with between sibilant and stop sequences. specific syllable constituents in lan- On the other hand, experimental ap- guages with different prosodic struc- proaches have hardly provided reliable tures, such as standard Athenian Greek phonetic evidence. Maddieson (1985), (hereafter Greek) and standard Stock- e.g., suggests the Closed Syllable Vow- holm Swedish (hereafter Swedish). el Shortening (CSVS) as a phonetic In Greek, early results have shown correlate of syllabification, according to that tonal rises associated with lexical which vowels are shorter in closed syl- stress as well as focus production initi- lables than in open ones. ate at syllable onsets (e.g. Botinis Botinis and Nirgianaki (2014, this 1989). In Swedish, Bruce’s research volume) suggest tonal turning points as (e.g. 1977) has shown how a lexical a tonal correlate of syllabification. In accent distinction is associated with the Greek, specifically, the L tonal target of timing of a HL tonal command in rela- LH commands in lexical as well as fo- tion to accented syllables, i.e. an early cus contexts is associated with syllable onset. Furthermore, a vowel segment

65 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University may be inserted between intervocalic In this presentation, in accordance consonant clusters. In /av'ɣo/ (‘egg’), with the above description and especial- e.g., in accordance with the tonal turn- ly the syllabification issue in Swedish, ing point, the intervocalic consonant we test the following hypotheses. Hy- cluster is heterosyllabified whereas a pothesis 1: The L target of the lexical vowel is as a rule inserted between the accent II HL tonal command correlates consonants. This syllabification sup- with consonant coda right edge. Hy- ports the SSP predictions, as there is no pothesis 2: This L target can also be sonority rising between fricatives, but regarded as the onset of the focal accent does not support the MOP predictions, rise to the following H, which corre- as /vɣ/ is canonical syllable onset at lates with the consonant onset left edge. lexical domain (cf. /'vɣeno/ ‘go out’). Hypothesis 3: No vowel insertion be- In Swedish, two tonal commands tween intervocalic consonant clusters as and respective L targets may be as- a function of syllabification is favored. sumed to correlate with syllabification: the L target of the accent II HL com- Experimental methodology mand can be expected to be reached in In order to test the above hypotheses, a the vicinity of the syllable boundary; production experiment was designed. according to Bruce (1977), this L target The speech material consists of eight likewise constitutes the onset of the rise accent II test words (Table 1) in the resulting from the focal accent H com- carrier sentence “vi säger ___ igen” mand. The latter, as we will argue, (‘we say ___ again’), produced at nor- might be better represented as a bitonal mal tempo by six female speakers, LH in accent II, instead of the estab- grown up and educated in the wider lished monotonal H. However, associa- Stockholm area. One speaker pro- tions of tonal commands and related nounced one of the test words idiolecti- targets as a function of syllable struc- cally and was excluded. Each speaker ture variability have hardly been inves- produced the speech material five times tigated. Swedish prosodic typology has and the corpus counts thus to 200 to- a two-way binary distinction. First, a kens (8 test words x 5 speakers x 5 rep- complementary quantity distinction, etitions = 200). according to which long and short vow- Table 1. Intervocalic consonant context and els are in principle followed by short test words with respective glosses. and long consonants, respectively (e.g. “glass” /VCː/ ‘ice cream’ vs. “glas” Context Test words Glosses /VːC/ ‘glass’ and, second, a lexical ac- cent distinction, according to which 1. /Vːl/ vila to rest stressed syllables carry either accent I 2. /Vl ː/ villa villa or accent II (e.g. “tánken” ‘the tank’ vs. 3. /Vmn/ nämna to name “tànken” ‘the thought’. Interestingly, 4. /Vlv/ halva half the lexical accent distinction may take 5. /Vːvl/ tavla board place in either type of quantity distinc- 6. /Vvl/ kravla to crawl tion. On the other hand, Swedish, much 7. /Vːl#/ bil arv car heritage like other Germanic languages, is a 8. /Vː#l/ bi larv bee larva fairly closed syllable structure language with a variety of branching codas and The speech material was recorded thus, any type of syllabification does at a sound-treated studio at the Humani- not in principle violate canonical sylla- ties Laboratory, Lund University, and ble phonotactics, either preceding coda the speech analysis was carried out with or following onset ones. Thus, unlike Praat (Boersma & Weenink 2013). Greek, which is a fairly open syllable Acoustic analysis and measurements language, Swedish hardly has any op- were carried out by the authors. timal context for vowel insertions.

66 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Results This section presents qualitative analy- sis examples, followed by quantitative analysis of vowel insertions. Qualitative analysis In figure 1.1, the accent II HL fall in the word “vi:la” spans within the first part of the nucleus vowel. This tonal struc- ture could be accounted for by assum- H L L H ing that long vowels in Swedish consist 1 vi säger v i : l a igen of two moras, which is also apparent in the waveform of the figure: the accent II fall in the first mora of the nucleus vowel is followed by a low tonal plat- eau throughout its second mora. The focal accent rise, on the other hand, spans between the left edge of the post- vowel consonant and the succeeding nucleus vowel. This suggests that the onset of the focal accent rise is correlat- H L L H ed with syllable boundary. 2 vi säger v i l : a igen In figure 1.2, the accent II HL fall in the word “vil:a” spans within the nucleus vowel, which is short, and is followed by a low tonal plateau up to the middle of the postvowel consonant whereas the focal accent rise follows thereafter up to the following vowel. Thus, long consonants in Swedish, much like long vowels, seem to behave like bimoraic syllable constituents and H L L H hence heterosyllabification is evident 3 vi säger n ä mV n a igen with the two moras attached to different syllables. This analysis indicates (1) complementary tonal structure distribu- tion in accordance with quantity func- tional distribution and (2) two L targets as a function of a low tonal plateau be- tween accent II fall and focal accent rise, which indicates a bitonal LH rep- resentation, rather than a monotonal H. In figure 1.3, the accent II and focal H L L H accent complementary tonal structure 4 vi säger h a l V v a igen distribution fall-plateau-rise in the test word “nämna” is evident, which corre- Figure 1. A female speaker’s examples of lates with the nucleus short vowel, the tonal representations as a function of sylla- first and second consonant of the inter- ble structure variability (cont. next page). vocalic cluster, respectively. A vowel In figure 1.4, the same tonal struc- insertion is however also evident, which ture to that of figure 1.3 as well as a may be a means to reinforce syllable vowel insertion are apparent in the test boundaries of intervocalic consonants. word “halva”. Thus, the accent II and

67 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University the focal fall-level-rise tonal sequence correlates with the nucleus short vowel, the first and second consonant of the cluster, respectively. In figure 1.5, the accent II and the focal fall-plateau-rise sequence is also evident in the word “ta:vla”. However, the vowel nucleus is long, despite the heterosyllabification of intervocalic H L L H consonant cluster, which indicates that 5 vi säger t a : v l a igen vowels in Swedish may be long even in closed syllable contexts. Furthermore, the low tonal plateau correlates with the first consonant of the cluster. In figure 1.6, the accent II and the focal fall-plateau-rise sequence corre- lates with short vowel nucleus and het- erosyllabic consonant cluster in “krav- la”. The focal L target at the right edge of the fist consonant indicates that the H L L H words “ta:vla” and “kravla” undergo 6 vi säger k r a v l a igen the same syllabification, despite respec- tive long vs. short vowel nucleus. In figure 1.7, the accent II and focal fall-plateau-rise sequence correlates with the first mora of the vowel nucle- us, the second mora, and the intervocal- ic consonant, respectively. Thus, in analogy with tonal associations ob- served for “vila”, this tonal pattern in- dicates the expected syllabification H L L H “bi.larv”. For “bil arv” (fig. 1.8) a 7 vi säger b i : # l a r v igen creaky voice at the onset of the second vowel is apparent, indicating a glottal stop and syllabification thus as “bil.arv”. The tonal pattern is somewhat inconclusive: The tonal rise at the clus- ter boundary may constitute the onset of the focal rise – thus indicating the same syllabification as in figure 1.7. The qualitative analysis above re- vealed four key aspects of Swedish H L L H prosody. First, tonal commands and 8 vi säger b i : l # a r v igen related tonal targets may be associated with specific syllable constituents. Se- Figure 1. A female speaker’s examples of cond, a low tonal plateau intervenes tonal representations as a function of sylla- between accent II and focal tonal tar- ble structure variability (see text). gets. Third, the L target of the focal Quantitative results tonal rise is a constant correlate of syl- labification. Fourth, several intervocalic In this paper, the quantitative results are clusters favor vowel insertion whereas confined to vowel insertions between other clusters disfavor it. consonant clusters (the total results will be presented at the conference).

68 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Table 2 shows vowel insertions as a which the focal rise is correlated with function of phonotactic variability the left edge of the second mora. At the across intervocalic consonant sequenc- same time, the L target of the focal rise es. In accordance with our experimental functions as a phonetic correlate of syl- methodology, more than 20 ms of vow- labification in Swedish, which was the el-like segment insertions were consid- main aim of this study in the first place. ered true vowel insertions whereas less This finding leads the way to the recon- than 20 ms were considered production sideration and a revised tonal represen- artifacts and are not thus included in the tation of the focal command as a bitonal table. It should be noted that vowel LH, instead of a monotonal H. On the insertion took place only in test words other hand, neither Hypothesis 1 nor 3-6, which are included in table 2. Hypothesis 3 is supported, as the L tar- Thus, no vowel insertion takes place in get of the accent II HLcommand is cor- the context of long vowels, assuming related with the right edge of the first that their internal composition consists mora of the nucleus vowel whereas a of two moras. vowel may be inserted between inter- vocalic consonants. Table 2. Vowel insertion in intervocalic Our results further enlighten thus consonant context as a function of syllable structure and phonotactic variability. Bruce’s (1977) tonal analysis in Swe- dish, according to which the lexical Vowel insertion accent II tonal fall and the focal accent Context Test words Count % tonal rise are distinct realizations of 3. /Vmn/ nämna 16 64 respective prosodic functions. This was 4. /Vlv/ halva 23 92 a unique approach in prosodic analysis 5. /Vːvl/ tavla 3 12 at the time, as the accent distinction had 6. /Vvl/ kravla 6 24 traditionally been described as a “dou- ble-peaked” accent II versus a “single- It is evident, that vowel insertion peaked” accent I. Bruce’s approach was takes place between all intervocalic widely adopted in tonal analysis of consonants clusters, albeit with differ- Swedish, suggesting a succession of ent percentage. Thus, the nasal-nasal as accent II fall and focal accent rise: “For well as liquid-fricative sequences seem a non-compound focal accent II (H*L to favor vowel insertion whereas frica- H) the word accent II fall (tied to the tive-liquid ones disfavor it. It should be stressed syllable) and the focal accent noted that MOP predicts heterosyllabi- rise will typically occur in immediate fication for the intervocalic cluster con- succession.” (Bruce & Granström 1989, sonants of all test words 3-6 whereas p. 18). Thus, in practice, Bruce suggest- SSP predicts heterosyllabification for ed a tonal interpolation between the L words 3-4 but tautosyllabification for target of the accent II fall and the H words 5-6. Interestingly, the SSP heter- target of the following focal accent rise, osyllabification prediction seem to fa- which critically disregards the L target vor vowel insertion in words 3-4. of the focal accent rise. In accordance with our analysis, however, this latter L Discussion target shows constant stability and we In accordance with the hypotheses pos- assume that its correlation with the on- ited in the introduction and the experi- set syllable constituent is essential in mental methodologies, the results sup- the tonal representation of Swedish. port Hypothesis 2, i.e. the onset of the Bruce’s analysis of Swedish had a focal rise appears to correlate with the major impact on tonal analysis and the left edge of the post-stress syllable on- development of prosodic theory. Thus, set. This is also evident with reference following Bruce (1977), Pierrehumbert to heterosyllabification of moraic ele- (1980) suggests two tonal categories, ments of long consonants, according to i.e. “pitch accent” and “phrase accent”,

69 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University which roughly (phonologically, but not no further elaboration is whatsoever functionally) correspond to respective attempted with regards to interactions lexical accent and focal accent in Swe- of tonal representations and syllable dish. In accordance with Pierrehumbert structure constituents. Thus, to the best (1980) and mainstream Autosegmental- of our knowledge, the role of syllable Metrical theory (AM theory) thereafter, structure constituency in intonation pitch accents may be either monotonal studies has practically been ignored in (L* or H*) or bitonal (e.g. L*H or H*L) current prosodic research. whereas phrase accents are in principle In our view (see also Botinis & monotonal (i.e. either L¯ or H¯). How- Nirgianaki 2014, this volume), segmen- ever, our results in this study contradict tal strings of any length in principle are AM theory’s premises about the organized into syllable sequences monotonal representation of phrase whereas, at the same time, segments accent (in our term focal accent), at associate with syllable constituency. least with reference to Bruce’s analysis Underlying representations and related of Swedish and respective adoption in tonal commands, on the other hand, are the context of AM theory. optimally surfaced as specific tonal Another shortcoming of AM theory targets at specific syllable constituents. is the pitch accent representation itself. The interface between different tonal The H*+L pitch accent, i.e. the accent commands and different syllable do- II phonological representation in Swe- mains may thus vary across languages, dish, assumes a H tonal target in the which is a challenging line of crosslin- domain of stressed syllable, i.e. a guistic prosody research and prosodic starred tone, whereas the L tonal target theory development in general. is basically unspecified. In principle, the L tonal target of the L tone may References thus be anywhere on the right of the H Atterer, M. & Ladd, Ladd, D.R. (2004). tone, even outside the stressed syllable On the phonetics and phonology of itself. In practice, AM theory notation ‘‘segmental anchoring’’ of F0: evi- and the assumptions behind it seem thus dence from German. JP 32, 177-97. to be too underspecified for underlying Boersma, P. & Weenink, D. (2013). phonological representations and too Praat: Doing phonetics by comput- broad for surface phonetic representa- er (http://www.praat.org). tions likewise. Instead, the association Botinis, A. (1989). Stress and Prosodic of tonal commands and respective tonal Structure in Greek. targets with specific syllable constitu- Botinis, A. & Nirgianaki, E. (2014). ents, in accordance with the results of Tonal production and syllabifica- the present study, is closer to phonetic tion in Greek (this volume). reality and matches in a natural way the Bruce, G. (1977). Swedish Word Ac- phonetics and phonology of prosody. cents in Sentence Perspective. Interestingly, that is what AM theory Lund: Gleerup. and its basic premises advocate in prac- Bruce, G. & Granström, B. (1989). tice, i.e. the relation of phonetics and Modelling Swedish intonation in a phonology in the first place. text-to-speech system. STL-QPSR Approaches within the framework 30, 17-21. of AM theory define in alternative ways Maddieson, I. (1985). Phonetic cues to the targets of tonal associations. Atterer syllabification. In Fromkin, V.A. and Ladd (2004), e.g., suggest associa- (ed.), Phonetic Linguistics, 203- tions of pitch accents and respective 221. New York: Academic Press. tonal targets with segmental landmarks, Pierrehumbert, J.B. (1980). The Phonet- i.e. specific “segmental anchorings”. ics and Phonology of English Into- Although an insightful remark per se, nation. PhD thesis, MIT.

70 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Prosodic boundaries and discourse structure in Kammu Anastasia M Karlsson1, Jan-Olof Svantesson1, David House2 1 Department of Linguistics and Phonetics, Lund University, Sweden 2 Department of Speech, Music and Hearing, KTH, Stockholm, Sweden [email protected], [email protected], [email protected]

Abstract cal, morphological or syntactic, be- The main function of sentence intona- tween the dialects are marginal, and tion in Kammu is to mark prosodic speakers of different dialects under- boundaries. There is no additional tonal stand each other without difficulty marking of focus. It is of particular in- (Svantesson 1983; Svantesson & House terest that the underlying intonation 2006). system is the same for both tonal The main function of sentence in- (Northern Kammu) and non-tonal tonation in Kammu is to mark prosodic (Eastern Kammu) dialects. Prosodic boundaries. Phrase boundaries occur at boundaries in Kammu have three func- the right edge of each prosodic phrase tions: they mark prosodic phrases, focus and are realised by a high (or high fall- and speaker engagement. In this study ing) pitch. The focused word is by de- we show that relationships between fault placed at the end of an utterance boundaries in terms of upstepping or its coinciding with the place of the bound- absence interact with information and ary tone, and the pitch of the phrase discourse structure. This relationship boundary tone is raised. There is thus has the same pattern in both tonal and no additional tonal gesture for focal non-tonal Kammu. accent. In the tonal dialect lexical tones do not change the phrase pattern, and Introduction we still find the high boundary tone at Kammu is a Mon-Khmer language. It is the right edge of prosodic groups unless spoken by some 600,000 people mainly it jeopardises the identity of the lexical in Northern Laos, but also in adjacent tones (Karlsson et al. 2012). areas of Vietnam and Thailand. One of Research questions and method the main dialects of this language is a tone language of the ‘East Asian’ type In Kammu, phrase boundaries between with (high or low) tone on each sylla- utterances said in isolation tend to be ble, while the other main dialect lacks up-stepped. Informal observations of lexical tones. spontaneous narratives indicate that The origin of the tones of the tonal besides upstepping of phrase bounda- dialect is due to the development of ries within an utterance there is also high pitch in vowels following a voice- upstepping between boundaries of ut- less consonant and low pitch in vowels terances. The upstepping occurs up to a following a voiced consonant, and the certain point and then the same pattern subsequent merger of voiceless and repeats again. These turning points voiced consonants into the unmarked seem to occur at thematically similar member of the pair, voiceless for stops places in narratives for all our speakers. and voiced for sonorants. Thus, puuc Our goal is to find out whether these ‘to undress’ became púuc (high tone) in turning points are related to discourse the tonal dialect and buuc ‘wine’ be- structure. The main assumption is that came pùuc (low tone). The non-tonal tonal phrase boundaries in Kammu are dialect kept the original forms un- multifunctional. They reflect prosodic changed. Other differences, phonologi- phrasing, information structure and

71 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University discourse structure. First, we assume [old new1 + new 2] [old new2 + new3]… that information structure is reflected The new information becomes an an- by upstepping of phrase boundaries. chor point (old information) in the next The utterance final boundary is the utterance. There are thus a lot of repeat- highest one as a reflection of default ed words in the speakers’ monologues, placement of the focused word (or and anaphoric reference is seldom used. ‘new’ information) at the end of an ut- An example is (only key events are in- terance. Second, we assume that the cluded): long-term relations between utterance boundaries reflect discourse structure. Before there is rice we have to clear the In our analysis we distinguish be- field… After clearing the field we burn tween information structure and dis- the field… After burning we sow.1 course structure. Narratives are divided into [given + new] units, called major The text can thus be seen as a list of phrases. As the information becomes successive events. Some speakers use given a new major phrase starts. Each only one utterance per event while major phrase consists of at least one some add a lot of additional infor- minor phrase. Minor phrases are de- mation. fined on prosodic grounds. We recog- In order to test our hypothesis that nise a group as a minor phrase if it has a each informational unit [given + new], prosodic boundary (high or high falling i.e. major phrase, is reflected in the to- pitch) at its right edge. Discourse topics nal structure by upstepping of boundary are recognised on semantic grounds; tones, we made two kinds of analysis. this is described below. The F0 contour First, we divided narratives into phrases of a part of a narrative with its division on prosodic grounds. This was done by into minor and major phrases and topics perceptual and visual analysis of the F0 is shown in Figure 1. contours using Praat. Each unit ending Recordings of four speakers (all with a prosodic boundary tone was la- men) of the non-tonal dialect and six belled as a minor phrase and the F0 speakers (four women and two men) of maximum on the last word was meas- the tonal dialect of Kammu were used ured. Second, we performed an analysis for this investigation. They recorded of the informational flow in narratives spontaneous accounts of rice growing, in terms of ‘new’ and ‘given’. Each unit from the beginning of the work in the consisting of ‘given + new’ is labelled field until the rice is cooked and eaten. as a major phrase. Thus, the division is: All speakers are well acquainted with this process and their accounts are very [[minor phrase]boundary1 [minor phrase] similar. Thus, we got fairly homogene- boundary2 [minor phrase]boundary3] ous spoken texts lasting about 2–5 major phrase] boundary4. minutes each. The narratives were tran- scribed and glossed by a native speaker Each major phrase consists of at of Kammu. least one minor phrase. The boundary of the last minor phrase is also the Analysis boundary of the whole major phrase. Information structure We expect the F0 maxima of boundary tones to be upstepped with the highest Structuring of new and old information F0 at the boundary of the major phrase is achieved in the same way by all (boundary 4 in the example above). speakers: new information is placed at the end of the utterance; it is then re- peated in the next utterance and is fol- 1 lowed by new information. The infor- Kammu people practice slash-and- burn agriculture. mational structuring is [anchor + new1]

72 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Division of narrative into topics course structure. As we observe upstep- A discourse topic is seen as an informa- ping of boundary tones within major tively coherent part of discourse with a phrases as a cue for their boundaries, clear beginning and end (see e.g. Chafe we assume that also the end of topics 2001). As we are dealing with narra- will be marked by a higher boundary. tives about a traditional activity we Results used the Kammu agricultural calendar compiled by Damrong Tayanin: http:// Final boundaries of major phrases digaaa.humlab.lu.se/digaaa/web/kammu In order to find out whether the general /KamRaw/kammu1.html. pattern is that F0 is rising in major The agricultural periods are: phrases, we measured the F0 maximum 1) Clearing, 2) First burning, 3) Se- of the last word of each major phrase cond burning, 4) Sowing, 5) First weed- and of each minor phrase within the ing, 6) Second weeding, 7) Third weed- major phrases. As a measure of the F0 ing, 8) Harvest, 9) Finishing off the rise within a major phrase we took the year, 10) Cold season. difference between F0 of the major Having these as our reference phrase and the mean of the F0 values of frame for division into thematic topics the (non-final) minor phrases that con- we found that all speakers have the fol- stitute the major phrase. For each lowing topics: speaker we thus obtained a number of 1) Clearing, 2) First and second differences which should be positive if burning, 3) Sowing, 4) Weeding, 5) the hypothesis that F0 increases in a Ripe rice, 6) Harvest, 7) Putting in major phrase is true. barns, 8) Pounding rice, 9) Soaking To test this hypothesis we used an rice, 10) Cooking rice, 11) Eating rice. exact binomial test for each speaker Some speakers have additional top- based on the number of positive and ics, such as making field houses, pro- negative differences. For tonal speak- tecting crops from animals or different ers, the influence of the tones was com- ways to cook rice. We chose only topics pensated for by adding the mean F0 that were found for most speakers: all difference between the high and low topics except (5) and (7) occur for all tone in the measured words for that speakers. speaker to the F0 value of the maxi- As phrase boundary tones in Kam- mum measured in each word with low mu convey several functions, we as- lexical tone. sume that they also interplay with dis-

275 250 200 150 100 50

Pitch (Hz) minor p minor p minor p minor p minor p minor p minor p minor p major p major p major p major p topic ends topic ends 0 5.4346 Time (s) ! Figure 1. Part of a narrative and its division into minor phrases, major phrases and topics. Non-tonal speaker. Glossing is [[go mark]minor phrase [we finish then clear]minor phrase]major phrase]topic ends [[clear]minor phrase [finish then dry]minor phrase]major phrase [[cut tree]minor phrase]major phrase [[that one month]minor phrase [two months]minor phrase [finish then burn]minor phrase]major phrase]topic ends.

73 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The results are shown in Table 1. Table 1. The number of major phrases for The tests show significant results (on which the difference between the F0 maxi- the 5% level) for all speakers except mum of the major phrase and the mean F0 Speaker 7 (non-tonal) and Speaker 18 maxima of the constituent minor phrases is (tonal), thus supporting our hypotheses equal to, greater than or less than zero. for most speakers. Boundaries of discourse topics Non-tonal speakers: Speak- #Diff #Diff #Diff p-value We tried to correlate the phrasing of the er =0 >0 <0 discourse with the local F0 maxima of 1 0 15 0 <0.001 the major phrases. In general there seems to be a tendency that local F0 6 1 29 1 <0.001 maxima also serve as boundaries be- 7 0 9 3 0.07 tween discourse topics (in about 58% of 8 0 20 4 <0.001 the cases), but this is not always the case. The general trend of F0 maxima Tonal speakers of boundary tones coinciding with the Speak- #Diff #Diff #Diff p-value end of each topic is shown in Figure 2. er =0 >0 <0 All speakers mark ‘pounding’ with 17 0 11 1 0.003 the highest F0. After this point the gen- 18 0 10 4 0.09 eral upstepping trend becomes opposite 19 0 9 2 0.033 and we find downstepping between 20 0 12 1 0.0017 topics and also between boundaries of 21 0 9 1 0.01 major phrases. 26 1 5 0 0.03

Figure 2. Mean of F0 maxima of topics

74 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Discussion a long-term downtrend of prosodic boundaries. Final boundaries of major phrases Due to the character of the structure Spontaneous discourse encompasses of the narratives, we assumed that top- many factors that may influence tonal ics are structured as [description + patterns, such as phrasing, focusing, name of activity when it is finished]. turn-taking, speakers’ attitudes and de- For example, in grees of engagement, self-corrections, hesitations, etc. We go to seek a field, seek in the forest, Having investigated only one of after finding the field we clear these factors – topic marking – in our study, we have to keep in mind that our the part before clear will be the descrip- result may be influenced by all these tion, and clear is the name of the activi- factors. We chose to separate infor- ty and its ending, coinciding with the mation structure and discourse structure end of the topic. However, in some cas- in our study, which proved to be fruit- es we found another type of structuring ful. Material was analysed by using of topics, when the topic is introduced three principles: prosodic analysis to at the beginning and then described, e.g.: extract tonal boundaries, analysis of informational status to detect major We seek a place we will clear, yes, seek phrases and semantic analysis to decide the forest, look for a place that will be topics. The three analyses were per- good for the rice and we clear. formed independently of each other and were then matched to see if our hypo- Here, clear is introduced in the begin- theses are correct. ning as a new topic and its development The division of narratives into in- comes afterwards. This kind of topic formation units [given + new] (major gets the highest F0 at the beginning of phrases) is reflected in prosodic phras- the topic instead of at the end. ing by upstepping of boundary tones. We obtained statistical significance for Typological implication both tonal and non-tonal speakers. As As regards prosodic typology, Kammu we move to discourse structure we can belongs to the phrase language type in only talk about trends. All speakers Féry’s (2010) typology. In this type of mark the topic about pounding with the language, information structure is most highest tonal boundary. Kammu speak- often conveyed by morpho-syntactic ers may see activities connected to rice means, and focusing is achieved by as divided into two main parts: field changes in the pitch level of phrasing work and cooking rice. Field work ends tones, dephrasing or insertion of a new when one can pound the harvested and boundary tone. No new pitch accents dried rice. Pounding is then the end of are added to mark a focused word as is the first part of the narratives and is also the case in intonation languages. Ac- marked by the highest boundary. cording to this description, major Indian The end of other topics tends also languages as Hindi, Bengali, Tamil and to be tonally marked by a higher Malayalam (Féry 2010), as well as Ko- boundary. This trend is, however, bro- rean (Jun 2005), West Greenlandic ken by two main factors. The part after (Arnhold, to appear) and Mongolian ‘pounding’, in which the rice is cooked (Karlsson, to appear) are typical phrase and eaten, shows the opposite trend: languages. Kammu has one type of boundaries of all units (both of major boundary tone realised with a high (or phrases and topics) tend to decline. high falling) pitch. Boundaries are mul- Thus the end of discourse is marked by tifunctional and they convey phrasing, focus, engagement, and topic structure.

75 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The occurrence of lexical tones does honor of Rama Kant Agnihotri (pp. not lead to any differences, and we find 288-312). Delhi: Aakar Books. the same strategies in conveying dis- Jun, S.-A. (2005). Prosodic Typology. course structuring into topics in both In S.-A. Jun (Ed.), Prosodic typol- tonal and non-tonal speakers. ogy: the phonology of intonation and phrasing (pp. 430-458) Ox- Acknowledgements ford: Oxford University Press. The work reported here was done with- Karlsson, A. (to appear). Intonation in in the project Integrating the structures Halh Mongolian. In S.-A. Jun of information and discourse: a cross- (Ed.), Prosodic Typology Volume linguistic approach, financed by the II. Oxford: Oxford University Swedish Research Council. An earlier Press. version of this paper was presented in Karlsson, A., D. House & J.-O. (Karlsson et al. 2013). Svantesson. (2012). Intonation adapts to lexical tone: the case of References Kammu. Phonetica 69, 28–47. Arnhold, A. (to appear). Prosodic struc- Karlsson, A. M., Svantesson, J-O., & ture and focus realization in West House, D. (2013). Multifunctionali- Greenlandic. In S.-A. Jun (Ed.), ty of prosodic boundaries in spon- Prosodic Typology Volume II. Ox- taneous narratives in Kammu. in P. ford: Oxford University Press. Mertens & A.C. Simon (Eds.), Chafe, W. (2001). The analysis of dis- Proceedings of the Prosody- course flow. In D. Schiffrin, D. Discourse Interface Conference Tannen & H. E. Hamilton (Eds.), 2013 (IDP-2013) (pp. 45-50) Leu- The Handbook of Discourse Analy- ven, Belgium. sis (pp. 673-687). Oxford: Black- Svantesson, J.-O. (1983). Kammu pho- well. nology and morphology. Lund: Féry, C. (2010). Indian languages as Gleerup. intonational ‘phrase languages’. In Svantesson, J.-O. & D. House. (2006). S. I. Hasnain & S. Chaudhury Tone production, tone perception (Eds.), Problematizing language and Kammu tonogenesis. Phonolo- studies: cultural, theoretical and gy 23, 309–333. applied perspectives – essays in

76 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Tonal production and syllabification in Greek Antonis Botinis, Elina Nirgianaki Lab of Phonetics & Computational Linguistics, University of Athens, Greece [email protected]

Abstract Syllable analysts oftentimes assume This is a study of syllabification as a that syllable is an abstract phonological function of lexical stress and sentence unit, which has not however any robust focus tonal production in Greek. The phonetic correlates (Koller 1966). In results of a production experiment indi- many languages, however, including cate that tonal turning points are associ- Greek, tonal turning points and espe- ated with syllable onset constituents in cially tonal onset rises correlate with both lexical stress and focus contexts syllable initial segments (e.g. Botinis and indicate thus syllable boundaries. 1989, Atterer & Ladd 2004). Thus, our On the other hand, several intervocalic main hypothesis in the present study is consonant clusters favor vowel inser- that tonal turning points correlate with tion whereas other ones disfavor it. syllable onset constituents and indicate Introduction thus syllabification in Greek. This study examines syllabification of Experimental methodology intervocalic consonants as a function of In accordance with one production ex- lexical stress and sentence focus tonal periment, the speech material consists production in Greek. Syllabification of 6 disyllabic oxytone words in the usually refers to phonotactic distribu- carrier sentence [ˈlen __ maˈzi] (“(they) tion and syllable structure whereas em- say __ together”) in lexical stress and pirical evidence is sporadic, at least sentence focus contexts (Table 1). 4 with respect to acoustic correlates, be- female speakers with standard Athenian ing mainly restricted to duration pat- pronunciation, at their early twenties, terns. However, syllabification may produced the speech material 4 times correlate with tonal production and a each. The corpus counts thus to 192 question thus concerns syllable bounda- tokens (6 words x 2 prosody conditions ries and tonal correlations in lexical x 4 speakers x 4 productions). The re- stress and focus production contexts. cordings took place at Athens Universi- Research on syllabification advo- ty Phonetics Studio and speech analysis cates several principles in various theo- was carried out with Praat (Boersma & retical contexts, such as the “Maximum Weenink 2013). Tonal normalization Onset” (MOP) and the “Sonority Se- was carried out with the ProsodyPro quencing” (SSP). MOP predicts syllabi- Praat script (Xu 2013) and statistical fication of consonants on the right, if processing with SPSS. the outcome forms a legal word-edge onset cluster (Kahn 1976). SSP predicts Table 1. Intervocalic consonant context and syllabification of consonants in accord- oxytone test words with respective glosses. ance with a sonority scale, i.e., in a fair- Cluster Test word Gloss ly simple version, [V(owel) > S(emivowel), L(iquid) > N(asal) > 1. [mn]/[NN] [amno] lamb O(bstruent)], which forms a mirror im- 2. [vɣ]/[FF] [avɣo] egg age onset rising and coda falling pattern 3. [zv]/[SF] [azvo] badger in relation to nucleus syllable peak 4. [ɣn]/[FN] [aɣno] pure (Steriade 1982, Clemens 1990). 5. [vl]/[F/L] [avlo] flute 6. [lɣ]/[L/F] [alɣo] ache

77 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Results This section presents qualitative analy- sis examples, followed by quantitative analysis and generalization of results. Qualitative analysis Lexical stress context 1 ˈl e n a m ˈn o m a ˈz i

In figure 1.1, the tonal turning point in the word “amˈno” correlates with the left edge of the second nasal, which implies heterosyllabification of the na- sal-nasal cluster. In figure 1.2, the tonal turning point in the word “avˈɣo” correlates with the 2 ˈl e n a v vˈɣ o m a ˈz i left edge of the second consonant, im- plying heterosyllabification of the frica- tive-fricative cluster whereas a vowel insertion between consonants is evident (i.e. v), which also implies heterosyllab- ification. In figure 1.3, the tonal turning point in the word “azˈvo” correlates with the left edge of the second consonant, im- 3 ˈl e n a z ˈv o m a ˈz i plying heterosyllabification of the sibi- lant-fricative cluster whereas, in con- trast to the fricative-fricative cluster in 1.2, no vowel insertion between the consonants is apparent. In figure 1.4, the tonal turning point in the word “aɣˈno” correlates with the v middle of the vowel on the right of the 4 ˈl e n a ɣ ˈn o m a ˈz i consonant cluster rather than any of the consonants whereas a vowel insertion between consonants is also apparent. Tonal displacement and vowel insertion also imply heterosyllabification of the fricative-nasal cluster. In figure 1.5, the tonal turning point v in the word “avˈlo” correlates with the 5 ˈl e n a v ˈl o m a ˈz i middle of the vowel rather than any of the cluster consonants, much like the fricative-nasal cluster in 1.4, implying heterosyllabification of the fricative- liquid cluster whereas a vowel insertion between consonants is also apparent. In figure 1.5, the tonal turning point in the word “alˈɣo” correlates with the 6 ˈl e n a l vˈɣ o m a ˈz i left edge of second consonant, implying Figure 1. A female speaker’s examples of heterosyllabification of the liquid- tonal representations as a function of sylla- fricative cluster, whereas a vowel inser- ble structure variability (cont. next page). tion between consonants is evident.

78 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Focus context In figure 1.7, the tonal turning point in the word “amˈno” in focus correlates with the left edge of the second nasal of the cluster. In figure 1.8, the tonal turn- ing point in the word “avˈɣo” in focus correlates with the middle of the second 7 ˈl e n a m ˈn o m a ˈz i fricative of the cluster. In figure 1.9, the tonal turning point in the word “azˈvo” in focus correlates with the left edge of the second consonant of the cluster. In figure 1.10, apart from microtonal per- turbations, hardly any turning point in the word “aɣˈno” in focus is apparent. v In figure 1.11, the tonal turning point in 8 ˈl e n a v ˈɣ o m a ˈz i the word “avˈlo” in focus correlates with the left edge of the second conso- nant of the cluster. In figure 1.12, the tonal turning point in the word “alˈɣo” in focus correlates with the second con- sonant of the cluster whereas a vowel insertion is also evident. 9 ˈl e n a z ˈv o m a ˈz i Summary of qualitative analysis The qualitative analysis above indicates the following regularities. First, a tonal turning point takes place between ex- amined intervocalic cluster consonants, which indicates heterosyllabification of respective consonants. Second, a vowel insertion between 10 ˈl e n a ɣ v ˈn o m a ˈz i consonants is favoured in several clus- ter contexts (e.g. “avˈɣo”, “aɣˈno”) re- inforcing thus respective heterosyllabi- fications but disfavored in other ones (e.g. “amˈno”, “azˈvo”). Third, the turning point correlates with the onset of the syllable constitu- ent, except for the words “aɣˈno” and 11ˈl e n a v ˈl o m a ˈz i “avˈlo” in out of focus context. Specifi- cally, the latter words appear with both vowel insertions and right tonal rise displacement to about the middle of the nucleus vowel whereas no such a dis- placement takes place in focus context. Thus, in general, it seems that tau- tosyllabification of intervocalic conso- 12ˈl e n a l vˈɣ o m a ˈz i nants is disfavored in Greek. On the other hand, focus production is an op- Figure 1. A female speaker’s examples of timal context for tonal turning points tonal representations as a function of sylla- and syllabification correlations. ble structure variability (see text).

79 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Quantitative analysis Figure 2 shows quantitative results Table 2 shows vowel insertion as a of four key words (additional results function of different consonant clusters will be presented at the conference). In in stress, i.e. out of focus, and focus general accordance with the qualitative contexts. In general, some clusters dis- analysis (see figure 1), the tonal turning favor vowel insertion, i.e. nasal-nasal point in all four figures correlates with (NN), sibilant-fricative (SF), and frica- the second consonant of the clusters, tive-liquid (FL), and some other favor which indicates heterosyllabification of it, i.e. fricative-fricative (FF), fricative- all clusters, whether a vowel is inserted nasal (FN), and liquid-fricative (LF). or not. Thus, the tonal turning point correlates as a rule with the onset con- Table 2. Vowel insertion (no/yes) as a func- sonant, but hardly with its left edge as tion of consonant cluster (cluster) and pros- earlier studies have shown. ody context variability (stress/focus). Stress Focus Total Cluster No Yes No Yes No Yes [NN] 16 0 16 0 32 0 [FF] 3 13 0 16 3 29 [SF] 16 0 12 4 28 4 1. ˈl e n a m n ˈo m a ˈz i [FN] 0 16 0 16 0 32

[FL] 15 1 13 3 28 4 [LF] 4 12 7 9 11 21 Total 54 42 48 48 102 90

Kruskal-Wallis nonparametric test showed significant differences among consonant clusters in “stress” context (Η(5) = 66.5, p <0.0001), with a mean 2. ˈl e n a v ɣ ˈo m a ˈz i rank of 68.0 for NN, 30.22 for FF, 68 for SF, 21.5 for FN, 65 for FL and 33.1 for LF. Pairwise comparisons showed significant differences in 9 pairs out of 15, i.e. FN/FL, FN/NN, FN/SF, FF/FL, FF/NN, FF/SF, LF/SF (p <0.0001) and LF/NN, LF/FL (p <0.005). Τhere were also significant differ- 2. ˈl e n a z v ˈo m a ˈz i ences of consonant clusters in “focus” context (Η(5) = 60.8, p < 0.0001), with a mean rank of 74 for NN, 24.5 for FF, 61.6 for SF, 24.50 for FN, 64.7 for FL and 46.1 for LF. Pairwise comparisons showed that 7 out of 15 pairs differed significantly, i.e. FN/NN, FN/SF, FN/FL, FF/NN, FF/SF, FF/FL (p < 0.0001) and LF/NN (p < 0.05). 1. ˈl e n a l ɣ ˈo m a ˈz i Mann-Whitney nonparametric test, Figure 2. Average tonal contours of five on the other hand, did not show any female speakers and four repetitions of the significant difference between “stress” key words in focus (bold letters) as a func- and “focus” contexts except for the SF tion of syllable structure variability (dark cluster (U=96.0, p<0.05, two-tailed), lines indicate no vowel insertion and light indicating that focus production hardly lines indicate vowel insertion). has any effect on vowel insertion.

80 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Discussion In accordance with a study on The basic hypothesis of the present Swedish, similar to the present one, the study is that tonal turning points in results indicate that tonal rises in sen- Greek indicate syllable boundaries. tence focus contexts initiate at the left Early research on Greek prosody (e.g. edge of onset syllable consonants and Botinis 1989) showed that tonal rises in correlate thus with syllable boundaries lexical stress and sentence focus con- (Botinis, Ambrazaitis & Frid, this vol- texts initiate at the very beginning of ume). Furthermore, much like in Greek, the syllable, i.e. the left edge of the on- several consonant clusters favor vowel set consonant. However, we assume insertion whereas other ones disfavor it. that this is an optimal context with ref- Vowel insertion in Swedish is most erence to tonal turning points and syl- unexpected as it concerns a fairly labification correlates, i.e. a CV phono- closed syllable structure language and tactic syllable context, with minimum any syllabification of intervocalic con- interaction from immediate prosody sonants does not in principle violate context. The results of this study cor- coda legality phonotactics. Thus, both roborate in principle earlier results with Greek and Swedish seem to disfavor reference to correlation of tonal turning complex onsets and codas and show points and onset syllable consonants. consequently fairly similar tendencies Considerable variability is nevertheless of vowel insertions between conso- evident within the domain of the onset nants. On the other hand, languages consonant, even tonal turning point with different prosodic systems and displacements to the right as a function syllable structures in particular, such as of syllable context variability and espe- Greek and Swedish, may use similar cially vowel insertions. strategies to mark syllable boundaries. Our results do not support either In addition to syllabification as a MOP or SSP predictions. The fricative- function of tonal turning points, the fricative as well as sibilant-fricative results of the present study (and espe- clusters in [avˈɣo] and [azˈvo], respec- cially in combination with the results in tively, are as a rule heterosyllabified, Swedish, Botinis et al., this volume) despite the tautosyllabification on the have several major implications. Tonal right that MOP predicts. However, a turning points are defined as a result of vowel is as a rule inserted in the frica- tonal targets, which are associated with tive-fricative cluster but not in the sibi- the segmental string (Bruce 1977). Au- lant-fricative one. Vowel insertion as a tosegmental-metrical theory (AM theo- function of consonant cluster variability ry) and especially Pierrehumbert and may depend on word-edge coda legali- collaborators (e.g. Pierrehumbert 1980, ty, as [s] is a legal coda but not [v]. Pierrehumbert & Beckman 1986), Likewise, the fricative-nasal cluster in adopting in principle Bruce’s analysis the word [aɣˈno] is as a rule heterosyl- of Swedish, suggest several “pitch ac- labified, as evidenced also by vowel cents” for the description of different insertion, despite the right tautosyllabi- languages, such as L*+H, H*+L, etc. fication that both MOP and SSP pre- Thus, in lexical stress context in Greek, dict. Thus, it seems that Greek speakers e.g. the stressed syllable (*) is assumed disprefer complex consonant onsets, to associate with a L*+H pitch accent in which results in heterosyllabification of accordance with AM theory premises. intervocalic consonant clusters as well In practice, this means that the L tonal as vowel insertions between conso- target may vary across the entire do- nants. Thus, syllabification in Greek main of the stressed syllable whereas results in a variety of syllable consonant the H tonal target may just be on the codas, which do not whatsoever consti- right with hardly any further specifica- tute legal word-edge coda phonotactics. tion, i.e. a “trailing tone”.

81 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

A similar shortcoming of AM theo- window between tonal targets versus ry is the lexical accent II representation the syllable constituent association. in Swedish. Assuming a H*+L pitch accent, the H tonal target is within the References domain of the stressed syllable, i.e. a Atterer, M. & Ladd, D.R. (2004). On starred tone, whereas the L tonal target the phonetics and phonology of is unspecified. In principle, the L tonal ‘‘segmental anchoring’’ of F0: evi- target may be anywhere on the right of dence from German. Journal of the H tone, even outside the stressed Phonetics 32, 177-197. syllable itself. We have however pro- Beckman, M.E. & Pierrehumbert, J.B. vided evidence that both H and L tar- (1986). Intonational structure in gets of the HL tonal fall of accent II in Japanese and English. Phonology 3, Swedish is confined within the stressed 255-309. syllable and in particular within the Boersma, P. & Weenink, D. (2013). nucleus vowel (Botinis et al., this vol- Praat: Doing phonetics by comput- ume). Thus, in accordance with AM er (http://www.praat.org). premises, any H*+L or H+L*, or even Botinis, A. (1989). Stress and Prosodic H*+L* or [HL]* sequence is as good as Structure in Greek. Lund: Lund any other. On the other hand, the University Press. starred tone, i.e. the H* tone, is as- Botinis, A., Ambrazaitis, G. & Frid, J. sumed to be somehow stronger and in (2014). Syllable structure and tonal some way more important than the un- representation: revisiting focal Ac- starred one, i.e. the L tone. It seems cent II in Swedish. Proc. FONETIK that, with specific reference to Swedish, 2014 (this volume), Stockholm, no reason can be found why either of Sweden. the two H and L tonal targets should be Bruce, G. (1977). Swedish Word Ac- starred or why either of the tonal target cents in Sentence Perspective. might be more important than the other. Lund: Gleerup. In accordance with our approach, Clements, G.N. 1990. The role of the our general hypothesis is that tonal tar- sonority cycle in core syllabifica- gets “seek” for specific syllable constit- tion. In Kingston, J. & Beckman, uent associations. Thus, in Greek, the L M.E. (eds.), Papers in Laboratory tonal target associates with the onset Phonology I: Between the Gram- syllable constituent. This is however an mar and Physics of Speech, 283- optimal association as tonal displace- 333. Cambridge University Press. ments may take place as a result of var- Kahn, D. (1976). Syllable-based Gener- ious context pressures. On the other alizations in English Phonology. hand, in onsetless syllable contexts, e.g. Ph.D. thesis, MIT. /aˈoristos/ (aorist) the L target presum- Kohler, K.J. 1966. Is the syllable a ably associates with the nucleus vowel phonological universal? Journal of of the stressed syllable, which may be linguistics 2, 207-208. an alternative association. We may thus Pierrehumbert, J.B. 1980. The Phonol- assume L target association with onset ogy and Phonetics of English Into- syllable constituent, otherwise with nation. PhD thesis, MIT. nucleus vowel. Another aspect of our Steriade, D. (1982). Greek prosodies approach is the “domain” of tonal target and the nature of syllabification. associations. In Swedish the domain of Ph.D. thesis, MIT. the accent II HL tonal fall is intrasyl- Xu, Y. (2013). ProsodyPro. Proc. of labic whereas the domain of the LH tools and resources for the analysis tonal rise in Greek is intersyllabic. of speech prosody, 7-10. Aix-en- Thus, a major issue is the temporal Provence, France.

82 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Sound initiation and source types in human imitations of sounds Pétur Helgason Department of Linguistics & Philology, Uppsala University, Sweden, and Speech, Music and Hearing, KTH, Sweden [email protected]

Abstract ducing sounds using these initiation There exists a rich body of research mechanisms. Pulmonic ingressive exploring the production of speech, but sounds, in particular, are quite common for non-linguistic sound production, for (cf. Eklund, 2008), and also occur in example imitations of environmental imitations. Sounds can also be produced sounds or animals, much less data and without creating an airstream, e.g. by research are available. Data from hu- clashing the teeth together or by slap- man sound imitations collected in the ping the tongue against the floor of the initial, exploratory phase of the SkAT- mouth. Such sounds are referred to as VG project were analyzed in terms of percussives (Pike, 1943: 103). Percus- the articulatory and aerodynamic condi- sives are encountered in sound imita- tions involved in their production. The- tions, but they are rarely found in (non- se exploratory data yielded a classifica- pathological) speech. tion of sound productions in imitations The source-filter model of speech based on the intersections between production (Fant, 1960) has been suc- sound initiation and sound source types. cessful in describing the acoustics of The source types identified are turbu- human speech sound production. In lent, myoelastic, whistled and percus- speech the principal sound sources are sive sources. The ways in which these voicing, produced with a pulmonic source types intersect with pulmonic, egressive airstream entraining the vocal glottalic and velaric sound initiation, folds into vibration, and friction noise, both egressive and ingressive, are de- produced by constricting a pulmonic scribed and discussed. egressive airstream at some point in the vocal tract, causing turbulence. Howev- Introduction er, humans can produce sounds with a In speech, the principal way of produc- number of additional source types, ing sound is to drive an airstream past some of which are used in spoken lan- one or more obstacles. The organ re- guages and some of which are not. sponsible for driving the airstream is Sound initiation and source types the initiator (Pike 1943: 85ff), while the source of the sound produced is located Here, the focus is on cataloguing source at the point of the obstacle(s). types that seem useful for sound imita- The sound initiation mechanisms tion. The approach is to categorize the commonly acknowledged in speech source types according to the articulato- production are pulmonic egressive, ry and aerodynamic conditions under glottalic egressive, glottalic ingressive which they are produced. The main and velaric ingressive (ibid.; see also categories of source types thus identi- Catford, 1977). Although there are no fied are myoelastic, turbulent, whis- attested cases of pulmonic ingressive tled and percussive. The three former and velaric egressive airstreams being source types can be produced using utilized as features in phonological sys- various initiation mechanisms, but per- tems, there is no real obstacle to pro- cussives constitute an initiation mecha- nism on their own. In the following,

83 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University examples of these four basic types of impossible) to produce sibilant frica- sources will be discussed primarily in tives with an ingressive airstream (Cat- terms of the initiation mechanisms in- ford, 1988: 20ff; see also Eklund 2008 volved and their observed or potential for a more comprehensive review). In uses in sound imitation. other cases, although appreciably dif- The exploratory data have various ferent, the acoustic result of ingressive sources. Many of the examples on friction is still quite similar acoustically which the analyses are based have been to the egressive counterpart. These facts found on-line, but exploratory record- may contribute to its apparent scarcity ings have also been made, with the aid in imitations. However, one should note of a professional improvisational actor. that ingressive friction is encountered in emotive sounds, e.g. sucking in air Turbulent sources through one’s teeth to indicate pain To produce fricative sounds, an air- (Cruttenden 1986: 180). stream is made turbulent by channeling it through a constriction in the glottis or Glottalic egressive turbulence the vocal tract (cf. Stevens 1999: 37f Glottalic egressive friction is fairly for an overview). In the exploratory common in languages, but as yet unat- phase of the SkAT-VG project we have tested in our exploratory data of imita- observed imitations using pulmonic tions. Possibly, the acoustically similar egressive friction (which parallel frica- outcomes of glottalic and pulmonic tives in speech) as well as velaric in- egressive friction are a contributing gressive friction (which parallel clicks factor – why use a glottalic airstream or click-like sounds). We have no ex- when a pulmonic airstream creates, amples yet where imitators use pulmon- more or less, the same sound? ic ingressive, glottalic egressive or in- gressive or velaric egressive friction. Glottalic ingressive turbulence According to UPSID (Maddieson and Pulmonic egressive turbulence Precoda, 1990) voiceless glottalic in- Friction made with a pulmonic egres- gressive speech sounds (i.e., voiceless sive airstream is by far the most com- implosives) are phonologically distinc- monly occurring turbulent source in tive in less than 1% of the world’s lan- imitations, just as it is in speech. As is guages. Judging by this typological the case with speech sounds, a turbulent rarity one could assume that such friction noise can be made at many sounds are fairly difficult to produce. places in the vocal tract. This type of The exploratory data have not yet friction is especially common in the yielded imitations that make use of a imitation of “basic” sound events, such glottalic ingressive airstream, as such. as the interaction of solids (e.g. knock- However, note Pike’s (1943: 40) obser- ing, scraping and squeaking sounds) vation that English speakers sometimes and sounds of gases in motion (e.g. use a voiceless velar implosive [ƙ] to blowing, puffing and hissing sounds) imitate the “glug-glug” sound of pour- (cf. Lemaitre et al. 2011 for further ex- ing liquid from a bottle (the voiced amples of sound events). For example, counterpart can also be used). Thus, the impression given by an improvisa- despite the typological rarity of such tional actor of the sound of “scraping sounds, they still seem to be used in on a hard surface” is quite speech-like imitations. and can be described as a voiceless ve- lar fricative [x]. Velaric egressive turbulence A velaric egressive source has not been Pulmonic ingressive turbulence encountered in the exploratory data, but While pulmonic ingressive friction is one can conceive of such sounds being not difficult to produce, it is difficult (or used to imitate sputtering in liquids.

84 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Squeezing a velaric airstream out be- source types the oscillation is frequent tween the teeth, for example, may faith- enough to be perceived as a tone. fully replicate the sound of a spraying can (although, obviously, this depends Pulmonic egressive myoelastic sources on denture). An ingressive airstream The most commonly encountered myo- leads to an acoustically similar result. elastic source by far, both in speech and sound imitations, is pulmonic egressive Velaric ingressive turbulence vocal fold phonation, i.e., voicing. As a Velaric ingressive turbulence is used to sound source in speech and singing, the produce click sounds, which are typo- vocal folds are highly versatile, allow- logically rare. Still, paralinguistic click ing a great deal of precision in the con- sounds are encountered quite frequently trol of onset and offset, timbre and os- in speech (cf., e.g., Jakobson, 1979: cillation frequency. 40). In English, for example, the dental In linguistic phonetics, a distinction click even has a more or less standard- is made between several vocal fold ized orthography, variably written as phonation types. Modal voice, breathy tut-tut or tsk-tsk. voice and creaky voice are the principal In the SkAT-VG exploratory data types (stiff voice, slack voice are also set, the impression of “trickling water” recognized but are not considered here, made by an improvisational actor con- nor is the difference between breathy tains an example of velaric ingressive voice and whispery voice; see initiation (see Figure 1). To achieve this Ladefoged & Maddieson (1996) for an effect, the actor alternated soft post- overview of the linguistic uses of alveolar or alveolar click sounds with voice). Non-linguistic voicing types sublaminal percussives (discussed be- include falsetto and pressed voice. low, in the section on percussives) with These various voice qualities are frequent and rapid labial modifications relevant for sound imitations, perhaps of the resonance characteristics. most notably in the imitation of animal sounds and engine sounds. The imita- tion of a cow, for example, usually in- volves a modal voice quality with a nasal resonance. The croak of a frog may be imitated with a creaky voice quality (an ingressive creak works even better). Falsetto voice is frequently en- countered in animal imitations, e.g. when imitating a cat meowing. A much less common myoelastic source type is aryepiglottic phonation in which the aryepiglottic folds vibrate in an air stream at frequencies ranging from approximately 40 to 100 Hz (Moisik, Esling & Crevier-Buchman, Figure 1. A spectrogram of an actor’s im- 2010). In the exploratory data we have pression of the sound of “trickling water”. observed impressions of animal growl- ing in which aryepiglottic phonation is Myoelastic sources used, but usually it is used in combina- In the myoelastic source type, muscle tion with voicing. Similarly, there are and elastic tissue are made to oscillate examples of imitations of rumbling in an air stream. This can lead to (al- engines, which combine aryepiglottic most) periodic sounds or intermittent vibration and voicing. breaks in an otherwise turbulent air- At least four types of supralarynge- stream. Crucially, for some myoelastic al pulmonic egressive myoelastic sources

85 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University can be created. First, some people can Pulmonic ingressive myoelastic sources achieve a uvular myoelastic oscillation, When vocal fold phonation is made equivalent to uttering a voiceless, uvu- with a pulmonic ingressive airstream lar trill, [ʀ̥ ]. Second, some people can the result is ingressive voicing. Acous- achieve an apico-alveolar oscillation, tically, ingressive voicing is quite dis- equivalent to producing a voiceless, tinct from egressive voicing, sounding apico-alveolar trill, [r̥ ]. For these two harsher and less sonorant (cf. Eklund, source types, the rate of oscillation can 2008). Like egressive voicing, ingres- exceed 30 Hz, but they are still not per- sive voicing can be made both as in- ceived as tones but rather as a rapid gressive falsetto and ingressive creak. series of impacts. There are no exam- In imitations, an ingressive falsetto ples of these two source types being is quite common. It is used to imitate used on their own for imitations in the various animal sounds, such as a dog exploratory data, but there are examples bark, a pig squeal and crow caw, but it of the apico-alveolar source combined can also be used to imitate squeaking with whistling in bird imitations. sounds, such as the squeaking sound of A third supralaryngeal source type wiping a window pane. uses a dorso-lateral configuration for the tongue and pushes out air between the tongue dorsum and a stricture that appears to be located at or anterior to the palatoglossal arch. The sound pro- duced is periodic with an f0 range from approximately 150 and 700 Hz, judging from the examples gathered so far. The most well known use of the dorso- lateral source type is the voice of Don- ald Duck, the famous cartoon character. The exploratory data set contains nu- merous examples of the use of this source type in the imitation of birds. The fourth supralaryngeal source type is made with a bilabial con- Figure 2. A spectrogram of an actor’s im- striction. The constriction can be made pression of a “squeak from a window pane”. with two distinct lip configurations, which yield quite different results. First, One example in the exploratory da- the lips can be pressed together without ta set, shown in Figure 2, does contain much stiffness in the labial tissue while both ingressive falsetto and ingressive an airstream is passed through. This creak. This is the impression made by leads to a fairly slow periodic myoelas- an improvisational actor of the sound of tic vibration (25-35 Hz) that is not per- a “squeak from a window pane”. ceived as a tone. The exploratory data Glottalic and velaric myoelastic sources set contains an example of such a voiceless, bilabial trill being used to The SkAT-VG exploratory data set imitate the blowing sound of a horse. contains no imitations that make use of The second lip configuration involves glottalic and velaric airstreams coupled pressing the lips together quite tightly with a myoelastic source. Using glottal- and making them much stiffer while ic and velaric airstreams there is a very forcing an airstream between them. limited volume of air available to drive This can lead to a (multiply) periodic a myoelastic oscillation. Some configu- source, which, in the exploratory data rations do yield a myoelastic effect, for set, is found in the imitation of an ele- example a glottalic egressive airstream phant trumpeting. can be coupled with an apico-alveolar

86 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University source to produce the equivalent of an imitators produce a short whistle with ejective trill, [r̥ ʼ]. However, the fact that velaric egressive airstream to imitate these types of sources cannot be sus- the impact sound of a drop of water. tained for very long, if it can be achieved at all, reduces their usefulness Percussive initiation in imitations, which may explain their Percussive initiation does not require an absence in the exploratory data set. airstream but results instead from an impact between solids, for example Whistled sources when the upper and lower teeth are Very few languages are reported to made to clash or scrape together (Cat- have distinctive whistled coronal sibi- ford 1977: 63). lants (Shosted, 2006). According to Percussives occur very rarely in Shosted (ibid.: 566), whistled sibilants (non-pathological) speech and are not are produced in a manner similar to “a phonologically distinctive in any lan- form of recreational whistling referred guage. Sands, Maddieson & Ladefoged to as ‘palatal’ or ‘roof’ whistling”, (1993: 183) observe that, very rarely, an which is achieved by letting the tongue allophonic variant of an alveolar click is tip form a constriction that directs the a percussive in which “the normal click airflow to the edges of the teeth. Pure is quite quiet but the tongue tip makes a “palatal” whistling is seldom encoun- forceful contact with the bottom of the tered except in the repertoire of whis- mouth after the release of the front click tling virtuosi, such as the Hungarian closure”. Incidentally, they also men- Hacki Tamás or the Australian Luke tion that this is a “sound sometimes Janssen. Still, the exploratory data set made by speakers of non-click lan- does include an example of this type of guages trying to imitate the sound made whistling being used to imitate the by the shoes of a trotting horse” (ibid.). American Robin (Turdus migratorius). As we saw in connection with Figure 1, In languages that do not have dis- the SkAT-VG exploratory data contain tinctive sibilant whistling, whistling can an example of such a “floored”, still occur sporadically when apical sublaminal percussive, used as part of sibilants are produced and sibilants with an impression given by an improvisa- a whistled component, similar to those tional actor of “trickling water”. found in speech, are observed when people imitate wind or weather noise. Labial whistling does not occur in speech but the majority of people ap- pear to be able to produce some form a labial whistle and this type of whistling is encountered frequently in daily life. Typically, labial whistling is pulmonic egressive, but it can almost as easily be produced ingressively. The exploratory data set contains examples of whistling being used to imitate birds, only in the form of palatal whistling and “digitally assisted” whistling (i.e. finger whis- tling), possibly because these generate higher oscillation frequencies. Figure 3. A spectrogram of an actor’s im- Also, short labial whistling noises pression of the sound of a “whip lash”. can be produced using both glottalic and velaric initiation, again both egres- The data set also contains an exam- sively and ingressively. The exploratory ple of a lamino-dental percussive, in data contain several examples where which the tongue is shot forward at a

87 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University high velocity creating an impact sound Catford, J. C. (1977). Fundamental as the lamina makes contact with the problems in phonetics. Edinburgh: teeth and the alveolar ridge. This oc- Edinburgh University Press. curred in an improvisational actor’s Catford, J. C. (1988). A practical intro- impression of the sound of a “whip duction to phonetics. Oxford: lash”, shown in the spectrogram in Fig- Clarendon Press. ure 3. In speech, oral stop sounds are Eklund, R. (2008). “Pulmonic ingres- made at the offset of an occlusion by sive phonation: diachronic and syn- releasing a turbulent airstream through chronic characteristics, distribution a narrow channel, giving rise to a high and function in animal and human energy release burst. By contrast, in the sound production and in human example in Figure 3, the “burst” at 0.17 speech.” Journal of the International ms in the spectrogram is created at the Phonetic Association 38: 235-324. onset of the occlusion and is in fact the Fant G. (1960). The Acoustic Theory of sound of the impact of the tongue lami- Speech Production. The Hague: na against the teeth. Moulton. Jakobson, R. and Waugh, L. (1979). Conclusion The Sound Shape of Language. The observations made during the ex- Bloomington, Ind: Indiana Univer- ploratory phase of the SkAT-VG pro- sity Press; and London: Harvester. ject have shown that in sound imitations Lemaitre, G., Dessein, A., Susini, P. & humans can utilize a far wider range of Aura, K. (2011). “Vocal Imitations articulations than are used to make and the Identification of Sound phonological distinctions in languages. Events”. Ecological Psychology, Also, imitators can utilize sound initia- 23:4, 267-307. tion mechanisms and source types that Maddieson, I. & Precoda K. (1990) Up- are not part of the repertoire of their dating UPSID. UCLA Working Pa- native language(s) and in many cases pers in Phonetics 74, pp. 104-111. they utilize mechanisms that are typo- Moisik, S. R., Esling, J. H., & Crevier- logically rare (and considered “diffi- Buchman, L. (2010). A high-speed cult”). laryngoscopic investigation of ary- A classification of sound produc- epiglottic trilling. Journal of the tions is proposed that is based on three Acoustical Society of America 127 basic source types, turbulent, myoelas- (3). 1548-1559. tic and whistled, intersecting with six Pike, Kenneth L. (1943). Phonetics: a basic sound initiation mechanisms, critical analysis of phonetic theory pulmonic, glottalic and velaric initia- and a technic for the practical tion, both egressive and ingressive. In description of sounds. Ann Arbor, addition, percussive sounds form a class MI: University of Michigan Press. of their own, being both an initiation Sands, B., Maddieson, I. and Ladefoged, mechanism and a source type. P. (1993). “The Phonetic Structures of Hadza”. UCLA Working Papers Acknowledgements in Phonetics, No. 84: Fieldwork I wish to thank Prof. Sten Ternström for Studies in Targeted Languages. helpful advice and comments. This re- Shosted, Ryan. 2006. “Just put your lips search has been supported by EU-FET together and blow? Whistled frica- grant SkAT-VG 618067. tives in Southern Bantu.” In H. C. Yehia, D. Demolin & R. Laboissiere References (Eds.), Proceedings of ISSP 2006, (pp. Cruttenden, A. (1986). Intonation. 565–572). Belo Horizonte: CEFALA. Cambridge: Cambridge University Stevens, K. N. (1999). Acoustic Phone- Press. tics. Cambridge: MIT Press.

88 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Human perception of intonation in domestic cat meows Susanne Schötz, Joost van de Weijer Lund University Humanities Lab, Centre for Languages & Literature, Sweden [email protected], [email protected]

Abstract communication, e.g. the frequency code This study examined human listeners’ where low f0 and resonances signal ability to classify cat vocalisations (me- large size and dominance. ows) recorded in two different contexts; Phonetic studies of pet vocalisa- during feeding time (food related me- tions are fairly scarce, and very little is ows) and while waiting at a vet clinic known about the prosodic aspects of pet (vet related meows). A pitch analysis vocalisations in pet–human communi- showed that food related meows tended cation. To what extent do pets adopt to have rising f0 contours, while vet and use human-like intonation in their related meows often had more falling f0 vocal communication with humans? patterns. 30 listeners judged 6 meows How are the prosodic patterns of pet of each context. Classification accuracy vocalisations perceived by human lis- was significantly above chance, and teners? This study is an attempt to shed listeners with cat experience performed some light on these issues by examining significantly better than inexperienced human perception of different intona- listeners. The food related meows with tional patterns in cat vocalisations. the highest classification accuracy Cat vocalisations and the meow showed clear rising f0 contours, while The cat (Felis catus, Linneaus 1758) clear falling f0 contours characterised was domesticated 10,000 years ago, and the vet related meows with the highest is one of the most popular pets of the classification accuracy. Our results world with some 600 million individu- suggest that cats use different intona- als (Turner & Bateson, 2000; Driscoll tion patterns in their vocal interaction et al. 2009). Cats are social animals with humans, and that humans are able (Crowell-Davis et al., 2004), and their to identify these vocalisations. interaction with humans has over a long Introduction time of living together resulted in cross- species communication that includes There is much anecdotal evidence of visual as well as vocal signals. There pets – especially cats and dogs – imitat- are several descriptions of the commu- ing speech when interacting with hu- nicative social behaviour of the domes- mans. This is probably a learned skill tic cat (e.g. Turner & Bateson, 2000; used to elicit certain responses or re- Bradshaw, 2013), but those concerning wards, e.g. food, from their human vocalisations are scarce and often frag- caretakers. Because of the position of mented. It is still unclear how cats their larynx, nonhuman mammals are combine different sounds, and how they able to articulate only a limited number vary intonation, duration and intensity of the vowel and consonant sounds of to convey or modulate a vocal message. human language (see e.g. Fitch, 2000). Cat vocalisations are generally di- However, many animals can produce vided into three major categories: (1) extensive vocal variation in duration, f0 sounds produced with the mouth closed and intensity (SPL), and should be able (murmurs), such as the purr, the trill to adopt human-like prosodic patterns. and the chirrup, (2) sounds produced Gussenhoven (2002) and Ohala (1984) with the mouth open(ing) and gradually describe pitch features related to bio- closing, comprising a large variety of logical codes, which are used in animal meows with similar [ɑ:ou] vowel pat-

89 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University terns, and (3) sounds produced with the Nicastro & Owren (2003) asked na- mouth held tensely open in the same ïve and experienced listeners to judge position, i.e. sounds often uttered in meow calls from twelve cats recorded aggressive situations, including growls, in five different behavioural contexts snarls, hisses, and shrieks (Moelk, (food-related, agonistic, affiliative, ob- 1944; Crowell-Davis et al., 2004). stacle, and distress). Classification ac- In cat–human communication, the curacy was modestly (but significantly) most common vocalisation is said to be above chance, and it was suggested that the meow or miaow (Nicastro & Owren, meows are unspecific, negatively toned 2003). Nicastro (2004) defines the me- sounds that attract human attention, but ow as a quasi-periodic sound with at that we can learn to appreciate meows least one formant and diphthong-like as we become more experienced. formant transitions. The duration ranges Schötz (2012, 2013) analysed dura- from a fraction of a second to several tion and f0 in 795 cat vocalisations and seconds, and the f0 contour is generally found that within each vocalisation type arch-shaped with the peak marking the (including the meow) durations were maximum mouth opening of the open- fairly similar, but the overall f0 varia- ing-closing gesture. Meows can include bility was high, partly due to the large atonal features and may be garnished number of different intonation patterns. with an initial or final trill or growl. McKinley (1982) divided the meow Purpose, aims and hypotheses type vocalisation into four sub-patterns The purpose of this study was to inves- based on the pitch and vowels included tigate human listeners’ perception of in the sound: the mew, a high-pitched domestic cat meows with different in- call with [i], [ɪ] or [e] quality; the tonation patterns. By asking listeners to squeak, a raspy nasal high-pitched classify a number of meows as belong- mew-like call; the moan, an [o]- or [u]- ing to one of two contexts: food related like opening-closing sound; and the or vet related, our aim was to find out meow, a combination of vowels result- which intonation patterns are more of- ing in a characteristic [iau] sequence. ten associated with food related vocali- Cats learn to produce different me- sations and which are more vet related. ows for different purposes, e.g. to solic- Further goals were to learn more about it feeding, to gain access to desired lo- human perception of prosody in cat cations and other resources provided by vocalisations and to increase our under- humans. Each meow is believed to be standing of cat–human communication. “an arbitrary, learned, attention-seeking Based on our own previous experi- sound rather than some universal cat– ence of these types of meows, as well human ‘language’” (Bradshaw, 2013). as on pitch patterns used in human If each cat and owner develop their own speech and also related to the frequency arbitrary vocal communication codes, code, we expected the meows of both other humans would be less able to contexts to be of similar duration and identify meows uttered by unfamiliar mean f0, but we expected a higher cats. However, if cat vocalisations con- number of rising pitch patterns in the tain some kind of functional referential- food related meows than in the vet re- ity (cf. Nicastro & Owren, 2003; lated meows. We also hypothesised that Macedonia & Evans, 1993), i.e. that experienced human listeners would each vocalisation strongly correlates judge the meows correctly more often with a certain referent and also that than inexperienced listeners and also be perceiver responses correlate with the more confident in their responses. vocalisation, then experienced humans Moreover, we hypothesised that meows should be able to classify meows pro- with rising intonation patterns would duced by unfamiliar cats fairly well. more often be judged as food related meows than vet related meows

90 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Material and method 2012) and the chirp (cf. Schötz, 2013). Three domestic cats: Donna, Rocky and The longer meows were often garnished Turbo (D, R and T; 1 female, 2 males, 3 by short initial trills. Table 1 shows the year old siblings) were recorded in two duration, and the mean, minimum, and different contexts: 1) in a familiar envi- maximum f0 values for the twelve me- ronment; in their kitchen while waiting ow stimuli. Figure 1 displays f0 con- to be fed and 2) in an unfamiliar envi- tours of the meows of the two contexts. ronment; in the waiting room (or in a Table 1. Duration (sec.) and f0 (Hz) values car outside) of a veterinary clinic. We for the 12 meows in two contexts (Food, used a Sony digital HD video camera Vet) by two cats (D, T). HDR-CX730 with an external shotgun meow duration mean f0 min f0 max f0 microphone Sony ECM-CG50. Audio FoodD1 0.78 739 528 939 files (wav, 44.1 kHz, 16 bit, mono) FoodD2 0.91 888 541 1003 were extracted with Extract Movie FoodD3 0.27 797 782 816 Soundtrack, and the meows extract- FoodT1 1.06 532 418 582 ed and normalised for amplitude in FoodT2 0.85 539 423 653 Praat (Boersma & Weenink, 2013). Six FoodT3 1.03 567 433 640 meows from each context produced by VetD1 1.10 790 715 887 two of the cats (D and T) were selected VetD2 0.80 838 764 924 as material, based on the overall record- VetD3 0.58 915 885 947 ing quality and on judgements of the VetT1 1.13 510 451 589 owner (one of the authors) of how rep- VetT2 0.87 697 639 737 resentative the vocalisations were for VetT3 1.02 540 487 570 each context. As one cat (R) was quiet Food-Context Food context during the recordings made in the vet 1000 context, no meows from this cat were 900 used. An auditive analysis of the mate- 800 rial by one of the authors revealed that 700

the food related meows tended to have Pitch (Hz) 600 rising tonal patterns, while veterinary 500 related meows had slightly arched or 400 0 Veterinary-context 0.2658 Vet context falling intonation. In addition, we no- 1000 Time (s) ticed some background noise and one 900 instance of background human speech, 800 but this was judged not to influence the 700

perception task. Pitch (Hz) 600

Measures of duration and f0 were 500 obtained with a Praat script and manu- 400 0 Normalised time 1.103 ally checked. One meow was signifi- Time (s) cantly shorter than the other vocalisa- Figure 1. Time normalised f0 contours of tions, but we decided to keep it in order the food and vet related meows. The black to get a first impression of how stimu- contours show the two stimuli that received lus duration would influence the per- the highest proportion of correct classifica- ception results. The other stimuli tions in each context in the perception test. ranged between 0.58 and 1.13 seconds Procedure in duration. All stimuli contained vow- els belonging to the meow type, as de- The experiment was designed as a mul- scribed by McKinley (1982), and were tiple forced choice identification test judged as clearly distinguishable from using the ExperimentMFC function in other common cat vocalisation types, Praat. A group of 15 men and 15 wom- including the purr (cf. Schötz & en volunteered as participants. Their Eklund, 2012), the murmur (cf. Schötz, average age was 44 years (range 23 to 69 years). Of the participants, 21 re-

91 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University ported being familiar with cats, that is, meows that received the highest num- they either owned a cat at the time of ber of correct classifications display testing, or they had owned a cat prior to more falling contours. the experiment. The time that these Table 2. Proportions of correct responses participants had owned a cat varied and average response time (RT) for the 12 from less than one year to a maximum meow stimuli in the two contexts (Food, of 55 years (median 2.5 years). Oral Vet) by two cats (D, T). and written instructions were given meow correct RT (ms) before the experiment, in which the task FoodD1 0.83 was to classify each meow as belonging 2342 FoodD2 0.80 to either the food context or to the vet 2419 FoodD3 0.37 context by clicking on the appropriate 2635 FoodT1 0.54 2944 box on a computer screen. The experi- FoodT2 0.66 2673 ment ran on a MacBook Pro computer FoodT3 0.62 in a quiet room. Each of the twelve me- 2706 VetD1 0.63 ow recordings were presented three 3012 VetD2 0.57 times in a randomised order through 2904 VetD3 0.68 HUMP NF22A speakers or AKG K270 2544 VetT1 0.71 2658 studio headphones at a comfortable VetT2 0.71 3127 sound level. A replay option allowed VetT3 0.64 3044 the participants to listen to each stimu- lus up to three times. After the test, the We performed a multilevel logistic re- participants were asked to make a sin- gression (with random stimulus and gle judgement of the degree of certainty subject intercepts) on the results in two of their responses on a 5-point scale. steps. In the first step we did not in- Each session lasted about 3-4 minutes. clude any predictors of interest other than the intercept. The results indicated Results that the overall intercept differed signif- Of all 1080 responses in the experiment icantly from zero (B = 0.7615, SE = 529 were food related and 551 veteri- 0.2529, z = 3.011, p = 0.0026), which nary related. In total, there were 699 suggests that the overall number of cor- correct responses (65%). The partici- rect responses was significantly above pants who reported familiarity with cats chance. In the second step, we added were more often correct (70%) than the the familiarity predictor to the first participants who did not (54%). model. This predictor had a significant Table 2 displays the proportions effect (B = 0.8908, SE = 0.3611, z = correct as well as the average reaction 2.467, p = 0.0136) and overall the se- time for every meow stimulus. As cond model was significantly better shown in the table, there was one meow than the first (χ2 = 5.5767, df = 1, p = (Food D 3) that was classified incor- 0.0182). This suggests that the partici- rectly considerably more often than the pants who were familiar with cats per- other meows. This meow was excep- formed significantly better than those tionally short compared to the other who were not. stimuli (cf. Table 1), and presumably We also tested whether the number contained too little information for the of years that the participants had owned participants to make good judgements. a cat was a better predictor than the The F0 contours of the two stimuli familiarity, but this turned out not to be of each context category that received the case. In fact, number of years had a the highest proportion of correct classi- non-significant effect on the dependent fications are the ones drawn in black in variable, suggesting that participants Figure 1. For the food related meows, who owned a cat for a longer period of these contours show clear rising intona- time did not score better than those who tion patterns, while the two vet related owned a cat for a relatively short time.

92 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The participants who were familiar with ly better than chance, and that experi- cats were not only more often correct in enced listeners were better judges than their answers, they were also more con- inexperienced ones. Moreover, there fident in their answers. The average was a tendency to judge meows with confidence rating given by participants rising intonation as food related, and familiar with cats was 2.86, whereas falling intonation as vet related. Our that given by the other participants was acoustic analysis showed that the food 1.78. This difference was tested in a related meows tended to have rising f0 linear regression analysis, which contours often in combination with high showed that it was significant (B = f0 range, while the vet related meows 1.0794, SE = 0.4133, t = 2.612, p = 0.0143). often had slightly falling f0 patterns, Finally, we examined the relation often accompanied by a low f0 range. It between the acoustic measurements of is possible that the listeners were influ- the stimuli shown in Table 1 and the enced by the different f0 ranges and judgements made by the participants. interpreted them as expressions of dif- Given the high degree of correlation ferent emotions; food related stimuli as between the different f0 variables, we happy (high f0 range), and vet related used only f0 standard deviation in com- stimuli as sad (low f0 range). bination with duration as predictors of A majority of the participants made the participant choices in a multilevel the additional comment that some me- logistic regression analysis. The results ows were quite easy to judge, while showed that f0 standard deviation was a others were much more difficult. The significant predictor (B = −0.0069, SE = meow with the shortest duration was 0.0008, z = −8.705, p = 0.0000), while often found very difficult to classify. duration was not (B = 0.3969, SE = Some listeners reported that they found 0.3502, z = 1.133, p = 0.2571). The some of the meows similar to those of relation between f0 standard deviation their own cats. This may suggest that and the listener’s judgements is visual- different cats use similar vocalisations ised in Figure 3. The lower the f0 in the contexts used in this study. standard deviation of the stimulus, the Our study suggests that cats can more often it was classified as a vet learn to manipulate prosodic patterns in related vocalisation. their vocalisations in order to better

● ● 0.7 elicit the desired response from their ● human companions. Similarly, many ● ● ● 0.6 humans adapt their speech or speaking ● style to their pets by using some kind of “pet talk” (see e.g. Burnham et al. 0.5

● 2002). It is not unlikely that pets and their owners together develop a set of 0.4 ● different prosodic patterns to improve ● inter-species communication. We hope 0.3

Proportion responses Veterinary to investigate this further in a future phonetic study of pet–human dialogues. 0.2 ● As far as we know this is one of the ● first phonetic studies of intonation in 50 100 150 F0 standard deviation (Hz) human-directed cat vocalisations, and Figure 2. Relation between f0 standard de- there are numerous questions yet to be viation and participant choice. answered in order to better understand how cats and other pets use prosody in Discussion and future work their vocal interaction with humans. Our results showed that listeners were Although this study examined a very able to identify domestic cat meows limited number of meows from only from two different contexts significant- two cats, our hypotheses that humans

93 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University can judge similar cat vocalisations that call systems and the problem of differ in intonation patterns significant- meaning in animal signals, ly better than chance and that experi- Ethology 93:177–197. enced listeners perform better than in- McKinley, P. E. (1982). Cluster experienced ones were confirmed. In analysis of the domestic cat’s vo- future studies, we intend to investigate cal repertoire. Unpublished other parameters, including f0 direction doctorial dissertation. University of and movement, vowel quality and dy- Maryland, College Park. namics (diphthongisation) as well as Moelk, M. (1944). Vocalizing in the intensity. House-Cat; A Phonetic and Func- tional Study, The American Journal Acknowledgements of Psychology 57(2):184–205. The authors gratefully acknowledge Nicastro, N. & Owren, M. J. (2003) support from the Linnaeus environment Classification of domestic cat (Fe- Thinking in Time: Cognition, Commu- lis catus) vocalizations by naîve nication and Learning (Swedish Re- and experienced human listen- ers”, search Council, grant no. 349-2007- Journal of Comparative 8695). We are also very grateful to all Psychology 117:44–52. our cat and human participants. Nicastro, N. (2004) Perceptual and acoustic evidence for species- level References differences in meow vocalizations Boersma, P. & Weenink, D. (2013). by domestic cats (Felis catus) and Doing phonetics by computer, African wild cats (Felis silvestris [Computer program] Ver. 5.3.56 re- lybica), Journal of Comparative trieved from http://www.praat.org/. Psychology, 118(3):287–96. Bradshaw, J. (2000) Cat Sense: The Ohala, J. (1984). An ethological Feline Enigma Revealed, London: perspective on common cross- Allen Lane. language utilization of F0 of voice, Burnham, D., Kitamura, C. & Vollmer- Phonetica, 41:1–16. Conna, U. (2002). What’s new, Schötz, S. & Eklund, R. (2011). A pussycat? On talking to babies and comparative acoustic analysis of animals, Science, 1435–1435. purring in four cats, In: Quarterly Crowell-Davis S. L., Curtis, T. M. & Progress and Status Report TMH- Knowles, R. J. (2004). Social QPSR, Volume 51, 2011. organization in the cat: a modern Proceedings from Fonetik 2011. understanding”, Journal of Feline Royal Institute of Technology, Medicine and Surgery 61:19–28. Stockholm, Sweden, pp 9–12. Driscoll, C. A., Clutton-Brock, J., Schötz, S. (2012). A phonetic pilot Kitchener, A. C. & O’Brien, S. J. study of vocalisations in three cats, (2009). The taming of the cat, Scien- Proceedings of Fonetik 2012, tific American, June 2009, 68–75. Department of Philosophy, Ling- Fitch, W. T. (2000). The phonetic uistics and Theory of Science, potential of nonhuman vocal tracts: University of Gothenburg, pp 45–48. Comparative cineradiographic Schötz, S. (2013). A phonetic pilot observations of vocalizing ani- study of chirp, chatter, tweet and mals, Phonetica, 57, 205–2185. tweedle in three domestic cats, Gussenhoven, C. (2002). Intonation and Proceedings of Fonetik 2013, interpretation: Phonetics and Linköping University, pp 65–68. phonology. In Proceedings of Turner, D. C. & Bateson, P. (Eds.) Speech Prosody 2002, Aix-en- (2000). The domestic cat: the biology Provence. of its behaviour. Cambridge: Macedonia, J.M. & Evans, C.S. (1993). Cambridge University Press. Variation among mammalian alarm

94 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

A pilot study of human perception of emotions from domestic cat vocalisations Susanne Schötz Centre for Languages & Literature, Lund University, Sweden [email protected]

Abstract scarce and often fragmented. It is still This paper presents preliminary results unclear how cats combine different from a pilot study where 36 human lis- sounds, and how they vary intonation teners classified 28 cat vocalisations and voice quality to convey or modulate into seven emotion categories. Classifi- a vocal message. cation accuracy and between-listener Cat vocalisations are generally di- agreement varied considerably between vided into three major categories: (1) vocalisations. The vocalisations were sounds produced with the mouth closed subdivided into categories based on the (murmurs), such as the purr, the trill emotions perceived by most listeners and the chirrup, (2) sounds produced and compared in an acoustic analysis. with the mouth open(ing) and gradually Preliminary results suggest that cats closing, comprising a large variety of vary their intonation to signal different meows with similar [ɑ:ou] vowel pat- emotions, and that humans perceive terns, and (3) sounds produced with the them based on cues used to signal emo- mouth held tensely open in the same tion in human speech. Surprisingly, the position, i.e. sounds often uttered in trill vocalisation used for friendly greet- aggressive situations, including growls, ings was often misjudged as anger. Fu- snarls, hisses, and shrieks (Moelk, ture work includes a deeper analysis of 1944; McKinley, 1982). the results and also a comparative study Nicastro & Owren (2003) asked na- of human–directed and cat–directed cat ïve and experienced listeners to judge vocalisations. meow calls from twelve cats recorded in five different behavioural contexts Introduction (food-related, agonistic, affiliative, ob- The cat (Felis catus, Linneaus 1758) stacle, and distress). Classification ac- was domesticated 10,000 years ago, and curacy was modestly (but significantly) is one of our most popular pets with above chance some 600 million individuals (Turner & McComb (2009) found acoustic Bateson, 2000; Driscoll et al. 2009). and perceptual differences between Cats are social animals (Crowell-Davis happy and food-soliciting cat purring. et al., 2004), and their interaction with Schötz and Eklund (2011) carried humans has over a long time of living out an acoustic study of cat purring, and together resulted in cross-species com- Schötz (2012, 2013) analysed 795 dif- munication that includes visual as well ferent cat vocalisations and found that as vocal signals. For instance, they have duration varied only somewhat within learned to produce different vocal sig- each vocalisation type. However, f0 nals for different purposes, e.g. solicit variability was high, partly due to nu- feeding or gain access to desired loca- merous different intonation patterns. tions and other resources provided by Schötz & van de Weijer (2014) ex- humans. There are several descriptions amined 30 human listeners’ ability to of the communicative social behaviour classify cat meows recorded in the two of the domestic cat (e.g. Turner & contexts during feeding time (food re- Bateson, 2000; Bradshaw, 2013), but lated meows) and while waiting at a vet those concerning vocalisations are clinic (vet related meows). Classifica-

95 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University tion accuracy was significantly above 59 years). Using a seven-point scale, chance, and listeners with cat experi- the participants were asked to rate their ence performed significantly better than experience of and attitude towards cats. naïve listeners. A pitch analysis showed Their mean experience with cats (1 = that food related meows tended to have none, 7 = extremely good) was 4.05, rising f0 contours, while vet related and their average attitude towards cats meows often had more falling f0 pat- (1 = hate, 7 = love) was 5.5. terns, suggesting that cats use different Oral and written instructions were intonation patterns in their vocal inter- given before the experiment. The task action which (experienced) humans are was to judge the emotion (using seven able to identify. categories) of 28 cat vocalisations, The purpose of this study was to which were played twice in the same investigate human listeners’ perception random order on an Apple MacBook of emotions from cat vocalisations and Pro computer through HUMP NF22A compare a number of acoustic features, speakers at a comfortable sound level. including measures of f0 and intonation To reduce the number of response cate- of the judged emotions, and also to hu- gories, some emotions were combined man emotions. A larger goal was to into a single category. Sorrow and fear learn more about cat–human communi- were combined into the category Sor- cation. rowFear, as some vocalisations were Material and method judged to signal both emotions. Moreo- ver, all emotions associated with ques- Vocalisations from five domestic cats tioning, begging, wanting or needing were recorded. Three cats were record- something (e.g. food or access to a de- ed in their home and two cats in an ag- sired location) were combined into the onistic context in the author’s garden. category Desire. Furthermore, the cate- Video was recorded with either a Sony gory Other could be used for any other digital HD video camera HDR- perceived emotion, and if the listeners CX730 with an external shotgun micro- were unable to judge the emotion, they phone Sony ECM-CG50 or an Apple were instructed to select the category iPhone 3G. Audio files (wav, 44.1 kHz, Don’t know. The seven categories used 16 bit, mono) were extracted with Ex- in the test were the following: tract Movie Soundtrack. Based on the 1) Joy: happy or content, overall recording quality and on 2) SorrowFear: sad or afraid, judgements of how representative the 3) Anger: angry or discontent, vocalisations were for each emotion on 4) Desire: questioning, begging, judgements of the author, who knew the wanting or hungry cats well, 28 different vocalisations 5) Neutral, were selected as material. A few vocali- 6) Other, sations contained background noise, but 7) Don’t know. this was judged to have no influence on After the test, the participants were the perception task. The vocalisations asked to make a single judgement of the were segmented, extracted and normal- degree of difficulty of the task on a 7- ised for amplitude in Praat (Boersma & point scale. Weenink, 2014). Three experiments with 15, 10 and Experiment 1: Perception test 10 students participating in each were carried out. Each experiment lasted Procedure about 20 minutes. Some time after the 36 students (22 women, 13 men) of experiment, the results were presented phonetics and general linguistics at to the listeners, and they were asked to Lund University volunteered as partici- comment on them, e.g. what listening pants in a listening experiment. Their strategies and/or phonetic cues they had average age was 25 years (range 19 to used to make their judgments. Many

96 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University listeners reported that they had based Between-listener agreement their judgements on cues of pitch and The listener agreement of the responses whether the vocalisation contained ele- varied considerably between stimuli. ments of noise. They had judged stimuli The two purring stimuli showed the with a low average pitch and a high highest agreement; over 9/10 of the degree of noise as Anger, and stimuli listeners judged these stimuli as Joy. with a high average high pitch and a More than half of the listeners per- low degree of noise as Joy. In addition, ceived the same emotion in nine of the listeners reported that rising intonation stimuli; four stimuli as Anger, two as was judged as Desire, and falling into- Joy, two as Desire and one as Sorrow- nation as SorrowFear. This information Fear. Moreover, eight stimuli received was used to select features for the the same response from more than 2/5 acoustic analysis. of the listeners, while five stimuli had Results over 1/3 of the same listeners respons- es. Four stimuli had less than 1/3 of the Figure 1 shows the total number of re- listeners’ responses from the same cat- sponses for the seven categories. Of all egory. Figure 3 shows the distribution 980 responses in the experiment 53 of responses from four example stimuli. were Neutral, 21 Other and 72 Don’t know. The remaining 834 responses Experiment 2: Acoustic analysis were fairy evenly distributed among the Based on the categories that had re- four categories SorrowFear (200), An- ceived the highest number of responses, ger (211), Joy (197), and Desire (226). the 28 stimuli were subdivided into six emotional categories. Three stimuli had received the highest number of re- sponses for two categories, and these were subdivided into two additional categories: JoyAnger and DesireSor- row. Measures of duration, f0 and har- monics-to-noise ratio (HNR) were ob- tained with a Praat script and manually checked. In addition, f0 contours of the Figure 1. Number of responses for the seven stimuli were plotted in six diagrams; i.e. categories in the listening experiment. one for each emotion category. Table 1 shows the number of stimuli that was Number of correct responses divided into each of the six emotion The emotions of the stimuli were categories and also the mean results of judged (by the author, who had made the acoustic analysis. Figure 4 shows the recordings and also knew the cats six diagrams with f0 contours of the and the contexts in which the vocalisa- vocalisations subdivided into each emo- tions were produced) and used as pre- tion category. The two purring stimuli liminary measures of correct emotions were categorized as Joy, but they were for the stimuli. Of the 980 responses in left out of these diagrams, as the f0 in the experiment 350 were correct (38%). purring is significantly lower than in Figure 2 shows the percentage of cor- other cat vocalisations (see Schötz & rect responses for each of the 28 stimu- Eklund, 2011; Schötz, 2012). li. The two purring stimuli (11 and 20) received the highest number of correct Results responses, while one low-pitched mur- Table 1 shows the mean values of dura- mur-meow with clear elements of noise tion, f0 (mean, standard deviation, (stimulus 12) received no correct re- range, minimum, and maximum) as sponses, and a low-pitched trill (stimu- well as mean HNR for six emotion cat- lus 23) only 2 correct responses. egories containing the 28 stimuli.

97 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

100% 80% 60% 40% responses responses % correct % correct 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Figure 2. Percentage of correct responses for the 28 stimuli of the listening experiment. Don't 1 Don't Other 3 Don't 10 Don't 11 know Sorrow 3% know Sorrow know Anger Other Fear Fear know 9% Desire 3% Other 9% 3% 3% 26% 14% 3% 11% 3% Neutral Sorrow 14% Joy Fear Anger 6% Desire 43% Desire 3% 31% Joy 31% Anger 94% Joy 63% Joy Anger 14% 11% 3% Figure 3. Proportion of listeners’ responses (categories) for the example stimuli 1, 3, 10 and 11. Table 1. Mean duration (sec.), mean f0, f0 standard deviation (stdev), range, minimum and maximum f0 (Hz), and harmonics-to-noise ratio (HNR) of the six emotion categories contain- ing the 28 cat vocalisation stimuli. Judged category no stimuli duration mean f0 f0 stdev mean f0 range min/max f0 HNR Joy 6 2.18 788 165 501 271/1023 4.1 Anger 9 0.95 415 84 217 211/817 4.8 SorrowFear 5 0.90 694 75 218 300/1102 10.8 Desire 5 0.94 646 108 285 233/886 7.1 JoyAnger 1 0.63 240 21 91 192/283 0.4 DesireSorrow 2 0.66 730 73 244 227/950 10.0 Total (all) 28 1.14 589 96 260 211/1102 6.8 Happy Sad-Afraid

1000 1000 800 800 600 600 400 400 f0 (Hz) Pitch(Hz) Pitch(Hz) 200 200 JoyAngry SorrowFearAsking–Sad 0 0 10000 1.81510000 0.7997 Time (s) Time (s) 800 800 600 600 400 400 f0 (Hz) Pitch(Hz) Pitch(Hz) 200 200 AngerAsking DesireSorrowHappy–Angry 0 0 10000 1.74610000 0.7502 Time (s) Time (s) 800 800 600 600 400 400 f0 (Hz) Pitch(Hz) Pitch(Hz) 200 200 Desire JoyAnger 0 0 0 Normalised time 1.074 0 Normalised time 0.6315 Time (s) Time (s) Figure 4. Time normalised f0 contours of cat vocalisations by the categories selected by the majority of the listeners in the perception test.

98 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Duration and HNR Discussion and future work The total average duration was 1.14 sec, The very preliminary results of this and the stimuli judged as Joy had the pilot study suggest that human listeners longest duration (2.18 sec.). Clearly are not very good at judging the emo- shorter durations were found in the tional state of cat vocalisations, perhaps stimuli judged as Anger, SorrowFear because they rely on phonetic cues used and Desire (0.90 – 0.95 sec). The cate- to signal emotion in human speech. gories JoyAnger and DesireSorrow had There was, however, much varia- the shortest durations (0.63 – 0.66), tion in the agreement between listeners. which may explain why these stimuli Some vocalisations, e.g. the purring had received an equal highest number stimuli (11 and 20) had much higher of responses for two different emotions. agreement than others, e.g. the greeting The mean HNR of all stimuli was trill stimuli (7 and 15). One explanation 6.8, and lower in Joy (4.1), Anger (4.8) may be that the listeners’ reported expe- and JoyAnger (0.4) than in the other rience of cats varied. About 1/3 of the categories. Desire had an HNR of 7.1, listeners reported that they were very and the two categories with the highest experienced, while 1/3 had hardly any HNR were SorrowFear (10.8) and De- experience. Another possible explana- sireSorrow (10.0), suggesting that the tion is that naïve listeners based their background noise found in these stimuli response on biological codes (see had not influenced the perception test or Gussenhoven, 2002) and cues for hu- the HNR analysis. man emotions, as stimuli with high f0 and f0 range were often judged as Joy, f0 values and intonation contours and stimuli with low f0 and range as The vocalisations judged as Anger and Anger or Sorrow. Naïve listeners may JoyAnger had the lowest mean values not know what a greeting trill is, and for f0, f0 stdev and f0 range. The other are likely to judge it as Anger, as it may categories had clearly higher f0 values, sound similar to an agonistic growl. and Joy had the highest values of all. The greeting trill stimuli received about If we exclude the contours with the same number of responses for An- typically initial low f0 followed by a ger and Joy. Moreover, trills are often steep rise and ending in very high f0, noisy, and several listeners reported that i.e. the contours of murmur-meows (see they had used a high amount of noise as Schötz, 2013), the rest of the f0 con- cues for Anger. tours of each category are often similar The results from the acoustic analy- in shape and range. Moreover, they are sis suggest that cats use intonation to not unlike the pitch patterns used by signal different emotions. However, humans to signal the same emotions although human listeners were fairly (see Lindblad, 1992; Rodero, 2011). good at identifying some emotions, SorrowFear f0 contours are often level other vocalisations were often misinter- and monotonous with a slight fall preted. Vocal signals generally co- throughout the vocalisation, which re- occur with visual signals, making it semble human intonation of sorrow easier to distinguish a growl from a trill more than of fear. Joy has f0 contours when it is used to scare off an intruder characterised by high f0 and a high f0 in one case, but to greet you when you range with much variation in intonation. come home from work in the other Anger f0 contours often contain breaks, case. It is likely that humans are much perhaps due to irregularities or noise, better judges of these types of calls and they are often lower in f0 with ei- when visual cues are available. Still, ther very level intonation or sudden vocal signals are important (especially rises and falls, which are in line with in darkness), and are frequently used by the two types of intonation found in cats in intra as well as in inter-species human anger.

99 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University communication. To be able to com- McComb, K., Taylor, A. M., Wilson, C. municate better with our cat compan- & Charlton, B. D. (2009). The cry ions, more phonetic research is needed. embedded within the purr. Current Have cats learned to adapt their vo- Biology - 14 July 2009 (Vol. 19, cal patterns (including intonation) to Issue 13, pp. R507–R508). human speech in order to better elicit McKinley, P. E. (1982). Cluster the desired response from their human analysis of the domestic cat’s vo- companions, or do cats and humans use cal repertoire. Unpublished the same biological codes? The results doctorial dissertation. University of found by McComb (2009) and the ten- Maryland, College Park. tative results of this pilot study suggest Moelk, M. (1944). Vocalizing in the that cats are able to adapt to human House-Cat; A Phonetic and Func- listeners. Future work includes a more tional Study, The American Journal thorough analysis of the results found of Psychology 57(2):184–205. here, and also comparative studies of Nicastro, N. & Owren, M. J. (2003) the phonetic properties of cat-directed Classification of domestic cat (Fe- and human-directed cat vocalisations. lis catus) vocalizations by naïve and experienced human listen- ers”, Acknowledgements Journal of Comparative The author is very grateful to all the cat Psychology 117:44–52. and human participants of this pilot Rodero, E. (2011). Intonation and study. emotion: influence of pitch levels and contour type on creating References emotions. Journal of Voice. Boersma, P. & Weenink, D. (2014). 25(1):e25–34. Praat: doing phonetics by computer Schötz, S. & Eklund, R. (2011). A [Computer program] Ver. 5.3.76, comparative acoustic analysis of retrieved from www.praat.org/. purring in four cats, In: Quarterly Bradshaw, J. (2013). Cat Sense: The Progress and Status Report TMH- Feline Enigma Revealed, London: QPSR, Volume 51, 2011. In Allen Lane. Proceedings of Fonetik 2011, Burnham, D., Kitamura, C. & Vollmer- Royal Institute of Technology, Conna, U. (2002). What’s new, Stockholm, Sweden, pp 9–12. pussycat? On talking to babies and Schötz, S. (2012). A phonetic pilot animals, Science, 1435–1435. study of vocalisations in three cats. Crowell-Davis S. L., Curtis, T. M. & In Proceedings of Fonetik 2012, Knowles, R. J. (2004). Social Department of Philosophy, Ling- organization in the cat: a modern uistics and Theory of Science, understanding”, Journal of Feline University of Gothenburg, pp 45–48. Medicine and Surgery 61:19–28. Schötz, S. (2013). A phonetic pilot Driscoll, C. A., Clutton-Brock, J., study of chirp, chatter, tweet and Kitchener, A. C. & O’Brien, S. J. tweedle in three domestic cats. In (2009). The taming of the cat, Scien- Proceedings of Fonetik 2013, tific American, June 2009, 68–75. Linköping University, pp 65–68. Gussenhoven, C. (2002). Intonation and Schötz, S. & van de Weijer, J. (2014). A interpretation: Phonetics and Study of Human Perception of phonology. In Proceedings of Intonation in Domestic Cat Meows. Speech Prosody 2002, Aix-en- In Proceedings of Speech Prosody Provence. 2014, Dublin. Lindblad, P. (1992). Rösten. Studentlitteratur, Lund.

100 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Aspects of second language speech prosody: data from research in progress Juhani Toivanen Diaconia University of Applied Sciences, Finland [email protected]

Abstract tion are presented in the following sec- In this paper, data from three prelimi- tions. nary studies concerning prosody in se- L2 speech prosody variation: some cond language (L2) speech is described. quantitative alternatives Firstly, ways of quantifying L2 speech In studies on second language acquisi- prosody are described. Secondly, a new tion, prosody, if dealt with at all, is of- way of formally describing L2 speech ten described in an anecdotal and im- prosody is described. Thirdly, an exper- pressionistic way. On finds descriptions imental situation is introduced proving such as “narrow voice range”, “flat that L2 speech prosody is a many- pitch”, etc. These descriptions are often faceted phenomenon, which is affected pedagogically relevant, and they may by several factors. make the transcription system more Introduction and ongoing re- accessible to the non-experts but a con- search siderable amount of subjectivity is a corollary of this approach. However, Prosodic features of second language even a more phonetically oriented de- speech have not been extensively de- scriptive system based on “more objec- scribed in SLA (second language acqui- tive” labels such as high/modal/low sition) literature. Recently, however, mean, range and variability of pitch there has been increasing interest in L2 may be confusing if the analysis is not prosody (Toivanen, 2001; Hincks, based on any concrete anchor values or 2004), and it seems that the develop- baseline data. ment of an appropriate methodological While non-numerical, non- apparatus in this research area is an experimental investigations of L2 important prerequisite of progress. In speech prosody have an important role this paper, the focus is on three separate in the study of situated language use, research projects dealing with different for example, a more quantitative ap- aspects of L2 speech prosody. The data proach is also needed. The first ap- was collected at Oulu University, Fin- proach is to describe pitch range with land, during 2000-2005, and some of the linear Hertz scale. A number of the results were presented in Toivanen acoustic studies of pitch range, mostly & Henrichsen (2006). In the present dealing with L1, have utilized this strat- paper, a systematic overview is present- egy but the problem is that this scale ed, along with a discussion of further fails to make an appropriate normaliza- implications. tion for the non-linearity of pitch per- The data set ception: a larger change in frequency at the higher absolute pitch range is need- L2 speech prosody variation using dif- ed to produce the same perceptual ef- ferent scales, a framework for L2 fect as a smaller change at the lower speech prosody description, and an ex- absolute pitch range. Thus with the lin- perimental scenario involving contextu- ear scale, comparisons of pitch ranges ally relevant L2 speech prosody varia- between males vs. females in pitch range are almost pointless.

101 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The second option is to convert the and good inter-transcriber consistency Hertz values into semitone values; the can be achieved as long as the voice logarithmic semitone scale has been quality represents normal (modal) pho- extensively used in investigations of L1 nation. Certain discourse situations and pitch range but even this scale is not varieties of English, however, probably completely appropriate from the view- involve voice qualities different from point of perception. modal phonation, and the prosodic The third, and evidently the best, analysis of such speech data with tradi- strategy is to use ERB measurements tional ToBI labeling may be problemat- (Equivalent Rectangular Bandwidth). ic. Typical examples are breathy, The ERB scale is based on the frequen- creaky and harsh voice qualities. Pitch cy selectivity of the human auditory analysis algorithms, which are used to system and the scale is perceptually produce a record of the fundamental more relevant than either the linear frequency (f0) contour of the utterance Hertz scale or the logarithmic semitone to assist the ToBI labeling, yield a scale (Hermes & van Gestel, 1991). messy or lacking f0 track on non-modal Toivanen (2001) investigated the voice segments. Non-modal voice qual- prosody of Finnish English L2 speech ities may represent habitual speaking in an experimental setting involving styles or idiosyncrasies or they are native English speech as baseline data. characteristics of emotional discourse. Two groups of speakers, advanced L2 Typically, non-modal voice segments English speakers and native speakers of occur in Finnish speech, as well as in British English (near-RP) read out a set the L2 English of Finns. of short standard texts (of the Rainbow In Toivanen & Henrichsen (2006), Passage type), and the recorded speech a 4-Tone Emotional Voice Transcrip- data was analyzed acoustically. tion Framework was introduced. The Pitch range was described with the framework is intended for transcribing semitone scale and the ERB scale; the the prosody of modal/non-modal voice linear scale was used in some prelimi- in (emotional) English speech. As in the nary comparisons. A number of unsys- original ToBI system, intonation is de- tematic differences in pitch variation scribed as a sequence of pitch accents between the two groups were found and boundary pitch movements (phrase with the linear scale, while the semitone accents and boundary tones). The origi- scale produced much more consistent nal ToBI break index tier (with four differences. The most systematic differ- strengths of boundaries) is also used. ences throughout the data, however, The fundamental difference between were detected using ERB measure- the 4-tone framework and the original ments. The ERB scale enabled the con- ToBI is that four main tones (H, L, h, l) clusion that pitch variation in Finnish are used instead of two (H, L). In the 4- English L2 speech is indeed significant- tone framework, “H” and “L” are high ly more limited than in native English and low tones, respectively, as are “h” speech. Clearly, the type of scale used and “l”, but “h” is a high tone with non- for pitch analysis is critical, and it modal phonation and “l” a low tone seems obvious that in comparative with non-modal phonation. Basically, cross-linguistic investigations of the “h” is “H” without a clear pitch repre- prosody and pitch variation in L2 sentation in the f0 contour record, and speech, the ERB scale should be con- “l” is a similar variant of “L”. sidered as a first choice. To assess the usefulness of the 4- tone descriptive framework, informal Phonological transcription of L2 interviews in English with Finnish stu- speech prosody dents at a university of applied sciences ToBI labeling is commonly used in the were used. The speakers talked about prosodic transcription of (L1) English, their exchange studies experiences

102 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University abroad. The discussions were recorded would seem to miss an important point in a sound-treated room; the speakers’ here. speech data was recorded directly to hard disk (44.1 kHz, 16 bit) using a Speech situation and the prosody of L2 high-quality microphone. The speech speech data consisted of 574 orthographic The third aspect of L2 speech prosody words (82 utterances) produced by to be dealt with is the effect of the three female students (20-27 years old). speech situation on the pitch range and Five Finnish students of linguis- variation. The speech data was pro- tics/phonetics listened to the tapes; the duced by seventeen Finnish students of subjects transcribed the data prosodical- business administration at a university ly using the 4-tone descriptive frame- of applied sciences (all females in their work. The transcribers had been given a early twenties). The subjects took vol- short training session in the 4-tone style untary Spanish courses as part of their labeling. Each subject transcribed the general language studies. The subjects data material independently of one an- had studied Spanish 3-5 years on an other. average, and they could be described as As in the evaluation studies of the semi-fluent in ordinary L2 language use original ToBI, a pairwise analysis was situations. The speakers read out a short used to evaluate the consistency of the emotionally charged (joyful) passage of transcribers: the label data of each tran- some 50 words from a Spanish transla- scriber was compared against the labels tion of a well-known Finnish novel. of every other transcriber for the partic- Each subject read out the passage nine ular aspect of the utterance. The 574 times in two different sessions on sepa- words were transcribed by five sub- rate days; the speakers were allowed to jects; thus a total of 5740 (574x10 pairs read out the text at their own pace with of transcribers) transcriber-pair-words suitable breaks between the readings. were produced. The following con- The instructions were given by two sistency levels were obtained: presence different persons. In the first session, of pitch accent 73 %, choice of pitch the instructions (basically stating that accent 69 %, presence of phrase accent the text and its repetitions should be 82 %, presence of boundary tone 89 %, read out in a manner “natural and com- choice of phrase accent 78 %, choice of fortable” to the speaker) were given (in boundary tone 85 %, choice of bounda- Spanish) by a Finnish person: a female ry tone 85 %, and choice of break index college lecturer in her thirties teaching 68 %. Spanish to the subjects at the time. In The level of consistency achieved the second session, the instructions for the 4-tone descriptive framework were given (in Spanish) by a native was somewhat lower than that reported speaker of Spanish, a female in her thir- for the original ToBI system. However, ties, who the speakers had not met be- the differences in the agreement levels fore. The speakers’ speech was data seem quite insignificant bearing in was recorded directly to hard disk (44.1 mind that the 4-tone system uses four kHz, 16 bit) using a high-quality mi- tones instead of two. Importantly, it can crophone. be concluded that a descriptive system The total set of materials consisted of speech prosody especially tailored of 17x9x2 (306) tokens (passages). The for L2 speech seems feasible. In Finn- data was analyzed acoustically with ish English speech, “l” typically and CSL (Kay Elemetrics) in terms of the systematically occurs, often with a de- following speech measures: speaking celerating speech tempo, in the vicinity fundamental frequency (f0), f0 range of a transition relevance place, with or and jitter. Each f0 value from the pitch without a change of speaker. A tradi- analysis was converted to ERB using tional ToBI-based transcription system the formula given by Hermes & van

103 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Gestel (1991). ANOVA was used for Discussion and conclusion statistical analysis of the Instructor (2) x The points and research data presented Repetitions (9) design. Instructor ef- in this paper have touched upon some fects were significant for f0, f0 range aspects that are relevant when the pros- and jitter. For each measure, the “Span- ody of L2 speech is discussed. On the ish Instructor” condition produced a one hand, one should be aware of the higher value than the “Finnish Instruc- parameters with which prosody can be tor” condition with every repetition. described. Should they be exact and Repetitions effects were significant for rigorously defined in L2 speech proso- f0, f0 range and jitter. For each meas- dy research, or can one do with a more ure, the value became progressively impressionistic apparatus? On the other higher with continued repetitions. Inter- hand, should one, perhaps, develop new action effects were significant for f0, f0 methods and analytic frameworks for range and jitter. For each measure, the L2 speech prosody research in general? differences between the Spanish In- Are the current tools entirely function- structor condition and the Finnish In- al? Finally, one should realize that L2 structor condition became greater as the speech prosody is highly dependent on repetitions progressed. the situational factors. One could hy- The instructor/interlocutor or the pothesize that there are more factors targeted audience clearly affected the involved than in most situations with L2 speech prosody in this setting: more native language speech prosody. lively prosody could be observed with the native speaker. It also seems that the References non-native speakers needed some time Hermes D. & van Gestel, J.C. (1991). to “get going” prosodically. Larger The frequency scale of speech pitch variation was produced as the text perception. Journal of the got progressively more familiar. The Acoustical Society of America, 90, most lively speech prosody occurred 97-103. when the non-native speakers accosted Hincks. R. (2004). Standard deviation to a native speaker and had overcome of f0 in student monologue. the initial tension. In the research on the Proceedings of FONETIK 2004. topic, a large amount of jitter is general- Stockholm University: Department ly associated with natural relaxed of Linguistics, 132-135. communication (Scherer, 1995) – a Takahashi, T. (1989). The influence of trend found in the present investigation the listener on L2 speech. Variation as well. in Second Language Acquisition, 1, All in all, these findings support the 66-80. conclusion that L2 Spanish speakers are Toivanen, J. (2001). Perspectives on sensitive to their interlocutors. There is Intonation: English, Finnish and some evidence elsewhere that L2 English Spoken by Finns. Frankfurt speakers become more hesitant (prosod- am Main: Peter Lang GmbH. ically) when they address a listener with Toivanen, J. & Henrichsen, P.J. (eds.). the same L1 background (Takahashi, (2006). Current Trends in Research 1989). on Spoken Language in the Nordic Countries. Oulu University Press.

104 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Perception and production of Swedish word accents by Somali L1 speakers Anna Hed Centre for Languages & Literature, Lund University, Sweden [email protected]

Abstract makes use of tone (e.g. Gottfried & According to the feature hypothesis, a Suiter, 1997; Schaefer & Darcy, 2013; phonological feature of the L2 is easier Burnham et al., 1996). Schaefer & Dar- to acquire if the L1 of the speaker con- cy's study is particularly interesting tains the same feature. Both Swedish since it connects the bias for acquiring and Somali are languages with word intonational features with the tonal accents and therefor it is, according to prominence hierarchy. this hypothesis, assumed that L1 speak- The tonal prominence hierarchy is a ers of Somali will acquire the word hierarchy of salience of intonational accents of Swedish easier than L1 features in languages. Highest in the speakers of languages without word hierarchy are tone languages (Manda- accents. This study shows that Somali rin, Thai), second are word accent lan- L1 speakers with Swedish L2 produce guages (Swedish, Japanese, Somali), the Swedish word accents accurately third are word stress languages (Ger- but are not, as a group, better than man, Farsi) and fourth are intonation chance in a perception test of the same only languages (French, Korean). The accents. This study contradicts the fea- results of their study on the perception ture hypothesis when it comes to per- of Thai tones showed that L1 speakers ception but confirms it when it comes of languages higher up in the hierarchy to production. were more accurate in perceiving the tones than L1 speakers of languages Introduction lower in the hierarchy. However, Tron- The purpose of this study is to investi- nier & Zetterholm (2013a) tested L2- gate if a speaker with word accents in acquisition of Swedish word accents their L1 has an easier time perceiving with L1 speakers of languages on dif- and producing the word accents in an ferent steps of this hierarchy. Their re- L2. This is all based on the feature hy- sults showed that L1 speakers of lan- pothesis, which states that "L2 features guages higher than Swedish in the hier- not used to signal phonological contrast archy (Vietnamese and Thai) did not in L1 will be difficult to perceive for produce the word accents, but L1 the L2 learner and this difficulty will be speakers of Somali, a language placed reflected in the learner’s production of on the same level as Swedish in the the contrast based on this feature" hierarchy did. The speakers of a lan- (McAllister et al., 2002). guage lower in the hierarchy, Farsi, did The hypothesis has been tested on not either produce the word accents. both segmental and suprasegmental This study is an attempt to develop their features (see Flege, 1995) and previous results with new Somali L1 informants studies of the L2 acquisition of tone do and also with an addition of a percep- conclude that it is easier to perceive and tion study. produce the tones of an L2 if the L1

105 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

There is not a lot of literature on the cross-linguistic acquisition of word accents, but there are some studies on the acquisition of Scandinavian word accents that should be mentioned: Tronnier and Zetterholm (2013a) is already discussed above. In addition there is a study by Kaiser (2011), who investigated perception and production of Swedish word accents by German L1-speakers and concluded that they did not perceive or produce them. An- other is Van Dommelen & Husby (2009) who compared the perception of Norwegian word accents by Mandarin and German L1 speakers. They con- Figure 1. Central Swedish A1 and A2. cluded that Mandarin L1 speakers were better at perceiving the word accents than the German L1 speakers. Looking at the litterature, more studies are done on the L2 acquisition of tone compared to L2 acquisition of other kinds of intonational features such as word accents, and more research is focused on perception compared to production. This study is an attempt to study both the L2 production and the L2 perception of such a system: Swe- dish word accents. Swedish word accents There are two different word accents in Swedish, Accent 1, below A1 and Ac- Figure 2. South Swedish A1 and A2. cent 2, below A2. The tones are as- signed to the syllable and the f0 pattern The distribution of the word accents is differs between different varieties of fairly predictable, with different suffix- Swedish. However, the common de- es and morphological categories as- nominator is that A2 has a later tonal signed to the different accents. A1 is peak than A1. In some regional varie- more common and is usually described ties A2 is realized with a second peak as the unmarked one. There are also (Bruce, 2010). This study deals with varieties of Swedish that do not distin- two of the varieties, the South Swedish guish between the word accents, e.g. and the Central Swedish, whose f0 con- the Finland Swedish varieties. The tours look like figure 1 and 2. word accent distinction has been shown to not to be critical for the perception of Swedish and many speakers of L2 Swedish omit the distinction (Thorén, 2005). Somali word accents

Somali word accents are assigned to the mora, and only vowels are assigned with morae. They only occur where

106 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University there are two morae as in long vowels In addition, there was a Swedish L1 and diphthongs. There can only be one control group for the perception test, high tone per word and there are three described below. This group consisted accent patterns that are available: high of five people, three females and two tone on the last mora, low elsewhere; males between 20 and 25 years old. high tone on the penultimate mora, low elsewhere; or low tones on all morae. Perception test The different accent patterns are related The perception of the Swedish word to different grammatical functions, such accents was tested with a discrimination as gender, number and case on NPs test, constructed and executed in Praat (Saeed, 1999). (Boersma & Weenink, 2014). The task There are both differences and sim- was to report if two sentences following ilarities between the systems. One dif- each other without a break were the ference is the notion that the Swedish same or different. There were two dif- word accents are assigned to the sylla- ferent boxes to click. Each sentence ble whereas the Somali word accents contained the carrier sentence "Det var are assigned to the mora. Another dif- X jag menade" 'It was X that I meant'. ference is that Somali only allows one The target word was a word from a high tone for each word, while some of minimal pair of either /ánden/ ‘the the Swedish varieties allow two. duck’ - /ànden/ ‘the spirit’; /stéːɡen/ The Somali word accents are de- ‘the (foot) steps’ - /stèːɡen/ ‘the lad- scribed as having grammatical function, der’; or /póːlen/ ‘Poland’ - /pòːlen/ ‘the while the Swedish ones mostly are de- pole’. There were three versions of scribed as lexical, however, as stated each sentence to avoid that the inform- before, Swedish word accents are af- ants listened to other cues. In total 36 fected by morphological affixes so sentence pairs were played. 9 instances there might not be a clear dichotomy. of A1+A1; 9 of A2+A2; 9 of A1+A2; and 9 of A2+A1. There were two tests. Material and Method One with a South Swedish speaking This section contains information about female and one with a Central Swedish the informants, and descriptions of the speaking female. The informants in two tests, starting with the perception Sundsvall were only tested on the Cen- test, followed by the production test. tral Swedish test. The results of the L2 informants Informants were also compared with those of a Three informants with Somali L1 and Swedish L1 control group. These re- Swedish L2 participated in this study. sults were analyzed statistically to They were between 28-38 years old, check for statistical significance in the one male and two females. One of the difference between the L2 and the L1 female lived and had acquired Swedish group. in Helsingborg, where the South Swe- dish variety is used, and one female and Production test one male lived and had learnt Swedish The production test was executed with in Sundsvall, where the Central Swe- read sentences. Each sentence con- dish variety is used. They had started tained a target word, expected to be learning Swedish from age 17-25. None focused. The target word was a two of the informants had had been taught syllable verb with infinite with A2 and about the word accents in Swedish. present tense with A1. The sentences The session with the informant in were recorded with a TASCAM DR-07 Helsingborg took place in a school en- recording device and later analyzed in vironment, and the sessions with the Praat. The fundamental frequency was informants in Sundsvall took place in a analyzed with a script that normalized home environment. the curves by putting the minimum Hz

107 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University value of each word at a baseline with The L1 group had an average score of the value 0 on a semitone scale. The 34.5 and the L2 group had an average script was originally developed by Su- score of 23.75. The L2 group was then sanne Schötz and used in e.g. Schötz et compared to the L1 group with a t-test al. (2011), but modified and adapted for and the difference was significant (t = - this particular study by the author. 7.259, p = 0.000). The word accents were considered To have a better than chance score accurate if there was a distinction, if A2 on the test one would have to have at had a later intonational peak and if they least 24 correct answers. The L2 groups resembled the patterns displayed in show results that are not above chance. figure 1 and 2. However, the data is very small and the individual difference was large. The L1 Results group performed significantly above Perception test chance. First, an average number for all partici- Production test pants was calculated for both varieties. The results of the production test No significant difference was found (t = showed that the L2 group did differen- -0.143, p = 0.888). The results are tiate between the two word accents, shown in figure 3. according to the pattern expected of the area where they acquired Swedish. Fig- ure 5 and 6 shows the speaker with 36 South Swedish like word accent distinc- 30 tion: 24 18 12 6 0 Central South Swedish Swedish Figure 3. Average results for all participants on both tests. Figure 5. L2 A1, speaker in Helsingborg.

Next, the mean scores of the two differ- ent groups, L1 and L2, were compared. The results are shown in figure 4.

36 30 24 Figure 6. L2 A2, speaker in Hel- 18 singborg

12 As can be seen the intonational peak is later in the A2 examples than in the A1. 6 Figure 7 and 8 shows examples from 0 one of the informants in Sundsvall. L1 L2

Figure 4. Comparison of results of L1 and

L2 group.

108 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

sible with this material since it does not consist of minimal pairs. There could also be other reasons for why the results of the perception test showed up like it did. One is the construction of the test. The instructions could have been too unclear or mislead- Figure 7. L2 A1, speaker in Sundsvall ing, and the minimal pairs used are not that common and especially not in that kind of context. Since the informants did not know what was looked for, maybe they missed the target complete- ly. However, the L1 control group got almost all instances correct with the same instructions. The word accents are shown to be somewhat redundant (Thorén, 2005), Figure 8. L2 A2, speaker in Sundsvall and none of the informants informed that they had been taught anything Figure 7 shows the early peak expected about them when they learned Swedish to be found in A1 and figure 8 shows in the first place. However, van Dom- the double-peaked pattern expected for melen & Husby (2009) showed that A2. training in perceiving the Norwegian The results did show some varia- word accents did not improve the re- tion in terms of tonal gesture, but there sults. was always a distinction between A1 This study confirms the findings of and A2. Zetterholm & Tronnier (2013a) in that Somali L1 speakers with Swedish L2 Discussion do produce the word accents. This im- First, a reminder of the feature hypothe- plies that the word accents are more sis: "L2 features not used to signal pho- easily accessible for speakers of lan- nological contrast in L1 will be difficult guages at the same step in the tonal to perceive for the L2 learner and this prominence hierarchy. However, fur- difficulty will be reflected in the learn- ther research might want to focus on er’s production of the contrast based on speakers of different languages with this feature" (McAllister et al., 2002). word accents and study if they learn the The hypothesis emphasizes that the word accents of other word accent lan- perception of the L2 feature will be guages. More evidence for cross- reflected in the production, but in this linguistic tone production and percep- study, that was not the case. In the per- tion is also needed. ception test, the L2 informants did not Another thing to look more into is show better than chance results, but the connection between production and contradictory, the production test perception, and reasons for why the showed that the speakers did differenti- results showed up like they did. ate between the word accents, and that they did so somewhat consistently. One Acknowledgements question to be posed though, is how I would like to thank the informants, Swedish L1 speakers would evaluate and also Susanne Schötz, Mechtild these accents, just as Tronnier & Zet- Tronnier and Joost van de Weijer for terholm (2013b) have done as a follow- their help in this study. up study. The best method would be a discrimination test, but that is not pos-

109 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

References Saeed, J. I. (1987). Somali Reference Boersma, P. & Weenink, D. (2014). Grammar. Wheaton, Md: Dun- Praat: doing phonetics by computer woody. [Computer program]. Version Schaefer, V. & Darcy, I. (2013). Cross- 5.3.74, retrieved 24 April 2014 linguistic perception of Thai tones from http://www.praat.org/ is shaped by the functional promi- Bruce, G. (2010). Vår fonetiska geo- nence of lexically-contrastive pitch grafi. Lund: Studentlitteratur. in L1. Presentation at New Sounds Burnham, D., Francis, E., Webster, D., 2013. Montreal. Luksaneeyanawin, S., Attapaiboon, Schötz, S., Bruce, G., Segerup, M. C., Lacerda, F., Keller, P. (1996). (2011). Dialektal variation i svensk Perception of lexical tone across ordmelodi – sammansatta ord. languages: evidence for a linguistic Svenskans beskrivning 31. Umeå: mode of processing. Spoken Lan- Umeå universitet. guage, 1996. ICSLP 96. Proceed- Thorén, B. (2005). Andraspråkstalares ings., Fourth International Confer- realisering av svenskans kvantitets- ence on , vol.4. (pp 2514-2517). distinktion. Svenskans beskrivning Flege, J. E. (1995). Second language 28. Örebro: Örebro universitet. speech learning: Theory, findings Tronnier, M., Zetterholm E. (2013a). and problems. In: Strange, W. Tendencies of Swedish Word Ac- Speech perception and linguistic cent Production by L2 Learners experience: Theoretical and meth- with Tonal and Non-Tonal L1. odological issues. (pp 233-277) Nordic Prosody, Proceedings of the Timonium, MD: York Press. XIth Conference, (pp 391-400). Gottfried, T. L. & Suiter, T. L. (1997). Tartu. Effect of linguistic experience on Tronnier, M. & Zetterholm, E. (2013b). the identification of Mandarin Appropriate tone accent production tones. Journal of Phonetics 25. (pp in L2 Swedish by L1 speakers of 207–231). Somali? Proceedings of New Kaiser, R. (2011). Do Germans produce Sounds, Montreal: Concordia and perceive the Swedish word ac- Working Papers in Applied Lin- cent contrast? A cross-language guistics. (In manuscript) analysis. TMH-QPSR, 51(1). (pp Van Dommelen, W. & Husby, O. 93-96). (2009). Perception of Norwegian McAllister, R., Flege, J. E., Piske, T. word tones by Chinese and German (2002). The influence of L1 on the listeners. In M. A. Watkins, A. S. acquisition of Swedish quantity by Rauber & B. O. Babtista (Eds.), native speakers of Spanish, English Recent research in second lan- and Estonian. Journal of Phonetics guage phonetics/phonology: Per- 30. (pp 229-258). ception and production. (pp 308- 321). Newcastle upon Tyne: Cam- bridge Scholars Publishing

110 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

The confusing final stops in L2 acquisition Elisabeth Zetterholm Department of Swedish, Linnaeus University, Sweden [email protected]

Abstract ing to perceptually differentiate phonet- There is often a relationship between ic contrasts in L2 (Strange & Shafer, perception and production in second 2008). language acquisition, depending on The theory about a critical age for phonological transfer from the speak- second language acquisition (SLA) by er’s first language (L1). In a production Lenneberg (1967) called The critical study of Karen speakers learning Swe- period hypothesis (CPH) has been dis- dish as their second language (L2) it cussed. However, children are often appears that they have difficulties learn- more successful in their SLA than ing the pronunciation of Swedish, espe- adults and a native-like L2 phonology is cially final stops. In order to find out if only found in a very early age of onset Karen speakers perceive phonological (AO). Two remaining questions are contrasts in Swedish words produced why this seems to be a fact and if there by a Swedish native speaker, a percep- is a certain critical age. For an overview tion test was constructed. The results of earlier research in this area, see indicate a phonological transfer from Abrahamsson (2012) and Ioup (2008). L1 to L2. Another observation is the A recent production study by Abra- role of age during second language ac- hamsson (2012) confirms the statement quisition. about the relationship between AO and L2 competence and shows that L2 par- Introduction ticipants with nativelikeness on both the The degree of success when learning a GJT (Grammatical Judgment Test) and second language depends on different VOT (Voice Onset Time) had an AO factors, external as well as internal, between 1-6 years. It seems that early such as linguistic characteristics; cogni- learners have developed both grammat- tion; the learner’s social background; ical and phonetic aspects of L2, in con- and language input (Abrahamsson & trast, this is not the case for late learn- Bylund, 2012). Phonological rules in ers. L1 are often transferred and have an Earlier studies attempt to show that impact on the acquisition of the pho- second language learners of Swedish nology and the production in an L2 have difficulties with pronunciation and (Major, 2008). It is of importance that the (e.g. Bannert, learners perceive phonological contrasts 1990; Zetterholm, 2014a, 2014b). In before further processing. Earlier re- this paper, a study of L2-speakers abil- search studies show a relationship be- ity to identify Swedish words is pre- tween perception and production (e.g. sented. The aim is to try to find out if Escudero, 2005; Ioup, 2008; McAllis- there is any correlation between percep- ter, 1995). Second language learners tion and production for Karen people often produce segments and prosody learning Swedish as their second lan- with a characteristic accent related to guage. their L1. Do they have the same diffi- Karen people in Sweden culties perceiving phonetic segments Since 2004 Sweden has admitted Karen and phonological distinctions in their refugees. This is an ethnic group living L2? For adult learners, but not for in South East Asia, specifically in Bur- young children, it is sometimes confus-

111 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University ma (or Myanmar), southern China and mountains, are difficult to understand the northern part of Thailand. There are for other Karen speakers (Manson, approximately 1000 Karen people liv- 2011). The languages are classified into ing in different cities all over Sweden three main groups; Northern, Central today. Most of them have lived in refu- and Southern languages (Bradley, gee camps in Thailand before arriving 1997). The two major dialects are to Sweden. The youngest are born in called Sgaw and Pwo Karen. These are the camps and went to school in the mainly spoken in the southern part of camps. None of the participants in this Burma and there are more than 1 mil- study have any kind of higher education lion speakers of each of the two dia- from Burma, some of them went to lects. The dialect Sgaw is a lingua fran- school only for a few years, and most of ca among Karens (Naw, 2011). The them lived in very small villages or in alphabet derives from Burmese charac- the forest before their escape to Thai- ters with some modifications. The word land. There is no information about order is SVO, which is uncommon their lives and what impact that might among Tibeto-Burman languages have on their language acquisition. The (mostly SOV) (Naw, 2011). adults speak Burmese as well as Karen, The syllable structure in Karen lan- a few of them have some knowledge in guages is CV or C1(C2)VT. C1 is any English and the teenagers in this study consonant, C2 is a voiced velar fricative learn English in school. [ɣ], a voiced bilabial approximant [w], An earlier study (Zetterholm, a voiced alveolar approximant [l] or a 2014a) shows that many of them, chief- voiced alveolar trill [r]. V is a vowel ly those middle-aged or older, have and T is a tone. A clear glottal stop al- great difficulties learning Swedish. ways precede a single initial vowel Their children, teenagers or slightly since no words can be produced with an older, seem to have less or no problem initial vowel (Naw, 2011). There are with their acquisition of L2. The adults two varieties of Sgaw Karen, called have been studying Swedish at SFI and Moulmein and Bassein Sgaw Karen the younger at school. Still, after eight (Jones, 1961). Both have 9 vowels and years the adults cannot speak Swedish 27 or 23 consonants respectively. All with an intelligible pronunciation. One consonants can occur in initial positions of the more specific problems is the but there are no consonants in syllable lack of pronunciation of the last conso- final positions. There are three aspirated nant in Swedish words. Without a clear and non-aspirated voiceless plosives /p context it is hard to know if the speak- pʰ t tʰ k kʰ/ as well as two voiced plo- ers mean två [tvoː] (two) or tvål [tvoːl] sives /b d/, but no velar voiced plosives (soap), tå [toː] (toe) or tåg [toːɡ] (train) /ɡ/. One of the dialects has 6 tones, two as well as vi [viː] (we) or vit [viːt] high, two mid and two low. The other (white). The learners also have prob- dialect has 5 tones. The tone system is lems producing aspirated voiceless ini- based on voice quality, f0 and duration. tial plosives. The study Karen languages A perception test was designed to get The Karen languages are a Tibeto- an idea about second language learners’ Burman branch from the Sino-Tibetan ability to conceive different vowel and phylum, consisting of many different consonant contrasts in Swedish words. dialects. They all have influences from A pilot study, with only Karen partici- Burmese and other nearby languages. pants, is presented here. There are probably 20-30 different Ka- ren languages, but it is not known ex- Method and material actly (Manson, 2011). Some of the dia- The test consisted of words produced lects, especially those spoken in the by a Swedish native speaker. The target

112 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University word was included in the sentence “Jag the central part of Sweden and have säger…igen” (I say…again). Two pic- been in Sweden for seven or eight tures, illustrating the target word and a years. word with a slight phonemic difference, As a control group 16 native Swe- were shown at the screen at the same dish speakers ran the test, nine female time as participants heard the target and eight male with a mean age of 32 word. This was made to facilitate if the years. They speak different Swedish listeners did not know the Swedish dialects, but they all live in the southern word, despite them being quite common part of Sweden. Swedish words. The sentence with the target word was repeated twice. It was Results not possible to go back and listen again Every native speaker of Swedish an- since the test ran automatically with a swered correctly in all examples. This short pause between each new word. is an indication that all words in the The test was performed in front of a recordings are pronounced clearly and Mac computer using the built-in loud- understandably. speakers. On an answer sheet the partic- In general, the Karen participants ipants saw the same two pictures, as on performed quite well in the test. Only the computer screen, and they had to one 15 years old female had correct choose one of the words by making a answers for all 46 words. The teenagers cross in a square below that picture. and the participants younger than 23 They did not see the written words. years old as well as the oldest male par- There were 46 pairs of words in the ticipant had more than 40, out of 46, test. Examples of the word pairs are tå- correct answers. See Figure 2. This tåg (toe-train), bur-burk (cage-jar), bo- raises questions about the significance bok (nest-book), två-tvål (two-soap), of age of onset in second language sol-stol (sun-chair). The words were learning. chosen following earlier research re- sults about pronunciation difficulties among second language learners of Swedish regardless of their first lan- guage (e.g. Bannert, 1990). See Figure 1 for one example of the pictures of word pairs used in the test.

Figure 2. Number of correct answers in correlation to age of the participants.

The age of onset (AO) for the par- Figure 1. One pair of words used in the test, tå-tåg (toe-train). ticipants is shown in Table 1. There is a minor indication of correlation between AO and correct answers, except for one male participant. Those who had begun Participants their acquisition of Swedish before the Thirteen people with Karen as their age of 15 years old performed best in native language participated in this test. However, there are only small dif- study, seven female and six male ferences. speakers, mean age of 32 years. All of them speak the Sgaw Karen dialect. They all live in different small towns in

113 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Table 1. Age of onset (AO) and actual age Discussion in relation to correct answers. The preliminary results of this study AO Age Correct raise questions about the importance of answers age of onset for perception in second language learning; the relation between 7 15 46 perception and production; as well as 8 16 45 phonological transfer from L1 to L2. 10 19 45 Earlier research indicates that an 10 19 44 early age of onset, before the age of six, 12 21 43 is better for acquiring a native-like L2 15 23 41 phonology (Abrahamsson, 2012, Ioup, 18 26 36 2008). Since there is a relationship be- 27 36 39 tween perception and production (e.g. 33 41 38 Escudero, 2005; Ioup, 2008) one could 35 43 39 suggest that younger learners, who have 37 46 36 a good command of their L2, should 45 54 37 perceive words in L2 better than older 49 58 44 learners. The youngest participant in this test had an AO of seven years and The most difficult words to dis- performed best in the test. All partici- criminate seem to be fyra-fira pants with an AO of 15 years or less [fyːra/fiːra] (four-celebrate) and fluga- had 41 (out of 46) or more correct an- flyga [flʉːɡa/flyːɡa] (fly-fly) probably swers. This is in accordance with earlier depending on the fronted rounded vow- research that AO is of importance. The els. This is a known general problem exception is the male speaker with an for second language learners of Swe- AO of 49 years. He has some dish, regardless of L1 (Bannert, 1990). knowledge of English and that, together However, an interesting observa- with a strong motivation to learn Swe- tion for the Karen speakers is that the dish, might be one explanation for his word pairs tå-tåg [toː/toːɡ] (toe-train) results. and två-tvål [tvoː/tvoːl] (two-soap) are The results in this perception study confusing for the listeners. When the confirm the hypothesis about the rela- target word is tå they seem to be more tion between perception and production uncertain compared to when the target as well as the transfer theory. The pro- word is tåg. The same pattern is shown duction data study by Zetterholm for två and tvål. Showing that when the (2014a) shows great difficulties for last consonant is pronounced clearly adult Karen speakers producing final there is almost no hesitation, but a word consonants. This perception study indi- without a final consonant is confusing. cates that Karen speakers, especially There is no velar voiced plosive /ɡ/ middle-aged Karen speakers, have dif- in Karen languages, and the target word ficulties identifying and discriminating tagg [taɡ] (thorn) is confusing when words with a final phonological con- presented as the target word in pair with trast. When the target word is without a the word tack [takʰ] (thanks). final consonant it seems to be confusing Identifying words with other vowel for the listeners. They do not see the or consonant contrasts cause no prob- written word, only the pictures while lems for the listeners. For example, the hearing the voice, so they are not sure if vowel contrast /e ø/ in lek-lök the word is spelled with a final conso- [leːkʰ/løːkʰ] (play-onion) or consonant nant or not. They are not provided the contrasts /l n/ in läsa-näsa [lɛːsa/nɛːsa] spelling, which could have been a clue (read-nose) or /pʰ b/ in puss-buss when identifying Swedish words. The [pʰʉs/bʉs] (kiss-bus). final voiced plosive /ɡ/ in contrast to

114 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University the voiceless plosive /kʰ/ also caused stam, M. Axelsson & I. Lindberg the listeners some identification prob- (eds.) Flerspråkighet – en for- lems. Sgaw Karen have both an aspirat- skningsöversikt. Vetenskapsrådets ed and a non-aspirated velar voiceless Rapportserie 5:2012, s 153-246. plosive /k kʰ/, but no voiced velar plo- Bannert, R. (1990). På väg mot svenskt sive and the plosives will never occur in uttal. Studentlitteratur. a final position. The /pʰ b/-contrast Bradley, D. (1997). Tibeto-Burman causes no problems. The two phonemes languages and classification. In: as well as a voiceless non-aspirated /p/ Papers in Southeast Asian Linguis- belong to the phoneme inventory in tics 14. Pacific Linguistics A-86:1- Sgaw Karen and the phonemic differ- 71. ence are familiar to them. The listeners’ Escudero, P. (2005). Linguistic Percep- confusion about the final stops and the tion and Second Language Acquisi- /ɡ kʰ/-contrast might indicate that they tion. Explaining the attainment of are aware of the structure and the pho- optimal phonological categoriza- nological contrast in Swedish words. tion. Doctoral Dissertation, Utrecht However, they are not quite sure with- University, LOT Dissertation Se- out clues when listening and therefore a ries 113. phonological transfer of the structure in Ioup, G. (2008). Exploring the role of L1 is a possible explanation. age in the acquisition of a second language phonology. In: J.G. Han- Conclusions sen Edwards & M.L. Zampini This perception study corresponds with (eds.). Phonology and Second Lan- the hypothesis about transfer of the guage Acquisition. John Benjamins Karen speakers’ L1 phonology to L2. It Publishing company. is obvious that many of the participants Jarvis, S. & Pavlenko, A. (2008). have problems identifying final stops in Crosslinguistic Influence in Lan- Swedish, which is in accordance with guage and Cognition. Routledge. Karen speakers’ production skills in Jones, R.B. Jr. (1961). Karen linguistic Swedish. Another observation is that studies: Description, comparison, the younger learners are better in this and text. Berkeley: University of identification test compared to the older California Press. learners, except for one of the male Lenneberg E. (1967). Biological Foun- participants. dations of Language. New York: Wiley & Sons. Acknowledgements Major, R.C. (2008). Transfer in second The research was partly supported by language phonology: A review. In: Birgit and Gad Rausings stiftelse, Åke J.G. Hansen Edwards & M.L. Wibergs stiftelse and Magnus Bergvalls Zampini (eds.). Phonology and Se- stiftelse, Sweden. Special thanks to all cond Language Acquisition. John the Karen speakers I have met so far. Benjamins Publishing company. Manson, K. (2011). The subgrouping of References Karen. Abrahamsson, N. (2012). Age of onset http://academina.edu/620539/The_s and nativelike L2 ultimate attain- ubgrouping_of_Karen_languages ment of morphosyntactic and pho- McAllister, R. (1995). Perceptual for- netic intuition. In: Studies in Se- eign accent and L2 production. In: cond Language Acquisition 2012. K. Elenius & P: Branderud (eds.) 34, 187-214. The XIIIth International congress Abrahamsson, N. & Bylund, E. (2012). of Phonetic Sciences (Vol. 4, pp Andraspråksinlärning och 570-573. Stockholm Universi- förstaspråksutveckling i en an- ty/KTH. draspråkskontext. In: K. Hylten-

115 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Naw, V. (2011). The Phonology of Zetterholm, E. (2014a in press). Final Dermuha and a Phonological and stops or not? The importance of fi- Lexical comparison between Der- nal consonants for an intelligible muha, Sgaw Karen and Pwo Ka- accent. Proceedings of PSLLT ren. Thesis. Payap University, 2013, Iowa State University. Chiang Mai, Thailand. Zetterholm, E. (2014b in press). Vowel Strange, W. & Shafer, V.L. (2008) Length Contrast and Word Stress in Speech perception in second lan- Somali-Accented Swedish. Pro- guage learners: The re-education of ceedings of the International Sym- selective perception. In: J.G. Han- posium on the Acquisition of Se- sen Edwards & M.L. Zampini cond Language Speech. Concordia (eds.). Phonology and Second Lan- Working Paper in Applied Linguis- guage Acquisition. John Benjamins tics, 5. Publishing company.

116 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Observed pronunciation features in Swedish L2 produced by two L1-speakers of Vietnamese Mechtild Tronnier1, Elisabeth Zetterholm2 1 Centre for Languages and Literature, Lund University, Lund, Sweden 2 Department of Swedish, Linnaeus University Sweden [email protected], [email protected]

Abstract namese language is provided in a con- Immigrants with Vietnamese as their trastive perspective, i.e. in comparison L1 have been living in Sweden for a with Swedish. couple of decades. Vietnamese L1- The Vietnamese language and its speakers are also currently present in sound system the SFI-classroom. The aim of this con- tribution is to present observed pronun- The Vietnamese language is mainly ciation features in L2-Swedish and is spoken in (the Socialist Republic of) based on material produced by two L1- Vietnam and is a member of the Mon- speakers of Vietnamese. We will also Khmer branch within the Austroasiatic discuss those features of L2- language family. There are three major pronunciation which lead to serious dialects of Vietnamese: the Northern communication. (Hanoi-) variety, the southern (Ho Chi Minh City/Saigon-) variety and a Introduction (Northern) Central variety. Since the late 70s the migration of Vi- An overview of the phonological etnamese L1-speakers has occurred all system of the Northern variety is given over the world. According to in the following, however, dialectal the Swedish Migration Board variations occur. (www.migrationsverket.se), about 745 Vowels immigrants from Vietnam were granted residence in 2013, most of them being The vowel system of Vietnamese com- relatives to other residents in Sweden. prises twelve monophthongs, which are Speakers with Vietnamese as their /i ɯ u e ɤ o ɛ ʌ ɔ ɐ a ɒ/ (Garlén, 1988). L1 are found in the classrooms where There is an overlap to a great extent Swedish as a second language is taught with the monophtongs in Swedish, but (Tronnier & Zetterholm, 2011). It has the Swedish front rounded vowels /y ø/ also been reported by teachers of Swe- do not occur in Vietnamese. dish as a second language, that learners In addition, Vietnamese has diph- of Swedish with an East Asian language thongs and triphthongs. All monoph- as their L1 are those who have the thongs can commence a dipthong which greatest difficulties in acquiring the finishes with [ɪ] (e.g [əɪ]), all monoph- Swedish pronunciation and are most thongs which are not back vowels can difficult to understand. commence a dipthong which finishes In this contribution, an investiga- with [ʊ] (e.g. [ɛʊ]) and high monoph- tion of observed pronunciation features thongs can commence a dipthong which in Swedish-L2 is presented and their finishes with [ə] (e.g. [ɨə]). Triphthongs importance for successful communica- commence with a high vowel, include a tion will be reflected on. The analysed schwa [ə] in the central part and end material is produced by two speakers with a high vowel (e.g. [ɨəʊ]). In sum- with Vietnamese as their L1. In addi- mary, Vietnamese is very rich in vocal- tion, the sound inventory of the Viet- ic sounds.

117 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Consonants section on vowels, Vietnamese has Vietnamese has the following stop- some vowels which are distinguished by length, e.g. This consonants: /p t tʰ c k ʔ ɓ ɗ/, of which /eː e/ and /a aː/. feature is shared with Swedish /ɓ ɗ/ are pre-glottalised and voiced, but do not always result in an implosive. although length distinction occurs on a The nasal consonants are /m n ɲ ŋ/ and greater number of vowels in Swedish. A typical and salient prosodic fea- the fricatives are /f v s z x ɣ h/. The ture of Vietnamese is the occurrence of following approximants also occur: /ɹ j the tones (Nguyễn & Edmondson, w l/. There is also some overlap be- tween the Vietnamese and Swedish 1998). Depending on the dialect, there consonants, but some of the Swedish are five or six tones. Contrast in tone is, however, not only based on melodic fricatives are missing (/ɕ ɧ/) and the voiced stops in Vietnamese have the variation, but also on phonation type, further dimension of pre-glottalisation. intensity and length. The six tones of Northern Vietnamese are: ngang mid- Syllable structure level, huyen low falling and breathy, Apart from the rare occurrence of redu- sac mid rising and tense, nang midfall- plications and compounds, Vietnamese ing, glottalised and short, hoi mid fall- words are monosyllabic. The syllable ing-rising and harsh, nga mid rising and must commence with a consonant or the glottalised. In southern dialects the nga- approximant /w/. The vocalic nucleus is tone is integrated in the hoi-tone, which compulsory. The introductory conso- results in only five tones. nant can be any of the Vietnamese con- Lexical distinction based on tonal sonantal phonemes except /p/ (excep- characteristics is common to both, Vi- tions are loanwords from French). In etnamese and Swedish. Tonal lexical addition, the initial consonant – if not features in Vietnamese are assigned to labial – may occur together with the every syllable. In Swedish, however, approximant /w/, but no other initial they occur only on stressed syllables, as combinations are acceptable. The nu- Swedish is a language which accom- cleus may consist of either a long or a modates multisyllabic words. Swedish short vowel or a diphthong or triph- is classified as a tone accent language thong. The syllable may end with the whereas Vietnamese is classified as a vocalic nucleus, which may also be tone language. followed by one or two approximants, a The present study single consonant or the combination of an approximant and a consonant. The The present study is based on record- permissible consonants in final position ings made of two female speakers of are: /p t c k m n ɲ ŋ/. In summary, the Vietnamese living in the southern part formula of the syllable structure is: of Sweden. Both speakers are fluent in (C1)(w)V(G/C2) conversational Swedish and both have In comparison, in Swedish, com- an academic background. One of the plex consonant clusters may occur in speakers reported a good command of initial and final position in the syllable English. The speakers were recorded – up to three consonants for root mor- reading Swedish sentences, a short text phemes. More consonants in final posi- and describing a picture story. The sen- tion are permitted if the cluster embod- tences were compiled so that words ies multiple morphemes. containing all Swedish vowels and con- sonants and most of the Swedish con- Prosody sonant clusters were present in the ma- As most of the words in Vietnamese are terial. Furthermore minimal word pairs, monosyllabic and lexical stress is there- that is to say words that are contrasted fore not a prosodic feature unlike in by quantity characteristics, stress Swedish. As has been shown in the placement or word accents were built

118 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University into the sentences. Many of these target Observed variation: consonants words were also present in the short Most of the consonantal divergence is text and supplemented by further based on the differences in the phono- words, e.g. compound words. tactic structure of Vietnamese com- The recorded material was then au- pared to Swedish. Some consonants, ditorily analysed and pronunciation however, are pronounced differently peculiarities which did not match the regardless of the variation in phonotac- expected pronunciations for Swedish tic rules between the two languages. in were transcribed. With regard to the many cases, phonemes that are not adequacy of the production of tonal permissible in initial or final position in word accents, a separate study was car- a syllable in Vietnamese are omitted in ried out and the procedure and results those positions. This is the case for of this study are published elsewhere. both, when these consonants occur (Tronnier & Zetterholm, forthcoming). alone or in clusters. In other cases re- In the following, the observed discrep- placement takes place. ancies in the pronunciation of L2- The consonant [ɡ] occurs as an al- Swedish as produced by the two L1- lophone of /ɣ/ in Vietnamese only in speakers of Vietnamese is presented. syllable initial position and if the pre- Observed variation: vowels ceding syllable ended with [ŋ]. It has been observed that was inserted in In many cases both speakers produced [ɡ] words which have in word medial the rounded front vowel rather like /ŋ/ /y/ position by the L2-speakers when the corresponding unrounded vowel . /i/ speaking Swedish. Thus the word In the case of the long vowel, some pengar “money” was pro- shade of an approximant is added. The [pʰɛŋaʁ] nounced as * . The pronuncia- first vowel in the word mycket “a lot” [pʰɛŋɡaʁ] tion of in Swedish L2 was found to thus results in and the vowel /ɡ/ [mʏkə] [i] be deviant and took on a variety of dif- in ny in . Further variations [nyː] [iʲ] ferent shapes. In initial position and if occur for the vowel /ʉ̟/ in the words not in a cluster, /ɡ/ was pronounced as huset “the house” [hʉ̟ːsət] and ut, “out- and in a final position it side” , which is realised as either [j ɣ w ɹ k ŋ] [ʉ̟ːt] was pronounced or was omitted. or . [ɰ j] [uː]: [huːsət] [yː]: [yːt] Examples here are the words gult “yel- No adequate difference is made in low” , which was pronounced quality between the vowels and [ɡʉ̟lt] /eː/ /ɛː/ * , ganska “quite” as and sometimes the more open vowel [wuːd] [ɡanska] *[janska], flög “flew” [fløɡ] pro- [ɛː] is preferred so that lekar “play” nounced as * and svag “weak” is pronounced * and [flœɰ] [leːkaʁ] [lɛːkaɹ] as * . v.v., so that äta “eat” becomes [svɑːɡ] [svɑː] [ɛːta] The lateral phoneme is also sub- * . /l/ [eːta] ject to variation in the production of the The clear difference in quality for L2-Swedish. However, it is sometimes the phoneme between the long /a/ [ɑː] produced correctly under all conditions: and the short [a] that is required in alone or in clusters in syllable initial Swedish is not made by the L2- and final position. Omission or re- speakers. An example is gran “fir tree” placement mainly occurs in syllable , which result in * . More [gʁɑːn] [gɹaːn] final position, both if occurs single or about the mismatch between the distin- /l/ in a cluster, and also when a syllable guishing differences in vowel length final lateral contributes to a word medi- and quality will be discussed below. al cluster: en del “a part” be- Some divergence in pronunciation [ɛn deːl] comes * golf “golf” be- might have occurred due to the Swedish [ɛ̃ deː] [ɡɔlf] comes * , Malmö (Swed. City) letters <ä å ö>. [ɡɔːf] [malmœ] becomes *[mamœ]. Replace- ment in a final position mainly took the

119 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University form of some kind of nasalization, in kyrkan “the church” [ɕʏɣkan] is pro- that either the lateral was replaced by a nounced as *[ɕʏːkan] and orm “snake” nasal consonant, so that segelbåt “sail- [ʊʁm] is pronounced as *[ʊːm]. One ing boat” [seːɡəlboːt] became further variety which is used for /r/ by *[sɛɡɛnboːt], or the preceding vowel both speakers is the approximant [ɻ], a became nasalized and the lateral is sound which is part of the sound system dropped: stol “chair” [stuːl] became of English. *[stõː]. In the case of /l/ in word medial The fricative /s/ only occurs initial- position, nasalization of the preceding ly in Vietnamese and is – if pronounced vowel has been observed to co-occur at all – also correctly pronounced in L2- with a maintained lateral consonant as Swedish. In syllable final position /s/ is in innehålla “contain” [ɪnːəhɔlːa] which sometimes omitted in both cases: when is pronounced as *[ɪ̃nːəhɔ̃ːla]̃ . it occurs as a single consonant in that Nasal consonants in final position position and when it is part of a conso- also were subject to deletion in many nant cluster, e.g.: hennes “hers” [hɛnːɛs] cases. Also here the preceding vowel is realized as *[hɛ̃nɛ] and hans “his” was strongly nasalized: min man “my [hanːs] is realized as *[hãn] or also husband” [miːn manː] was pronounced *[hãː]. as *[mɪ̃ː mãː] and lingon “lingonberry” Different strategies were pursued [lɪŋɔn] as *[lɪ̃ŋɔ̃]. The rules of contact for consonant clusters in syllable initial assimilation as required in Swedish for position when the last consonant of the nasal consonant /n/ also across word such cluster was /l/ or /r/. Vowel inser- boundaries, was violated in many cases, tion occurred occasionally, as in gräset and instead a sound, which is produced “the grass” [ɡʁɛːsət], which results in further away from the adequate place of *[ɡəɻɛːsət]. Deletion of one of the con- articulation was used. An example is sonants in the initial cluster has also ibland måste man “sometimes you have been observed, as in bråkiga “rowdy” to…”, which in very clear speech re- [bʁoːkɪɡa] which was pronounced as sults in [ɪbland mɔstə manː] and in ef- *[poːkɪɰa], and frukt “fruits” [fʁʊkt] as fortless but acceptable speech can be- [fʊː]. Contextual devoicing for /r/ in a come [ɪblam͡mɔstə manː]. The L2- cluster, which may partially occur in speakers however used a non- L1-Swedish, appeared to be more permissible nasal consonant in the tran- prominent in L2-Swedish, so that trött sition between the two words: [ɪblaŋ “tired” [tʁœtː] resulted in [tɹ̥œtː]. In mastə]. many cases, the initial clusters – mainly The phoneme /r/ has many allo- those introduced by /s/ – were correctly phones in Swedish, and some of the pronounced. various pronunciations of /r/ by the L2- Consonant clusters in syllable final speakers overlap with the acceptable position and word medial position were allophones. In some cases, however, very much subject to deviation. Omis- some of those allophones were inade- sion of one or more elements in the quately placed and sounded therefore cluster occurred e.g. for the word frukt deviant. In syllable final position /r/ “fruits” [fʁʊkt] which was pronounced was often replaced by a vowel in L2- as *[fʊː], ibland “sometimes” [ɪbland] Swedish, which is possible in native as *[ɪblaːn] and konst “arts” [kʰɔnst] as Swedish as well, but the L2-speakers *[kɔ̃ŋs]. Problems in medial clusters inserted an unusual vowel here, which occur mainly when the cluster consists also seemed to be too long or at least of three or more consonants and then too prominent: mörk “dark” [mœʁk] omission of one or more elements takes was pronounced as *[mœ͡ɐk]. Any trace place, as in riksdagen “the Swedish of /r/ was also found to be completely parliament” [ʁiːksdɑː(ɡə)n] which be- omitted in syllable final position when came *[ɹɪ͡jsdɑːɡɛn], arbetslivet “work- both a single phoneme or in a cluster: ing life” [aɰbetsliːvət] which became

120 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

*[aɹbesliːvet], plötsligt “suddenly” sonants are of greater significance. The [plœtslit] which became *[plɔsɪt], flyg- production of individual consonants is platsen “airport” [flyːɡplatsən] which not a serious complication in most cas- became *[flyːplaːsən] and konstbok “art es, but instead, it is the omission or book” [kʰɔnstbuːk] which became replacement of these that leads to prob- *[kʰɔ̃sbʊk]. The fricative /s/ seems to lems in communication. be strong and maintained in medial po- More specifically, the occurrence sition and instead other consonants are of a distinct nasalization of vowels ac- omitted. companied by the deletion of not only nasal consonants, but also /l/ hampers Observed variation: prosody following and understanding the flow Placement of stress on the wrong sylla- of speech. The replacement of the ini- ble occurs occasionally, but is not very tial consonant [ɡ] only leads to a com- salient. What is more noticeable is that munication dilemma, when the replac- inaccurate vowel length is produced. ing sound comprises many articulatory Thus a longer vowel is preferred and features which are different from those the target word löss “lice” [lœsː] is not of [ɡ], such as [w]. The pronunciation distinguishable from the word lös of diverse varieties of sounds for /r/ is “loose” [løːs], glass “ice cream” [ɡlasː] not a very intricate problem. Even mis- is not distinguishable from glas “glass” pronunciation of vowel quality is not so [ɡlɑːs] and villan “the house” [ˈvɪlːan] is significant, unless combined with incor- not distinguishable from vilan “the rec- rect vowel length, and when there is a reation” [ˈviːlan]. In some cases, correct minimal pair in Swedish. length is produced, but the wrong vow- Clusters are problematic for the L2- el quality is chosen as has been men- speakers primarily in word medial and tioned in the section on vowels. final position. In the initial position, It has been shown in earlier studies clusters sometimes lead to difficulties (Tronnier & Zetterholm, 2013, Tronnier in comprehension if an extra vowel gets & Zetterholm, forthcoming), that the inserted. This introduces an extra sylla- Swedish tone accent distinction pro- ble and breaks up the cluster. No vowel duced by Vietnamese L1-speakers was insertion occurs in medial or final posi- inaccurate. The results from identifica- tion, where one or more consonants are tion tests revealed that listeners with omitted instead. An interesting observa- Swedish as their L1 judged most of the tion is that although /s/ can be omitted stimuli as belonging to words with one if it occurs as a single final consonant, particular word accents of those two it often is not one of the consonants which are possible – which is called omitted in a cluster. Accent 2 – in most cases. The interpre- With regard to prosodic features, it tation is therefore, that the Vietnamese is vowel insertion into clusters and the L2-speakers of Swedish do not have deviation from expected vowel length command over the Swedish accent dis- leads to a disruption of the expected tribution and that their preferred use of flow of speech and can therefore trigger a tonal contour identified as a repre- miscommunication. sentative of Accent 2 might be related It can also be stated that if several to tonal patterns relevant in Vietnam- types of variation occur in one word, it ese. makes it more difficult for the listener to understand the intended word. Discussion The description of pronunciation varia- Summary tion produced by the two L1-speakers Problems of miscommunication in L2- of Vietnamese when speaking L2- Swedish produced by L1-speakers of Swedish presented above leads to the Vietnamese is based on numerous and assumption that issues concerning con- complex factors. The major complica-

121 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University tions are related to missing consonants Swedish Migration Board (2014). in word medial and syllable final posi- Retrieved from: tions. This is the case, if there is sup- http://www.migrationsverket.se posed to be a single consonant or if the Tronnier, M. & Zetterholm, E. (2011). omitted consonant is part of a cluster. In New Foreign Accents in Swedish. addition, the alteration of the rhythm of Proc. 17th Int. Congress of Phonet- speech due to either vowel insertion – ic Sciences (ICPhS2011), Hong which results in an extra syllable – or Kong, 2018–2021. an unexpected variation of vowel length Tronnier, M. & Zetterholm, E. (2013). can lead to misunderstandings. The Tendencies of Swedish word accent more unusual types of pronunciation production by L2-learners with to- variation are produced per word, the nal and non-tonal L1. In E.L. Asu more incomprehensible is the word. & P. Lippus (eds), Proceedings of Nordic Prosody XI, (15-17 August References 2012): 391-400, Tartu, Estonia Garlén, C. 1988. Svenskans fonologi. Tronnier, M. & Zetterholm, E. (forth- Lund: Studentlitteratur. coming). Swedish word accent pro- Nguyễn, V. L. and Edmondson, J.A. duction by L2-speakers with differ- (1998), Tones and voice quality in ent tonal L1s. In Proceedings of modern northern Vietnamese: In- TAL 2014. strumental case studies, Mon- Khmer Studies 28, 1–18,.

122 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Consonant inventory of Swedish speaking 24-month-olds: A cross-sectional study Emilie Gardin1, Maria Henriksson1, Emilia Wikstedt1, Marie Markelius2, Lena Renner2 1 Department of Clinical Science, Intervention and Technology, Karolinska Institutet, Stockholm, Sweden 2 Department of Linguistics, Stockholm University, Stockholm, Sweden [email protected], [email protected], [email protected] [email protected], [email protected]

Abstract moment in development” (van Severen This cross-sectional study examines the et al., 2012:164). consonant inventory of Swedish speak- Speech development is grounded in ing twenty-four-month olds. The results both sociocultural and cognitive con- are compared with English speaking texts (Strömqvist, 2011). There are children at the same age. 15 audio files many variables that affect the child’s recorded from 13 children were tran- speech production such as the size of scribed using independent analysis. the child’s oral cavity and larynx. Indi- Individual inventories where construct- vidual differences are overt during ed for both word-initial and word-final speech development but many similari- consonants for each subject. The results ties occur. An example of this is that are to a high degree consistent with the hearing and deaf children up to about findings in the study compared. Anteri- nine months almost have the same or consonants are more frequent in the amount and type of babbling (Smith, subject’s inventories than posterior ones 1982) indicating that early vocalization in both initial and final word position. has a strong connection with the child’s Word initial voiced plosives are more anatomy. Stoel-Gammon (1985) sug- common in the inventories than voice- gests that there is a definite continuity less with the reverse situation i.e. voice- in children’s phonological develop- less plosives are more frequent than ment, meaning that there is a high de- voiced in word final position. gree of overlap between the phonetic inventories of babbling and meaningful Introduction speech. In general the phonemes that A number of studies have investigated appear in many of the world’s lan- the phonological patterns, phonological guages are the phonemes first devel- acquisition and consonant inventories oped in children’s speech. On the con- of English, Dutch and African- Ameri- trary language specific phonemes are can speaking twenty-four-month-olds the ones last developed (Bjar, 2011). (Stoel-Gammon, 1985, McIntosh & As an introduction to Swedish pho- Dodd, 2008, van Severen et al., 2012, nology Table 1 shows the development Bland-Stewart, 2003). However there is of consonants in the Swedish language. a gap when it comes to Swedish speak- It is apparent that the language specific ing two-year-olds and their consonant phonemes /ɕ/ and /ɧ/ are the ones last to inventory. appear. When comparing the inventory A way of defining a consonant in- of the , presented in ventory is as “the set of consonantal Table 2, to the Swedish equivalent types that occur in a child’s production many similarities are evident. Plosives, of meaningful words at a particular laterals and nasals do not differ at all whereas fricatives and trills merely show slight differences.

123 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Table 1. Summary of the Swedish conso- Method nants in order of developmental acquisition. Results from this study are to be com- Cited from Bjar (2011:119), modified to IPA standard. pared with the results from Stoel- Gammon, (1985). Plosives E p b t d k g L Laterals A l A Data Collection Trills R r ʀ T The audio files analyzed in the present Fricatives L j h f v s ɕ ɧ E study were originally recorded for an- Nasals Y m n ŋ other purpose at the department of lin- guistics at Stockholm University. They Table 2. An overview of the English conso- were recorded using a H2 handy re- nant inventory without specification on age corder. The recordings have a varying acquisition (Grunwell, 1981). length between 1.37-8.52 minutes and contain interactions between the child, Plosives p b t d k g their caregiver and the researcher in the Laterals l Trills r laboratory at the department of linguis- tics. In the process of transcribing the Fricatives j h f v ʃ ʒ tʃ dʒ θ ð Nasals m n ŋ audio files it became necessary to limit Approximant w the length of some of the files. This measure was taken to ensure that each Transcription is a part of the process subject was given the approximate when analyzing speech. Independent equal amount of utterances. analysis focuses on the pronunciation Subjects patterns of the child regardless of the adult model (Stoel-Gammon, 1991). In this study, recordings of 13 children Another option is the use of relational were used (7 girls and 6 boys, mean age analysis, which considers the accuracy 24;0). Guardians of children born in of the consonant production in relation April 2011 were contacted and their to the adult target form. More specifi- addresses were acquired from the Swe- cally, only correctly produced conso- dish IRS (Skatteverket). 600 letters of nants are used in the analysis (van interest were sent out. 50 participants Severen et al., 2012). replied with consent and 35 of those The present study aims to investi- participated in another study at the de- gate the consonant inventory of Swe- partment of linguistics at Stockholm dish speaking two-year-olds. The study University. This study contains data by Stoel-Gammon (1985) exhibit the from 13 of these 35 participants. consonant inventory of American- Transcriptions English speaking children aged twenty- four-months. The consonants present in The method chosen is independent analysis, with IPA as standard. Each initial position were /b, t, d, k, g, m, n, audio file was transcribed twice by two h, w, f, s/ and in final position /p, t, k, n, r, s/. separate transcribers to insure inter- The hypothesis in this study is that rater-reliability. Swedish children show similar conso- Data analysis nant inventory as English speaking Although the data obtained contained twenty-four-month olds in both initial both meaningful speech and babbling, and final position. Expected findings the utterances which could be related to are that the Swedish-speaking children an adult target word were included in are able to produce labial, dental and the analysis. The consonants were cate- velar plosives as well as laterals and gorized by position in the word: initial nasals, which is covariant with the re- or final. Statistical significance was sults of Stoel-Gammon, (1985).

124 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University calculated using independent one- word position /b, d, t, m, v/. Word ini- sample t-test in IBM SPSS version 22. tial voiced plosives /d, b/ are more common than voiceless /t, p/ with the Results reverse situation in final position, i.e. After analyzing the transcriptions inter- voiceless plosives occur more frequent- rater-reliability was 90.3 %. Tables 3 ly in the final inventories than voiced and 4 present analyses of initial and ones. final consonants on group level. The Table 5. Manner of articulation for each tables include average inventory size phoneme analyzed and number of subjects (mean and range) and list the consonant with phoneme acquired in both initial (I) phonemes occurring in at least 50 % of and final (F) word position. the subject’s inventories. Both tables show that nasals are the phonemes most Number of subjects Manner of Phone with phone acquired commonly acquired by Swedish twen- articulation (Initial-Final) Initial-Final ty-four-month-olds closely followed by Plosives voiced plosives. In initial position /m, d (I) 12-0 d, b/ is acquired in over 90 % of the b (I) 12-0 inventories. The raw scores of the con- ɡ (I-F) 7-2 sonants in initial and final position were t (I-F) 8-5 compared to the raw scores provided by k (I-F) 5-1 Stoel-Gammon (1985). One-sample t- (I) 5-0 test shows that there is no statistical p ʈ (I-F) 1-1 significance (p=.398, α=.05) in number (I) 1-0 of phonemes acquired in initial position ɖ Laterals between English- and Swedish- l (I-F) 7-10 speaking two-year-olds. However in Fricatives final position there is statistical signifi- s (I-F) 7-4 cance (p=.010, α=.05). v (I-F) 9-1 Table 3. Initial position: inventory size and ɕ (F) 3-0 phonemes in 50 % of subjects. h (I) 6-0 Inventory size: Phonemes in inventories ʝ (I-F) 4-6 Age (n) Mean (Range) of 50 % of subjects f (I) 3-0 (I) 1-0 a a a ʂ 24 (13) 8.7 (4–14) m d b n s ɡ l t v Nasals

m (I-F) 12-5 a Indicates phoneme occurred in 90 % of the in- n (I-F) 11-11 ventories. ɳ (I-F) 1-1 Table 4. Final position: inventory size and Approximant phonemes in 50 % of subjects. ɹ (F) 1-0

Inventory size: Phonemes in inventories Age (n) Mean (Range) of 50 % of subjects Discussion

24 (13) 3.8 (1–9) n l The results of the present study are to a high degree consistent with the findings from the study compared. The examina- Initial inventories are represented to a tion of the phonological acquisition of larger extent than final. Table 5 shows English speaking two-year-olds by that anterior consonants are more fre- Stoel-Gammon (1985) indicates that quent in the subject’s inventories than there are many similarities to Swedish posterior ones in both initial and final speaking two-year-olds. According to Stoel-Gammon the consonantal acquisi-

125 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University tion for English speaking two-year-olds perimental study with audio files de- is as follows, in word initial position: signed to fit the purpose containing /b, d, t, k, g, m, n, h, f, s, w/ and final more spontaneous speech. word position: /t, p, k, n, r, s/. Differ- ences are overt when examining initial References and final word inventories closely. Sur- Bjar, L. (2011). Orden tar form – om prisingly the nasal /m/ is the most barns uttalsutveckling. In L. Bjar common consonant used in initial posi- (Ed.), Barn utvecklar sitt språk (pp. tion. An explanation to this might be 101–124). Lund: Studentlitteratur that the audio files were not recorded to AB. be used primarily for the present study Bland-Stewart, L. M. (2003). Phonetic and do not contain enough spontaneous Inventories and Phonological Pat- speech. To a great extent the audio files terns of African American Two- contained naming of objects and the Year-Olds: A Preliminary Investi- Swedish words /mœsa/ and /lampa/ are gation. Communication Disorders words frequently uttered. Quarterly, 24(3), 109–120. The subject’s caregiver, usually Grunwell, P. (1981). The development their mother, was present which made of phonology: A descriptive profile. the children more inclined to utter First Language, 3, 161–191. /mama/. In spite of this, phonemes ex- McIntosh, B., & Dodd, B. J. (2008). pected in initial position /d/ and /b/ Two-year-olds’ phonological ac- were present in 90 % of the subjects. quisition: Normative data. Interna- Worth mentioning is that there are al- tional Journal of Speech-Language most no similarities in the final invento- Pathology, 10(6), 460–469. ries. Speculations are that the purpose Smith, B. L. (1982). Some observations for which the audio files were recorded concerning premeaningful vocali- inhibits the subjects spontaneous speech zations of hearing -- impaired in- and heightens the occurrence of specific fants. Journal of Speech and Hear- target words such as /bil/. This could be ing Disorders, 47(4), 439–441. an explanation to the high frequency of Stoel-Gammon, C. (1985). Phonetic the anterior lateral /l/ in final position. inventories, 15–24 months: A lon- The lack of final consonants in the tar- gitudinal study. Journal of Speech, get words might explain the low fre- Language and Hearing Research, quency of final consonantal phonemes 28, 505–512. overall. Stoel-Gammon, C. (1991). Normal and One of the difficulties when tran- disordered phonology in two-year- scribing is the bias to the adult target olds. Topics in Language Disor- word. Using independent analysis lead ders, 11(4), 21–32. to discussions about word segmenta- Strömqvist, S. (2011) Barns tidiga tion: where does the word begin and språkutveckling. In L. Bjar (Ed.), end. Relational analysis might have Barn utvecklar sitt språk (pp. 57– been the solution to this problem. Nev- 76). Lund: Studentlitteratur AB. ertheless, independent analysis where Van Severen, L., van den Berg, R., Mo- chosen because of the age of the sub- lemans, I., & Gillis, S. (2012). jects and their inconsistency pattern Consonant inventories in the spon- between babbling and meaningful taneous speech of young children: speech. A bootstrapping procedure. Clini- Although the results to a large ex- cal Linguistics & Phonetics, 26(2), tent are consistent with the findings 164–187. from Stoel-Gammon (1985) the amount of data is too limited to make broader generalizations and hence, additional studies are required. Preferably an ex-

126 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Real-time registration of listener reactions to unintelligibility in misarticulated child speech Ivonne Contardo1, Anita McAllister1, Sofia Strömbergsson2 1 Division of Speech-Language Pathology, CLINTEC, Karolinska Institutet, Sweden 2 KTH Speech, Music and Hearing, Stockholm, Sweden [email protected], [email protected], [email protected]

Abstract Harrison, & McCormack, 2012). For This study explores the relation be- one, the child’s ability to produce iso- tween misarticulations and their impact lated words in a clinical setting may not on intelligibility. 30 listeners (17 clini- be very representative of how well he cians and 13 untrained listeners) were or she is understood when communi- given the task of clicking a button cating with unfamiliar people. Second, whenever they perceived something no currently available clinical measure unintelligible during playback of misar- of intelligibility exposes what causes ticulated child speech samples. No dif- reductions of intelligibility. In order to ferences were found between the clini- link levels of (un)intelligibility to fea- cians and the untrained listeners regard- tures in speech production, intelligibil- ing clicking frequency. The distribution ity assessments need to be comple- of listener clicks correlated strongly mented with assessments of the type with the clinical evaluations of the same and degree of speech impairment. samples. The distribution of clicks was The most widely used metric of se- also related to manually annotated verity of speech disorders is the Per- speech errors, allowing examination of centage of Consonants Correct (PCC; links between events in the speech sig- Shriberg & Kwiatkowski, 1982). This nal and reactions evoked in listeners. measure is calculated as the proportion Hereby, we demonstrate a viable ap- of correctly produced consonants (as proach to ranking speech error types judged by a trained clinician) across all with regards to their impact on intelli- target consonants in a speech sample. Despite its established reliability and gibility in conversational speech. validity as a quantitative measure of Introduction severity of involvement (ibid.), the PCC Children with speech disorders often metric is associated with some limita- present with systematic speech error tions, e.g. relating to its application to patterns. As a communicative conse- highly unintelligible speech. And alt- quence, intelligibility is often reduced. hough it may seem intuitive to assume a For these children, as well as for strong correlation between a child’s younger children following a typical speech production skills and the per- course of speech acquisition, communi- ceived intelligibility of his or her cation is especially affected when inter- speech, the relation between the two is acting with people they do not already weak (Kwiatkowski & Shriberg, 1992). know (Coplan & Gleason, 1988; Hence, linking speech production prob- Kwiatkowski & Shriberg, 1992). lems to levels of intelligibility requires Speech intelligibility is an im- alternative approaches. portant consideration in many clinical Of the contextual factors influenc- decisions; however, it is not trivially ing intelligibility, the listener’s famili- assessed. Standard measures of intelli- arity with the speaker has been shown gibility may be questioned with respect to play an important role; family mem- to their functional relevance (McLeod, bers are, for instance, better at glossing a child’s intended words than unfamil-

127 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University iar people (Kwiatkowski & Shriberg, teners’ responses. Second, the qualita- 1992), and clinicians have been found tive analysis was very limited, which to evaluate misarticulated speech as precluded any conclusions regarding more intelligible compared to untrained what specific speech error types evoke listeners (Lundeborg & McAllister, the most reactions in listeners. The pre- 2007; McGarr, 1983). In order to un- sent study aims to address these issues. derstand how clinical evaluations re- flect the child’s everyday communica- Research questions tive challenges, there is value in cali- In the present study, the following re- brating clinical evaluations against search questions are addressed: evaluations of other listeners. 1. Are there any differences between Audience Response Systems (ARS) SLPs and untrained listeners in their have long been used in concurrent eval- real-time reactions to unintelligibil- uations of e.g. movies and screenplays, ity in samples of misarticulated where many subjects are asked to click child speech? a button when they like (or dislike) what they see. The method has also 2. To what extent do real-time reac- been used for time-efficient evaluation tions to unintelligibility reflect re- of speech synthesis by many subjects sults of standard clinical measures? (Edlund, Hjalmarsson, & Tånnander, 3. To what extent do specific speech 2012). Applying the ARS-based meth- errors contribute to decreased intel- od to recordings of misarticulated ligibility? speech presents itself as an interesting opportunity. First, the method allows Method for fast collection of ratings from many Conversational speech was recorded listeners, thus strengthening the reliabil- from 7 preschool-aged Swedish chil- ity of the ratings. Second, recruiting dren exhibiting speech production defi- untrained listeners as raters gives an cits. Speech production was assessed by indication of the extent of the everyday means of LINUS (Blumenthal & problems that children experience when Lundeborg Hammarström, 2013); communicating with unfamiliar people, speech production characteristics are thus relating to the concept of function- summarized in Table 1. Parental eval- al intelligibility [8]. Third, the real-time uation of intelligibility was assessed by ratings can serve as pointers to salient means of the Intelligibility in Context speech problems, indicating what Scale (ICS) (McLeod et al., 2012). speech phenomena are most disturbing In the recording situation, the chil- to listeners. If coupled with qualitative dren and an adult (an SLP student) speech analysis, this information can go talked about toys or pictures, visible to far beyond standard measures of intelli- both of them. The children were rec- gibility/severity. orded with a Zoom H2 recorder with a Strömbergsson & Tånnander 44 kHz sampling frequency. Sequences (2013) describe a first exploration of of continuous child speech were ex- applying the ARS method to the do- tracted manually from the child-adult main of misarticulated speech. Howev- conversations, and sequentially concat- er, despite demonstrating the potential enated to form one-minute long speech of the ARS method as an instrument for samples. In all, 11 such speech samples identification of features in children’s were combined into a listening script, speech that are most detrimental to in- with one sample serving as an introduc- telligibility, the study is limited in sev- tory item, and excluded from analysis. eral respects. First, the instructions pro- For the 10 conversational speech vided to the listeners were unclear, thus samples, the Percentage of Consonants restricting the interpretation of the lis-

128 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Table 1. Descriptions of the recorded children. Age is given as years;months.

ID Age Speech errors A 3;8 Added voicing, velar fronting, stopping, /r/-weakening, cluster reduc- tions, omission of /ɕ/ and /h/, assimilations and metatheses, /e/ not established. B 5;0 /r/-weakening, stopping, omission of /s/, cluster reductions, assimila- tions. C 6;1 /r/-weakening, stopping, /m/ → [b], /n/ → [d], /ɧ/ → [s], /ŋ/ → [g]. D 6;0 /r/-weakening. E 5;6 Velar fronting, cluster reductions. F 5;8 /ʈ/ → [t], cluster reductions. G 5;6 /r/-weakening, /l/-weakening, stopping, labialization, cluster reduc- tions, assimilations, epentheses, /b/ → [v], /e/ not established.

Correct (PCC) was calculated along the and implemented in a web-based ARS procedures described in Shriberg & listening test. The listeners were in- Kwiatkowski (1982), by two independ- structed to listen to the speech samples ent experimenters. Inter-judge reliabil- and to click any keyboard key (or ity between the experimenters was .96 mouse key) whenever they perceived (Crohnbach’s alpha). For each sample, something unintelligible during play- the average of the two judges’ PCC back. All listener clicks were registered measures served as the final PCC during playback. The average number measure for that particular sample. of clicks over all listeners and all All 10 conversational speech sam- speech samples was used in the ples were also subject to qualitative weighting of the listeners’ clicks, so analysis. Here, the first author used that clicks from listeners who do not Wavesurfer (Sjölander & Beskow, click very often are given more weight 2000) to mark and label all speech er- than clicks from listeners who click rors occurring in the samples. Stretches more frequently. of unintelligible speech were assigned The distribution of the weighted the label “unintelligible”, and typically clicks was analyzed by means of Kernel ranged across several words. From the Density Estimations (KDEs). The anal- resulting timestamps, the midpoints of ysis resembles a histogram, but the pro- the events were passed on to further duced curve is continuous and smooth. analysis. For each recording, the distribution of 30 adults participated in the ARS- clicks was linked to the manually anno- based listening test; their age varied tated speech errors; if a KDE peak was between 25 and 61 (M = 35.80, SD = found within an interval of 500-1400 9.56). The gender distribution was ms after an annotated speech error, this 10:20 (male:female). 13 of the listeners assembly of listener clicks was consid- were SLPs, all with experience of ered to reflect a reaction to that specific working with children. There was no speech error. difference between the SLPs and the 17 untrained listeners with regards to age: Results t(28) = .36, p = .72). The SLPs’ work- Potential differences between the two ing experience ranged between 8 listener groups in their clicking behav- months and 23 years (M = 9.03, SD = ior were explored by means of a one- 8.10). way ANOVA, with total number of The 10 conversational speech sam- clicks as the dependent variable, and ples were randomized for each listener, listener group (SLPs vs. untrained lis-

129 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University teners) as an independent variable. This Table 2. Evaluation results for all record- analysis revealed no difference between ings, with regards to the PCC and the groups: F(1,28) = .13, p = .72. (weighted) listener clicks. A Pearson correlation analysis was used to explore the relationship be- Rec Child ICS PCC Clicks tween the PCC and the number of 1 A 3.7 66% 10.5 (weighted) clicks per recording, reveal- 2 A 3.7 61% 11.7 ing a strong negative correlation be- 3 B 3.6 77% 3.7 tween the two: r(10) = -.91, p < .001. 4 B 3.6 83% 2.4 The inverted correlation between PCC 5 C 4.4 89% 1.1 and the number of (weighted) clicks per 6 D 4.7 96% 1.1 recording is indicated by the figures in 7 D 4.7 91% 2.9 Table 2. In order to explore the relation 8 E 4.0 88% 3.3 between intelligibility ratings as as- 9 F 3.6 99% 0.7 sessed by the ICS and the PCC on the 10 G 3.3 70% 5.4 one hand, and between ICS and the number of (weighted) clicks per record- The extent to which different speech ing on the other hand, two separate error types evoked listener reactions are Pearson correlation analyses were per- listed, for all error types, in Table 3. As formed. No correlation was found be- revealed in the table, assimilation and tween the children’s ICS scores and the added voicing often evoke reactions in PCC scores [r(10) = .55, p = .10], nor listeners, whereas errors like metathesis between ICS scores and the number of and syllable omission do not appear as (weighted) clicks per recording [r(10) = destructive to intelligibility. -.38, p = .29]. Table 3. The speech error types evidenced in the recorded data, together with the number of times they are followed by a KDE peak (interpreted as a listener reaction).

Speech error Freq. Evoked % of instances fol- reactions lowed by reaction Assimilation 10 5 50% Unintelligible* 16 8 50% Added voicing 4 2 50% /r/-weak. + final cons. deletion 3 1 33% Velar fronting 26 7 27% Stopping 27 7 26% /r/-weakening 79 19 24% Final consonant deletion 19 4 21% Omission 26 5 19% Cluster reduction 27 5 19% Other 33 6 18% Cluster red. + velar fronting 6 1 17% Vowel error 26 4 15% Assimilation + devoicing 1 0 0% Syllable omission 3 0 0% Cluster reduction + /r/-weak. 1 0 0% Metathesis 9 0 0% /ɕ/-error 1 0 0% Devoicing 1 0 0% * Stretches of speech labeled as unintelligible by the annotator.

130 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

time window where listener reactions Discussion are sought, in the process of linking This study has presented an application speech events to listener reactions. By of an ARS-based method of evaluating using a relatively long time window, intelligibility to the conversational and by allowing listener reactions to be samples of misarticulated child speech. interpreted as having been evoked by The comparison between listeners with more than one speech event, the risk of professional experience with misarticu- overlooking existing connections is lated child speech and untrained listen- minimized. This, however, is at the ers, with regards to their reactions to expense of precision, which may lead to unintelligibility, revealed no difference the identification of links that are not between the groups. The results of the actually there. In future work, these ARS-evaluation were validated against decisions may need refinement. a standard clinical measure of severity Many factors contribute to varia- (the PCC), revealing a strong correla- tions in intelligibility. Focus in the pre- tion between the two. By linking anno- sent study has been specific segmental tated misarticulations to the distribution speech errors, whereas other aspects of of listener reactions, we have demon- the speech material have been disre- strated the potential in the ARS-based garded. In order to control for the influ- method to rank different types of ence of other factors – e.g. lexical, syn- speech errors (or, for that matter, any tactic, prosodic or pragmatic factors – episodic speech phenomena) by their using a more restricted speech material, impact on, in this case, intelligibility. and/or including more material, from Given observations that the correla- more speakers, should be considered. tion between the severity of the speech Much work remains to arrive at impairment and intelligibility is weak firm conclusions on how specific (e.g. Shriberg & Kwiatkowski, 1982), speech errors contribute to decreased the lack of correlation between the ICS intelligibility. However, the present and the PCC measures is not surprising. study constitutes an important step, in However, the lack of correlation be- describing a viable method for collect- tween the results of the listener evalua- ing norms in this area. By integrating tion and the ICS measure requires such information in the prioritizing of commenting. This may reflect the fact clinical targets, intervention may be that perceived intelligibility varies better directed at those patterns that across situations, and that a one-minute cause the most problems for children in recording of a one-to-one conversation their everyday lives. in a quiet room risks not being repre- sentative of general everyday situations. Conclusions Just as in (Strömbergsson & We have demonstrated the potential of Tånnander, 2013), no difference was applying an ARS-based method to the found between experienced clinicians domain of misarticulated child speech, and untrained listeners in their evalua- to explore the relative contribution of tions of intelligibility. This contradic- different speech errors to perceived tion to earlier findings (Lundeborg & (un)intelligibility. Although more data McAllister, 2007; McGarr, 2011), may – in terms of more speakers and broader be due to the nature of the speech mate- coverage of speech error types – is re- rial (conversational speech vs. isolated quired to allow general conclusions words), or to the nature of the misartic- regarding the impact of different speech ulations (primarily phonological errors errors on intelligibility, the paucity of vs. the speech of a child with apraxia of established norms in this area strongly speech or deaf children). motivates continued efforts in this di- A limitation concerns the uncer- rection. tainty tied to the determination of the

131 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Acknowledgements severe developmental dyspraxia. The web-based platform for the ARS- Logopedics Phoniatrics Vocology, test was provided by Södermalms 32(2), 71–79. Talteknologiservice (STTS). Jens Ed- McGarr, N. S. (1983). The lund produced the KDE curves. intelligibility of deaf speech to experienced and inexperienced References listeners. Journal of Speech and Blumenthal, C. & Lundeborg Hearing Research, 26(3), 451–458. Hammarström, I. (2013). LINUS McLeod, S., Harrison, L. J., & preliminärmanual från februari McCormack, J. (2012). The 2014. Sweden: Dept. of Intelligibility in Context Scale: Neuroscience and Validity and reliability of a Locomotion/Speech Pathology, subjective rating measure. Journal Linköping University. of Speech Language and Hearing Coplan, J. & Gleason, J. R. (1988). Research, 55(2), 648–656. Unclear speech: Recognition and Shriberg, L. D. & Kwiatkowski, J. significance of unintelligible speech (1982). Phonological disorders III: in preschool children. Pediatrics, A procedure for assessing severity 82(3), 447–452. of involvement. Journal of Speech Edlund, J., Hjalmarsson, A., & and Hearing Disorders, 47(3), 256– Tånnander, C. (2012). 270. Unconventional methods in Sjölander, K. & Beskow, J. (2000). perception experiments. In WaveSurfer - an open source speech Proceedings of Nordic Prosody XI. tool. In B. Yuan, T. Huang, & X. Tartu, Estonia. Tang (Eds.), Proceedings of ICSLP Kwiatkowski, J. & Shriberg, L. D. 2000, The 6th Intl Conf on Spoken (1992). Intelligibility assessment in Language Processing (pp. 464– developmental phonological 467). Beijing, China. disorders: Accuracy of caregiver Strömbergsson, S. & Tånnander, C. gloss. Journal of Speech and (2013). Correlates to intelligibility in Hearing Research, 35(5), 1095– deviant child speech – comparing 1104. clinical evaluations to audience Lundeborg, I. & McAllister, A. (2007). response system-based evaluations by Treatment with a combination of untrained listeners. In Proceedings of intra-oral sensory stimulation and Interspeech 2013 (pp. 3717–3721). electropalatography in a child with Lyon, France.

132 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

SUBIC: Stockholm University Brain Imaging Center Francisco Lacerda, Björn Lindblom Department of Linguistics, Stockholm University, Sweden [email protected], [email protected]

Abstract cutting-edge when it comes to sophisti- This contribution presents an outline of cated brain imaging facilities dedicated SUBIC (Stockholm University Brain to a wide range of psychological, social Imaging Center, working name). and developmental scientific research. SUBIC is conceived as an interdiscipli- In addition, from a scientific and a lo- nary infrastructure that will promote gistic perspective, Stockholm Universi- Stockholm University’s participation in ty is in a unique position to make pro- international cutting-edge research fo- ductive use of such an advanced infra- cused on the function and the morpho- structure with its in-house scientific logic evolution of the brain. expertise as well as to provide excellent integrated laboratory space for that in- Introduction frastructure. Needless to say adequate SUBIC is conceived as a brain imaging scientific expertise and specialized infrastructure to be hosted by Stock- technical competence are necessary holm University for advanced research conditions to insure a high-quality out- in the Humanities, Social Sciences, put from brain imaging studies but there Evolutionary Zoology and Law. The is a critical mass of researchers within new infrastructure will supplement the Stockholm University’s four faculties – range of resources for scientific re- Humanities, Social Sciences, Natural search already available at Karolinska Sciences and Law – who are already Institute’s current brain imaging infra- engaged in scientific research that uses structures by offering low noise (quiet) brain imaging techniques. Indeed, many brain imaging technologies specifically of these researchers are today collecting tailored to the special needs of behav- their brain imaging data at the Karolin- ioral, psychological and developmental ska Institute or abroad. Also the neces- studies involving spoken language and sary detailed knowledge of magnetic other acoustic stimuli as well as ad- and electric fields, as well as of the cry- vanced technology to study the mor- otechniques involved in MRI and MEG phological and functional evolution of devices for brain imaging, is available the brain in non-human species. This from among researchers from the Uni- combination of functional brain re- versity’s Department of Physics. The search in humans with the zoological same goes for the needs of advanced perspective on the morphological and computational competence, which is functional evolution of the brain in oth- available with Stockholm University at er species will integrate Stockholm the Departments of Mathematics and University’s multidisciplinary expertise the Department of Statistics, for in- and create a brain research center of stance. In parallel with the availability both national and international rele- of expertise to carry out the methodo- vance. logical and computational tasks re- The geographic proximity and quired by the advanced brain imaging complementarity of the brain imaging technology per se, a significant contri- resources that will be available both at bution will come from Stockholm Uni- Karolinska Institute and at Stockholm versity’s Faculties of Humanities, So- University will trigger important scien- cial Sciences, Natural Sciences and tific synergies and place SU’s at the Law in terms of non-clinical research

133 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University areas, addressing central basic research will be a physical laboratory facility questions concerning the fundamental within which Stockholm University’s nature of both non-human and human multidisciplinary expertise will meet, cognition, learning, information pro- discuss and run experimental studies cessing and interactions with other in- addressing fundamental research issues dividuals. We argue that it is via this regarding the functional evolution of basic research that Stockholm Universi- the brain and brain function within cen- ty’s contribution will be most signifi- tral aspects of human behavior. Inter- cant because it will provide high- disciplinary scientific exchange within quality reference information on the SUBIC will be promoted by requiring fundamental functions and features un- that scientists from all fields involved in derlying – i.e., essential knowledge on SUBIC participate into regular high- human and non-human behavior under- level weekly interdisciplinary seminars lying a wide range of scientific re- for the discussion of ongoing brain im- search. aging research experiments or funda- In addition to the scientific exper- mental scientific issues. Through the tise and breadth of potential research participation in the seminars and the questions per se, the laboratory space availability of laboratory facilities, a available at Stockholm University is an culture of interdisciplinary scientific optimal location for a well-planned exchange is expected to emerge and brain imaging center, with easy access attract scientists from other national and for handicapped, parents with small international institutions. Another inti- children, a number of non-model ani- mately related scientific aim is to con- mal species as well as for the delivery tribute to the Human Brain Project with and storage of lab supplies. Stockholm a high-quality scientific data describing University has made available a 360 m2 brain function associated with a broad laboratory space on a single ground range of human behaviors. level plan that meets all the require- Several lines of research within the ments of volume dimensions and ener- social sciences and humanities have gy supply imposed by the brain imaging already integrated in their methodolo- infrastructure. gies the benefits of the available brain The infrastructure will be available imaging technology but more systemat- for national and international academic ic work must be done. The study of researchers from all the disciplines. learning, development of concepts, Specialized technical staff (1.5 fulltime) memory and how it is affected by will maintain the infrastructure and stress, by sleep or altered with age (Eb- assist researchers with technical issues ner, Maura, Macdonald, Westberg, & and experiment design. The infrastruc- Fischer, 2013; Fischer et al., 2010; ture may also be available, with lower Kuhl et al., 1997; Kuhl, Williams, Lac- priority, for non-academic research. erda, Stevens, & Lindblom, 1992; A multidisciplinary board will MacDonald, Nyberg, Sandblom, Fisch- manage the infrastructure and be ulti- er, & Backman, 2008; Marklund, mately responsible for planning and Schwarz, & Lacerda, 2014; Norrelgen, assessing study proposals as well as Lacerda, & Forssberg, 1999; Werheid allocating technical and other staff re- et al., 2010), are just a few examples of sources to conduct the studies. basic research where the new brain im- aging technology already plays a cen- Scientific aims tral role. For instance, fMRI techniques The main scientific aim to be achieved are currently being applied to sleep re- by this infrastructure is the study is to search within the large on-going col- integrate the cognitive, social and evo- laborative project involving KI and SU, lutionary perspectives in the study of “Sleepy Brain”, addressing issues like brain function. To achieve this, SUBIC brain connectivity during different

134 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University phases of sleep, consolidation processes cond language is acquired. These prac- of pre-sleep learning, effects of sleep tically based research needs provide reduction and REM-sleep disturbance strong motivation for shifting the focus on emotional regulation. Indeed, virtu- - from structural descriptions of lan- ally all aspects of the human behavior guages – to language use and language are linked to brain activity and brain learning. It is in this sense we see the imaging technology is nowadays a advent of brain imaging and a neurosci- well-established and very successful ence perspective that are likely to pro- methodology in different areas of Ex- mote a paradigm shift: From abstract perimental psychology. However, there structural descriptions to behaviorally still are important areas of research that based functional accounts. are not yet well acquainted with the In this context Experimental pho- opportunities offered by this methodol- netics can be said to have been the only ogy to address brain function in relation field of linguistics attempting to con- to central aspects of typical and atypical tribute with independently motivated human behavior. Linguistics offers a and biologically anchored explanations, good example of a scientific area within but it is fair to say that despite such which the availability of SUBIC’s new developments by and large phonetics brain imaging technology is likely to remains trapped by the structuralist trigger a paradigm shift. Let us look paradigm and descriptive work is still closer at its example. its dominant theme. To be sure, good Traditionally, linguistics has been descriptive data is a necessary basis for focused on structural descriptions of further scientific work but the most languages. It has been concerned main- significant advances are to be expected ly with syntax and morphology but es- from meta-analyses and testable hy- sentially lacking a biological link to potheses emerging from those data. language as used by interacting indi- However, deeper insights on the pro- viduals. Noam Chomsky’s revolution- cesses underlying language communi- ary work introducing the concepts of cation and the linguistic interaction generative grammar provided powerful between individuals are only possible descriptions of how a finite set of ele- through experimental work that reveals ments could capture an individual’s how linguistic capabilities emerge dur- syntactic competence. Nevertheless it ing early development and how they are offered an essentially static idealized modulated by the social processes in description, which was limited in both which the individual is involved developmental perspective and a seri- throughout life. Experimental phonetics ous pursuit of its biological bases. In- has contributed significantly during the deed, in so far as the latter were consid- last decades with answers to some of ered at all, they were seen as secondary these questions. In the new intellectual issues. As a result most of the contribu- framework it will continue to grow in tions to the study of the biological depth and breath. foundations of language and spoken The recent development and avail- communication have come from empir- ability of brain imaging technology ical cognitive psychology and other have created important new challenges disciplines rather than from core lin- and research options for humanities, in guistic research. particular regarding the study of natural The social relevance of a discipline language, and the neurobiological or- is linked to its practical use. There are ganization and pragmatic aspects of its numerous applications – specifically in interactive use and acquisition. the educational and clinical domains - In summary, the proposed brain that rely on solid knowledge of how use imaging infrastructure will allow re- of spoken language works and how searchers from the humanities, social mastery of both one’s native and a se- sciences and natural sciences to inte-

135 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University grate a unique evolutionary perspective motor activities. The functional study of on functionally induced changes in the brain is one of the most challenging brain morphology with cutting-edge scientific goals and obviously demands studies of brain function under ad- a broad range of coordinated interdisci- vanced cognitive tasks and how brain plinary efforts. The functional study of function is modulated by factors like the human brain also extends the scope stress, sleep, age or life experience. of traditional clinical research directly More specifically the following inter- associated with the physiological con- disciplinary lines of study have been sequences of conditions like aphasias, discussed and can initiated once the epilepsy or brain injuries, by providing brain imaging infrastructure is availa- integrated models of how multisensory ble: information is processed by the brain • Memory processes as a function of age, and underlies the formation of central stress, sleep and multisensory infor- concepts and behaviors. The range of mation competences necessary to address such • Implicit learning as a function of age, broad functional aspects is, of course, background or additional information extremely vast and cannot be strictly and sensory modality (primarily visual predicted or dictated in advance. The and audio as well as a combination of coordination of the partial knowledge the two) provided by each of the involved disci- • Early language learning, the multisen- plines and their current methodologies sory bases of the emergence of linguistic requires a culture of systematic scien- concepts early in life, the emergence of tific research capable of generating syntactic structures and morphological well-grounded and testable explanatory generalization models as well contributing with high- • The emergence of concepts and linguis- tic processing in sign- and oral lan- quality and well documented data for guage future research efforts and re-analyses. • Representation of faces and voices and The functional study of the human brain their social significance raises this challenge because it attempts • Decision making and risk assessment in to account for how the human brain economic decisions and game situations handles the individual representations • Memory processes and witnesses’ of sensory information throughout life, judgments as well as how those representations • Face and voice recall in everyday life impact on the individual’s behaviors, and in forensic contexts cognitive representations and also are • Implicit learning in game situations and modulated through the individual’s cultural transfer social interaction and relation to her • Cultural changes and priming influ- ecologic settings. The academic envi- ences in representation and ranking ronment provided by Universities – tasks with the combination of, on the one • Non-human species’ evolutionary mor- hand knowledge from disciplines within phologic brain changes associated with the Social sciences, the Humanities and changes in brain function Law, and on the other hand the meth- Some of these studies are already being odological and physical knowledge conducted using available infrastruc- from disciplines within the Natural sci- tures but the SUBIC infrastructure will ences – is clearly the necessary and allow for much wider and more inte- probably the optimal context to start grated interdisciplinary research gains. addressing the full scale of issues raised Significance by the functional study of the brain without losing the important individual The brain is involved in all sensory and social perspectives. representation and information integra- The growth of neuroscience during tion as well as in human cognitive and the past few decades has been charac-

136 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University terized as explosive. Despite numerous gate multiple aspects of brain evolution, significant insights in many subareas of brain function, and clinical aspects of the field, neuroscientists from different the brain. The expertise in Stockholm specialties have expressed concern that University in the disciplines of Psy- recent developments have come with a chology, Linguistics, Culture evolution, snag: Radical progress in developing Game theory, Brain evolution, Neuro new treatments for brain disease is an economy, Decision processes and the obvious possibility but reaching those methodological development of brain goals will require a long-term effort. It imaging makes Stockholm University will not happen in the immediate future. the perfect host for this national center In their report to the European Com- of Brain Imaging. Together with other mission, the applicants of The Human leading experts both from Swedish and Brain Project point out: International universities, the Stock- holm University Brain Imaging Centre “We find that the major obstacle that will offer a state of the art hub for fu- hinders our understanding of the brain is the fragmentation of brain research ture endeavors into the remaining key- and the data it produces. …. Today we questions in Science focusing on brain urgently need to integrate this data – to evolution, integrated brain function show how the parts fit together in a sin- throughout life, brain development and gle multi-level system.” aging. And all this will be achieved in a center that emphasizes the importance The Human Brain Project (awarded of the welfare of the subjects through €1bn by the EU) involves a large group animal/infant/child/adult friendly tech- of scientists with backgrounds primarily nology. in neurobiology and computational sci- Finally, the proposed infrastructure ence the goal of the project being to will allow to unleash the power of basic develop new treatments for brain dis- scientific research by addressing strong ease and new brain-like computing long-term fundamental issues intimate- technologies. ly linked to the very human nature and We here maintain that the “integra- to the urgent need of deeper insights on tion of brain data” that the above quote human cognition and how human brain refers to should go beyond the clinical function evolved in relation to other areas by including contributions from species. the Humanities, the Social Sciences and Natural Sciences that throw light on Survey of the field how normal non-human and human Brain imaging facilities have become a behaviors work. To explain diseases common tool in most medical faculties, scientifically, to diagnose them and both in Sweden and other industrialized treat them, fundamental knowledge of countries. An increasing number of the underlying normal behavior is a non-medical Universities in Europe, the prerequisite. That is the international USA and Japan have also been invest- and intellectual context in which the ing in brain imaging facilities, which SUBIC initiative is presented. Stock- are typically used in research conducted holm University combines this type of within Psychology, with more recent fruitful academic environment with the expansions towards neuro-economics geographic proximity to the more clini- and studies of decision-making. cally oriented research driven by scien- tists at the Karolinska Institute. EEG We will use a holistic approach that For more than one century ago, the fun- encompasses all aspects of variation in damental Physics knowledge generated brain morphology and behavior in a by basic research in electronics and suite of organisms, ranging from Dro- electromagnetism reached a level of sophila fruit flies to humans, to investi- understanding of central electronic phe-

137 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University nomena that made it possible to start relevant cases. The technique explores building devices capable of amplifying different aspects of the magnetic reso- electrical signals. Among the extremely nance properties of hydrogen nuclei, wide range of applications for the new depending on if the goal is to obtain amplifying devices, the development of structural or functional measurements. sensitive and low-noise amplifiers For instance, in the case of fMRI the opened for the first Electroencephalo- oxygenation-dependent magnetic prop- graphic (EEG) studies of the living erties of the haemoglobin are the key to brain, both in animals and in human the investigation of brain regions in- subjects. Many of the initial studies of volved in different tasks. As the neu- EEG activity in the human brain were ronal activity level in a brain region understandably triggered by clinical increases when a subject is requested to needs, like the study of cortical pro- perform a certain task, the neurons’ cesses associated with aphasia, epilepsy heightened activity level demands more and other neurological diseases, but the oxygen and richly oxygenated haemo- technology was quickly adopted to the globin floods into the region. However, study of non-clinical basic research after a few seconds the more active questions, in particular within Psychol- neurons have all the oxygen supply they ogy and Linguistics. EEG measure- needed but the oxygenated blood con- ments provide cortical activity data tinues to flow for a short while. This with high temporal resolution, which causes a temporary excess of highly makes them suitable to register Event oxygenated blood in the region, which Related Potentials (ERP), but the meas- can be measured because of the differ- urements spatial resolution is poor. In ences in the magnetic behavior of oxy- addition, the registered electrical sig- genated haemoglobin (diamagnetic) and nals are affected by the skull’s thick- deoxygenated haemoglobin (paramag- ness, which may difficult the direct netic). Obviously, functional MRI comparison and interpretation of ampli- (fMRI) is an extremely important tech- tude measurements obtained from dif- nique for both basic and clinical re- ferent electrodes. search because it generates high spatial- resolution data of the whole brain from MRI which it is possible to identify regions A more recent technology for brain of heightened neuronal activity associ- imaging, introduced in the early 1990- ated with different experimental tasks. ies, is Magnetic Resonance Imaging However, because fMRI actually (MRI). As in other cases, the technique measures the excess of oxygenated itself emerged from a combination of blood resources overflooding the active basic research in neurophysiology and brain region, the temporal resolution of in Physics, like Electromagnetism, fMRI is typically in the range of a cou- Atomic Physics and the physics of ex- ple of seconds – which is much poorer tremely low temperature electrical con- than millisecond resolution available duction (superconductors). The MRI from EEG measurements – but fMRI’s technique is suitable for both structural, high spatial resolution is an excellent functional (fMRI) and diffusion (con- feature for investigation of a range of nectivity) brain imaging, which serve psychological phenomena where it is different purposes. Like with the EEG, important to identify the brain struc- most of the initial MRI studies were tures involved in specific tasks. Fur- concerned with structural assessments thermore, the same equipment can be of the clinical brain, but the range of used to obtain detailed structural brain applications has progressively expand- data, which is very important for the ed towards more general basic research study of long-term structural changes in questions that obviously feed back into the brain (like aging, specializations the need to understand the clinically like in high level music performance,

138 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University structural differences associated with Evolutionary Zoology is yet anoth- the development of sign or oral lan- er research area where MRI scanning is guage skills, long-term effects of envi- an important tool for the structural ronmental conditions, etc.) or to pro- study of the evolution of the brain vide individual anatomical descriptions across species. Scanners for this pur- of a subject’s brain – descriptions that pose have to meet species-specific provide valuable information for the needs of spatial resolution and volumes interpretation of EEG, MEG and NIRS of operation. For instance, for morpho- data as well as for the design of TMS logical brain scanning of animals, like experiments. The use of MRI scanners, dogs, wolves or foxes, smaller versions both in functional and structural studies, of 3 T equipment provide good enough is well established in many fields of brain images, but small rodents, small psychology involving the representation fish or insects require much stronger of all types of sensory stimuli (although magnetic fields (9.4 T) to obtain brain still less with regard to auditory stimuli) images with enough spatial resolution and in the last decades there has been though within scanning volumes as an increasing demand for the use of small as 1000 cm3. fMRI technology in the study of socio- logical, economical, cognitive and deci- NIRS sion-making processes. The two major The near-infrared spectrophotometry drawbacks of the technique until quite (NIRS) is a technique that uses differ- recently have been its poor temporal ences in the way oxygenated and deox- resolution and the disturbing levels of ygenated haemoglobin reflects infrared intermittent noise generated during the light. Its working principle is similar to scanning sequences. This has not been a that of MRI in that it measures the problem for the methodologies used in amount of re-emergent light when illu- the above mentioned research fields but minating cortical blood vessels with a the limited temporal resolution makes near-infrared light source. The amount the technique inadequate for the study of reflected light varies with the oxygen of rapidly time-varying phenomena, level in the haemoglobin and can there- like the cortical processing of speech fore be used to estimate the amount of stimuli or cognitive responses associat- oxygenated blood being requested to a ed with speech stimuli. Also the noise cortical region. The technique has a levels in the scanner pose serious diffi- better temporal resolution than fMRI culties to studies requiring the presenta- and better spatial resolution than EEG. tion of speech or good quality audio It can measure fast variations in the signals. Indeed, even though earplugs blood oxygenation level achieving a and specially designed headphones do temporal resolution of about 200 ms as improve much of the disturbance well as the slow varying oxygenation caused by the scanner’s noise, the con- levels caused, as in fMRI, by the local trol of the presentation levels or of the excess of oxygenated blood flooding to spectral details of the stimuli actually the active brain region. Although a delivered to the subjects is far from temporal resolution of 200 ms still is satisfactory in most of the traditional poor for the study of events linked to fMRI cameras. Fortunately, a new gen- specific speech or auditory features, eration of much less noisy scanners has NIRS offers a good compromise of appeared recently which will facilitate temporal and spatial resolution. In addi- basic research involving representations tion, the NIRS equipment is portable of auditory stimuli as well as minimiz- and its use is not constrained by the ing the disturbance and tension that the common requirements of electromag- MRI scanner noise can cause in young netic shielding because it operates in children. the infrared range of the electromagnet- ic spectrum.

139 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

MEG TMS Magnetoencephalography (MEG) is Transcranial magnetic stimulation another very important brain imaging (TMS) is a non-invasive technique to technology. MEG was first implement- induce momentary electrical disturb- ed in 1968 by David Cohen, a physicist ances in the brain function by using a from the University of Illinois (now at rapidly varying magnetic field. It is the Harvard Medical School in Boston), silent and allows localized stimulation but it was only during the last few dec- of target cortical regions that polarizes ades that its use became more wide- or depolarizes the neurons in the target- spread. In contrast to the thousands of ed region, which momentarily impairs fMRI equipment available at hospitals their normal function. Brain function is and many universities around the world, restored immediately as soon as the the number of MEG cameras in the stimulation ends and there are no re- world is still less than one hundred with ported long-term effects from TMS use nearly 50% of the cameras located in in scientific research. TMS is an im- Japan and the USA. MEG measures the portant technical resource to test hy- very weak magnetic fields generated by potheses about the localization of dif- the brain’s neuronal activity, i.e. fields ferent brain functions. For instance, of about 10 fT (10 × 10-15 Tesla) to be TMS is currently been used in speech measured against the one billion times research to investigate the role of Bro- stronger earth’s 25-65 µT (25 × 10-6 to ca’s area in the perception of speech 65 × 10-6 Tesla) magnetic field. MEG is contrasts or to study how blockage of a silent and passive technique that com- motor cortex areas involved in certain bines the temporal resolution of EEG speech articulation movements influ- with a good spatial resolution. There- ences the perception of speech sounds fore, although MEG’s spatial resolution that are produced by those articulatory is not as good as fMRI’s, the tech- movements. nique’s excellent temporal resolution and silent environment, makes it an LSF microscopy optimal compromise for the study of Light-sheet fluorescent microscopy cortical activity associated with rapidly (LSF) is a technique to obtain high- varying stimuli, like speech stimuli or resolution images from biological struc- animated visual sequences. Another tures by systematically illuminating important feature is that the measured successive thin layers of tissue. This magnetic fields are not affected by the technique is highly beneficial for cer- skull’s thickness allowing, for instance, tain questions particularly in very small for more reliable measurements of in- animals because it offers higher resolu- fants’ brain activity because MEG data tion than even a 9.4T MRI while being are not affected by differences in fonta- very fast, allowing collection of high nel’s development. In addition, the resolution images for large sample siz- combination of simultaneous MEG and es. This will be particularly important EEG data provide a very reliable and for large scale analyses of brain mor- unambiguous source localization results phology in small vertebrates and in- while keeping the high temporal resolu- sects. Suggested use would target for tion. MEG measurements are typically instance comparative analyses of fine- complemented with the individual sub- scale aspects of brain morphology ject’s structural MRI data, which signif- across multiple species of small verte- icantly increases the precision of the brates and insects, and artificial selec- source localization estimates and tion experiments on various aspects of strengthens the interpretation of the brain morphology. MEG and EEG data.

140 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Societal impact terms of explanatory models that Basic research in the Humanities and strengthen the traditional theories with- Social sciences that contributes with in the humanities. fundamental knowledge about the func- Acknowledgements tion of the human brain. Research that SUBIC is being created with support produces good quality data and answers from Henrik Granholm’s Stiftelse and to fundamental questions. Understand- Stockholm University. ing of the human brain. Spin-off effects due to the interdis- References ciplinary character of the research and The Human Brain Project, A Report to the available critical mass at Stockholm the European Commission prepared University. by Henry Markram and co-authors. Basic research on the similarities Ebner, N. C. et al. (2013). Oxytocin and and differences in brain morphology socioemotional aging: Current and function between dogs, wolfs and knowledge and future trends. Front humans. Basic research on invertebrate Hum Neurosci, 7, 487. and vertebrate brain morphology evolu- Fischer, H. et al. (2010). Simulating tion through the tree of life. In particu- neurocognitive aging: effects of a lar, zoological studies in the center will dopaminergic antagonist on brain start with large-scale analyses of the activity during working memory. evolution of brain morphology and be- Biol Psychiatry, 67, 575-580. havior in dog breeds and wolves, ro- Kuhl, P. K. et al. (1997). Cross- dents, insects and fishes. language analysis of phonetic units Attracting to Stockholm internation- in language addressed to infants. al research on brain function Science, 277, 684-686. Like in the advent of digital computers Kuhl, P. K. et al. (1992). Linguistic in the early 1970-ies, brain imaging experience alters phonetic percep- resources are still a very expensive and tion in infants by 6 months of age. specialized technology with most infra- Science, 255, 606-608. structures allocated to hospitals. The MacDonald, S. W. et al. (2008). In- investment on brain imaging infrastruc- creased response-time variability is tures at non-medical universities will associated with reduced inferior pa- trigger a natural increase in the rietal activation during episodic knowledge of the functional brain and recognition in aging. J Cogn Neu- stimulate the development of method- rosci, 20, 779-786. ologies that will integrate brain imaging Marklund, E., Schwarz, I. C., & Lacer- in a broad range of academic research. da, F. (2014). Mismatch negativity The availability of the resources and the at Fz in response to within-category increasing volume of their use are ex- changes of the vowel /i/. Neu- pected to offer new research avenues to roreport. academic areas that study the complex Norrelgen, F., Lacerda, F., & Forssberg, relationship between individuals and H. (1999). Speech discrimination their interaction with others, but that and phonological working memory traditionally do not explore empirical in children with ADHD. Dev Med methods. For instance, the possibility of Child Neurol, 41, 335-339. studying changes in brain activity in Werheid, K. et al. (2010). Biased connection with the complex human recognition of positive faces in ag- experiences, like films, literature or ing and amnestic mild cognitive music, may help proposing specific impairment. Psychol Aging, 25, 1- accounts of the human behavior in 15.

141 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University Author index

Aare, Kätlin 47 Asu, Eva Liina 23 Berger, Alexandra 1 Botinis, Antonis 65, 77 Contardo, Ivonne 127 Dabbaghchian, Saeed 29 Edelstam, Fredrik 53 Edlund, Jens 53 Frid, Johan 17 Gardin, Emilie 123 Gustafson, Joakim 53 Gustafsson, Lars 17 Hed, Anna 105 Heldner, Mattias 1, 47 Helgason, Pétur 83 Henriksson, Maria 123 Karlsson, Sofia 1 Karlsson, Fredrik 15 Lacerda, Francisco 133 Lindblom, Björn 133 Hedström Lindenhäll, Rosanna 1 Löfqvist, Anders 17 Markelius, Marie 123 Marklund, Ellen 35 McAllister, Anita 127 Meena, Raveesh 29 Myrberg, Sara 59 Nazem, Atena 35 Nirgianaki, Elina 77 Nolan, Francis 23 Olsson, Sofia 35 Nyberg Pergament, Sarah 1 Renner, Lena 123 Salvi, Giampiero 33 Schötz, Susanne 17, 23, 89, 95 Schwarz, Iris-Corinna 35 Stefanov, Kalin 29 Strömbergsson, Sofia 127 Toivanen, Juhani 5, 101 Tronnier, Mechtild 117 Uhlén, Inger 35 Vanhainen, Niklas 33

142 Proceedings from FONETIK 2014, Department of Linguistics, Stockholm University

Vojnovic, Ivan 1 van de Weijer, Joost 89 Wikstedt, Emilia 123 Włodarczak, Marcin 47 Zellers, Margaret 41 Zetterholm, Elisabeth 111, 117

143