<<

The Phonetician Journal of the International Society of Phonetic Sciences

Editorial board

President: Editor-in-Chief: Ruth Huntley Bahr Mária Gósy e-mail: [email protected] e-mail: [email protected]

Angelika Braun, Katarzina Klessa, Daniel Recasens, University of Trier, Adam Mickiewicz Autonomous University Germany University, Poland of Barcelona, Spain Nick Campbell, Jens-Peter Köster, Judith Rosenhouse, Trinity College Dublin, University of Trier, Swantech Ltd., Israel Ireland Germany Radek Skarnitzl, Jens Edlund, Marko Liker, University of Karlova, KTH, Sweden Univerity of Zagreb, Czech Republic Susanne Fuchs, Croatia Elisabeth Shriberg, Centre for General Alexandra Markó, SRI International, USA Linguistics, Germany Eötvös Loránd Vered Silber-Varod, Hilmi Hamzah, University, Hungary The Open University of Universiti Utara, Vesna Mildner, Israel, Israel Malaysia Univerity of Zagreb, Masaki Taniguchi, Valerie Hazan, Croatia Kochi University, Japan University College Sylvia Moosmüller †, Jürgen Trouvain, London, England University of Vienna, University of Austria Saarbrücken, Germany

Editorial assistant: Judit Bóna e-mail: [email protected]

Technical editor: László Csárdás

Submissions should be sent to: e-mail: [email protected]

Guest Editors: Nicola Klingler M.A. Acoustics Research Institute, Austrian Academy of Sciences Wohllebengasse 12-14, 1040 Wien, Austria Tel.: +43 1 51581-2541 [email protected]

Doz. Dr. Michael Pucher Acoustics Research Institute, Austrian Academy of Sciences Wohllebengasse 12-14, 1040 Wien, Austria Tel.: +43 1 51581-2508 [email protected]

ISPhS International Society of Phonetic Sciences

Honorary President: President: Secretary General: Ruth Huntley Bahr Harry Hollien Mária Gósy

Vice Presidents: Past Presidents:

Angelika Braun Jens-Peter Köster Marie Dohalská-Zichová Harry Hollien Mária Gósy William A. Sakow † Damir Horga Martin Kloster-Jensen† Heinrich Kelz Milan Romportl † Stephen Lambacher Bertil Malmberg † Asher Laufer Eberhard Zwirner † Judith Rosenhouse Daniel Jones †

Honorary Vice Presidents:

A. Abramson H. Morioka S. Agrawal R. Nasr L. Bondarko T. Nikolayeva † E. Emerit R. K. Potapova G. Fant † M. Rossi P. Janota † M. Shirt W. Jassem † E. Stock M. Kohno M. Tatham E.-M. Krech F. Weingartner A. Marchal † R. Weiss

President’s Office: Secretary General’s Office: Prof. Dr. Ruth Huntley Bahr Prof. Dr. Mária Gósy Dept. of Communication Sciences and Dept. of Phonetics, ELTE Eötvös Disorders, Loránd University, University of South Florida Research Institute for Linguistics, 4202 E. Fowler Ave., PCD 1017 Benczúr u. 33 H-1068 Budapest Tampa, FL 33620-8200 USA Hungary Tel.: ++1-813-974-3182 Tel.:++36 (1) 321-4830 ext. 172 Fax: ++1-813-974-0822 Fax:++36 (1) 322-9297 e-mail: [email protected] e-mail: [email protected]

The Phonetician A Peer-Reviewed Journal of ISPhS/International Society of Phonetic Sciences ISSN 0741-6164 Number 116 / 2019

CONTENTS

Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals by Carolin Schmid………………………………………………………………………………………………...6 Towards building a cross-lingual speech recognition for Slovenian and Austrian German by Andrej Žgank and Barbara Schuppler………………………………………...20 Revisiting nonstandard variety TTS and its evaluation in Austria by Carina Lozo and Michael Pucher ………………………………………………………...... 34 “Viennese Monophthongs”: Present – marked – given? On intra-individual variation of /aɛ̯ / and /ɑɔ̯ / diphthongs in standard German pronunciation in rural Austria by Jan Luttenberger and Johanna Fanta-Jende …………………………………………………45 Distribution of VOT in conversational speech: The case of Austrian German word-initial stops by Petra Hödl…..………………………………...... 59

F0 contours of ironic and literal utterances by Hannah Leykum….…………… ...... 73 ISPhS membership application form...... ………………………..…………………………...84 News on dues ………..………...………………………………………………………….....85

GUEST EDITORS’ NOTE

The papers in this special issue were collected on the occasion of the ‘Phonetics and Speech Technology’ Workshop at the 44th Austrian Linguistics Conference in Innsbruck, Austria. This was organized by the members of the Phonetics group of the Acoustics Research Institute (Austrian Academy of Sciences) in 2018. Following the workshop, we invited the workshop participants and other members of the phonetics community in Austria to submit papers for this special issue. As the erstwhile head of the Phonetics Group, Sylvia Moosmüller and Carolin Schmid initiated and organized the first Workshop on "Phonetics in and about Austria" in 2014. The aim was to establish an annual forum for researchers in phonetics in Austria. The idea for this reoccurring workshop originated in 2013 at the 40th Austrian Linguistics Conference in a “Sociophonetics” workshop organized by Manfred B. Sellner. A collection of papers following this workshop was published under the title "Phonetics In and About Austria" by the Austrian Academy of Sciences, eds. Sylvia Moosmüller, Carolin Schmid and Manfred B. Sellner).

The "Phonetics In and About Austria" workshop showcased research associated with phonetics, primarily located in Austria. It has subsequently been adopted by academics all over Austria as a welcome forum to present their work. Sylvia Moosmüller sadly passed away in spring 2018, and we organized the workshop to continue the tradition that she started. This special issue includes contributions on L2 acquisition (Schmid), speech technology (Zgank/Schuppler, Lozo/Pucher), dialectology (Luttenberger/Fanta- Jende), and laboratory phonetics (Hödl, Leykum). The breadth of these contributions mirrors the multifaceted research interests of Sylvia Moosmüller and her seminal work in all of these areas. Furthermore, the presence of a great number of young and especially female researchers in this special issue is a testament to the precious and sustained efforts that Sylvia Moosmüller undertook to encourage young scholars to embark on a scientific career.

The guest editors are very happy to be able to publish this special issue in The Phonetician, to underline the close connection Sylvia Moosmüller always had with ISPhS and its journal. We thank the editorial team of The Phonetician for their assistance, and especially Mária Gósy for her support. We also thank the anonymous reviewers.

Nicola Klingler and Michael Pucher

LATERALS IN THE L2 PHONEME INVENTORIES OF BOSNIAN-GERMAN LATE BILINGUALS

Carolin Schmid Acoustics Research Institute, Austrian Academy of Sciences [email protected]

Abstract

This paper contributes to the research on phonetic processes in contact. The pronunciation of German lateral approximants of highly experienced late bilingual L1-Bosnian L2-German speakers living in Vienna (BoV) was analyzed and compared to the pronunciation of German speakers as well as to the pronunciation of laterals in L1 Bosnian. Standard Austrian German (SAG) and the Viennese Dialect (VD) were the main German contact varieties in Vienna. The languages under investigation have a similar lateral phoneme, which, however, is more velarized in Bosnian than in German. Over 5000 lateral segments of 14 BoV, and 12 L1-German speakers were analyzed in read speech, by measuring the F₁- F₂ distance in bark and considering (1) lateral position within the word, (2) syllable stress, (3) phoneme context, and (4) speaker gender. Results showed that BoV speakers realize their L2 German laterals differently than L1 SAG and VD speakers. However, the productions in their L2 are significantly different from laterals in their L1 Bosnian. This difference was larger for women than men. No gender or language differences were found for lateral position within the word, syllable stress, or phoneme context.

Keywords: second language acquisition, speech production, acoustic phonetics, laterals, Bosnian, German

1 Introduction Learning a second language (L2) later in life is an issue for many people over the world for a variety of reasons, including voluntary or forced migration. When living in an environment in which another language is spoken, the pressures on learners to communicate in the L2 are manifold: amongst the most important factors are the intelligibility of their L2 speech, the strength of their L1 accent in their L2 speech (which, for instance, can be more or less stigmatized), or the building or conservation of an identity, in terms of belonging to a (language)-community (e.g., Derwing & Munro, 2015; Schmid et al., 2014; Krzyzanowski & Wodak, 2007). Learning a second language later in life is a challenge, because many language external as well as –internal factors are involved and contribute to the acquisition

Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 7 process. Among the most important and most investigated language external factors are the age of onset of acquisition, length of residence in an L2 environment, frequency of contact to speakers of the L1 and L2, language use, and motivation (for overviews, see e.g., Colantoni et al. 2015; Riehl, 2004, Derwing & Munro, 2015). The present paper focuses on the language internal factors of L2 speech acquisition, that is, how the characteristics of the learners' L1 and the to-be-acquired L2 interact during the acquisition process (Riehl, 2004; Scharwood-Smith, 1983). Critically, among the many aspects of language that have to be mastered during L2 acquisition, language specific characteristics at the phonetic/phonological level are often less conscious to the speaker than other linguistic levels. Therefore, the challenge in acquiring a new language involves, most critically, the phonetic/phonological level, that is, sound perception and production. This is because, among the many aspects of an L2 that have to be acquired, sound perception/production typically is the most difficult for learners, as evidenced by the retention of a foreign accent (Schmid, 2011). While influences from the learners' native language (L1) on their L2 are often persistent, it has been found that influences can also go from L2 to L1 (e.g., Flege, 1987; Major, 1992; de Leeuw et al., 2013). This cross-language influence at the sound level has been the focus of many L2 sound learning models. Among the most discussed models of L2 sound learning are the Speech Learning Model (SLM Flege, 1995; Flege, 2007), the L2-Perceptual Assimilation Model (PAM-L2; Best & Tylor, 2007), and the Second Language Linguistic Perception (L2LP) model (Escudero, 2009; van Leussen & Escudero, 2015). In the present study, SLM will serve as the framework for the formulation of research questions and the interpretation of the experimental results. SLM focuses on the acquisition of individual L2 segments, rather than on sound contrasts with which both PAM-L2 and L2LP are primarily concerned. In contrast to L2LP, SLM addresses L2 learning at a given stage and for groups of learners, while L2LP aims at modeling the entire developmental process of L2 speech perception at the level of the individual learner instead of the level of a specific learner group. Critically for the present study, SLM assumes that L1 and L2 categories exist in a common phonological space. As a consequence, L1 and L2 categories are presumed to influence each other. On the one hand, this leads to what is termed "equivalence classification", that is, L2 sounds are initially perceived in relation to the L1 sound inventory. On the other hand, the L1 categories are also seen as flexible. That is, neither of the two languages of a language learner (i.e., L1 or L2) is taken to match the language of monolinguals. As for the relation of L1 and L2 sounds, a given L2 category is either identified and used in the form of an already existing sound category of the L1, or the learner builds a new category. If an L2 sound is similar to an L1 sound (only differing in fine phonetic detail), SLM predicts an assimilation of L1 and L2 sounds, so that their pronunciation would be within the range of one common, merged category, and each of the sounds would approach the respective other sound. If an L2 sound is perceived to be different from all L1 phonetic categories, the speaker would more probably build a new sound

8 Carolin Schmid category in his/her L2 and thus be able to realize it within the range of native speakers of the L2. In this case, SLM predicts a dissimilation of the L1- and L2 sounds, in order to maintain or enhance the perceived contrast. SLM, as well as the other L2 sound learning models, has to a large extent been tested on the acquisition of vowel categories (e.g. Bohn & Flege, 1992; Flege & MacKay, 1999; Pallier et al., 2001) or stop consonants (mainly focusing on voice onset time; Flege, 1987; Kim, 1994; Bond & Fokes, 1991; Pater, 2003) with a focus on English as the target language, or, less frequently, French. As for the acquisition of liquids, the majority of studies deal with the rhotic-lateral contrast, most prominently with the acquisition of the English /l/-/ɹ/ contrast by L1 Japanese speakers (e.g., Ingvalson et al, 2011; Aoyama et al., 2004). The focus of the present study is on the production of L2 laterals. Previous studies on laterals were mostly concerned with the clear-dark continuum in languages with one or more lateral phonemes or allophones. The articulatory basis for clear and dark laterals is the alveolar lateral approximant, where clear laterals refer to plain alveolar laterals and dark laterals to velarized lateral approximants. The acoustic realization of laterals depends on vowel context and language variety (Recasens, 2004; Carter 2002; Carter & Local, 2007). It also may be subject to psychosocial factors (Moosmüller et al., 2016; Simonet, 2010), and depend on the phonemic status, if more than one lateral is involved (Moosmüller et al., 2016; Müller, 2015). Only a few studies deal with the influence of language contact on the realization of different lateral sounds. Simonet (2010) describes the pronunciation of laterals by early Catalan-Spanish and Spanish-Catalan bilinguals, who encounter a velarized lateral in Catalan and a plain alveolar lateral in Spanish. In terms of the SLM, this study found evidence for new category formation in the L2. However, at the same time some evidence was found for category assimilation, as the mean values for L2 laterals were produced as closer to the laterals of the L1 than to the laterals of speakers for whom their L2 was the dominant language. While most speakers were found to differentiate both lateral categories, some of the female Spanish-dominant speakers appeared to have developed a single, merged category by assimilating the velarized lateral to the plain alveolar one, suggesting a sociolinguistic aspect of language contact. The current study investigates the pronunciation of laterals by Bosnian-German late bilinguals living in Vienna who migrated in adulthood during the Bosnian war in the 1990s (estimates suggest that a total of 40,000 people were affected, see Magistrat der Stadt Wien, 2018). Bosnian1, as a western south Slavic language based on the Stokavian dialect, distinguishes two lateral phonemes: a palatalized and a velarized lateral phoneme, which are also described as soft (palatalized) and hard (velarized) laterals in Slavistics (Petrovic & Grubisic, 2010). The velarized lateral in Bosnian is described as darker compared to the German lateral (Maric, 2005; Gick et al., 2006; Recasens, 1995, e.g. for the German lateral) and is orthographically identical to the

1 In the present study, the terminus “Bosnian” is used for the language spoken by the people living in Bosnia, irrespective of religious or ethnic attributions. Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 9 German lateral /l/. Standard Austrian German (SAG) and the Viennese dialect (VD) only feature one lateral phoneme, which is in both varieties a (plain) alveolar lateral (Moosmüller et al., 2015; Schmid et al., 2015). From a sociolinguistic perspective, it is noteworthy that this lateral phoneme has a velarized variant in the VD, even though this variant is avoided in careful speech, as it is a salient feature for the rather negatively evaluated dialect (Schmid et al., 2015, Moosmüller, 1991). However, since it is a possible contact variant, the present study makes use of two L1 German control groups, one speaking SAG and one VD. Overall, and in contrast to most previous studies, the present study assesses the production of an L2 phoneme that could be considered to have two counterparts in the learners' L1. However, for reasons of space, and acoustic similarity between sound categories (see below), the present study focuses on only two comparisons: (1) the production of German /l/ by BoV speakers as compared to the L1 German control groups and (2) production of the German /l/ by the BoV speakers compared to their production of the velarized Bosnian lateral. 1.1 Articulation and acoustics of alveolar and velarized laterals In the present study, the phonetic realizations of the L2 German plain alveolar lateral (hereafter, this lateral will be referred to as alveolar lateral) and the L1 Bosnian velarized lateral are investigated as those sharing greater acoustic similarity (see also Simonet, 2010). In terms of the SLM, the German L2 lateral is supposed to be considered a "similar" phoneme and hence should be assimilated to the Bosnian L1 lateral category. The similarity of the L1 velarized (instead of the palatalized) and the L2 alveolar lateral sounds arises because of the (primary) place of articulation, the acoustic output (particularly as the laterals are gradual in nature and strongly dependent on the vowel context, see also Recasens, 2012), and especially because of the orthographic representation. Lateral sounds are articulated with the tongue forming a closure at the mid-sagittal line of the vocal tract, while the airstream escapes by the sides of the tongue. The German lateral is characterized as having an alveolar constriction (likewise for Standard Austrian German) (Wängler, 1961; Delattre, 1965; Recasens, 2012), that is, the front part of the tongue is pressed against the hard palate. For velarized laterals this front gesture can be even more fronted with contact against the teeth. In addition, velarized laterals have a secondary articulatory gesture in which the back of the tongue body is retracted. Acoustically, laterals are characterized by high sonority and therefore “[…] by well-defined formant-like resonances.” (see Ladefoged & Maddieson, 1996: 193). At the transitions between vowels and laterals, abrupt changes in formant location and overall spectral intensity are often observed (ibid.). These are caused by the presence of an additional resonator, more precisely, the cavity above the constriction and the side channels, which acoustically lead to the spectral zeros and antiformants (Stevens, 1998). Especially F₂ strongly depends on the place of articulation of the closure and the general shape of the tongue body (Ladefoged & Maddieson, 1996: 193). The longer cavity behind the primary place of articulation in velarized laterals leads to lower F₂ values and to a perceptually darker lateral quality

10 Carolin Schmid of velarized laterals compared to alveolar laterals. The F₂ values display the most important differences in the comparison of these two lateral types (in velarized laterals, F₂ is below 1000 Hz in the /a/ context and around 1000 Hz in the /i/ context, and above 1220 Hz in the /a/ context and even above 1500 Hz in the /i/ context in alveolar laterals, see Recasens, 2012). F₁ in contrast is higher in velarized laterals and lower in alveolar laterals, but overall has a relatively low frequency in all laterals (mostly below 400 Hz). Thus, F₂ is considered to be the most important parameter for the differentiation of alveolar and velarized laterals (amongst others Thomas, 2011; Sproat & Fujimura, 1993; Carter & Local, 2007). 1.2 Research Questions Based on the literature reviewed in the introductory section, the following hypotheses are formulated within the SLM framework: 1. L2 German laterals pronounced by BoV speakers will differ significantly from L1 German laterals pronounced by SAG and VD speakers. 2. L2 German laterals pronounced by BoV speakers will not differ significantly from L1 Bosnian laterals. This is because the L2 German lateral category is rather similar to the L1 Bosnian lateral category of the velarized lateral. SLM predicts that similar categories are more difficult to acquire than new categories and hence will be merged.

2 Methods 2.1 Participants A bilingual group of 14 L1 Bosnian L2 German speakers (7 female) living in Vienna was recorded, as were 12 L1 German speakers (6 female). In the German group, six were selected to be speakers of Standard Austrian German (SAG) and six speakers of Viennese dialect (VD; see below for details). Demographic information was assessed in a qualitative interview. The BoV speakers were all born and raised in the region of today’s Bosnia and came to Vienna as young adults, between the ages of 20 and 35. Since all BoV speakers were migrants during the war in Bosnia, they had been living in Vienna for about 20 years. At the time of recording, they were between 40 and 60 years of age. They learned German mainly through immersion. For them, German was the only, or by far the most dominant foreign language. The educational background of the BoV speakers was relatively homogeneous, as all started university studies in Bosnia before moving to Austria. The L1 German speakers spanned the same age range as the BoV speakers (40-60 years). They were all born in Vienna, as were their parents. The SAG speakers were defined by their own and their parents' higher educational background than that of the VD speakers and their parents (see Moosmüller, 1991; Moosmüller et al., 2015). 2.2 Materials and Recordings In the present study, read speech was analysed. 104 German target words were selected to be balanced for the position of the lateral within the word (word-initial, medial but morpheme-initial, medial, and final position), syllable stress (stressed, secondary stressed, or unstressed), and vowel context (front vs. back vowels Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 11 preceding and following the lateral). For each target word, a short context sentence was constructed such that the targets occurred sentence medially (see Table 1). Since a pure reading task was considered not sufficiently engaging for the participants, 46 sentences were constructed such that participants could be asked to finish the sentences as they thought fit.

Table 1: Examples for laterals embedded in target words and carrier sentences in the German reading task Word Sentence English translation leben Sie leben nur einmal. You only live once. Volumen Das Volumen ist groß. The volume is large. Kiel Das ist in Kiel anders. This is different in Kiel.

For the BoV speakers an additional 36 target words in Bosnian were selected that contained the palatal and velar lateral. 28 of the targets formed /l'/-/ɫ/ minimal pairs. The other words formed near minimal pairs to control for vowel context. The words were embedded in scripted dialogue sequences (Table 2).

Table 2: Examples for laterals embedded in target words/minimal pairs and carrier sentences/dialogues in the Bosnian reading task Minimal English Carrier dialogue pair translation „Puno ljudi ima na ulici.“ - „Ljudi si rekao?“ ljudi Yes, I said people. „Da, ljudi sam rekao.“ „Imao sam ludi period.“ - „Ludi se rekao?” ludi „Da, ludi sam rekao.“ Yes, I said mad. „Možete to i bolje napraviti.” - „Bolje si bolje rekao?“ Yes, I said better. „Da, bolje sam rekao.“ „Bole me noge.” - „Bole si rekao?” bole Yes, I said hurt. “Da, bole sam rekao.” „Koralj je u moru.“ - „Koralj si rekao?“ koralj Yes, I said coral. „Da, koralj sam rekao.“ „U koral idu ovce.” - „Koral si rekao?” koral Yes, I said chorus. „Da, koral sam rekao.”

Recordings were made between September 2015 and 2017 at the Acoustics Research Institute of the Austrian Academy of Sciences. They took place in a sound-attenuated booth using an Edirol Roland R-44 recorder and a suspended microphone. The signal was digitized at 44.1 kHz. The sentences were read off a sheet of paper. All speakers read the German sentences twice, BoV speakers additionally read the Bosnian sentences twice. Before the reading task, the author conducted a semi-structured interview to elicit spontaneous speech and assess participants' biographic and sociocultural background. The interview was conducted using the same equipment as

12 Carolin Schmid for the reading task. Speech of the interviewee and the author were recorded. Recordings took approximately 90 minutes. 2.3 Measurements and Statistics The laterals, the vowels surrounding the laterals, and the whole target word were segmented manually by the author using STx software (Noll et al., 2019). The boundaries of the lateral segments were determined by a drop in intensity relative to the surrounding vowel, changes in the spectral composition of the signal and formant transitions. Formant frequencies (F₁ and F₂) of the laterals were calculated using the linear prediction coding method (LPC), and, if necessary, were manually corrected. Mean F₁ and F₂ values over the whole lateral segment were calculated. In order to normalize differences in vocal tract length, the distance between F₁ and F₂ in bark was calculated for each lateral (mean F₂-mean F₁, see Simonet, 2010; Nance, 2014). Velarized laterals were indicated by a smaller distance between F₁ and F₂, as compared to plain alveolar laterals. This distance measure was subjected to statistical analyses. A total of 5206 laterals was analyzed. For statistical analyses, linear mixed-effects models were fitted in R (R core team, 2019) using the lme4 package (Bates et al., 2015). Two main models were built: one model to compare the L2 German pronunciation of the BoV speakers to the monolingual German speaking control groups, and the second to compare the L2 German pronunciation of the BoV speakers to the pronunciation of laterals in their L1 Bosnian. In both models, the F₂- F₁ difference was used as the dependent variable. Fixed factors in the model comparing L2 vs. L1 German were lateral Position within the word (word initial/medial but morpheme initial/medial/final), Speaker Gender (m/f), Syllable Stress (primary/secondary/unstressed), Phoneme Context (back- back/back-front/front-back/front-front), Speaker Group (Bosnian/SAG/VD), the interaction between the latter two factors, as well as the interactions between Speaker Gender and Speaker Group, and lateral Position and Speaker Group. In the model comparing BoV speakers' laterals in their L1 vs. L2, fixed factors were lateral Position within the word (word initial/medial2/final), Speaker Gender (m/f), Language (Bosnian/German) and the interaction between Language and Gender, and lateral Position and Language. Phoneme Context and Syllable Stress could not be analyzed in this model due to the smaller number of tokens in the Bosnian recordings. Speaker and Word were entered as random factors. Significance of factors was first assessed through model comparisons using log- likelihood ratio tests as implemented in the Anova() function in R. Interactions and factors were eliminated one at a time. Interactions and factors were retained in the model if the more complex model with this factor included fit the data better than the simpler model without this interaction or factor. The best fitting models were further inspected as described below.

2 Note that there were no Bosnian tokens in which the laterals occurred morpheme initially but in word medial position. Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 13 3 Results 3.1 L2 German laterals compared to L1 German laterals Model comparisons using log-likelihood ratio tests showed that the best fitting model included the fixed factors lateral Position, Speaker Group, Phoneme Context and the interaction between the latter two factors. To assess the main effect of Position, the model was inspected using the summary() function in R. The F₂- F₁ difference of the laterals significantly differed between the final and initial position (where the final position was mapped onto the intercept; b(initial) = 0.463, t = 2.63, p =.009). To follow up on the interaction and specifically assess the effect of Speaker Group with regard to possible differences between the BoV (i.e., L2 German) speakers and the L1 German groups, a set of linear mixed-effects models was fit, one for each of the levels of Phoneme Context. Again, the F₂- F₁ difference served as the dependent variable, and fixed factors were lateral Position and Speaker Group, as well as random effects for participants and items. For the Factor Speaker Group, BoV was mapped onto the intercept such that the model inspection would show differences in the two L1 German groups. Note that Position was kept in the models as a control variable and effects of Speaker Group are averaged across lateral Positions. Results are presented in Table 3 and show that the laterals produced by BoV speakers in German significantly differ from the laterals of both L1 German groups in all phoneme contexts, specifically, the laterals of the BoV speakers were produced with lower F₂- F₁ values (see also Figure 1).

Table 3. Results of the follow-up models that were fit to assess differences in lateral production between the speaker groups for the different Phoneme Contexts difference SAG - BOV speakers difference SAG - BOV speakers Phoneme context b t p b t p back_back 2.12 4.454 < .001 1.28 2.926 . 008 back_front 1.64 3.623 .001 1.39 3.327 .003 front_back 1.74 3.114 .005 1.48 2.897 .009 front_front 1.05 2.317 .03 1.34 3.226 .004

14 Carolin Schmid

Figure 1. Interaction of Speaker Group and phoneme context.

3.2 L2 German laterals compared to L1 Bosnian laterals Model comparisons using log-likelihood ratio tests showed that the best fitting model included the fixed factors Position, Gender, Language, and an interaction between the two latter factors. An inspection of the model again showed a significant difference between F₂- F₁ values for laterals in word final vs. initial position (final position mapped onto the intercept: b(initial) = 0.738, t = 2.40, p = .017). To follow up on the interaction and specifically assess the effect of Language (L1 Bosnian vs. L2 German), two additional models were fit, one for each Group of Gender (i.e., male and female). The F₂- F₁ difference in the speakers' Bosnian vs. German laterals differed for both male and female speakers (male: b(L2_German) = 1.027, t = 3.024, p = .003; female: b(L2_German) = 1.580, t = 5.235, p < .001), however, the difference between languages appeared larger for the female than male speakers.

Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 15

Figure 2. Interaction of Language and Gender

4 Discussion and Conclusions The present study examined the pronunciation of lateral phonemes in the L2 German speech of highly experienced, late bilingual L1 Bosnian speakers. First, lateral sounds in L2 German were compared to laterals of L1 German speakers. It was shown that the L2 German laterals were more velarized (thus have smaller F₂-F₁ values) than the L1 German laterals, irrespective of lateral position within the word and Phoneme context, i.e., the laterals in L2 German and L1 German differed significantly in their degree of velarization. Crucially, this result held for both the SAG and the VD, even though the latter is often characterized by having a velarized lateral variant3 (e.g., Schmid et al., 2015). Second, laterals in L2 German were compared to the laterals of the same speakers’ L1 Bosnian. In Bosnian speech, the velarized lateral phoneme served as reference, as it was hypothesized that – due to articulatory and orthographic similarity – the L2 German lateral would be assimilated to the velarized lateral phoneme in L1 Bosnian (according to the SLM, see Flege, 2007). Results showed that laterals in L2 German

3 The present data also revealed that the VD, compared to SAG, has a more velarized lateral variant, which, however, differs only in the back-back context.

16 Carolin Schmid by the BoV speakers were not only different from laterals of the L1 German speakers, but they were also significantly different from the velarized laterals in their L1 Bosnian. Laterals in L2 German were less velarized (thus having greater F₂- F₁ differences) compared to the L1 Bosnian laterals, irrespectively of lateral position within the word. Concerning the laterals in L1 Bosnian and L2 German, the F₂- F₁ difference also varied as a function of gender. Female speakers were shown to differentiate more between laterals in their L1 and their L2, insofar as their L2 German laterals were closer to those of the L1 German speakers (with greater F1-F2 distances) than the laterals of the male speakers. This suggested that female speakers were more prone to use L2 speech categories within the range of the native speakers of that language than men. This is in line with previous variationist studies concerning the use of standard language within the scope of L1 variation and the leading role of female speakers in sound change (e.g., Labov, 1990; see also Chambers, 1995). Thereby, female speakers could be “leaders in sound change” by building new L2 sound categories closer to the more prestigious target language categories compared to the male speakers, as it has been suggested recently by a number of studies (e.g., Maclagan et al., 1999; Eckert & McConnell-Ginet, 1999; Shin, 2013; van der Slick et al., 2015; Moyer, 2016). With regard to L2 sound learning models, and specifically SLM, the results of the present study suggest that L2 German laterals are considered as similar to the L1 velarized laterals by BoV learners, however, apparently not sufficiently for full equivalence classification to occur. This is shown by the finding that the laterals of the bilingual speakers are significantly more velarized than the laterals of the speakers of L1 German. However, since the L2 German laterals of the bilingual speakers were significantly less velarized than their L1 Bosnian laterals, the results also provided evidence for a new category formation. Notably, the present findings for late-bilingual Bosnian-German speakers are in line with the study of Simonet (2010) on early Catalan-Spanish bilinguals. The early bilinguals also realized the lateral sounds as a new category, significantly different from both the L2 and the L1 categories. Overall, the similar L2 lateral phoneme, which is supposed to be merged to the L1 lateral phoneme category, seems to be identified as a new sound category. But instead of undergoing a dissimilation process, as would be predicted for a perceived new category in an L2, it is not realized within the range of the L2 and was still geared towards the L1 category. The results showed that even late learners are sensitive to fine phonetic details which are not important to signal phonological contrast.

5 Acknowledgements I thank Eva Reinisch for valuable input and help in the writing process, an anonymous reviewer for her/his helpful comments, and all participants of this study.

References Aoyama, K., Flege, J., Guion, S., Akahane-Yamada, R., & Yamada, T. 2004. Perceived phonetic dissimilarity and L2 speech learning: The case of Japanese /r/ and English /l/ and /r/. Journal of Phonetics, 32, 233-250. Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 17 Bates, D., Maechler, M., Bolker, B., & Walker, S. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. Best, C. & Tylor, M. 2007. Nonnative and second-language speech perception: Commonalities and complementarities. In: Munro, M., Bohn, O.-S. (eds.): Language experience in second language speech learning: In honor of James Emil Flege. Amsterdam: John Benjamins. 13- 34. Bohn, O.-S. & Flege, J. 1992. The production of new and similar vowels by adult German learners of English. Studies in Second Language Acquisition, 14, 131-158. Bond, Z. & Fokes, J. 1991. Perception of English voicing by native and non-native adults. Studies in Second language acquisition, 13(4), 471-492. Carter, P. 2002. Structured variation in British English liquids: The role of resonance. Unpublished PhD thesis. York: University of York, Department of Language and Linguistic Science. Carter, P. & Local, J. 2007. F2 variation in Newcastle and Leeds English liquid systems. Journal of the International Phonetic Association, 37, 183-199. Chambers, J. 1995. Sociolinguistic Theory. Oxford: Blackwell. Colantoni, L., Steele, J., & Escudoro, P. 2015. Second language speech: Theory and practice. Cambridge: Cambridge University Press. Delattre, P. 1965. Comparing the phonetic features of English, French, German and Spanish: An interim report. Heidelberg: Julius Gross Verlag. de Leeuw, E., Mennen, I., & Scobbie, J. 2013. Dynamic systems, maturational constraints and L1 phonetic attrition. International Journal of Bilingualism, 17(6), 683-700. Derwing, T. & Munro, M. 2015. Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. Amsterdam: John Benjamins. Eckert, P. & McConnell-Ginet, S. 1999. New generalizations and explanations in language and gender research. Language in Society, 28, 185-201. Escudero, P. 2009. Linguistic perception of “similar” L2 sounds. In: Boersma, P. & Hamann, S. (eds.): in Perception. Berlin: Mouton de Gruyter. 151-190. Flege, J. 1987. The production of and phones in a foreign language. Evidence for the effect of equivalence classification. Journal of Phonetics, 15, 47-65. Flege, J. 1995. Second language speech learning: Theory, findings, and problems. In: Strange, W. (ed.): Speech perception and linguistic experience: Issues in cross-language research. Baltimore: York Press. 233-277. Flege, J. 2007. Language contact in bilingualism: Phonetic system interactions. Laboratory Phonology, 9, 353-381. Flege, J. & MacKay, I. 1999. Perceiving vowels in a second language. Studies in Second Language Acquisition, 26(1), 1-34. Gick, B., Campbell, F., Oh, S., & Tamburri-Watt, L. 2006. Toward universals in the gestural organization of syllables: A cross-linguistic study of liquids. Journal of Phonetics, 34(1), 49-72. Ingvalson, E., McClelland, J., & Holt, L. 2011. Predicting native English-like performance by native Japanese speakers. Journal of Phonetics, 39(4), 571-584. Kim, M.-R. C. 1994. Acoustic characteristics of Korean stops and perception of English stop consonants. Ph.D.Thesis. Madison, Wisconsin: University of Wisconsin-Madison. Krzyzanowski, M. & Wodak, R. 2007. Multiple identities, migration and belonging: 'Voices of migrants'. In: Caldas-Coulthard C.R., Iedema R. (eds.): Identity Troubles. London: Palgrave Macmillan. 95-119. Labov, W. 1990. The intersection of sex and social class in the course of linguistic change. Language Variation and Change, 2, 205-254. Ladefoged, P. & Maddieson, I. 1996. The sounds of the world’s languages. Oxford: Blackwell Publishers.

18 Carolin Schmid Maclagan, M., Gordon, E., & Lewis, G. 1999. Women and sound change. Conservative and innovative behavior by the same speakers. Language Variation and Change, 11(1), 19-41. Magistratsabteilung 17. 2018. Daten und Fakten Stadt Wien: MigrantInnen in Wien 2018. Available from: https://www.wien.gv.at/menschen/integration/daten-fakten/bevoelkerung- migration.html. Major, R. 1992. Losing English as a . The Modern Language Journal, 76(2), 190-208. Maric, D. 2005. Das System der Aussprachefehler der Bosnisch/Kroatisch/Serbisch lernenden Deutschen. Pismo, 3, 116-138. Moosmüller, S. 1991. Hochsprache und Dialekt in Österreich. Soziophonologische Untersuchungen zu ihrer Abgrenzung in Wien, Graz, Salzburg und Innsbruck. Sprachwissenschaftliche Reihe, 1. Wien, Köln, Weimar: Böhlau. Moosmüller, S., Schmid, C., & Brandstätter, J. 2015. Standard Austrian German. Journal of the International Phonetic Association, 45(3), 339-348. Moosmüller, S., Schmid, C., & Kasess, C. 2016. Alveolar and velarized laterals in Albanian and in the Viennese dialect. Language and Speech, 58(4), 488-514. Moyer, A. 2016. The puzzle of gender effects in L2 phonology. Journal of Second Language Pronunciation, 2(1), 8-28. Müller, D. 2015, Cue weighting in the perception of phonemic and allophonic laterals along the darkness continuum: evidence from Greek and Albanian. Albanohellenica 6. Nance, C. 2014. Phonetic variation in Scottish Gaelic laterals. Journal of Phonetics, 47, 1-17. Noll, A., Stuefer, J., Klingler, N., Leykum, H., Lozo, C., Luttenberger, J., Pucher, M., & Schmid, C. 2019. Sound Tools eXtended (STx) 5.0 – a powerful sound analysis tool optimized for speech. Proceedings of Interspeech 2019, Graz. 2370-2371. Pallier, C., Colomé, A., & Sebastiàn-Gallés, N. 2001. The influence of native language phonology on lexical access: Exemplar-based versus abstract lexical entries. Psychological Science, 12, 445-449. Pater, J. 2003. The perceptual acquisition of Thai phonology by English speakers: Task and stimulus effects. Second Language Research, 19(3), 209-223. Petrovic, D. & Grubisic, S. 2010. Fonologija srpskoga jezika. Beograd: SANU. R Core Team. 2019. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Available from: https://www.R-project.org/. Recasens, D. 1995. Velarization degree and coarticulatory resistance for /l/ in Catalan and German. Journal of Phonetics, 23, 37-52. Recasens, D. 2004. Darkness in [l] as a scalar phonetic property: implications for phonology and articulatory control. Clinical Linguistics & Phonetics, 18 (6-8), 593-603. Recasens, D. 2012. A cross-language acoustic study of initial and final allophones of /l/. Speech Communication, 54, 368-383. Riehl, C. 2004. Sprachkontaktforschung: Eine Einführung. Tübingen: Narr. Schmid, C., Moosmüller, S., & Kasess, C. 2015. Sociophonetics of the velarized lateral in the Viennese dialect. In: The Scottish Consortium for IcPhS 2015 (ed.): Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow: The University of Glasgow. Schmid, M. 2011. Language attrition. New York: Cambridge University Press. Schmid, M., Steinkrauss, R., & Lahmann, C. 2014. Sprachverlust im Kontext von Migration und Asyl. In: Bischoff, D., Gabriel, C., & Kilchmann, E. (eds.): Sprache(n) im Exil. München: Edition text + kritik.121-131. Sharwood Smith, M. 1983. Cross-linguistic Aspects of Second Language Acquisition. Applied Linguistics, 4(3), 192–199. Shin, N. L. 2013. Women as leaders of language change: A qualification from the bilingual perspective. In: Carvalho, A. & Beaudrie, S. (eds.): Selected Proceedings of the 6th Workshop on Spanish Sociolinguistics. 135-147. Simonet, M. 2010. Dark and clear laterals in Catalan and Spanish: Interaction of phonetic categories in early bilinguals. Journal of Phonetics, 38(4), 663-678. Laterals in the L2 phoneme inventories of Bosnian-German late bilinguals 19 Sproat, R. & Fujimura, O. 1993. Allophonic variation in English /l/ and its implications for phonetic implementation. Journal of Phonetics 21(3), 291-311. Stevens, K. 1998. Acoustic phonetics. Cambridge: Cambridge University Press. Thomas, E. 2011. Sociophonetics: An introduction. Basingstoke, Hampshire, New York: Palgrave Macmillan. van der Slik, F., van Hout, R., & Schepens, J. 2015. The gender gap in second language acquisition: Gender differences in the acquisition of Dutch among immigrants from 88 countries with 49 mother tongues. PLoS ONE, 10(11). Van Leussen, J.-W. & Escudero, P. 2015. Learning to perceive and recognize a second language: The L2LP model revised. Frontiers in Psychology, 6, 1-12. Wängler, H.-H. 1961. Atlas deutscher Sprachlaute. Berlin: Akademie-Verlag.

TOWARDS BUILDING A CROSS-LINGUAL SPEECH RECOGNITION SYSTEM FOR SLOVENIAN AND AUSTRIAN GERMAN

Andrej Žgank1 and Barbara Schuppler2 1Laboratory for Digital Signal Processing, University of Maribor, Slovenia 2Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria [email protected], [email protected]

Abstract

Methods of cross-lingual speech recognition have a high potential to overcome limitations on resources of spoken language in under-resourced languages. Not only can they be applied to build automatic speech recognition (ASR) systems for such languages, they can also be utilized to generate further resources of spoken language. This paper presents a cross-lingual ASR system based on data from two languages, Slovenian and Austrian German. Both were used as a source and target language for cross-lingual transfer (i.e., the acoustic models were trained on material from the source language, and recognition was tested on material from the target language). The cross-lingual mapping between the Slovenian phone set (40 phones) and the Austrian German phone set (33 phones) was carried out using expert knowledge about the acoustic-phonetic properties of the phones. For the experiments, we used data from two speech corpora: the Slovenian BNSI Broadcast News speech database and the Austrian German GRASS corpus. We trained HMM and DNN acoustic models for monolingual and cross-lingual speech recognition. Evaluating the results, it became clear that the DNN acoustic models outperformed the HMM models. The speech recognition results for Austrian German as the target language clearly outperformed those with Slovenian as the target language. Possible explanations for this difference in performance are: 1) The higher number of phones in the Slovenian language, 2) The speaking style discrepancies of the databases (i.e., a mix of read and spontaneous speech in the Slovenian data vs. read speech only in the Austrian data), and 3) the recording quality mismatch (i.e., GRASS is recorded under better conditions than BNSI).

Keywords: Cross-lingual, Automatic Speech Recognition, Slovenian, Austrian German

1 Introduction In recent years, deep learning approaches have improved the area of spoken language technologies significantly, particularly automatic speech recognition (ASR).

Towards building a cross-lingual speech recognition system… 21 However, despite this progress, the question of available language resources still presents an important limitation. In order to be able to develop an automatic speech recognition system successfully, language resources are needed in the form of a transcribed speech database with a duration of 10 to 100 hours. The main aim behind building cross-lingual ASR systems is to develop methods that allow reduction in the data quantity needed from each language. The automatic speech recognizer developed for one (source) language can be ported to another (target) language, taking into account some decrease of speech recognition accuracy. The first Slovenian - Austrian German cross-lingual ASR is proposed in this paper. Considering the large number of languages worldwide, particularly those which are under-resourced, cross-lingual speech recognition plays an important role. The paper is organized as follows. A general overview of cross-lingual speech recognition, previous work on cross-lingual ASR and the development of corpora for low-resourced languages are provided in the remaining part of Section 1. The spoken language resources used for the experiments and the automatic speech recognition set- up are presented in Section 2. Our speech recognition results are presented and discussed in Section 3, while the paper ends with the conclusion in Section 4. 1.1 Cross-lingual speech recognition Cross-lingual speech recognition is based on the hypothesis that speech models (acoustic and/or language) from the source language can be applied successfully to recognize speech in the target language without using any spoken language resources from the target language. The common phones between different languages can be used as a foundation for cross-lingual speech recognition. The number of common phones depends on the languages’ similarities, and is usually the highest for similar languages belonging to the same language group. For the remaining phones, the most suitable port must be found between the existing source phones and target phones. The similarity estimation between the source and target phones can be carried out either with expert knowledge or by data-driven methods (Besacier et al., 2014). In the case of an expert knowledge approach, the acoustic-phonetic properties are taken into account when a human expert tries to find the best match between a target language phone and existing source phones. When there is no perfect match, the expert must decide which articulatory characteristics of the source phone are the most important ones. In the case of a data-driven method, information based directly on speech data or parameters of acoustic models are used to estimate the best match between the source and target phones. The data-driven method depends on the availability of spoken language material, which can be challenging in the case of a very limited amount of target language data. An alternative cross-lingual speech recognition approach is to build the source acoustic models in a multilingual way (Huang et al., 2013), which usually covers a wider set of acoustic-phonetic characteristics, and can, thus, improve the cross-lingual representation. In our work presented here, we follow the knowledge-based approach,

22 Andrej Žgank and Barbara Schuppler where the mapping between source and target language is defined according to the acoustic-phonetic characteristics. 1.2 Previous work Research on cross-lingual speech recognition has addressed various methods and languages (Besacier et al., 2014). Cross-lingual speech recognition systems can be built either for similar languages (Cerva et al., 2011) or for languages that belong to different language groups (Müller et al., 2016). One of the most extensive works on cross-lingual (and multilingual) speech recognition was done by Tanja Schultz and her colleagues, who built and used the multilingual GlobalPhone speech database for their experiments (Schultz, 2004; Vu et al., 2010; Schlippe et al., 2013). The GlobalPhone speech database (Schultz et al., 2013) covers 20 languages and has in total more than 400 hours of recordings. In the case of the GlobalPhone database, the closest language pair to our case was the German – Czech pair. German and Austrian German form the same pluricentric language, and while Czech belongs to the Slavic language group, which also includes the Slovenian language. The GlobalPhone baseline cross-lingual German-Czech system achieved 75.2% WER on a test set similar to that used in the GRASS speech database (Vu et al., 2010). Stahlberg and his colleagues (2014) used Slovenian as the target language in a zero resource experiment on the BMED medical domain corpus, with Croatian, English, and German as the source languages. The Slovenian cross-lingual phone error rate was 56.8% for Croatian and 57.8% for German source acoustic models. Another set of Slovenian cross-lingual experiments was carried out by Diehl and his colleagues (2007), where Spanish-English-German multilingual acoustic models were used as the source. To the best of the authors’ knowledge, there has been no previous work on cross- lingual speech recognition for the Austrian German – Slovenian language pair. In the scope of the COST MASPER project (Zgank et al., 2004), experiments were performed on German as the source language and Slovenian as the target language, but only on the limited domain of isolated words` recognition. The German – Slovenian cross-lingual speech recognition system (Zgank et al., 2004) achieved 61.68% Word Error Rate (WER) for phone acoustic models, and 74.23% WER for grapheme acoustic models. A phonetically balanced isolated test set with 1,491 words in the vocabulary was used for this evaluation. 1.3 Development of corpora for low-resourced languages The creation of speech corpora is very time intensive and expensive, especially when creating different annotation layers (orthographic transcription, phonetic transcription, prosodic annotation, etc.). In the last two decades, a large number of studies have focused on the development of automatic transcription tools in order to facilitate the creation of language resources. Tools have been created to annotate speech databases automatically on an orthographic level (e.g., Lamel et al., 2004), to segment speech automatically into its words and phone-segments (e.g., Schiel, 1999; Schuppler, 2011) and to create prosodic annotations (for an overview cf. Towards building a cross-lingual speech recognition system… 23 Strömbergsson, 2016) and to identify prosodic phrase boundaries automatically (e.g., Ludusan & Schuppler, 2019). When these resources are being used for speech technology applications, the output of such automatic tools is mostly sufficient. Inaccuracies are mostly consistent throughout the whole database, and, thus do not decrease the accuracy of statistically built models. There are even reports that models trained on automatically created segmentations yield better results than those trained on manual segmentations, as the latter show high inconsistencies among transcribers, and even in material transcribed by the same person. The discrepancy among transcribers increases with the level of spontaneity of the speech material, from 5.6% in read speech to 21.2% in spontaneous speech (Kipp et al., 1997). When automatically created annotations of speech resources are being used for phonetic studies, a manual correction step frequently is needed. Segment boundaries need to be adjusted and additional annotations of segmental or supra-segmental detail may be needed for the specific study. Also, for the case of smaller phonetic production and corpus-studies, it has been shown that the existence of an automatically created annotation makes the creation of the final phonetic annotation not only faster, but also more consistent among transcribers (e.g., Schuppler, 2011). As such, ASR based tools facilitate the creation of speech resources. For a long period of time, these tools themselves, however, already required the existence of relatively large data inputs from the respective language. Thus, the creation of resources for those languages could be facilitated by tools which already had a relatively large number of resources available, and the resource-gap between high- and low- resourced languages was increasing even more. Not surprisingly, the desire to fill this gap and to develop zero-resource and cross-lingual ASR systems was identified. This paper contributes to this field of research. This paper focuses on the development of a cross-lingual ASR system for the Slovenian and the Austrian German languages. Slovenian is a language spoken by only a small number of people (approx. 2.5 million) and Austrian German is a non- dominant variety of German (Clyne, 1995), spoken by a relatively small number of speakers (8.5 million). Neither can be considered to be under-resourced following the definition given by Krauwer (2003) and Berment (2004). They both have electronic resources available as they are, in principle, present on the web (even though Austrian German does not have its own official writing system, but the variety is reflected by a set of specific lexical items and used in a non-standard orthography, for instance in personal written communication and poetry), and as there is linguistic expertise available. However, compared to other European languages/nations, the Slovenian and Austrian markets do not have high economic relevance and as a result, commercial speech recognition systems are not widely available. Furthermore, the resources available for academic speech research are limited and do not allow for machine learning techniques requiring big data. For instance, MAUS tools for automatic transcription, segmentation and lexicon creation are not available for either

24 Andrej Žgank and Barbara Schuppler Slovenian or Austrian German (Reichel, 2012; Kisler et al., 2017). The current paper contributes to the development of ASR methods (i.e., automatic speech transcription) for which the current amount of resources available is sufficient by employing a cross- lingual approach.

2. Materials & Methods 2.1 Slovenian BNSI Broadcast News Database The Slovenian BNSI Broadcast News speech database (Zgank et al., 2005) was used for the Slovenian speech recognition experiments. The speech corpus has 2,160 minutes (36 hours) of read and spontaneous speech from 42 daily TV news shows. The evening and late-night shows were produced in the period between 1998 and 2004 by the Slovenian national TV broadcasting service RTV Slovenija. The training set is comprised of 30 hours of speech. The remaining 3 hours are used for development, and another 3 hours for evaluation. There are 1,565 different speakers present in the speech database, 1,069 of them being male and 477 are female. The gender of the remaining 19 speakers could not be determined. The speech in the BNSI speech database has different acoustic backgrounds in its various components, due to the type of material that was captured. Approximately one half of the speech consists of read and spontaneous speech in a studio environment. The other half has different acoustic backgrounds present, which can extend from normal open places` background noise, over background speech, to background music found in jingles and announcements. These acoustic conditions can degrade speech recognition performance severely, and are given as f-conditions:

 f0: read speech, wide-band, studio environment,  f1: spontaneous speech, wide-band, studio environment,  f2: read/spontaneous speech, narrow-band, no background,  f3: read/spontaneous speech, music in the background,  f4: read/spontaneous speech, other backgrounds,  f5: non-native speech, various conditions,  fx: other conditions.

The f0-f4 conditions were used for training the acoustic models, whereas the f5 and fx conditions were omitted from the set due to the acoustic incompatibility with the remainder of the training set. A separate part of the BNSI speech database is a text corpus containing scripts/scenarios from news show productions. This part of the BNSI speech database is comprised of approximately 10M words, which were used as a separate set for training a language model in combination with the Slovenian FidaPLUS text corpus, which is based solely on written text, and has approx. 621M words. 2.2 Graz Corpus of Read and Spontaneous Speech (GRASS) The Austrian German speech recognition experiments are based on the GRASS corpus (Schuppler et al., 2014; Schuppler et al., 2017). In total, it contains 1,900 min Towards building a cross-lingual speech recognition system… 25 of read and conversational speech from 38 speakers (19 female, 19 male), recorded with a close talking head-set and a large diaphragm microphone in a sound-proof recording studio. The speakers were born in rural or urban areas of eastern Austria and have lived the largest portion of their lives in either Vienna or Graz. At the time of the recordings, they were between 23 and 65 years old and were at least high-school graduates. We used only the read speech component for the speech recognition experiments presented here. Each of the 38 speakers read 76 phonetically balanced sentences, which summed up to 2,744 utterances with 19,511 word tokens from 1,660 word types. The orthographic transcriptions were created from the reading material and corrected manually. As for the BNSI corpus, phonetic segmentations were created automatically using a forced alignment approach (Schuppler et al., 2017). 2.3 Phone mapping The cross-lingual mapping between the Slovenian (40 phones) and the Austrian German phone set (33 phones) was carried out using expert knowledge about the acoustic-phonetic properties of the phones. Both defined mappings are presented in Appendix 1. This section discusses potential problems with mapping phones between Slovenian and Austrian German. Regarding consonants, Austrian German and Slovenian differ with respect to how plosives are distinguished. Whereas Slovenian plosives can be classified into voiced and voiceless, Austrian German plosives are classified into aspirated and unaspirated (e.g., Moosmüller & Ringen, 2004). Furthermore, Austrian German does not distinguish voiced from voiceless fricatives (e.g., Klaaß, 2008), while Slovenian does. In the phone mapping, Slovenian voiced and voiceless fricatives can only be mapped together onto voiceless Austrian fricatives, which has a potential impact on ASR performance. This also holds true for voiced vs. voiceless affricates. Additionally, affricates are modeled as a whole in the Slovenian data (e.g., the phone model tS) but as a sequence of a model for a plosive and a model for a fricative in the Austrian German data, because there were not enough instances of affricates in the data to train them as a whole. Since a sequence of two models has a higher minimum duration than one model, mappings from two phones in the source language to one phone in the target language causes temporal recognition problems and a bad recognition performance especially with speech at a high speaking rate. The issue with mapping two acoustic phone models on one phone model also occurs for the diphthongs in Austrian German. Not only are they spoken faster than a sequence of two vowels in Slovenian, but they are also frequently monophthongized. The process of monophthongization has been documented for Vienna (Moosmülller, 1997; Moosmülller, 1998) and reported to spread across Austria (Vollmann & Moosmüller, 2001). Also in the read speech of the GRASS corpus, monophthongization has been observed, especially among the speakers from Upper and Lower Austria, even after having lived in Graz for most of their lives.

26 Andrej Žgank and Barbara Schuppler Another problematic phone mapping concerns /r/, because it has different segmental and phonotactic characteristics in Slovenian and Austrian German. In the Austrian German data, the acoustic model for /r/ includes realizations as an approximant, trill, and fricative (i.e., [ɹ, r, ɣ]), vocalized realizations of /r/ are modeled in the separate phone model [ɐ]. Given that Slovenian allows for the rhotic consonant /r/ to be syllabic and even to bear the stress, we expect that the simple mapping from Slovenian to Austrian German might lead to a lower performance of the ASR system. 2.4 Automatic speech recognition set-up The experimental set-up for the proposed cross-lingual speech recognition was comprised of two parallel systems, where each one represented a complete source language monolingual speech recognizer. The Kaldi toolkit (Povey et al., 2011) with different state-of-the-art methods was used to develop the automatic speech recognition systems. Specifications were not compatible between languages; thus an individual data preparation step was needed for each speech database. Initially, we prepared the transcriptions, lexicon with the respective phone set, and the speaker list. The second part of the data preparation implemented was the feature extraction procedure. The spoken audio signal was converted into feature vectors using the Mel Frequency Cepstrum Coefficients (MFCC) and their first and second derivatives. The final feature vector had 39 values. Increased feature extraction procedure robustness was achieved with cepstral mean normalization. This step was necessary because different speech databases were merged. The GMM-HMM acoustic models with 3 states left-right topology were trained initially. The flat-start approach was used. The Austrian German set had 33 phones and Slovenian had 40 phones. The first aim was to produce transcription alignments at the phone level, which were needed for deep neural network development. This was carried out with training monophones and triphones. The later GMM-HMM acoustic models were then employed in the evaluation for comparison purposes. The Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transformation were applied to improve the robustness of the speech recognizer further. The second part was dedicated to developing the DNN-HMM acoustic models originating from the earlier experiment. Here, the p-norm based feedforward deep neural networks were applied with the frame-level training procedure. Two hidden layers were included in the architecture, and 30 training epochs were applied. The monolingual speech recognition systems were used for evaluation with the corresponding test set. Finally, the cross-lingual automatic speech recognition was performed on the target languages. The initial goal was solely to use the word loop language model to perform the single-word recognition, but the cross-lingual evaluation pointed out that the word error rate for such a scheme was above 85%. A reasonable cause could be in the limited amount of training data. To overcome this condition, the n-gram language models were used to produce the final models for cross-lingual speech recognition. Towards building a cross-lingual speech recognition system… 27 3 Results 3.1 Monolingual speech recognition The first part of the evaluation was focused on the Austrian German and Slovenian monolingual speech recognition. These results were used as reference values for cross-lingual speech recognition. The word error rate was applied as the evaluation metric. The HMM and DNN acoustic models were tested for all setups. Although the HMM acoustic models present an older architecture, we nevertheless included it in the evaluation for reasons of comparison with other systems. The Austrian German and Slovenian monolingual results are given in Table 1.

Table 1: Austrian German and Slovenian monolingual speech recognition results

Acoustic models Language model Austrian WER(%) Slovenian WER(%)

HMM 3-gram 10.62 34.03

DNN 3-gram 11.25 31.39

The Austrian German monolingual speech recognition system produced good results, with WER around 11%. The WER was 10.62% with the combination of HMM acoustic models and the 3-gram language model, which was also the best overall result in our experiments. The change to DNN acoustic models degraded the speech recognition accuracy slightly, and the WER increased to 11.25%. The possible cause for this degradation could be the limited amount of speech material used for training the DNN models. The Slovenian monolingual evaluation produced results with WER above 30%. The Slovenian HMM acoustic models achieved 34.03% WER. The DNN acoustic models, using the same data and setup, improved the speech recognition results to 31.39%, even though only 30 hours of speech material were used for training. The lower Slovenian monolingual WER in comparison with the Austrian German monolingual WER is a result of the complex acoustic background, which is present in the BNSI speech database and reduces the accuracy. An additional reason for lower Slovenian monolingual results is that Slovenian belongs to highly inflected languages, which influences the speech recognition accuracy (Rotovnik et al., 2007; Donaj & Kacic, 2017). 3.2 Cross-lingual speech recognition The second part of the evaluation was devoted to cross-lingual speech recognition. Each of the languages was used once as the target language for speech recognition with acoustic models from the other language. The cross-lingual speech recognition results in Table 2 are provided as WER.

28 Andrej Žgank and Barbara Schuppler Table 2: Cross-lingual speech recognition results for Austrian German and Slovenian as the target language

Source acoustic Language Austrian WER(%) Slovenian WER(%) models model

HMM 3-gram 92.22 90.27

DNN 3-gram 59.31 72.97

The Slovenian HMM acoustic models produced 92.22% WER for the Austrian German evaluation set. The Slovenian DNN acoustic models improved the WER to 59.13%, which was the best cross-lingual speech recognition result in our experiments. The Austrian German monolingual DNN reference system achieved 11.25% WER. The difference between the monolingual reference and the cross- lingual system was 47.88%. The large accuracy gap between both systems indicates the divergence between the Austrian German and Slovenian languages from the acoustic-phonetic perspective. Another important outcome shows that HMM acoustic models have insufficient modeling power for cross-lingual speech recognition in the case of such complex test scenarios. The basic Slovenian cross-lingual speech recognition configuration (Table 2) provided similar results to those as were observed in the previous case of the Austrian German language. The 3-gram language model in combination with HMM source acoustic models produced a 90.27% WER. The best result was achieved with a combination of DNN source acoustic models and 3-gram language models, where the evaluation ended with 72.97% WER. The Slovenian cross-lingual speech recognition accuracy thus improved by 25.18%. The Slovenian monolingual reference system had 31.39% WER with DNN acoustic models and 3-grams. The difference between monolingual and best cross-lingual Slovenian systems was 41.58%, which is somewhat smaller than in the case of the Austrian German language. A possible reason is that the Austrian German monolingual system generally achieved better results than the Slovenian system, with high accuracy for speech recognition of this type. The evaluation results can also be compared with the MASPER cross-lingual results (Zgank et al., 2004), where German source acoustic models were used to recognize the Slovenian target speech. The WER of 61.68% with the MASPER system outperformed the cross-lingual WER of 72.97% in the last part of this experiment. The possible cause of this gap in the performance lies in the fact that the BNSI evaluation set uses continuous speech, while the SpeechDat(II) database test set used in MASPER is based on isolated speech. Overall, the recognition performance for Austrian German as a source language is better (10.62% WER) than for Slovenian as a source language (34.03% WER). We interpret this result as meaning that high recording quality and a clear speaking style are beneficial for the creation of models, even though the target speech database was Towards building a cross-lingual speech recognition system… 29 not recorded with a comparably noiseless recording setup and does not contain speech from an equally clearly produced speaking style. This is in line with what Schuppler (2017) has shown for American English monolingual classification experiments with different speaking styles for the same language. By comparing experiments based on TIMIT (read speech, studio quality) and Switchboard (spontaneous speech, over the telephone), where both would serve as source and target, and vice versa, she found that overall, classification performance degrades with lower recording quality and increased spontaneity. Interestingly, models trained on TIMIT outperformed models trained on Switchboard, both on the matched condition (TIMIT) and on the not- matched condition (Switchboard). 3.3 Phone level confusions Table 3 shows the phone-level confusions that occurred in the best working system (DNN based) for the two different experimental settings. The term deletion means that the ASR did not annotate a phone where actually there was one in the reference transcription, insertion means that the ASR annotated a phone where there actually was none and substitution means that the ASR annotated a phone with a label different from the reference transcription. The main difference between the two settings (Slovenian as a source and Austrian German as a target language vs. Austrian German as a source and Slovenian as a target language) is that with Slovenian as the target, the number of deletions was extremely high. The phone with the highest deletion rate was [n] (total number: 8,405), followed by [ɛ] (8,000), [o] (7,659), [i] (6,907), [ɑ] (6,558), [t] (6,225) and /r/ (6,051). Overall, vowels showed the highest deletion rates and fricatives the lowest. High deletions may result from the different temporal characteristics of the two corpora (Adda-Decker & Snoeren, 2011). Slovenian is only marginally faster (e.g., Tivadar (2017) reports speaking rates of 4.5–6.5 syllables per second for Slovenian compared to approx. 6 syllables per second reported by Bosker & Reinisch (2017) for German). The databases also differ in speaking style (i.e., read speech in Austrian German vs. broadcast news in Slovenian). Given the previous findings, broadcast news produced by trained TV speakers can be expected to be of a higher speaking rate and thus shorter phone durations than sentences read by not- trained speakers. We interpreted the high number of deletions to be the result of speaking style differences in the two corpora. Given the phonological differences between the two languages outlined in Section 2.3, we carried out further analysis on the substitutions, deletions and insertions of /r/. With Austrian German as the source and Slovenian as the target, only 62 out of 6,254 /r/s in total in the reference transcription were classified correctly (0.1% correctly classified). The largest portion of /r/s (6,051) were deleted, and not detected by the ASR system, 92 /r/s were labeled as vowels, 35 as nasals, 9 as fricatives, and 5 as other consonants (with only one occurrence per confusion). There was one /r/ insertion. We interpreted the high deletion rates as a result of the more spontaneous speaking style in the Slovenian than in the Austrian German test data. With Slovenian as the source and Austrian German as the target language, the picture of /r/ errors was

30 Andrej Žgank and Barbara Schuppler different. Out of total of 349 /r/s, 52 were labeled correctly (14.9% correctly classified), only 163 were deleted, 9 were inserted, 32 were labeled as vowels, 27 as fricatives, 35 as plosives, 20 as liquids, 14 as approximants and 7 as nasals. With Austrian German, the target /r/s were most often confused with vowels and nasals and hardly ever misidentified as other consonants; however with Slovenian as the target, /r/s were as often identified as vowels as they were with fricatives, plosives and other consonants. This ambiguity in /r/ confusions indicated that /r/s in read Austrian German are vocalized in positions where the pronunciation dictionary does not account for this process. /r/ has previously been reported to be deleted, accompanied by a prolongation of the preceding vowel (Jackschina et al., 2014), or vocalized in 49.1% of the tokens in the read speech component of GRASS (Schuppler et al., 2014). Given that the /r/ models trained on GRASS have been trained on a significant portion of vocalic tokens, it is not surprising that it is not well prepared to recognize the Slovenian /r/ and is confused more or less equally/randomly with phones of each manner of classification. One possible method to deal with the broadness in the variation of the phone model /r/ in future experiments would be to improve the segmentation and labeling before modeling the phone models by providing a lexicon with variants in addition to the canonical forms.

Table 3: Phone level confusions for Austrian German and Slovenian as target language

Type of confusion Austrian German Slovenian

Correct 2,919 2,949

Deletions 3,668 114,132

Insertions 249 356

Substitutions 4,485 5,257

Total 11,321 122,694

4 Conclusion This paper presented the first results from a cross-lingual ASR system with Slovenian and Austrian German as the source and target languages. Neither of the two languages is highly resourced. We used resources from both languages which were of different speaking styles and different recording situations. The Slovenian speech material was comprised of broadcast material and the Austrian German database consisted of speech read in studio recordings. Given our expectation from the difficulty of speaking styles and acoustic background, the monolingual ASR system for Austrian German performed better (10.62% WER) than the monolingual ASR system for Slovenian (34.03% WER). The performance of the cross-lingual ASR system for both Towards building a cross-lingual speech recognition system… 31 target languages was comparably low when recognized with HMMs. The cross- lingual results with DNNs were significantly better, although the improvement was greater for Austrian German as the target language (59.31% with Austrian German vs. 72.97% WER with Slovenian as the target language). Our results showed clearly that the quantity of available speech material was sufficient for training the DNNs, which outperformed the HMMs. It can be concluded that DNNs handle a cross-lingual mismatch in speaking style and acoustic background better than HMMs. In future work, we plan to evaluate a joint acoustic model against the two target languages, as we expect such an approach to level out the effects of the bias in acoustic quality to some extent. Overall, our experiments showed that cross-lingual ASR systems provide the possibility to build an automatic transcription system in a resource efficient way by combining expert linguistic knowledge with suitable modelling techniques.

Acknowledgements The work by Barbara Schuppler was supported by the Elise Richter grant (V638 N33) from the Austrian Science Fund. Andrej Zgank’s research work was partially funded by the Slovenian Research Agency (Research Core Funding No. P2-0069). We would like to thank Lucija Krušić for her support with Slovenian phonetics and phonology and the anonymous reviewer of an earlier version of this paper for his/her constructive criticism and the suggestions on how to improve our analysis.

References Adda-Decker, M. & Snoeren, N.D. 2011. Quantifying temporal speech reduction in French using forced speech alignment. J. Phonetics, 39, 261-270. Berment, V. 2004. Méthodes pour informatiser des langues et des groupes de langues peu dotées. Unpublished PhD Thesis. Grenoble I: J. Fourier University. Besacier, L., Barnard, E., Karpov, A., & Schultz, T. 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85-100. Bosker, H. R. & Reinisch, E. 2017. Foreign languages sound fast: Evidence from implicit rate normalization. Frontiers in psychology, 8, 1063. https://doi.org/10.3389/fpsyg.2017.01063 Cerva, P., Nouza, J., & Silovsky, J. 2011. Study on cross-lingual adaptation of a Czech LVCSR system towards Slovak. In: Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.): Analysis of verbal and nonverbal communication and enactment. The processing issues. Berlin, Heidelberg: Springer. 81-87. Clyne, M. 1995. The German language in a changing Europe. Cambridge: Cambridge University Press. Diehl, F., Moreno, A., & Monte, E. 2007. Cross-lingual acoustic model development for automatic speech recognition. In: 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). 425-430. Donaj, G. & Kacic, Z. 2017. Context-dependent factored language models. J Audio Speech Music Processing, 6. https://doi.org/10.1186/s13636-017-0104-6. Huang, J. T., Li, J., Yu, D., Deng, L., & Gong, Y. 2013. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 7304-7308. Jackschina, A., Schuppler, B., & Muhr, R. 2014. Where /aR/ the /R/s in Standard Austrian German? In: Proceedings of Interspeech. 1698-1702.

32 Andrej Žgank and Barbara Schuppler Kipp, A., Wesenick, M., & Schiel, F. 1997. Pronunciation modeling applied to automatic segmentation of spontaneous speech. In: Proceedings of Eurospeech 1997. 1023-1026. Kisler, T., Reichel, U., & Schiel, F. 2017. Multilingual processing of speech via web services. Computer, Speech & Language, 45, 326-347. Klaaß, D. 2008. Untersuchungen zu ausgewählten Aspekten des Konsonantismus bei österreichischen Nachrichtensprechern. Duisburger Papers on Research in Language and Culture, 74, 7-277. Krauwer, S. 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In: Proceedings of the 2003 International Workshop Speech and Computer SPECOM-2003, Moscow, Russia. 8-15. Lamel, L., Gauvain, J. L., Adda, G., Adda-Decker, M., Canseco, L., Chen, L., Galibert, O., Messaoudi, A., & Schwenk, H. 2004. Speech transcription in multiple languages. In: Proceedings of the ICASSP 2004. https://doi.org/10.1109/ICASSP.2004.1326655. Ludusan, B. & Schuppler, B. 2019. Automatic detection of prosodic boundaries in two varieties of German. Conference presentation. Pluricentric Languages in Speech Technology, Satellite Workshop of Interspeech. September 14, 2019, Graz, Austria. Moosmüller, S. 1997. Diphthongs and the process of monophthongization in Austrian German: A first approach. In: Proceedings of Eurospeech. 787-790. Moosmülller, S. 1998. The process of monophthongization in Austria (reading material and spontaneous speech). In: Papers and Studies in Contrastive Linguistics. 9-25. Moosmüller, S., Ringen, C. 2004. Voice and aspiration in Austrian German plosives. Folia Linguistica, 38, 43-62. Müller, M., Stüker, S., & Waibel, A. 2016. Language adaptive DNNs for improved low resource speech recognition. In: Proceedings of Interspeech. 3878-3882. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., & Silovsky, J. 2011. The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society. 1-4. Reichel, U. D. 2012. PermA and Balloon: Tools for string alignment and text processing. In: Proceedings of Interspeech 2012. paper no. 346. Rotovnik, T., Maucec, M. S., Kacic, Z. 2007. Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Communication, 49(6). 437-452. Schiel, F. 1999. Automatic phonetic transcription of non-prompted speech. In: Proceedings of the ICPhS 1999. 607-610. Schlippe, T., Volovyk, M., Yurchenko, K., & Schultz, T. 2013. Rapid bootstrapping of a Ukrainian large vocabulary continuous speech recognition system. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 7329-7333. Schultz, T. 2004. Towards rapid language portability of speech processing systems. In: Conference on Speech and Language Systems for Human Communication. Delhi, India. 1- 4. Schultz, T., Vu, N. T., & Schlippe, T. 2013. Globalphone: A multilingual text & speech database in 20 languages. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 8126-8130. Schuppler, B. 2011. Automatic analysis of acoustic reduction in spontaneous speech. Nijmegen: Radboud Universiteit Nijmegen. ISBN: 978-90-9025869-0. Schuppler, B. 2017. Rethinking classification results based on read speech, or: why improvements do not always transfer to other speaking styles. International Journal of Speech Technology, 20(3), 699-713. https://doi.org/10.1007/s10772-017-9436-y. Schuppler, B., Hagmüller, M., Morales-Cordovilla, J. A., & Pessentheiner, H. 2014. GRASS: the Graz corpus of read and spontaneous speech. In: Proceedings of LREC. 1465-1470. Schuppler, B., Hagmüller, M., & Zahrer, A. 2017. A corpus of read and conversational Austrian German. Speech Communication, 94, 62-74. Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. 2014. Towards automatic speech Towards building a cross-lingual speech recognition system… 33 recognition without pronunciation dictionary, transcribed speech and text resources in the target language using cross-lingual word-to-phoneme alignment. In: Spoken Language Technologies for Under-Resourced Languages. 73-80. Strömbergsson, S. 2016. Today’s most frequently used F0 estimation methods, and their accuracy in estimating male and female pitch in clean speech. In: Proceedings of Interspeech 2016. 525-529. Tivadar, H. 2017. Speech rate in phonetic-phonological analysis of public speech (using the example of political and media speech). Jazykovedny časopis, 68, 37-56. Vollmann, R. & Moosmüller, S. 2001. ’Natürliches Driften’ im Lautwandel: die Monphthongierung im österreichischen Deutsch. Zeitschrift für Sprachwissenschaft, 20, 42- 65. Vu, N.T., Schlippe, T., Kraus, F., & Schultz, T. 2010. Rapid bootstrapping of five Eastern European languages using the rapid language adaptation toolkit. In: 11th Annual Conference of the International Speech Communication Association. 865-868. Zgank, A., Kacic, Z., Vicsi, K., Szaszak, G., Diehl, F., Juhar, J., & Lihan, S. 2004. Crosslingual transfer of source acoustic models to two different target languages. In: COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction. 1-4. Zgank, A., Verdonik, D., Markus, A. Z., & Kacic, Z. 2005. BNSI Slovenian broadcast news database-speech and text corpus. In: Proceedings of Interspeech 2005, 9th European Conference on Speech Communication and Technology. 1537-1540.

Appendix

The Slovenian phone set and the cross-lingual mapping to the Austrian German phone set

SI i m ɛ n iː k p r ɔ g aː s x a j l ts d uː oː

AT iː m ɛ n iː k p r ɔ g a s x a j l t s d ʊ ɔ

SI ɛː ʃ eː t tʃ v ə z ʍ o ʒ u w f b tⁿ dⁿ r̩ ɔː dz ɱ

AT eː ʃ eː t t ʃ v ə s ʊ ɔ s ʊ ʊ f b t d ɐ ɔ d s n

The Austrian German phone set and the cross-lingual mapping to the Slovenian phone set.

AT ə a aɪ aʊ b ç d ɛ eː f g h ɪ iː j k l

SI ə a a j a u b x d ɛ eː f g x i iː j k l

AT m n ŋ øː ɐ ɔ ɔʏ p r s ʃ t ʊ v x y

SI m n n ɔ ɛ r̩ ɔ ɔ j p r s ʃ t u v x uː

REVISITING NONSTANDARD VARIETY TTS AND ITS EVALUATION IN AUSTRIA

Carina Lozo and Michael Pucher Acoustics Research Institute, Austrian Academy of Sciences {carina.lozo, michael.pucher}@oeaw.ac.at

Abstract

Speech technologies for nonstandard language varieties are turning from a technical possibility to a true demand. Good performance with human language technologies (HLT) for nonstandard language varieties like automatic speech recognition (ASR) or text-to-speech synthesis (TTS) is not only for the research community, but soon will be expected by the users of respected systems and applications. With a growing scope of applications to be offered by various kinds of robot-agents in the future, the interest of the commercial and public sectors in nonstandard language variety synthesis is also increasing. As Clyne already saw in 1991, especially in countries with a pluricentric language as in Austria, the progress in speech technologies for local language varieties, like dialects, is highly anticipated. Here, people exhibit competencies both in the standard as well as in the dialect varieties (Moosmüller 1995) and so should their supposed speech technologies. This paper presents a meta-analysis and the discussion about previous work, focusing on the prospects of speech synthesis for language varieties, such as dialects, and their evaluation and the challenges emerge with it.

Keywords: Text-to-Speech synthesis, TTS evaluation, Austrian German, nonstandard language varieties

1 Introduction Digital assistants, such as Apple’s Siri or Samsung’s Bixby, have arrived with each smartphone in the pockets of the general public long ago. Their voices are pleasant, formal, intelligible – and exclusively employ a standard pronunciation of the user’s native language. But since there are context-dependent language registers in human- human dialogue it is only intuitive to assume that there should also be different registers available for synthetic voices in spoken human-computer dialogue. This contribution argues that a high quality text-to-speech (TTS) system is not only characterized by intelligibility or naturalness, but also appropriateness with regard to its evaluation and application, making it a multifaceted problem Ammon (2015) states that a language is only a set of varieties. Following this, German represents an independent language; its dialects and sociolects (e.g. Styrian

Revisiting nonstandard variety TTS and its evaluation in Austria 35 or Viennese) form only a variety within the German language. A further discrimination to make at this point is between the concepts of standard variety and nonstandard variety. Standard varieties are codified, official languages for which dictionaries and reference works are available and used for correction purposes in schools or other educational contexts, like Standard Austrian German (SAG). However, this does not apply to nonstandard varieties like dialects or sociolects. While there are detailed scientifically described dialects of German, there are neither a standardized orthography nor educational books that propagate these dialects. Moreover, standard varieties are more prestigious than their nonstandard counterparts. German is considered as a pluricentric language, meaning that there is more than just one dominant standard variety. Standard German German (SGG), Swiss Standard German (SSG) and Standard Austrian German (SAG) are independent varieties on their own with parallel existing nonstandard varieties. As for SGG and SAG, these standard varieties roof each of their own nonstandard varieties. Roofing is an asymmetric relation between two varieties. The standard varieties employ a more dominant role in the dialect and the standard continuum roofs nonstandard varieties, but it is not the case vice versa (Ammon, 2015). Hereafter, Austrian dialects, as well as sociolects, will be referred to as nonstandard varieties of German in Austria. With a roofing standard variety like SAG TTS, it seems hard to argue for nonstandard varieties in Austria. Dedicating additional resources to TTS or human language technologies (HLT) for nonstandard varieties appears to be a redundant and kind of a niche task given the fact that the available TTS for SAG are working acceptably. The current announcements used by the Austrian Federal Railways or the screen reader of Vienna’s governmental website are high quality synthetic voices with high intelligibility. With prominent Austrian voices from well-known TV and radio moderators, they also create a cultural identification. However, studies by Tamagawa et al. (2011) or Navas et al. (2014) show how beneficial TTS or general HLT for local and nonstandard varieties can be to users. Tamagawa et al. (2011) found that local varieties of a pluricentric language are preferred for synthetic voices in robots. The preference of different robot’s voices, positive and negative emotions towards robots, and identification of the nationality of the robot’s voice for three varieties of English (US English, British English and New Zealand English) were investigated. The study showed that participants reported more positive emotions towards the robot employing their own variety (NZ) compared to the others (US, UK), while there was no significant difference between the voices with respect to negative feelings. Navas et al. (2014) created a TTS for the Navarro- Lapurdian dialect by extending an already existing system for Standard Basque. In the evaluation, the listeners were speakers of the respected variety and showed a clear preference for the dialectal voice. With the advent of artificial intelligence, the need for speech technologies for language varieties becomes apparent. A synthetic voice will influence the user’s perception and assessment of the system. Hence, further development of social service

36 Carina Lozo and Michael Pucher robots need to take a more user-centered approach than the common digital assistants currently can offer. Consequently, social robotics needs a clear embodiment of the social agent presented through their system’s voice. Krenn et al. (2012; 2014) showed that users largely evaluate the voice when making an assessment of the social agent, making neutral and featureless voices in interactive situations like medical examinations less appropriate and even counterproductive for the well-being of the user. As Pucher et al. (2009) have shown, possible applications for standard and nonstandard variety voices in Austria almost seem to be diametrically opposed. Whereas, the nonstandard voices are appropriate for “fun” applications, such as games, the standard voices are more appropriate for more formal scenarios, like online banking. Therefore, the appropriateness of a synthetic voice is an indicator of the suitability of a synthetic voice. What makes a synthetic voice appropriate in the first place? Appropriateness is a subjective concept; it describes something as suitable for a particular use or person. So what may be appropriate for some people in a particular context may be inappropriate for others. For instance, considering age differences and how young people may be more used to synthetic voices than the elderly, makes appropriateness a legitimate concern for building voices in the future. Moore (2017) states that a synthetic voice is appropriate when it meets the user’s expectations and that a human-like voice may lead to an overestimation of the system’s capabilities, thereby fostering frustration with the system. Hence, and this may be surprising, appropriateness for synthetic voices is not necessarily associated with a human-like voice. Further, it seems to be problematic for the user to engage successfully with speech-enabled devices deploying human-like voices when they are clearly not human. Thus, an appropriate voice also needs to be consistent with its visual appearance (Moore, 2017). The building of synthetic voices is affected by the notion of appropriateness throughout its development phases. As mentioned before, the field of application and the inherent qualities of the entity using the system, as well as the possible application scenarios, have to be taken into account when evaluating the quality of a particular voice. This leads to the assumption that TTS evaluation must also meet criteria of appropriateness, which is seemingly not just a binary issue and should be treated like as a multidimensional problem. The traditional approaches to assess the quality of TTS involve a listening test, where participants are exposed to isolated sentences generated by a synthetic voice are asked to rate those sentences with respect to the quality of the signal. Metrics like Mean-Opinion-Scores (MOS) for subjective impressions of synthetic voices and Word-Error-Rate (WER) to objectively measure intelligibility are used widely in evaluations. However, as Wagner and Betz (2017) revealed, TTS evaluation still makes itself dependent on decontextualized listening tests and traditional approaches lack embedded meaningful human-computer communication. The complex problems emerging with TTS evaluation can be summed up with that the apparent “ideal” Revisiting nonstandard variety TTS and its evaluation in Austria 37 environment for evaluating TTS is not coherent with real-life scenarios. When listeners are presented with isolated synthetic speech (in the form of context-free sentences) in a low-noise environment, it is not equivalent to how TTS is applied “in the wild”. In agreement with King (2014), a “de-idealised” evaluation environment (e.g. adding noise) is essential. Alternative approaches like these and new metrics must find their way into the realm of TTS evaluation, reflecting the need and awareness of the community to rethink its standards.

2 Nonstandard variety TTS In the past decade, there have been several efforts to create nonstandard variety voices for TTS. The following dialects have received consideration: the Swedish dialect (Beskow & Gustafson, 2009), the Tianjin dialect (Hu et al., 2011), the Navarro- Lapurdian dialect (Navas et al., 2014), the Mymensinghiya dialect of the Bangla language (Begum et al., 2019) and the Austrian dialects (Pucher et al., 2010; Toman et al., 2015). Respected systems have used a statistical parametric approach with hidden Markov models (HMM) to create synthetic voices. On the one hand, these systems provide high stability and on the other, HMM-based TTS also comes with a flexibility to create new voices through adaption or interpolation (Zen et al., 2009). For instance, Toman et al. (2015) showed that intermediate varieties between standard and nonstandard voices could be created. Hence, there is great interest in improving the existing systems and extending the scope of synthetic voices beyond standard varieties. For German dialects, a simple retraining of existing acoustical models based on a standard variety is not sufficient. However, nonstandard variety synthesis has proven to be a challenging area of research. Complicated by the lack of a standardized orthography, there are also complexities with collecting proper corpus data. Improvements are needed as the current state of statistical language modelling requires large training data to ensure a high quality performance of the system. Nonstandard varieties need further research. However, obtaining proper data is difficult: speakers need to meet specific criteria, like age, gender or education, for the creation of the right persona and the speakers also need to satisfy established criteria for being a speaker of the nonstandard variety. Speakers of nonstandard varieties often have a specific social background, as it is the case for sociolects like the Viennese dialect (VD) (Moosmüller, 1987) or a specific geographic background. 2.1 TTS voices for Austrian varieties Four different voices were identified by Pucher et al. (2017, 2018): Standard Austrian German (SAG), Viennese dialect (VD), Innervillgraten dialect (IVG) and Bad Goisern dialect (GOI). These voices were developed as speaker-dependent voices using the HSMM-based speech synthesis system published by the EMIME project (Yamagishi & Watts, 2010). For the corpus, one autochtonous male and female speaker from the respective regions were recorded at 44100 Hz, 16 bits/sample. The synthesized samples were volume normalized. A 5 ms frame shift was used for the

38 Carina Lozo and Michael Pucher extraction of 40-dimensional mel-cepstral features, fundamental frequency and 25- dimensional band-limited aperiodicity measures. More details on the voice training can be found in Toman et al. (2015) as well as details regarding the data used for training in Pucher et al. (2016). The data for the VD and SAG voice was collected within the research project “Viennese sociolect and dialect synthesis” (VSDS). The other two voices were collected as part of the “Goisern and Innervillgraten Dialect Speech Corpus” (GIDS), which contains audio-visual speech recordings of the respected regions. These eight speakers (four males, four females) recorded a total of 7068 sentences, of which two thirds are in the speaker’s respective dialect and one third in a regional variation of Standard Austrian German (RSAG).

3 Previous work on nonstandard TTS Previous work (Pucher et al., 2017, study one) lead to the questioning of common practices regarding the evaluation of nonstandard synthesis, which involced a method for synthesizing new dialects with existing dialect models of a similar dialect. In this study, both dialects belonged to the South Bavarian dialect group. A small amount of training data was used to transfer original prosodic features (i.e., duration and fundamental frequency (F0)) of a speaker from Southwestern Styria (STY) to a synthetic voice (Innervillgraten dialect, IVG) in order to evaluate how a basic phone mapping model can be improved to reflect dialect authenticity. Subsequent work investigated evaluation methods of synthesized nonstandard varieties of Austrian German (Pucher et al., 2018, study two).While both studies deal with synthetic nonstandard variety voices, they differ regarding the evaluation setting. Study one had listeners evaluate its synthetic output compared to an original speaker in a laboratory condition for the listener, whereas study two took context into account and presented the synthesized material in a possible real-life scenario for the listener to rate. Section 3.1. and 3.2. deal with the respected studies in more detail. 3.1 Study one For voice evaluation in the first study, we had 10 listeners from different regions of Austria. Six female and four male listeners, ages 22 to 66 years, volunteered. We chose the participants according to their dialect familiarity. Seven of the listeners were considered as dialect speakers since they were born and raised in rural regions of Austria. Three of them were raised in the particular region of southwestern Styria where our STY speaker came from and three other listeners were born and raised in Upper Styria. The remaining three dialect speakers had their main place of residence in Carinthia and Tyrol until adulthood. One listener can be considered as a speaker of the standard variety of Austrian German. Based on prosody transfer from the original speaker’s voice (orig), three different outputs were generated: a basic synthetic (syn) voice, a synthetic voice with the segment durations of the original speaker (syn_dur), a synthetic voice with the segment durations and the F0 of the original speaker (syn_dur_f0). The evaluation consisted of an intelligibility test (with WER), a Mean-Opinion-Score (MOS) test on dialect authenticity, and a pair-wise comparison of the voices. For all tests, we also Revisiting nonstandard variety TTS and its evaluation in Austria 39 included the original samples, such that we had 14 sentences * 4 conditions = 56 samples in total. For the first part, the listeners had to write the perceived content of the audio samples in a text field. In the second part, each listener had to rank all audio samples, including the synthesized and the original dialect samples. For the evaluation, the listeners rated each sample on an ordinal scale (1 – “schlecht (bad)”, 2 – “dürftig (poor)”, 3 – “mittelmäßig (average)”, 4 – “gut (good)”, to 5 – “hervorragend (great)”). Figure 1 shows the average scores for each sample type. As expected, the orig audio samples were considered the best in terms of authenticity. We also could see that the synthesized samples that used prosodic transfer (syn_dur, syn_dur_f0) are in general slightly better than the synthesized samples (syn).

Figure 1. Mean Opinion Score of different synthesizing methods (orig = original speaker’s voice, syn = basic synthetic voice, syndur = synthetic voice with segment duration of the original speaker, syndurf0 = synthetic voice with segment duration and F0 of the original speaker) 3.2 Study two In order to gain more insight into how sensitive the listener can be to synthesized dialect, we proposed an adapted evaluation method based on the data from Pucher et al. (2017). The evaluation for this study was gathered using an online survey tool (Leiner, 2018), where the listener was presented with one audio sample at a time, with the rating provided before the next sample was presented. A total of 26 listeners, three Germans and 23 Austrians from age 17 to 67 years, composed of 15 female and 10 male listeners (one didn't state their gender), completed the task. The listeners were not chosen by the authors as in study one. For study two, a link to the survey was distributed online so that the distribution of listeners regarding age or gender could not be controlled. The Austrian listeners were distributed across the federal states of

40 Carina Lozo and Michael Pucher Austria Lower Austria, Upper Austria, Styria and Vienna. Since we had a particular interest in the relation between the dialect background of a listener and their evaluation of the adequacy of a synthesized dialect, we asked the listeners to rate themselves as a dialect or non-dialect speaker. Overall, 78% of the Austrian listeners rated themselves as dialect speakers. In the evaluation, we presented 13 synthesized sentences to the listener. We used four different synthetic voices (VD male; GOI female; SAG male; and IVG male) and six different contexts (navigation, reservation, public traffic, gaming, weather and public service) in the survey. First, the listener was asked to rate the adequacy of the synthesized sentence in the given context on a slider offered by the evaluation interface. The two ends of the slider were described “very inappropriate” (0%) to “very appropriate” (100%). Statement (1) shows an example of the public traffic task for VD.

(1) Stellen sie sich vor, Sie fahren mit der S-Bahn durch Wien und die Ansage der Haltestellen erfolgt mit dieser Stimme (Imagine that you go through Vienna by train and the announcement of stops is made with the following voice). Bitte bewerten Sie, ob ihnen diese Stimme in der angegebenen Situation als passend erscheint (Please evaluate if you find this voice appropriate in this context). A Mean Opinion Score (MOS) test on the quality of the synthetic voice followed, since we wanted to look into the connection between the adequacy-rating and the quality-rating of a voice. As such, each sample was rated on an ordinal scale (1 – “sehr gut (very good)”, 2 – “gut (good)”, 3 – “neutral (neutral)”, 4 – “eher schlecht (poor)”, and 5 – “sehr schlecht (very bad)”). Figure 2 shows the MOS of the different voices’ quality. An ANOVA revealed a significant effect for voice type. Significant differences (p<0.05) between the following voice pairs were noted: VD and GOI, VD and SAG, GOI and SAG, GOI and IVG, SAG and IVG. Only the comparison between the VD and IVG voice revealed no significant difference. Overall, the SAG voice was significantly better than the nonstandard voices. Correlational analyses revealed significant correlations between the rating of contextual appropriateness and the quality of the synthetic voice. Contradictory to the expectations, a positive correlation between the dialect background of the listener and the rating of adequacy was not found. In addition, there was no significant difference between young (<30 years) and older (>30 years) listeners. However, it is worth mentioning that the mean rating of younger participants with a dialect background and was slightly higher than the participants without dialect background and older participants. The findings in study two showed that there is low appropriateness for high quality voices in certain tasks (VD voice is not appropriate for online banking task) and high adequacy for low quality voices in certain tasks (GOI is appropriate for a reservation task). Revisiting nonstandard variety TTS and its evaluation in Austria 41

Figure 2. Mean Opinion Score for synthetic voices of different Austrian varieties (GOI = Bad Goisern dialect; IVG = Innervillgraten dialect; SAG = Standard Austrian German; VD = Viennese Dialect)

4 Conclusion We concluded that TTS of nonstandard varieties like dialects or sociolects, and different registers will inevitably gain societal relevance in the future. It is not enough to bring synthetic standard variety voices to perfection. In fact, we have to think beyond the synthesis of standard languages, because a need for a versatile landscape of synthetic voices is foreseeable. Low-resourced languages or language varieties, like local dialects, can also benefit from HLT: the thorough documentation of a language needed to create HLT helps slow down language extinction. Digitizing local dialects may have a positive effect on its speakers, since language is also a tool of empowerment and carries cultural identity (Besacier et al., 2014) than the roofing standard variety could ever provide. With the Special Research Programme project “German in Austria” (Lenz, 2018), there is currently a systematic effort to collect a wide range of different speech data, among all registers, varieties, demographics and regions of Austria. The versatile connotations of regional dialects can be utilized to achieve a higher acceptance of synthetic voices in everyday life and thereby facilitate application of these dialects in various situations. This can be seen as a unique chance to diversify HLT for Austrian varieties.

42 Carina Lozo and Michael Pucher Apart from improving the human-computer interaction, TTS can also bring valuable insights to dialectological concerns. Pucher et al. (2019) showed how speech synthesis can be used as a tool to investigate questions concerning prosodic distance between dialects. Therefore, this paper argues that TTS or HLT can operate beyond its realm in computer-human interaction and can be in a reciprocal relationship with other disciplines, discovering new frontiers yet again. Evaluation of TTS is a pressing issue on its own in the speech synthesis community. The awareness for the need of new approaches or even rethinking the conception of TTS evaluation is illustrated by researchers like Wester et al. (2015), Mendelson & Aylett (2017) and especially Wagner et al. (2019). As section 3 showed, evaluating nonstandard variety TTS brings up additional issues. Since the standard dialect continuum is characterized by an asymmetrical prestige distribution, nonstandard variety TTS is highly dependent on its licensing through appropriate context. The performance of the system even seems to be secondary to the user in case of nonstandard TTS application. For further studies on nonstandard variety TTS, it is indispensable not only to improve the systems or voices, but the evaluation methods as well. The presentation of synthesized nonstandard varieties is crucial to the assessment of these systems, as study one (Pucher et al., 2017) and two (Pucher et al., 2018) showed. Due to the versatile indexical meaning of dialects or sociolects, mono-cultured evaluations must be dismissed. That a setting like in study one is unfavourable for the listener can also be seen in the rather low MOS the original speaker has achieved (see figure. 2). When presented without context, even original dialectal speech seems not to be genuinely authentic. From this a further issue arises not discussed in this paper. Only a small fraction of resources in TTS development is dedicated to evaluating the listeners’ needs. This contribution therefore closes by endorsing Wagner et al.’s (2019) call for making TTS evaluation a research area on its own. Improving TTS by diversifying standards, methods and contributing disciplines will enable the community to meet the increasing societal demands for authentic synthetic voices.

References Ammon, U. 2015. Die Stellung der deutschen Sprache in der Welt. Berlin, Boston: De Gruyter. Begum, A., Askari, S. Md. S., & Sharma, U. 2019. Text-to-speech synthesis system for Mymensinghiya dialect of Bangla language. In: Panigrahi, C. R., Pujari, A. K., Misra, S., Pati, B., & Li, K.-C. (eds.): Progress in advanced computing and intelligent engineering, Vol. 714. Singapore: Springer. 291-303. Besacier, L., Barnard, E., Karpov, A., & Schultz, T. 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85-100. Beskow, J. & Gustafson, J. 2009. Experiments with synthesis of Swedish dialects. Proceedings of Fonetik 2009. 28-29. Clyne, M. (ed.). 1991. Pluricentric languages: Differing norms in different nations. Berlin, Boston: De Gruyter. Hu, Q., Tao, J., Pan, S., & Zhao, C. 2011. HMM-based Tianjin dialect speech synthesis using bilateral question Set. 2011 IEEE International workshop on machine learning for signal processing. 1-4. Revisiting nonstandard variety TTS and its evaluation in Austria 43 King, S. 2014. Measuring a decade of progress in Text-to-Speech. Loquens, 1(1), e006. Krenn, B., Endrass, B., Kistler, F., & André, E. 2014. Effects of language variety on personality perception in embodied conversational agents. Human-Computer Interaction. Advances Interaction Modalities and Techniques - 16th International Conference. 429-439. Krenn, B., Schreitter, S., Neubarth, F., & Sieber, G. 2012. Social evaluation of artificial agents by language varieties. In: Nakano, Y., Neff, M., Paiva, A., & Walker, M. (eds.): Intelligent virtual agents, Vol. 7502. Berlin, Heidelberg: Springer. 377-389. Leiner, D. J. 2018. SoSci Survey (Version 3.2.03-i) [Computer software]. Available from: https://www.soscisurvey.de. Lenz, A. N. 2018. The special research programme “German in Austria. Variation –Contact – Perception”. In: Ammon, U. & Costa, M. (eds.): Sprachwahl im Tourismus – mit Schwerpunkt Europa. Language choice in tourism – Focus on Europe. Choix de langues dans le tourisme – focus sur l’Europe. Berlin, Boston: De Gruyter. 269-277. Mendelson, J. & Aylett, M. P. 2017. Beyond the listening test: An interactive approach to TTS evaluation. Interspeech 2017, 249-253. Moore, R. K. 2017. Appropriate voices for artefacts: Some key insights. 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR- 2017). Skovde, Sweden. 7-11. Moosmüller, S. 1987. Soziophonologische Variation im gegenwärtigen Wiener Deutsch. Eine empirische Untersuchung. Stuttgart: Steiner (=Zeitschrift für Dialektologie und Linguistik Beihefte 56). Moosmüller, S. 1995. Evaluation of language use in public discourse. In: Stevenson, P. (ed.): The German language and the real world. Sociolinguistic, cultural, and pragmatic perspectives on contemporary German. Oxford: Clarendon Press. 257-278. Navas, E., Hernaez, I., Erro, D., Salaberria, J., Oyharçabal, B., & Padilla, M. 2014. Developing a Basque TTS for the Navarro-Lapurdian dialect. In: Navarro Mesa, J. L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo García, A., Guerra Moreno, I., & Toledano, D. T. (eds.): Advances in speech and language technologies for Iberian languages, Vol. 8854. Heidelberg, Dordrecht, London, New York: Springer. 11-20. Pucher, M., Lozo, C., & Moosmüller, S. 2017. Phone mapping and prosodic transfer in speech synthesis of similar dialect pairs. In: Trouvain, J., Steiner, I., & Möbius, B. (eds.): Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017. Dresden: TUDpress. 180-185. Pucher, M., Lozo, C., & Moosmüller, S. 2018. Evaluation methods for dialect speech synthesis of similar dialect pairs. DAGA 2018 - 44. Jahrestagung für Akustik, 515-517. Pucher, M., Lozo, C., Vergeiner, P., & Wallner, D. 2019. Diphthong interpolation, phone mapping, and prosody transfer for speech synthesis of similar dialect pairs. 10th ISCA Speech synthesis workshop. 200-204. Pucher, M., Neubarth, F., Strom, V., Moosmüller, S., Hofer, G., Kranzler, C., Schuchmann, G., & Schabus, D. 2010. Resources for speech synthesis of Viennese varieties. Proceedings of the 7th international conference on language resources and evaluation (LREC). Valetta, Malta. 105-108. Pucher, M., Rausch-Supola, M., Moosmüller, S., Toman, M., Schabus, D., & Neubarth, F. 2016. Open data for speech synthesis of Austrian German language varieties. In: Draxler, C. & Kleber, F. (eds.) Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum. München, 147–150. Pucher, M., Schuchmann, G., & Fröhlich, P. 2009. Regionalized text-to-speech systems: Persona design and application scenarios. In: Esposito, A., Hussain, A., Marinaro, M., & Martone, R. (eds.): Multimodal signals: Cognitive and algorithmic issues, Vol. 5398. 216- 222. Tamagawa, R., Watson, C. I., Kuo, I. H., MacDonald, B. A., & Broadbent, E. 2011. The effects of synthesized voice accents on user perceptions of robots. International Journal of Social

44 Carina Lozo and Michael Pucher Robotics, 3(3), 253-262. Toman, M., Pucher, M., Moosmüller, S., & Schabus, D. 2015. Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176-193. Wagner, P. & Betz, S. 2017. Speech synthesis evaluation: Realizing a social turn. In: Trouvain, J., Steiner, I., & Möbius, B. (eds.): Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017. Dresden: TUDpress. 167-173. Wagner, P., Beskow, J., Betz, S., Edlund, J., Gustafson, J., Eje Henter, G., Le Maguer, S., Malisz, S., Székely, É., Tånnander, C., & Voße, J. 2019. Speech synthesis evaluation— State-of-the-art assessment and suggestion for a novel research program. 10th ISCA Speech Synthesis Workshop. 105-110. Wester, M., Valentini-Botinhao, C., & Henter, G. E. 2015. Are we using enough listeners? No! - An empirically-supported critique of Interspeech 2014 TTS evaluations. Interspeech 2015. 3476-3480. Yamagishi, J. & Watts, O. 2010. The CSTR/EMIME HTS system for Blizzard Challenge 2010. In: Proceedings of Blizzard Challenge 2010. Available from: https://era.ed.ac.uk/bitstream/handle/1842/4864/CSTR_Blizzard2010.pdf?sequence=1&is Allowed=y Zen, H., Tokuda, K., & Black, A. W. 2009. Review: Statistical parametric speech synthesis. Speech Communication, 51(11), 1039-1064.

“VIENNESE MONOPHTHONGS”: PRESENT – MARKED – GIVEN? ON INTRA-INDIVIDUAL VARIATION OF DIPHTHONGS IN STANDARD GERMAN PRONUNCIATION IN RURAL AUSTRIA

Jan Luttenberger1 and Johanna Fanta-Jende2 1Acoustics Research Institute, Austrian Academy of Sciences 2Departement of German Studies, University of Vienna [email protected], [email protected]

Abstract

Since the late 19th century, the “Viennese Monophthongization”, a process in which /aɛ̯ / and /ɑɔ̯ / diphthongs are levelled to /æː/ and /ɒː/, has emerged as a complex feature of Eastern Central Bavarian in Vienna and Lower Austria. While often recognized by other speakers of German as a nonstandard variant and serving as an example of “classic” sound change in the Viennese dialect, it also gradually reaches non-dialectal language use, suggesting high acceptability as an unmarked form among its speakers (Moosmüller, 2011; Moosmüller & Vollmann, 2001). Since the status of diphthong realization in the area outside of Vienna remains unclear to a large extent, we strive to investigate the phenomenon of monophthongization in rural Austria by looking at intra- individual variation in three different settings, drawing on the corpus of the Special Research Programme “German in Austria” (Lenz, 2018). These samples contain reading a list of single word lexemes, the fable “Northwind and Sun” and a formal style interview. Applying a new mathematical model on F1 and F2 trajectories of examined instances of phonological /aɛ̯ / and /ɑɔ̯ / vowels, we obtained scores describing the level of diphthongization that show a trend toward monophthongized realizations in spontaneous speech compared to reading tasks.

Keywords: Viennese Monophthongization, Standard Austrian German, intra-individual variation, formant measurements, mathematical modelling

1 State of Research: The “Viennese Monophthongization” In the context of variationist linguistics in Austria, the term “Viennese monophthongization” often describes two processes, both emerging from Austria’s capital Vienna. The first process can be dated back to 1120 (see Wiesinger, 2001: 98- 122) or at least the late 13th century (see Kranzmayer, 1956: 60). This is when a language shift due to historical and phonological reasons seemed to have taken place in Vienna, promoting the use of an /a/ monophthong instead of the former /ɔɐ̯ / diphthong for all words deriving from Middle High German (MHG) ei. Note that the Middle High German vowel system is typically used in German dialectology as a

46 Jan Luttenberger and Johanna Fanta-Jende reference system for current variationist research. By assuming that all German dialects root back to the same idealized historic sound framework, comparisons between differing developments and sound changes can be captured (cf. Löffler, 2003: 65-58 and Wiesinger, 1983: 813 for further discussion). Typical examples are /braːt/ or /haːs/ instead of /brɔɐ̯ t/ and /hɔɐ̯ s/ for New High German (NHG) breit “broad” and heiß “hot”. The /ɔɐ̯ / diphthong has often been referred to as an “indexical Central Bavarian feature” (“bairisches Leitmerkmal”, Scheutz, 1999: 118) or as “default realization” (“Normalrealisierung”, Scheuringer, 1990: 235), dominating almost the entire Austrian landscape apart from minor variants on a local level (e.g. /ʊɐ̯ / or /ɔɪ̯/ diphthongs, see Wiesinger’s map for the lexeme heim “home” WEK, 1962-1969). At the beginning of the 20th century, the /a/ monophthong began to reach beyond the city borders of Vienna (see Pfalz, 1910: IX), continuously spreading to other parts of Austria until today (see Scheutz, 1999: 117; Unger, 2014; Lenz, 2019). The second process connected to the term “Viennese monophthongization” describes the assimilation of the first and second diphthong parts in the diphthongs /aɛ̯ / and /ɑɔ̯ / (see Moosmüller & Scheutz, 2013: 83), resulting, for example, in [væːs] NHG weiß “white” or [hɒːs] NHG Haus “house”. This (second) Viennese Monophthongization will be the main object of investigation in the present paper. Gartner (1900) first documented this phenomenon in 1900 and characterized the Viennese realizations of the written diphthong as “äi with an a that approaches [+open] ę” (“äi mit einem a, das sich dem ę nähert”, Gartner, 1900: 143; emphasis in original; translation by the authors)1. Also, Luick (1932) declares for the “most articulated Viennese” that “no diphthongs can be found anymore but instead only very open e and [in the case of /ɑɔ̯ /] o sounds” (“Im ausgesprochendsten Wienerischen werden überhaupt keine Diphthonge mehr, sondern bloß sehr offene e und [im Falle von /ɑɔ̯ /] o dafür gesprochen”; Luick, 1932; translation by the authors). While Gartner (1900: 143) describes the “simplification” (“Vereinfachung”) of the diphthong as a particular feature of young speakers from lower social classes, the semi-monophthongized form “æi” is described by Bíró (1910: 22) as a characteristic specifically among the “Austrian educated class” (“gebildeter Kreis”). This (second) process of monophthongization was assumedly completed around 1940, being not only used in weak prosodic positions without compensatory lengthening but in all structurally similar positions of the entire lexicon (see Moosmüller, 1998; Moosmüller, 2002: 100, Moosmüller & Scheutz, 2013: 83). Furthermore, a “sociological-vertical extension to all social classes” (Moosmüller & Scheutz 2013: 83) must be assumed currently. In contrast to the previously described process of monophthongization (/ɔɐ̯ / to /a/), the present shift from /aɛ̯ / to /æː/ and /ɑɔ̯ / to /ɒː/ is a comparably new phenomenon, affecting not only the Viennese dialect, but particularly the Viennese standard

1 When citing an author, we try to depict the vowels in question according to the original convention of transcription (e. g. Teuthonista), marked by double quotation marks. In all other cases, the International Phonetic Alphabet (IPA) is used (International Phonetic Association 2014). “Viennese Monophthongs”: Present – marked – given? 47 language as well. This is especially the case with the young urban generation making use of a local intended standard variety as the primary language of everyday life with dialectal forms only holding indexical function as ironic, expressive or imitational speech (see Soukup, 2009: 39; Glauninger, 2012). The consequences are homonyms between, for example, [væːs] for (ich) weiß “(I) know” and [væːs] as in the color “white” Weiß. As these lexemes refer back to different MHG vowels (MHG ei and MHG î), the distinction is still maintained in the Viennese dialect, but neither are represented in general Standard Austrian German nor in the Viennese standard variety. For a better understanding, we provide an overview of the complex constellation of the two processes labelled “Viennese Monophthongization” (Figure 1).

Figure 1. Overview of the processes labelled as “Viennese Monophthongization” 2 (1st) Viennese Monophthongization (12th/13th century): from /ɔɐ̯ / to /aː/ in Viennese dialect (only MHG ei) (2nd) Viennese Monophthongization (around 1900) from /aɛ̯ / to /æː/ and /ɑɔ̯ / to /ɒː/ in Viennese standard (MHG ei, î, û, ou) and (partly) in Viennese dialect (only MHG î and MHG ou)

Even though the typical Viennese variants are often stigmatized by other speakers in Austria (see Scheutz, 1999: 117), attitudinal-perceptual studies have demonstrated that standard-near recordings as well as the concept of “high German” are often associated with educated speakers from Vienna (see Lenz, 2019; Moosmüller, 1991). Hence, an increasing spread of Viennese forms can be observed, possibly covering

2 Note that the pronunciation of MHG ou in Bavarian dialects (Viennese dialect included) is usually dependent on the phonetic context: It becomes /aː/ before bilabial and labiodental consonants (e. g. in Baum „tree”) and remains /ɑɔ̯ / as in NHG in cases before velar consonants (e. g. Auge „eye”), vowels (e. g. Mauer „wall”) and in word final position (e. g. Frau „woman”; see Lenz 2019).

48 Jan Luttenberger and Johanna Fanta-Jende great parts of the Eastern Central Bavarian dialect region and also slowly pushing forward to South-Central and Southern Bavarian areas (see Moosmüller, 1991: 35-37; Moosmüller, 1998: 10; Moosmüller & Scheutz, 2013). However, with Standard Austrian German (SAG), the normative status remains unclear. The “Viennese monophthongs” may not be salient to the respective speakers and therefore seem to be a legitimate form used among nonprofessional speakers throughout their entire linguistic spectrum (Fanta-Jende, in print). Still, the full-diphthongs /aɛ̯ / and /ɑɔ̯ / are prescriptive as the only variants in the “higher” standard pronunciation of professional speakers (Krech et al., 2009). Additionally, previous studies have identified inter-situational differences depending on the provided method of investigation or elicitation task. By comparing her results to those of Iivonen (1989, 1994), Moosmüller (1998) identifies less articulatory movement in her own data, which “could be explained by the fact that Iivonen’s measurements were made on isolated words”, while Moosmüller relied on connected speech material during a reading task and spontaneous conversations. Based not only on formant measurements but also on timing relations within the diphthong, her results demonstrate higher monophthongization rates in the spontaneous speech material, especially in weak prosodic positions (see Moosmüuller, 1998: 20).

2 Intra-individual Variation A great number of sociolinguistic studies show that variation in speech only very broadly corresponds to large social categories, such as class and gender. Instead, speech variation depends a lot more on pragmatic, topical and situational context illustrating a wide range of registers in a single speaker (see Labov, 1967; Gumperz, 1982; Eckert, 2008). The language situation in Austria seems to demonstrate an “ideal research laboratory” (Lenz, 2018: 269) for variationist linguistics in this regard due to its complex variety constellations and vivid “social-vertical” dynamics. While there have been thorough investigations focusing on patterns of “switching” and “shifting” among the dialect-standard-axis for particular regions of Germany (see Lenz, 2003; Kehrein, 2012), comparable endeavors are still uncommon in the Austrian context (see Scheutz, 1985; Scheuringer, 1990). With regard to the 1st Viennese Monophthongization, Scheutz (1985), for instance, registered more /aː/ monophthongs instead of /ɔɐ̯ / diphthongs in formal conversations than in informal speech situations in the Central Bavarian town of Ulrichsberg in Upper Austria. The fact that the individual speakers reduced their amount of diphthongs in favor of another – in this case the Viennese – nonstandard variant (and not necessarily the standard variant /aɛ̯ /) indicates that the latter is perceived as less dialectal by the respective speakers and consequently, the “most appropriate” for the given formality and orality of the situation (Scheutz, 1985: 243). As to our knowledge, there are no studies focusing particularly on the pronunciation of the diphthongs /aɛ̯ / and /ɑɔ̯ / and their intra-individual variation according to varying “Viennese Monophthongs”: Present – marked – given? 49 situational settings. Hence, it will be the goal of the present paper to fill this research gap.

3 Survey Methods To tackle the questions of “areal-horizontal” and “social-vertical” variation of monophthongized /aɛ̯ / and /ɑɔ̯ / in Austria, we draw on the corpus of the Special Research Programme (FWF F60 – project parts PP02 and PP03) “German in Austria. Variation – Contact – Perception” (henceforth SFB). A broad methodological repertoire has been proven to be a decent indicator for accessing the participants’ overall language repertoires and individual dialect-standard-spectra (Fanta-Jende, in print). Consequently, we assume differences within the individual standard repertoire, e.g. according to the formality or “freedom” of the elicitation task (reading of isolated words vs. connected speech reading vs. “free” conversation). Hence, all participants completed the following speech tasks: i) The reading of single word lexemes (SWL) to elicit the “best” individual standard variety (n=18 /aɛ̯ / and n=6 /ɑɔ̯/ per person). ii) A reading of the fable “Northwind and Sun” (N&S) in its German translation as an indicator of highly controlled but full-sentence reading pronunciation (n=10 /aɛ̯/ and n=3 /ɑɔ̯/ per person). It has to be noted that only n=5 instances of /aɛ̯/ occur in sentence accent positions. iii) A formal interview (INT) about the speakers’ language habits conducted by an interviewer of the SFB talking in intended SAG (WEIS n= 34 /aɛ̯/ and n=11 /ɑɔ̯/; NMYB n=30 /aɛ̯/ and n=16 /ɑɔ̯/; NECK n=42 /aɛ̯/ and n=17 /ɑɔ̯/; for abbreviations see second paragraph below). While tasks 1 and 2 aimed at eliciting SAG exclusively, the interview allows for much broader stylistic variation. The interviewers strived to maintain a standard-near pronunciation throughout the entire interview, assumedly offering an authentic language situation with a high degree of formality (e.g. formal Sie “you”). Yet, it appears that the interviewees do not completely accommodate, but rather make use of their entire linguistic spectrum, including frequent switches to dialectal or regiolectal variants. Since we concentrate on variation in the “higher” spectra of the vertical axis, we will not consider those dialectal variants in our investigation. Since the focus of our study will strongly lie on intra-individual variation, we analyzed the individual repertoires of only three persons. All of them are elderly women (age above 65) and autochthonous in their hometown, i. e. a rural village/town in Austria, with at least one parent from the same location, hence all speakers are comparable from a sociolinguistic viewpoint. The selected villages have a population of 500 to 2000 inhabitants, each representing one major Bavarian dialect area of Austria: Neumarkt an der Ybbs (NMYB) in western Lower Austria as a representative of Central Bavarian, Neckenmarkt (NECK) in central Burgenland for South Central Bavarian, and Weißbriach (WEIS) in Carinthia as the South Bavarian location.

50 Jan Luttenberger and Johanna Fanta-Jende Based on the before mentioned assumptions on the spreading of the 2nd Viennese Monophthongization, we hypothesize that monophthongized realizations are most common in the Central Bavarian region and weakest in the South Bavarian region due to the influence of Vienna. Therefore a tendency for monophthongization should appear more prominently in NMYB and NECK.

4 Data Analysis Phenomena like the /ɔɐ̯/ → /a/ monophthongization are relatively easy to identify and to categorize due to being phonologically prominent input-switch rules in the sense of Dressler & Wodak (1982). In contrast, the variants of /aɛ̯/ and /ɑɔ̯/ are assumed to be gradual, ranging from clear diphthongs to clear monophthongs without falling into separate phonological categories. Although the endpoints of this continuous space, as well as the general direction of articulatory movement can be aurally distinguished and can bear (sociophonetic) meaning, we strive for more reliability and comparability. Therefore, to capture the range of articulation, we drew on acoustic measurements for a description of the encountered sounds. Since our selected phenomena revolve around vowels, the analysis included the measurement of the first two formants in Hertz at 20 points for each vowel. The formants were obtained automatically using the formant tracker implemented in the software STx (Noll et al., 2019). These automatic formant measurements were checked and corrected along the relative maxima in the spectrum in cases where the formant tracker had failed. To achieve a method of comparison independent of subjective human interpretation, we used a simple derivation that tries to formalize the procedure a human expert would use to distinguish between those settings. More specifically, we manually marked the start and end point of the vowel and sampled the first and second () () formant values at 20 equidistant points. Thus, we obtained the samples 푓 and 푓 for 푖 = 1,…,20 for each vowel. Note that this approach also resulted in time normalization. In preliminary experiments, we observed that the first and last four samples were unreliable positions to use in a quantification step as the formants in these regions were heavily affected by coarticulation. Thus, we used the arithmetic ( ) ( ) ( ) mean 푓() = of samples 5-7 for 푗 = 1,2 to represent the formant frequencies ( ) ( ) ( ) of the starting vowel and the arithmetic mean 푓() = of samples 14-16 to represent the formant frequencies of the ending vowel. The difference between the formant frequencies at the starting vowel and ending vowel was then normalized by their average mean as we expected larger variations at higher frequencies. We thus () () () ( ) obtained the two parameters 푑 = () () for 푗 = 1,2 ranging, in theory, between a value of 0 and 2. A high value of these parameters indicates that the formants changed significantly from the start to the end of the vowel. Since a change in formant frequency also indicates a change in vowel quality, a high value indicates a (strong) diphthong. A small value close to zero indicates no change in formant frequency and hence a monophthongic realization. Consider the following case as an example: To reach an index of (-)1, the formant frequency had to diminish by two thirds or increase “Viennese Monophthongs”: Present – marked – given? 51 by a triple of the starting value. Practically, human articulation does not allow for values much higher than 1, considering a prototypical range of 250-900 Hz for F1 and 600-2500 Hz for F2. Additionally, the algebraic sign “+” denotes a falling trajectory, i.e. formant values are decreasing, while the sign “-” denotes a rising trajectory analogously. Depending on the specific vowel, the differences can be more prominent in either formant as depicted in the examples below. Also note that this method does not discriminate between rising and falling diphthongs. We expected that there would be three broad categories of possible realizations: i) (Articulatory) rising diphthongs /aɛ̯ / or /ɑɔ̯ /, associated with SAG, but also still present in the Southern Bavarian dialect as comparably distant from the Viennese influence (see e.g. Kranzmayer, 1956; Fanta-Jende, in print), and assumed to be present in Central Bavarian prior to the 20th century (see e.g. Pfalz, 1913). This would result in a decreasing F1 and an increasing F2 for /aɛ̯/ while both formants for /ɑɔ̯/ would be decreasing. Therefore, /aɛ̯/ should exhibit positive F1 scores and negative F2 scores, while for /ɑɔ̯/, both scores should be positive. In practice, clearly distinguishable diphthongs of these categories should reach indices between 0.3 and 0.4. ii) Monophthongs /æː/ or /ɒː/, associated with Viennese Dialect and Eastern Central Bavarian dialect in general, would result in (nearly) no formant movement. Under current circumstances, we consider a formant score below 0.1 as a clear monophthong, which corresponds to a change in formant frequency of around 10% from start to end. iii) “Reversed”, falling or centralizing diphthong /ɛɐ̯/ and /ɔɐ̯/, an even more salient variant of the aforementioned monophthong, first reported by Kranzmayer (1956) and reaffirmed by base dialect recordings from the SFB would have formants that travel in opposite directions compared to variant 1.

5 Results and Interpretation First, we will take a look at the mean formant scores for each person in each of the three settings (Figure 2). Both formant scores decrease with formality for each participant. Although the setting N&S is a reading task similar to SWL, its scores seemed to be much closer to the spontaneous speech of INT. This might be partially explained by the fact that half of the instances do not bear any sentence accent. We also observed that F1 scores appeared to be more prominent than F2 scores. For the spontaneous speech of INT, we found that the pronunciation tends to go toward monophthongized forms for all three persons. Looking at averaged /ɑɔ̯/ scores, we encountered a slightly different picture (Figure 3). While in general, INT scores lowest compared to the reading settings, we identified a strong movement of F1 with a slight rehearsal of F2 for the informant WEIS, even when speaking spontaneously. An examination of the corresponding items strongly suggested a centralization process for the second part of /ɑɔ̯/ in the informal speech of WEIS in the INT setting. This would indicate a shift toward a more dialectal realization as a centralization of vowels is a known phenomenon for some

52 Jan Luttenberger and Johanna Fanta-Jende areas in the Southern Bavarian region (cf. Kranzmayer 1956: 50). In contrast, NMYB and NECK tended to pronounce monophthongs quite unambiguously. Based on the arguably very small sample of only three items, we discovered an unexpectedly high number of monophthongs in the reading speech during the N&S setting of the informant NECK.

Figure 2. Mean F1 (left) and F2 (right) indices by person and setting for realizations of /aɛ̯/

Figure 3. Mean F1 (left) and F2 (right) indices by person and setting for realizations of /ɑɔ̯/ “Viennese Monophthongs”: Present – marked – given? 53 For a more detailed view of the score-wise opposing settings of SWL and INT, we show the plots resulting from entering the computed values on a coordinate system with F1 scores on the x-axis and F2 scores on the y-axis (Figure 4, 5, 6, 7).

Figures 4, 5. Plots of combined scores for F1 (x-axis) and F2 (y-axis) for SWL setting

54 Jan Luttenberger and Johanna Fanta-Jende

Figures 6, 7. Plots of combined scores for F1 (x-axis) and F2 (y-axis) for INT setting For the SWL setting, all data points fell well within the space of /aɛ̯/ and /ɑɔ̯/. For /aɛ̯/, we also observed an F1 movement as the more conclusive feature indicating a diphthong, while for /ɑɔ̯/ both formants seemed to change similarly. Regarding the INT setting, most data points fell closer to the origin, indicating monophthong realizations, although a rather wide range of variation can be observed. This included instances where one or both trajectory directions appeared to be reversed. Given their small number, they might rather be outliers in a dynamic range than instances of a separate variety. Again, for /aɛ̯/, F1 seems to be the more prominent constituting element. “Viennese Monophthongs”: Present – marked – given? 55 6 Discussion Throughout the examined data, the variable of “setting” in which the participants produced vowels showed a clear influence. Reading tasks elicited more prominent diphthongs than spontaneous speech. This correlates well with the authors’ impression that the participants exhibited much more colloquial registers in the interview even though talking to a stranger who tended toward a formal style of speech. This was true regardless of which dialectal region the person originated from. A possible influence of the dialectal background seemed to be likely since NECK and especially NMYB tended to use more monophthongs in comparison to WEIS. Both the towns Neckenmarkt and Neumarkt an der Ybbs are much closer to Vienna than Weißbriach and it is safe to assume that their local dialects are more strongly under the influence of the Viennese Dialect. On the other hand, less pronounced articulation combined with a high speech rate seemed to play a role in the diminished scores. Assuming that there is only one phonological category for the whole exhibited spectrum, it remains unclear whether the realizations striving toward monophthongs signify a shift in sociolinguistic speech style or merely indicate a less careful pronunciation resulting from an increased speech rate.

7 Résumé In this article, we took a first glance at the intra-individual variation of diphthong pronunciation regarding Austrian German /aɛ̯/ and /ɑɔ̯/. We looked at the realization of these vowels in three different tasks performed by three different informants from different dialectal regions in Austria. By applying an automated measurement of formant trajectories for analyzing fine-grained gradual variation, we employed a simple, yet effective, mathematical model to obtain scores for easy comparison and maximum objectivity. Since this approach of modeling was used for the first time, its reliability has yet to be proven and future adjustments seem to be needed. While the investigated samples are far too small to allow for any general statement, they demonstrated that diphthong realization in non-standard Austrian German varied considerably even intra-individually, hinting at the complex interplay between standard and nonstandard varieties in Austria. Of course, future studies would have to utilize more informants to gain more representative data. Moreover, not only the phonetic and phonological, but also pragmatic and conversational parameters will have to be examined thoroughly to reveal (socio-)phonetic variation at the micro-level. In particular, speech rate and other physiological factors have to be ruled out if variance in vowel realization is interpreted as stemming from stylistic or sociolinguistic factors. The development and refinement of mathematical tools and models to process the data gained from acoustic measurements is another very important field. Complementing the aural and visual judgements of researchers, especially with an academic background from philology and linguistics, these mathematical tools and models allow for greater inter-subjective comparability and overall robustness of judgement. At the same time, further

56 Jan Luttenberger and Johanna Fanta-Jende automatization of data analysis seems highly desirable to account for limited resources, especially in time and workforce. In this regard, we hope to have provided a starting point for further investigation and development.

Acknowledgements A very big thanks to Günther Koliander from the Acoustics Research Institute of the Austrian Academy of Sciences, who “translated” our understanding of diphthong formant structure into the mathematical model used in this article.

References Bíró, L. A. 1910. Lautlehre der heanzischen Mundart von Neckenmarkt: phonetisch und historisch bearbeitet [Sound theory of the Heanzisch dialect of Neckenmarkt: phonetically and historically investigated]. Leipzig: Seele. Dressler, W. U. & Wodak, R. 1982. Sociophonological methods in the study of sociolinguistic variation in Viennese German. Language in Society, 11, 339-370. Eckert, P. 2008. Variation and the indexical field. Journal of Sociolinguistics, 12(4), 453-476. Fanta-Jende, J. in print. Varieties in contact. Horizontal and vertical dimensions of phonological variation in Austria. In: Lenz, A. N. & Maselko, M. (eds.): Variationist linguistics meets contact linguistics. Wiener Arbeiten zur Linguistik [Variationist linguistics meets contact linguistics. Viennese works on linguistics]. Göttingen: Vienna University Press. Gartner, T. 1900. Lautbestand der Wiener Mundart [Sound inventory of the Viennese dialect]. Zeitschrift für hochdeutsche Mundarten, 1, 141-147. Glauninger, M. M. 2012. Zur Metasoziosemiose des »Wienerischen«. Aspekte einer funktionalen Sprachvariationstheorie [On the meta sociosemiosis of the »Viennese«. Aspects of a functional theory on language variation]. Zeitschrift für Literaturwissenschaft und Linguistik (LiLi), 166, 110-118. Gumperz, J. 1982. Discourse Strategies. Cambridge: University Press. Kehrein, R. 2012. Regionalsprachliche Spektren im Raum: Zur linguistischen Struktur der Vertikale [Spectra of regional varieties in space: On the linguistic structure of the vertical]. Stuttgart: Steiner. Kranzmayer, E. 1956. Historische Lautgeographie des gesamtbairischen Dialektraumes [Historic sound geography of the whole Bavarian dialect region]. Wien/Graz: Verlag der Österreichischen Akademie der Wissenschaften. Krech, E.-M., Stock, E., Hirschfeld, U., & Anders, L. C. 2009. Deutsches Aussprachewörterbuch [German pronounciation dictionary]. 1st ed. Berlin: de Gruyter. Iivonen, A. 1989. Regional German vowel studies. [Mimegeographed Series of the Department of Phonetics] Helsinki: Helsingin Yliopiston Fonetiikan Laitoksen Moniteista. University of Helsinki. Vol. 15. Iivonen, A. 1994. Zur gehobenen regionalen phonetischen Realisierung des Deutschen [On the elevated regional realization of the German language]. In: Vereck, W. (ed.): Verhandlungen des Internationalen Dialektologenkongresses Bamberg 29.7. - 4.8.1990, Vol. 3. Regional Variation, Colloquial and Standard Languages (=Zeitschrift für Dialektologie und Linguistik. Beiheft 76). Stuttgart: Steiner. 311-330. International Phonetic Association. 2014. Handbook of the International Phonetic Association. Cambridge: Cambridge University Press. Labov, W. 1967. Sprache im sozialen Kontext. Beschreibung und Erklärung struktureller und sozialer Bedeutung von Sprachvariation [Language in its social context. Description and Explanation of structural and social meaning of language variation]. Edited by Norbert Dittmar and Bert-Olaf Rieck. Vol 1. Kronberg: Scriptor. Lenz, A. N. 2003. Struktur und Dynamik des Substandards: Eine Studie zum “Viennese Monophthongs”: Present – marked – given? 57 Westmitteldeutschen (Wittlich/Eifel) [Structure and dynamic of the substandard: A study about West Middle German (Wittlich/Eifel)]. In: Zeitschrift für Dialektologie und Linguistik. Beihefte. Stuttgart: Steiner. Lenz, A. N. 2018. The special research programme “German in Austria. Variation – Contact – Perception”. In: Ammon, U. & Costa, M. (eds.): Sprachwahl im Tourismus – mit Schwerpunkt Europa. Language Choice in Tourism – Focus on Europe. Choix de langues dans le tourisme – focus sur l’Europe. Berlin, Boston: de Gruyter. 269-277. Lenz, A. N. 2019. Bairisch und Alemannisch in Österreich [Bavarian and Alemannic in Austria]. In: Herrgen J. & Schmidt, J.-E. (eds.): Deutsch: Sprache Und Raum – Ein Internationales Handbuch der Sprachvariation. Bd. 4 [Language and Space – German. An International Handbook of Linguistic Variation. Vol. 4]. Berlin, Boston: de Gruyter Mouton. 318-363. Löffler, H. 2003. Dialektologie. Eine Einführung [Dialectology. An introduction]. Tübingen: Narr. Luick, K. 1932. Deutsche Lautlehre: Mit besonderer Berücksichtigung der Sprechweise Wiens und der österreichischen Alpenländer [German sound theory: Especially considering the pronunciation of Vienna and the Alpine regions of Austria]. Wien: ÖBV Pädagogischer Verlag. Moosmüller, S. 1991. Hochsprache und Dialekt in Österreich: Soziophonologische Untersuchungen zu ihrer Abgrenzung in Wien, Graz, Salzburg und Innsbruck [Standard language and dialect in Austria: Socio-phonological investigations on their boundaries in Vienna, Graz, Salzburg and Innsbruck]. Wien, Köln, Weimar: Böhlau. Moosmüller, S. 1998. The Process of monophthongization in Austria (Reading material and spontaneous speech). Papers and Studies in Contrastive Linguistics, 34, 9-25. Moosmüller, S. 2002. Der Stellenwert der phonologischen und phonetischen Variation in der Sprechererkennung [The significance of phonological and phonetic variation in speech recognition]. In: Braun, A. & Masthoff, H. R. (eds.): Phonetics and its applications: Festschrift for Jens-Peter Köster on the occasion of his 60th Birthday. Stuttgart: Steiner. 97-109. Moosmüller, S. 2011. Sound changes and variation in the vowel system of the Viennese dialect. In: Debowska-Kozowska, K. & Dziubalska-Koaczyk, K. (eds.): On Words and Sounds. A Selection of Papers from the 40th PLM, 2009. Newcastle upon Tyne: Cambridge Scholars Publishing. 138-154. Moosmüller, S. & Scheutz, H. 2013. Der Vokalismus in den Stadtdialekten von Salzburg und Wien zwischen Monophthongierung und E-Verwirrung: Eine phonetische Studie [The vowels in the city dialects of Salzburg and Vienna between monophthongization and e- confusion: A phonetic study]. In: Harnisch, R. (ed.): Strömungen in der Entwicklung der Dialekte und Ihrer Erforschung. Beiträge zur 11. Bayerisch-österreichischen Dialektologentagung [Trends in the development of dialects and research on them. Contributions to the 11. Bayerisch-österreichische Dialektologentagung]. Passau: Edition Vulpes. 83-89. Moosmüller, S. & Vollmann, R. 2001. „Natürliches Driften" im Lautwandel: die Monophthongierung im österreichischen Deutsch [“Natural drift” in sound change: the monophthongization in Austrian German]. Zeitschrift für Sprachwissenschaft, 20(1), 42-65. Noll, A., Stuefer, J., Klingler, N., Leykum, H., Lozo, C., Luttenberger, J., Pucher, M., & Schmid, C. 2019. Sound Tools eXtended (STx) 5.0 – a powerful sound analysis tool optimized for speech. In: Proccedingsof Interspeech 2019 - Show&Tell, Graz, Austria. 2370-2371. Available from: https://www.isca- speech.org/archive/Interspeech_2019/pdfs/8022.pdf. Pfalz, A. 1910. Lautlehre der Mundart von D. Wagram und Umgebung [Sound theory of the dialect of Deutsch Wagram and its vicinity]. Unpublished PhD thesis. Vienna: University of Vienna.

58 Jan Luttenberger and Johanna Fanta-Jende Pfalz, A. 1913. Deutsche Mundarten IV. Die Mundart des Marchfeldes [German dialects IV. The dialect of the Marchfeld]. Wien: Alfred Hölder. Scheuringer, H. 1990. Sprachentwicklung in Bayern und Österreich. Beiträge zur Sprachwissenschaft 3 [Language development in Bavaria and Austria. Contributions to linguistics 3]. Hamburg: Buske. Scheutz, H. 1985. Strukturen der Lautveränderung: Variationslinguistische Studien zur Theorie und Empirie sprachlicher Wandlungsprozesse am Beispiel des Mittelbairischen von Ulrichsberg/Oberösterreich [Structures of sound change: Variationist linguistic studies on the theory and empiricism of processes of language change exemplified on the Central Bavarian dialect of Ulrichsberg in Upper Austria]. Wien: Braumüller. Scheutz, H. 1999. Umgangssprache als Ergebnis von Konvergenz- und Divergenzprozessen zwischen Dialekt und Standardsprache [Vernacular as a result of convergent and divergent processes between dialect and standard language]. In: Stehl, T. (ed.): Dialektfunktionen – Dialektgenerationen – Dialektwandel [Functions of dialect – dialect generations – dialect change]. Tübingen: Narr. 105-131. Soukup, B. 2009. Dialect use as interaction strategy: A sociolinguistic study of contextualization, speech perception, and language attitudes in Austria. Wien: Braumüller. Unger, J. 2014. Der Nonstandard in Deutsch-Wagram: Unter Berücksichtigung der Orte Aderklaa und Parbasdorf [The non-standard in Deutsch-Wagram: With consideration for the towns Aderklaa and Parbasdorf]. Dissertation: University of Vienna. WEK = Wiesinger, P. 1962-1969. Ergänzungskarten zum Deutschen Sprachatlas: Nacherhebungen in Süd- und Osteuropa [Supplementary maps for the Deutscher Sprachatlas: Additional surveys in Southern and Eastern Europe]. Marburg: Handgezeichnet. Wiesinger, P. 1983. Die Einteilung der deutschen Dialekte [The classification of the German dialects]. In: Besch, W., Koop, U., Putschke, W. & Wiegand, H. E. (eds.): Dialektologie: Ein Handbuch zur deutschen und allgemeinen Dialektforschung. 2. Halbband. [Dialectology: A handbook about German and general research on dialect. Second half volume]. Berlin, New York: de Gruyter. 807-900. Wiesinger, P. 2001. Zum Problem der Herkunft des Monophthongs A für Mittelhochdeutsch EI in Teilen des Bairischen [About the problem of the origin of the monophthong A for Middle High German EI in some parts of Bavarian]. In: Bentzinger, R., Nübling, D. & Steffens, R. (eds.): Sprachgeschichte, Dialektologie, Onomastik, Volkskunde. Beiträge zum Kolloquium am 3./4. Dezember 1999 an der Johannes Gutenberg-Universität Mainz. Wolfgang Kleiber zum 70. Geburtstag [Language history, dialectology, onomastics, ethnolgy. Contributions to the colloquium on December 3th/4th 1999 at Johannes Gutenberg University Mainz. For Wolfgang Kleiber’s 70th birthday]. Stuttgart: Franz Steiner. 91-126.

DISTRIBUTION OF VOT IN CONVERSATIONAL SPEECH: THE CASE OF AUSTRIAN GERMAN WORD-INITIAL STOPS

Petra Hödl University of Teacher Education Burgenland [email protected]

Abstract

Standard Austrian German shows a phonemic contrast between lenis /b d g/ and fortis /p t k/ which is phonetically realized as an aspiration contrast. In word- initial position, however, this aspiration contrast is affected by a post-lexical lenition process (Moosmüller, 1991). The aim of the present paper is to empirically investigate the acoustic consequences of this process. While previous studies on VOT in Austrian German dealt with read speech (Moosmüller & Ringen, 2004), this study analyses VOT in a corpus of conversational speech. The main research question is not whether mean VOT durations differ but whether and how much the distributions of VOT values obtained for /b d g/ and /p t k/ resemble each other. Results showed that the VOT distributions of bilabial and alveolar lenis and fortis stops are characterized by a large amount of overlap while this overlap is much smaller for velar stops. Comparison with VOT distributions of English stops reported in the literature (Stuart-Smith et al., 2015; Nakai & Scobbie, 2016) suggests that the degree of overlap observed in the present data may be language-specific and a manifestation of the lenition process affecting word-initial /p/ and /t/ in Austrian German.

Keywords: VOT, Austrian German, semi-spontaneous speech

1 Introduction Voice onset time (VOT) is arguably one of the most extensively investigated and reported temporal features of speech. Since its introduction to the phonetic sciences more than 50 years ago (Lisker & Abramson, 1964), hundreds of studies have been dedicated to it (for overviews see Abramson & Whalen, 2017; Cho et al., 2019). Its continuing popularity is reflected by its presence in the proceedings of the 19th International Congress of Phonetic Sciences (ICPhS). In said volume, the terms “VOT” or “voice onset time” occur in in the titles of 14 (of 787) papers and languages which are covered range from Japanese (Hwang & Mazuka, 2019) and Finnish (Horslund, 2019) to Michif (Rosen et al., 2019) and Warlpiri (Bundgaard-Nielsen & O’Shannessy, 2019).

60 Petra Hödl Phonetic studies of VOT in Austrian German, however, are sparse, although this specific variety of German represents an interesting case for the investigation of voicing contrasts. Similar to English, the phonemic difference between /b d g/ and /p t k/ in word-initial position is realized as an aspiration contrast rather than a “true” voicing contrast. This means that /b d g/ are produced with a short-lag VOT while /p t k/ are produced with a long-lag VOT. This circumstance has led a number of phonologists to assume a distinctive feature [+/– spread glottis] for languages, such as English and German (Jessen, 1998; Beckman et al., 2013; Iverson & Salmons, 1995). Consequently, German is labelled as an “aspirating language”. Austrian German is also assumed to be an aspirating language (Moosmüller & Ringen, 2004). However, in this German variety, the aspiration contrast is said to be affected by a post-lexical lenition process (see Moosmüller, 1991: 64). This optional process is applied to the front stops /p/ and /t/ in word-initial position and renders them less aspirated or even unaspirated (see e.g. Muhr, 2001; Krech et al., 2009: 239). The aim of the present paper is to investigate how this assumed reduced degree of aspiration manifests itself in conversational speech produced by adult native speakers of Austrian German. The most crucial question is how much the VOT durations of lenis and fortis stop overlap with each other when taking a large corpus of observations into account. Previous studies on VOT in Austrian German generally observed statistically significant VOT differences between lenis and fortis stops (Grassegger, 1988; Moosmüller & Ringen, 2004). However, these studies dealt with formal (read) speech while the focus of the present investigation is on conversational speech. By studying VOT production in a semi-spontaneous speech setting, insights may be gained into whether aspiration in Austrian German is restricted to formal speech registers or whether it can also be found in casual communicative situations.

2 Overview of previous studies on German VOT There are several studies of VOT in various varieties of German, such as, for instance Jessen (1998) for Standard German by North German speakers, Kleber (2018) for Standard German by Bavarian and Saxon speakers, Ladd & Schmid (2018) for Swiss German by Zurich speakers, Moosmüller & Ringen (2004) for Standard Austrian German by Viennese speakers, Grassegger (1988) for Standard Austrian German by Styrian speakers and Luef (2020) for Standard Austrian German by Upper Austrian speakers.1 Additionally, there are two unpublished academic theses (Kornder, 2016; Puntschuh, 2018) dealing with the acquisition of non-native voicing contrasts by Austrian learners. These theses also contain small-scale pilot studies of the learners’ native pronunciation of Austrian stops. Table 1 below gives an overview of the mean VOT values reported in these studies.

1 For the latter literature reference, I would like to thank Nicola Klingler. Distribution of VOT in conversational speech 61 Table 1: Summary of reported mean VOT values in different varieties of German

Study Speaker /p/ /b/ /t/ /d/ /k/ /g/ Ladd & Swiss German 14 ms (fortis) ------Schmid (n = 20) 13 ms (lenis) (2018) 50 ms (aspirated) Jessen North German 63 ms 15 ms 75 ms 21 ms 83 ms 26 ms (1998) (n = 6) Kleber Saxon (n = 22) 64 ms 20 ms 69 ms 28 ms 80 ms 38 ms (2018) old/young 69 ms 18 ms 68 ms 26 ms 77 ms 32 ms Bavarian 60 ms 18 ms 56 ms 23 ms 67 ms 32 ms (n = 20) 68 ms 16 ms 63 ms 22 ms 73 ms 29 ms old/young Moosmüller Viennese 52 ms 13 ms 61 ms 26 ms 100 ms 31 ms & Ringen (n = 6) 29 ms 12 ms 33 ms 14 ms 64 ms 26 ms (2004) high/mid vowel Grassegger Styrian (n = 25) 29 ms 17 ms 39 ms 21 ms 87 ms 28 ms (1988) reading/picture 27 ms 18 ms 37 ms 22 ms 85 ms 30 ms naming Luef (2020) Upper Austrian 18 ms 12 ms 40 ms 16 ms 64 ms 24 ms (n = 32) 34 ms 4 ms 52 ms 9 ms 68 ms 13 ms old/young Puntschuh Styrian (n = 5) 51 ms 11 ms 57 ms 17 ms 92 ms 21 ms (2018) Styrian (n = 1) 15 ms 11 ms 20 ms 18 ms 105 ms 29 ms Kornder Styrian (n = 5) 13 ms 11 ms 20 ms 16 ms 66 ms 26 ms (2016) Styrian (n = 6) 12 ms 9 ms 7 ms 12 ms 85 ms 25 ms Styrian (n = 2) 72 ms 10 ms 75 ms 15 ms 98 ms 18 ms Styrian (n = 2) 21 ms 11 ms 24 ms 15 ms 60 ms 21 ms

With the exception of Puntschuh (2018), Grassegger (1988) and Luef (2020), all studies are exclusively dedicated to read speech. The former authors used pictures to elicit their target words, in case of Grassegger (1988) and Luef (2020) in addition to a reading task. Puntschuh (2018) is the only study that aimed at collecting conversational speech by incorporating an interactive game. However, due to the specific nature of this game, the elicited target words were produced in a highly emphatic way (see Puntschuh, 2018: 149) – which explains the high mean VOT values reported in this study. In the investigations of read speech, target words were elicited in isolation or embedded in sentences usually presented in standard German orthography. Here the exceptions are Ladd & Schmid (2018) who presented their sentences in Swiss German orthography and Kornder (2016) who used a non-standard orthography.

62 Petra Hödl Crucially, with the exception of Ladd & Schmid (2018) for Swiss German, all studies found differences in mean VOT values between lenis and fortis stops – at least for some speakers. Another observation from the summary in Table 1 is that VOT differences are in all cases reported to be larger in velars than in bilabials or alveolars. In addition to the mean as the most commonly reported measure of central tendency, measurements of dispersion, such as the standard deviation, can also be found in most studies. Moosmüller & Ringen (2004), for instance, point out that variation of VOT values is much higher in fortis than in lenis stops. While measurements, such as the mean and the standard deviation, are certainly very useful, I argue that more holistic descriptive information about the data, such as visualisations using histograms can be very insightful. Hence, the present study focuses more on the overall shape of the VOT distributions and less on individual statistical measurements.

3 The advantages of (semi-)spontaneous speech for the study of VOT in Austrian German Large scale phonetic studies of word-initial VOT in German that take (semi-) spontaneous speech data into account, are – at least to my knowledge – virtually absent. Especially when it comes to Austrian German, assumptions about the contrast between word-initial lenis and fortis stops in non-scripted speech are merely based on anecdotal observations and auditory impressions. However, while (semi-)spontaneous speech is challenging at all stages of the phonetic research process, “the increased ability to generalise results from these data to naturally occurring speech makes this challenge worth overcoming” (Baker & Hazan, 2011: 761). I argue that for the research topic of the present study, insights from (semi-)spontaneous speech may be especially fruitful compared to read or –more generally – laboratory speech. One reason for this claim is that the phonological opposition between /b d g/ and /p t k/ is reflected in German graphematics. Words with initial lenis stops are written with the letters , and while fortis stops are written with

, and respectively. Hence there is a contrast depicted in written language and consequently speakers might feel tempted to produce a contrast in their speech production – simply because there is one in writing. Therefore, elicitation techniques that avoid direct influence from writing are certainly favorable. Furthermore, it is quite likely that the production of initial fortis stops in stressed isolated positions (as in a word list) may lead to an overly strong articulation of these sounds. This would be rather detrimental for answering the research question of the current study, since lenition of word-initial /p/ and /t/ is considered to be an optional process in Standard Austrian German – which is more likely to occur in prosodically less prominent positions within connected speech and in casual communicative situations. Analysing Austrian German stops only in contexts that trigger carefully produced utterances in isolation may not do the contrast justice, as the conditions beneficial for lenition processes to take place might be missing. This issue has also been briefly mentioned by Moosmüller et al. (2015: 341, footnote 3) who Distribution of VOT in conversational speech 63 acknowledge the difficulty of eliciting optional phonological processes in a task of reading from a word list.

4 Methodology 4.1 Material In order to elicit conversational speech, a picture description task involving the identification of differences was designed. This was based on the idea of the diapix pictures (see e.g. van Engen et al., 2010; Baker & Hazan, 2011). During the task, talkers had to verbally describe pictures to one other in order to find differences between the pictures. This technique was used to elicit specific target words across all participants in a relaxed, communicative and unscripted setting. In total, four DIN A4 picture-pairs with a range of different scenes were complied. These pictures contained drawings of various lexical items with word-initial stops in a stressed position followed by either the vowel /a/, /ɪ/ or /ɔ/. Note that the number of words in each category was not balanced. This weakness of the design was mainly due to the restriction of items that could be visually depicted in a clear and unambiguous way. 4.2 Talkers For this study, the results of 33 talkers (16 females and 17 males) are reported. All were native speakers of Austrian German with self-reported normal hearing and neither speech nor language disorders. Their mean age was 28.7 years (range: 21-65 years) and everyone had at least some sort of entrance qualification for higher education. All talkers were born and raised in Styria, which is located in the Southeast of Austria. Note that Standard Austrian German typically refers to the standard variety of German spoken by educated Viennese speakers (see Moosmüller, 1991). However, Vienna is located in the Middle Bavarian dialect region – consequently, Standard Austrian German is labelled a “Middle Bavarian variety” (see e.g. Moosmüller & Brandstätter, 2014) – while a considerable amount of Austria is not. Styria, for instance, is located in a transition zone of Southern Bavarian and Middle Bavarian dialects (Wiesinger, 1983). Since regional differences are said to affect the phonetic realization of the lenis-fortis distinction (see e.g. Krech et al., 2009: 239), it is worthwhile examining whether the claims made about Standard Austrian German (as investigated in Viennese speakers) also hold for non-dialectal speech produced by educated Styrian speakers. 4.3 Recording procedure All recordings were made under living-room conditions in a quiet room at the Department of Linguistics at the University of Graz. Two cardioid condenser lavalier microphones and an external audio-interface were used to conduct two-channel recordings at a sampling frequency of 44.1 kHz and a quantization level of 16 bit. The software used for recording was Audacity (version 2.1.2).

64 Petra Hödl During recordings, which lasted approximately 90 minutes including breaks, participants were seated back to back to each other and instructed to find as many differences in the pictures as they could. One talker (A) was assigned the role of the main describer while the other talker’s (B) task was to identify the differences. No instructions were given to the participants regarding their pronunciation or speech style. For the current study, only the speech productions of talkers (A) were analysed. All sessions were conducted by the same experimenter and participants received 20 Euro each for compensation. 4.4 Data set In total, the production experiment yielded 1938 minutes (i.e. 32.3 hours) of recordings. From these recordings, VOT measurements of 6585 word-initial stops from items with word-initial stress were analysed. Positive VOTs from 2579 samples were taken from lenis stops (39%) and 4006 positive VOTs from fortis stops (61%). Table 2 shows the number of measurements across the three different places of articulation.

Table 2: Number of tokens for each place of articulation and voicing category

bilabial alveolar velar lenis (n = 2579) 1065 895 619 fortis (n = 4006) 1226 1315 1465

VOT segmentation was done manually in Praat (version 6.0.28) (Boersma & Weenink, 2017). As demarcation points for the segmentations, the onset of the stop burst on the left and the first indication of voicing in the speech signal on the right were used. For judging the position of the burst and the beginning of voicing, the waveform, as well as the spectrogram of the signal, were considered (see Abramson & Whalen, 2017).

5 Results The following sections present descriptive statistical statements and figures to illustrate the distribution of VOT values in the compiled data. For results from inferential statistical analyses of word-initial VOT in Austrian German, I would like to refer readers to Hödl (2019). The data set analysed there is composed of the same data presented in this paper plus an additional set of stops in unstressed position. 5.1 Distribution of VOT in Austrian German (full data set) When pooled across all three places of articulation, fortis stops showed a clearly longer VOT duration (x̄ = 42 ms) than lenis stops (x̄ = 17 ms). Fortis stops were also characterized by a much higher variability of VOT values (see Figure 1 below). Histograms depicting the distribution of VOT values for lenis and fortis stops in comparison to each other were even more informative. As can be seen in Figure 2 further below, the histograms of the two categories overlap quite extensively. Within Distribution of VOT in conversational speech 65 the range of VOTs between 0-25 ms, the distribution of lenis stops was almost entirely superimposed by the distribution of fortis stops.

200

150

Category

100 fortis lenis VOT (ms)

50

0

fortis lenis Voicing category Figure 1. Boxplots of VOT in word-initial stops pooled across places of articulation (fortis: n = 4006; lenis: n = 2579)

It can also be stated that the mode, i.e. the most frequently occurring VOT value, is almost the same in the two subsets of the data, namely 9 ms for lenis stops and 10 ms for fortis stops. Table 3 below provides further descriptive information of the data set.

Table 3: Descriptive statistical summary of the full VOT data set

lenis fortis mean 17 ms 42 ms standard deviation 9 ms 29 ms median 14 ms 36 ms mode 9 ms 10 ms min 1 ms 2 ms max 75 ms 181 ms

66 Petra Hödl

150

100 Category

fortis lenis

Number of Number of observations 50

0

0 50 100 150 VOT(ms) Figure 2. Histogram of VOT in word-initial stops pooled across places of articulation (fortis: n = 4006; lenis: n = 2579)

Note that in the full data set, all three places of articulation were included. This lumped together two places of articulation for which deaspiration was expected (i.e., bilabial and alveolar), with one (i.e., velar) for which it is not. This may also be the reason for the fortis VOT distribution to look bimodal. Obviously, this is not very useful for detailed statements about VOT in Austrian German. Hence, for further analysis, each place of articulation was analysed individually. 5.2 VOT distribution for different places of articulation When the VOT data is analysed and depicted separately for each place of articulation, it becomes evident that velar stops show the largest difference between VOT values of lenis and fortis stops. As can be seen in Figure 3, for the velars /g/ and /k/, the boxes do not overlap. In contrast, for /b/ and /p/ as well as for /d/ and /t/, they do. What all three places of articulation have in common though, is that the fortis categories show greater dispersion of VOT values, as well as higher medians (as indicated by the horizontal black lines in the boxes) and maxima than the lenis categories. However, the distance between the lenis and fortis medians is much larger in velars than in bilabials or alveolars. Distribution of VOT in conversational speech 67

200

150

Category

100 fortis lenis VOT VOT (ms)

50

0

bilabial alveolar velar Voicing category

Figure 3. Boxplots of VOT in bilabials (fortis: n = 1226; lenis: n = 1065), alveolars (fortis: n = 1315; lenis: n = 895) and velars (fortis: n = 1465; lenis: n = 619)

The individual histograms in Figure 4 show even more clearly, that the distributions of VOT values for /b/ and /p/, as well as for /d/ and /t/, overlap to a very large extent, while for /g/ and /k/, two distinct distributions were noted. Note that even for the velars, there is a bit of an overlap between the VOT values of lenis and fortis stops but it is far less than for the other two places of articulation. While the modes for /b/ and /p/ are 9 ms and 10 ms, respectively, and for /d/ and /t/ 12 ms and 16 ms, the modes for velars lie much further apart from each other: 24 ms for /g/ and 65 ms for /k/. Also, the means and medians differ to a greater extent in velars than in the other two places of articulation. For further descriptive statistical information, see Table 4 below.

68 Petra Hödl

100

(a) bilabial Category

fortis lenis 50 Number of Number of observations

0

0 50 100 150 VOT(ms)

60

(b) alveolar

40 Category fortis lenis

Number of Number of observations 20

0 0 50 100 150 VOT(ms)

40

30 (c) velar Category

20 fortis lenis Number of Number of observations 10

0

0 50 100 150 VOT(ms)

Figure 4. Histograms of VOT in word-initial stops, depicted separately for each place of articulation Distribution of VOT in conversational speech 69 Table 4: Descriptive statistical summary of VOT, given separately for each place of articulation

bilabials alveolars velars lenis fortis lenis fortis lenis fortis mean 11 ms 24 ms 18 ms 30 ms 24 ms 66 ms standard deviation 6 ms 23 ms 9 ms 21 ms 9 ms 21 ms median 10 ms 15 ms 15 ms 23 ms 23 ms 64 ms mode(s) 9 ms 9/10 ms 12 ms 16 ms 24 ms 62/65 ms min 1 ms 2 ms 3 ms 4 ms 4 ms 10 ms max 70 ms 178 ms 75 ms 143 ms 63 ms 181 ms

A long-lag VOT can be caused by various phonetic events, aspiration being only one of them. In fact, VOT is only an indirect measure of aspiration (see e.g. Keating, 1984) and longer VOT durations may easily be caused by other factors such as affrication. In order to disentangle the contributions of aspiration and affrication to the longer VOT durations in Austrian German velars, spectral analyses would be necessary.2 For now, one can only assume that affrication could play a role in certain cases, in particular when VOT durations are exceptionally long. This might be especially true for velar stops preceding front vowels. For them, Moosmüller & Ringen (2004) have observed a fair amount of affrication in their data as well.

6 Discussion An obvious question that may arise from these results is whether these distributions and the overlap between lenis and fortis categories are specific to Austrian German or whether one can find this large amount of overlap in VOTs in other aspiration languages as well. It boils down to the following question: Is the observed overlap caused by the tendency for deaspiration of fortis stops in Austrian German or is it a general feature of (semi-)spontaneous speech? Evidence that the observed VOT distributions may indeed reflect a language- specific factor comes from comparison with VOT in spontaneous Scottish English, which has been investigated by Stuart-Smith et al. (2015). In their study, the authors analysed word-initial stops taken from a spontaneous speech corpus of Glaswegian. Like Standard Southern British English, Scottish English has a contrast between lenis and fortis stops. However, fortis stops are reported to be produced with less aspiration than in other varieties of British English (see e.g. Wells, 1982: 74, 112). This feature makes Scottish English in some ways similar to Standard Austrian German. However, when we now compare the Austrian German data with the Scottish English data by Stuart-Smith et al. (2015), the differences between the two data sets become evident. While the Austrian data is characterized by a remarkable overlap between lenis and fortis VOT values, the Scottish data showed a clear distinction

2 I am grateful to an anonymous reviewer for bringing this issue to my attention.

70 Petra Hödl between the two categories (see Figure 5 for a comparison of the two data sets; the plot on the left-hand side is reprinted from Stuart-Smith et al., 2015: 519).

(a) Glaswegian (b) Austrian German

Figure 5. Comparison between Scottish English (left; graph taken from Stuart-Smith et al., 2015: 519) and Austrian German (right). The Scottish data contains stressed syllable-initial lenis (n = 4088; red) and fortis (n = 3247; turquois) stops from 23 female speakers from a Glaswegian spontaneous speech corpus. The Austrian data contains word-initial lenis (n = 2579; orange) and fortis (n = 4006; blue) stops from 33 speakers.

As Stuart-Smith et al. (2015) state, the Glaswegian VOT distribution for voiced and voiceless stops “clearly shows that the voicing contrast is maintained through positive VOT for these speakers” (Stuart-Smith et al., 2015: 522). Another study on spontaneous English VOT by Nakai & Scobbie (2016) examined a corpus of BBC interviews conducted by 10 native speakers of various English varieties (American, British and Australian). In their paper, they depict histograms of VOT values for bilabials, alveolars and velars separately (Nakai & Scobbie, 2016). Again, there is a bit of an overlap between the two voicing categories. On the whole, however, the two categories show distinct VOT distributions. Crucially and unlike the Austrian German data, this is true for all three places of articulation – also for /b/ vs. /p/ and /d/ vs. /t/. To sum up, the comparison of the Austrian VOT distribution with the distributions reported by Stuart-Smith et al. (2015) and Nakai & Scobbie (2016) for English indicates that the large overlap of VOT values for /b/ and /p/ and /d/ and /t/ in the Austrian data may indeed be language-specific – and likely to be caused by the optional process of lenition postulated in the literature (Moosmüller, 1991). At least, Distribution of VOT in conversational speech 71 it seems safe to conclude that the degree of overlap observed in the presented data is not a general feature of VOT in (semi-)spontaneous speech.

References Abramson, A. S. & Whalen, D. H. 2017. Voice Onset Time (VOT) at 50. Theoretical and practical issues in measuring voicing distinctions. Journal of Phonetics, 63, 75-86. Audacity Team. Audacity (version 2.1.2). Available from: https.//audacityteam.org/. Baker, R. & Hazan, V. 2011. DiapixUK. Task materials for the elicitation of multiple spontaneous speech dialogs. Behavior Research Methods, 43, 761-770. Beckman, J., Jessen, M., & Ringen, C. 2013. Empirical evidence for laryngeal features. Aspirating vs. true voice languages. , 49, 259-284. Boersma, P. & Weenink, D. 2017. Praat. Doing phonetics by computer (version 6.0.28). Available from: http.//www.praat.org/. Bundgaard-Nielsen, R. & O’Shannessy, C. 2019. Voice onset time and constriction duration in Warlpiri stops (Australia). In: Calhoun, S., Escudero, P., Tabain, M., & Warren, P. (eds.): Proceedings of the 19th International Congress of Phonetic Sciences. Melbourne 2019. 3612-3616. Cho, T., Whalen, D. H., & Docherty, G. 2019. Voice onset time and beyond. Exploring laryngeal contrast in 19 languages. Journal of Phonetics, 72, 52-65. Grassegger, H. 1988. Signalphonetische Untersuchungen zur Differenzierung italienischer Plosive durch österreichische Sprecher. Forum Phoneticum,40. Hamburg: Buske. Hödl, P. 2019. Production and perception of voice onset time in Austrian German. Unpublished doctoral thesis. Graz: University of Graz. Available from: https://unipub.uni- graz.at/download/pdf/4795325 Horslund, C. S. 2019. VOT in loanwords in Finnish. Evidence for prevoicing of initial /b, d, g/. In: Calhoun, S., Escudero, P., Tabain, M., & Warren, P. (eds.): Proceedings of the 19th International Congress of Phonetic Sciences. Melbourne 2019. 1605-1609. Hwang, H. K. & Mazuka, R. 2019. Shift of voice onset time and enhancement in Japanese infant-directed speech. In: Calhoun, S., Escudero, P., Tabain, M., & Warren, P. (eds.): Proceedings of the 19th International Congress of Phonetic Sciences. Melbourne 2019. 3255-3259. Iverson, G. K. & Salmons, J. C. 1995. Aspiration and laryngeal representation in Germanic. Phonology, 12, 369-396. Jessen, M. 1998. Phonetics and phonology of tense and lax obstruents in German. Studies in functional and structural linguistics, 44. Amsterdam, Philadelphia: Benjamins. Keating, P. A. 1984. Phonetic and phonological representation of stop consonant voicing. Language, 60, 286-319. Kleber, F. 2018. VOT or quantity. What matters more for the voicing contrast in German regional varieties? Results from apparent-time analyses. Journal of Phonetics, 71, 468-486. Kornder, L. 2016. Cross-linguistic study of voicing in learners of English as a foreign language. Unpublished master thesis. Graz: University of Graz. Krech, E.-M., Stock, E., Hirschfeld, U., & Anders, L. C. (eds.). 2009. Deutsches Aussprachewörterbuch. Mit Beitragen von Walter Haas, Ingrid Hove und Peter Wiesinger. Berlin, New York: de Gruyter. Ladd, R. D. & Schmid, S. 2018. Obstruent voicing effects on F0, but without voicing. Phonetic correlates of Swiss German lenis, fortis, and aspirated stops. Journal of Phonetics, 71, 229- 248. Lisker, L. & Abramson, A. S. 1964. A cross-language study of voicing in initial stops. Acoustical measurements. Word, 20(3), 384-422. Luef, E. M. 2020. Development of voice onset time in an ongoing phonetic differentiation in Austrian German plosives. Reversing a near-merger. Zeitschrift für Sprachwissenschaft,

72 Petra Hödl ahead of print, 1-23. Moosmüller, S. 1991. Hochsprache und Dialekt in Österreich. Soziophonologische Untersuchungen zu ihrer Abgrenzung in Wien, Graz, Salzburg und Innsbruck. Sprachwissenschaftliche Reihe, 1. Wien, Köln, Weimar: Böhlau. Moosmüller, S. & Brandstätter, J. 2014. Phonotactic information in the temporal organization of Standard Austrian German and the Viennese dialect. Language Sciences, 46, 84-95. Moosmüller, S. & Ringen, C. 2004. Voice and aspiration in Austrian German plosives. Folia Linguistica, 38, 43-62. Moosmüller, S., Schmid, C., & Brandstätter, J. 2015. Standard Austrian German. Journal of the International Phonetic Association, 45(3), 339-348. Muhr, R. 2001. Varietäten des Österreichischen Deutsch. Revue belge de philologie et d'histoire, 79, 779-803. Nakai, S. & Scobbie, J. M. 2016. The VOT category boundary in word-initial stops. Counter- evidence against rate normalization in English spontaneous speech. Laboratory Phonology, 7, 1-31. Puntschuh, S. 2018. Voice Onset Time in Plosiven des L2-Spanischen bei österreichischen Studierenden. Unpublished doctoral thesis. Graz: University of Graz. Rosen, N., Stewart, J., Pesch-Johnson, M., & Sammons, O. 2019. Michif VOT. In: Calhoun, S., Escudero, P., Tabain, M., & Warren, P. (eds.): Proceedings of the 19th International Congress of Phonetic Sciences. Melbourne. 1372-1376. Stuart-Smith, J., Sonderegger, M., Rathcke, T., & Macdonald, R. 2015. The private life of stops. VOT in a real-time corpus of spontaneous Glaswegian. Laboratory Phonology, 6, 505-549. van Engen, K. J., Baese-Berk, M., Baker, R. E., Choi, A., Kim, M., & Bradlow, A. R. 2010. The Wildcat corpus of native- and foreign-accented English. Communicative efficiency across conversational dyads with varying language alignment profiles. Language & Speech, 53, 510-540. Wells, J. C. 1982. Accents of English (I). An introduction. Cambridge, New York, Melbourne: Cambridge University Press. Wiesinger, P. 1983. Die Einteilung der deutschen Dialekte. In: Besch, W., Knoop, U., Putschke, W., & Wiegand, H. E. (eds.): Dialektologie. Ein Handbuch zur deutschen und allgemeinen Dialektforschung, Vol. 2. Handbücher zur Sprach- und Kommunikationswissenschaft, 1.2. Berlin, New York: de Gruyter. 807-900.

F0 CONTOURS OF IRONIC AND LITERAL UTTERANCES

Hannah Leykum Acoustics Research Institute, Austrian Academy of Sciences [email protected]

Abstract

In everyday communication, most speakers apply disambiguating cues to highlight ironic intent of utterances. These cues can be verbal, paraverbal or non-verbal. With respect to paraverbal cues, fundamental frequency (F0) is one of the commonly used parameters. The present study investigates several F0 parameters of ironic and literal utterances in Standard Austrian German: mean F0, SD of F0, minimal and maximal F0 values, F0 range, and F0 contour. A comparison of F0 contours in both types of utterances is the focus of this paper. Data recordings were made of 20 speakers of Standard Austrian German as they produced short utterances in an ironic and in a literal manner (e.g. “Sehr gut!” ‘very good’ or “Super!” ‘super’). The analysis of the recordings showed that differences exist in the F0 contours between ironic and literal utterances. The F0 contour of ironic utterances is lower and flatter than the F0 contour of literal utterances. The contour fell slightly over the course of ironic utterances. Moreover, a lower average F0, a lower standard deviation (SD) of F0, and a smaller F0 range were found for ironic utterances as compared to literal utterances.

Keywords: irony, verbal irony, F0 contour, Standard Austrian German

1 Introduction Verbal irony is a regularly-used figure of speech in interpersonal communication situations. In order to increase the chance that listeners understand the irony of an utterance, many speakers highlight an ironic intent of an utterance by using verbal, non-verbal, and/or paraverbal cues. With regard to paraverbal cues, it has been shown that mainly fundamental frequency (F0), intensity and durational cues are used to mark irony (e.g. Attardo et al., 2003; Cheang & Pell, 2009; Chen & Boves, 2018; Kreuz & Roberts, 1995; Laval & Bert-Erboul, 2005; Lœvenbruck et al., 2013; Nauke & Braun, 2011; Rockwell, 2000; Scharrer & Christmann, 2011). Concerning F0 cues of irony, the parameters mean F0 and its standard deviation (SD) have commonly been investigated. Here, some language-specific differences seem to exist. In studies on irony in French, Italian, and Cantonese, F0 and SD of F0 were found to be higher in ironic utterances as compared to literal utterances (Anolli et al., 2000; Cheang & Pell,

74 Hannah Leykum 2009; Lœvenbruck et al., 2013). In contrast, studies on English and German in Germany, reported that mean F0 is lower in ironic utterances compared to literal utterances (Chen & Boves, 2018; Niebuhr, 2014; Rockwell, 2000; Scharrer & Christmann, 2011; Schmiedel, 2017). Likewise, SD of F0 and F0 range are lower in ironic realisations of utterances as compared to literal realisations (Schmiedel, 2017) and neutral realisations (Niebuhr, 2014). Only a few studies considered differences between F0 contours of ironic and literal utterances: In a study on German in Germany, the F0 contours of ironic realisations did not differ from F0 contours of literal realisations of the same utterances in a quantitative analysis (Scharrer & Christmann, 2011). A qualitative analysis of the same data showed minor significant differences between ironic and literal utterances: F0 contours of literal utterances were characterised as being mostly flat whereby F0 contours of ironic utterances were found to be either rising or falling (Scharrer & Christmann, 2011). An investigation of utterances in American situation comedies revealed flat intonation contours in ironic utterances (Attardo et al., 2003). Likewise, data of a production experiment with speakers of British English showed that sarcastic utterances were realised with a flatter F0 contour (Chen & Boves, 2018). By contrast, in French a rising contour for ironic utterances was reported (Laval & Bert-Erboul, 2005). These results point to a language dependent use of F0 contour to mark irony. The present study analyses whether speakers of Standard Austrian German (SAG) use different F0 contours for literal and ironic realisations of short utterances. In addition, other F0 parameters were investigated. Due to the fact that in German in Germany, either no differences or only minor differences were found. It is hypothesised that, in SAG, no major F0 contour differences exist between ironic and literal realisations of short utterances. Following the results of former studies on German in Germany, ironic utterances in SAG are expected to be realised with a lower average F0, a lower SD of F0, and a smaller F0 range.

2 Methods 2.1 Participants Recordings of 20 speakers of Standard Austrian German (as defined by Moosmüller, 1991; Moosmüller et al., 2015) were conducted in a semi-anechoic sound booth. All speakers were born and raised in Vienna and had a high educational level (university degree or currently enrolled at university). In addition, at least one of their parents was also from Vienna and had a high educational level. The speakers were balanced for gender and age group (younger speakers: 23-32 years; older speakers: 48-60 years). 2.2 Material and recordings In order to investigate the hypotheses, ten short utterances with a positive meaning, when realised in a literal manner, were chosen. The utterances consisted of either two monosyllabic words or one disyllabic word. For all utterances, the stress was on the first syllable. In a pre-test (Leykum, 2020), items were rated for a predominant manner of use (ironic or literal) and for frequency of use by Austrian speakers. Only items F0 contours of ironic and literal utterances 75 that are used in both manners of realisation and are rated as being used regularly by Austrian speakers were chosen for the present study. The chosen utterances were: “Danke!” ‘thanks’, “Glückwunsch!” ‘congratulations’, “Herrlich!” ‘lovely’, “Köstlich!”, ‘delicious’, “Sehr gut!” ‘very good’, “Sehr nett!” ‘very nice’, “Sehr schön!” ‘very beautiful’, “Spannend” ‘exciting’, “Super!” ‘super’, and “Wahnsinn” ‘incredible’. In order to elicit the utterances in a natural way, short scenarios were written (see example in Table 1). The scenarios led the speakers to either a literal interpretation of the utterance or an ironic interpretation. The speakers were instructed to read the scenarios and answer to each scenario with a given utterance in an appropriate way (hereafter: scenario-condition). In addition, in a second round, the speakers were explicitly instructed to produce the items in either an ironic or a literal manner (hereafter: explicit-condition). The speakers had to produce all stimuli twice (once in each condition) in order to have authentic reactions to the scenarios (scenario- condition) on the one hand, and, on the other hand, to ensure that the speakers interpreted the stimuli correctly as being ironic or not (explicit-condition). However, in the explicit-condition, an exaggerated use of paraverbal irony cues was likely to occur. Together with several additional items, the utterances were arranged in a semi- random order such that consecutive instances of the same item did not occur. Moreover, it was checked that not more than four realisations of the same kind, ironic or literal, followed each other. The speakers had to realise the utterances first in the scenario-condition and thereafter in the explicit-condition.

Table 1: Scenarios to elicit “Danke!” ‘thanks’ in an ironic and a literal manner.

Ironic literal Person A gibt Person B einen Person A kommt lächelnd zur Tür dreckigen Putzfetzen und sagt: rein und sagt: „Ich habe dir Blumen „Ich habe dir was mitgebracht.“ mitgebracht!“ „Danke!“ „Danke!“ Person A gives person B a dirty Person A smiles while entering the cleaning cloth and says: “I’ve room and says: “I’ve brought you brought you something.” some flowers!” “Thanks!” “Thanks!”

The recordings were conducted in a sound booth (IAC-1202A) with a cardioid microphone (AKG C451 EB). In addition, for further analyses, electroglottographic recordings (for a subset of the speakers) and video recordings were conducted.

76 Hannah Leykum 2.3 Analyses The 10 utterances realised by 20 speakers twice in an ironic manner and twice in a literal manner (800 utterances) were analysed using the sound analysis software STx (Noll et al., 2019; Noll et al., 2007). The automatic F0 calculation of STx was used to obtain F0 measurements. However, the F0 values of each utterance were checked for plausibility and, when necessary, F0 values were manually corrected. Due to the fact that the utterances differ in duration (literal: mean=0.675 sec, SD=0.172 sec; ironic: mean=0.755 sec, SD=0.214 sec), the data were normalised to obtain comparability of the F0 contours. Hereby, each utterance was automatically divided in 20 parts of equal length. For each part, F0 values were extracted (whenever the part was voiced) and normalised by converting the time scale into percentage values. In addition, the SD of F0 (in semitones), F0 range, mean intensity and utterance duration were measured and included in the statistical models. The statistical analyses were conducted with R (R Core Team, 2015) by fitting a Generalised Additive Mixed Model (GAMM) using the R package gamm4 (Wood & Scheipl, 2017). The F0 values were used as dependent variables. In addition, for checking the other F0 parameters (mean F0, SD of F0, minimal and maximal F0 value within each utterance, and F0 range) for significant differences between ironic and literal realisations of the utterances, Linear Mixed Effects Models (lmers) using the package lme4 (Bates et al., 2015) were fitted. In both, the GAMM and the lmers, “speaker” and “utterance” were included as random factors. The fitting of the models followed a forward approach: Independent variables were added one by one. When a variable or an interaction of variables resulted in a significant effect or a tendency for an effect (p<0.1), it was kept in the model. When it did not show any effect, it was excluded from the model. When necessary, Tukey post-hoc tests with p-value adjustment were carried out using the lsmeans package (Lenth, 2016).

3 Results 3.1 F0 contour Concerning the F0 contours, the fitted GAMM model revealed a significant interaction for manner of realisation and gender (t = −10.637, p < 0.001) and a significant interaction between manner for realisation and SD of F0 (t = 12.116, p < 0.001). Moreover, main effects of word intensity (t = 15.722, p < 0.001) and task (t = −6.268, p < 0.001) were noted. The smooth terms for both ironic and literal utterances were significant (ironic: edf = 3.983, F = 64.25, p < 0.001; literal edf = 6.166, F = 216.76, p < 0.001). Predictions of the GAMM model are visualised in Figure 1. Neither age group of the speakers (t = 0.287, p = 0.774), nor utterance duration in milliseconds (t = −0.834, p = 0.404) had a significant effect on the F0 contour. F0 contours of ironic and literal utterances 77

Figure 1. F0 contours of ironic and literal utterances (predictions of the GAMM model).

As can be seen in Figure 1, the F0 contour of ironic utterances was mainly flat with a slight F0 decrease during the utterance. In contrast, the F0 contour of literal utterances was rising at the beginning of the utterance and falling thereafter. The F0 movement was much larger in the literal utterances. In addition, the F0 contour of literal utterances was above the F0 contour of ironic utterances. These findings were confirmed by statistical analyses of the other F0 parameters below. Additionally, a visual investigation of all individual F0 contours confirmed the results of the statistical analysis. Yet, some F0 contours differed from the general pattern. A few ironic utterances had a rising F0 contour and some F0 contours of literal utterances were nearly as flat as the F0 contours of the corresponding ironic realisations of the utterances. Especially for the utterance “Herrlich” ‘lovely’, most speakers did not use different F0 contours for the ironic and the literal realisations of the utterances. 3.2 Mean F0 Linear mixed effects models (lmers) were used to analyse F0 differences between ironic and literal utterances. Concerning mean F0, the fitted model revealed a significant interaction between manner of realisation and gender (F(1,771) = 26.995, p < 0.001), a significant effect of task (t(771) = 2.757, p = 0.006), and a tendency for an effect of age group (t(20) = 1.860, p = 0.078). Post-hoc tests for the interaction between manner of realisation and gender showed that all comparisons were significant (see Table 2). For both male and female speakers, mean F0 was significantly higher in the literal realisations of the utterances (Figure 2). This effect was larger for female speakers.

78 Hannah Leykum Table 2: Post-hoc comparisons: Interactions between manner of realisation and gender. Contrast df t-value p-value female: ironic vs. literal 774 -13.289 <0.001 male: ironic vs. literal 774 -5.956 <0.001 ironic: male vs. female 25 9.294 <0.001 literal: male vs. female 25 11.480 <0.001

Mean F0

200 Realisation ironic

fit(Hz) 150 literal

100

female male Gender Figure 2. Fitted mean F0 values: Interaction between gender and manner of realisation.

With respect to the main effect of task, the F0 values were lower in the explicit- condition as compared to the scenario-condition. Concerning the effect of age group, the analyses showed lower F0 values in the older age group as compared to the younger speakers. 3.3 Standard deviation of F0 With regard to the standard deviation (SD) of F0 (converted to semitones), only a main effect of manner of realisation occurred (t(771) = 8.348, p < 0.001). Neither task (t(771) = −0.280, p = 0.780), nor age group (t(20) = −0.112, p = 0.912), nor gender (t(20) = −0.290, p = 0.775) had a significant effect on the SD of F0. SD of F0 values were lower in ironic realisations of the utterances compared to literal realisations (Figure 3). F0 contours of ironic and literal utterances 79 SD of F0 (in semitones)

3.5

3.0 Realisation ironic literal 2.5 fit(semitones)

2.0 ironic literal Realisation Figure 3. Fitted standard deviation of F0 (in semitones): Main effect of manner of realisation

3.4 Minimal F0 value A mixed effects model revealed a significant three-way interaction between manner of realisation, age group and task (F(1,771) = 6.378, p = 0.012) for the minimal F0 values within each utterance. Post-hoc tests showed that, of the relevant comparisons, only for the older speakers in the explicit-condition was the difference between ironic and literal utterances significant (t(778) = −3.411, p = 0.016). Here, the minimal F0 values of ironic utterances were lower when compared to the minimal F0 value of literal utterances. All other relevant pairwise comparisons were not significant. Moreover, an interaction between manner of realisation and gender was found (F(1,771) = 4.308, p = 0.038). Post-hoc tests revealed that only for the female participants, ironic and literal utterances differed significantly (t(778) = −3.893, p < 0.001), but not for the male participants (t(778) = −0.971, p = 0.766). For female speakers, the minimal F0 values of ironic utterances were lower than the minimal F0 value of literal utterances. 3.5 Maximal F0 value Concerning the maximal F0 value within each utterance, the statistical analysis showed a significant interaction between manner of realisation and gender (F(1,771) = 23.637, p < 0.001), a main effect of task (t(771) = 2.669, p = 0.008), and a tendency for an effect of age group (t(20) = 1.778, p = 0.091). The subsequent post-hoc tests revealed a lower maximal F0 value in ironic utterances as compared to literal utterances for both female speakers (t(774) = −12.731, p < 0.001) and male speakers (t(774) = −5.868, p < 0.001). This effect was smaller in the male speakers than in the female speakers. With regards to the main effect of task, the maximal F0 value was

80 Hannah Leykum higher in the scenario-condition when compared to the explicit-condition. Concerning the tendency of an effect of age group, the results showed that young speakers had slightly higher maximal F0 values than older speakers. 3.6 F0 range The F0 range (in semitones) was found to have a significant interaction between manner of realisation, age group and task (F(1,771) = 4.446, p = 0.035). No effect of gender (t(20) = −0.754, p = 0.460) occurred. The post-hoc tests revealed significant effects for manner of realisation for the older age group in the scenario-condition (t(777) = −5.561, p < 0.001), and for the younger age group both in the explicit- condition (t(777) = −4.100, p = 0.001) and in the scenario-condition (t(777) = −3.243, p = 0.027). For the older age group in the explicit-condition, no significant difference between ironic and literal utterances was found (t(777) = −2.217, p = 0.343). For all three significant post-hoc comparisons, the ironic realisations of the utterances had a smaller F0 range than the literal realisations (Figure 4).

F0 range (in semitones) 20

19

18 Realisation ironic 17 literal fit(semitones) 16

15 explicit:o explicit:y scenario:o scenario:y Task:age group

Figure 4. Fitted F0 range (in semitones): Interaction between task, age group and manner of realisation (o = older age group; y = younger age group).

4 Discussion It was hypothesised that, in SAG, no major F0 contour differences would exist between ironic and literal realisations of literally positive short utterances. Moreover, ironic utterances in SAG were expected to be realised with a lower average F0, a lower SD of F0, and a smaller F0 range. The hypotheses concerning the global F0 measurements (mean F0, SD of F0, and F0 range) were confirmed in the present study. In Standard Austrian German, the mean F0 of ironic utterances was lower, compared to literal realisations of the same F0 contours of ironic and literal utterances 81 utterances. Moreover, the SD of F0 was lower and the F0 range was smaller in ironic utterances. The gender*realisation interaction for the mean F0 showed a smaller effect for male speakers compared to female speakers. This is explainable by the use of the Hertz-scale; a conversion to semitones probably would have resulted in a comparable effect size in male and female speakers. The same holds for the minimal and maximal F0 values within each utterance. The fact that both the minimal and the maximal F0 value within each utterance are, at least for female speakers, lower in ironic utterances, points to an F0 contour which is located in lower frequency regions for ironic utterances compared to literal utterances. This was also confirmed by lower mean F0 values of ironic utterances and by the F0 contours of both types of utterances. For both the mean F0 and the maximum F0 value, a main effect of task revealed higher values in the scenario-condition compared to the explicit-condition. Intuitively, one would expect more extreme values when the speaker is explicitly asked to produce ironic and literal realisations of the utterances. However, the values were lower for both ironic and literal utterances. Thus, a probable explanation is an effect of task order. The participants first produced the realisations in the scenario-condition and then in the explicit-condition. Due to fatigue, the second realisation of the utterance may be slightly lower in frequency than its realisation in the first round. This fact could also explain why the F0 range did not differ between ironic and literal realisations in the explicit-condition, but in the scenario-condition for the group of older participants. Taking all F0 measurements together and having a look at the F0 contours of ironic and literal utterances, the hypothesis stated that the contours were not expected to differ from each other. However, statistical analysis, as well as visual investigation of the individual F0 contours, did show significant differences between F0 contours of ironic and literal utterances. The ironic utterances were mostly flat and slightly falling over the course of the utterance, whereas literal utterances were rising at the beginning of the utterance and falling thereafter for the remaining part of the utterance. Since only disyllabic utterances, which were stressed on the first syllablem were chosen for the analyses, it is not surprising that the F0 peak was in the first half of the utterances. The statistical analysis of the F0 contours revealed an interaction between manner of realisation and gender. As mentioned above, the effect of gender emerged due to the fact that values on the Hertz scale were compared. The main effect of task results from lower F0 contours in the explicit-condition than in the scenario-condition. The main effect of word intensity was explained by a generally higher F0 when speaking louder. When comparing the present results with earlier studies investigating F0 contours in ironic and literal utterances, the findings of the present study do not confirm the results in German in Germany (Scharrer & Christmann, 2011). However, the results are in accordance with studies on American English (Attardo et al., 2003) and British English (Chen & Boves, 2018) revealing flat F0 contours in ironic utterances. The differences in the results on F0 contours between the present study and the study on German in Germany (Scharrer & Christmann, 2011) could have occurred due to

82 Hannah Leykum methodological differences. The present study investigated short but complete utterances, whereas in the study of Scharrer & Christmann (2011) only one monosyllabic target word within each sentence was investigated. The divergence of the present results from finding on F0 contour differences in French (Laval & Bert- Erboul, 2005) are in accordance with other language specific differences in the use of F0 parameters to mark irony: In French (Laval & Bert-Erboul, 2005; Lœvenbruck et al., 2013) as well as in Italian (Anolli et al., 2000), Cantonese (Cheang & Pell, 2009) and Japanese (Adachi, 1996), higher mean F0 values were found in ironic utterances compared to literal utterances. Most of the studies also revealed a larger F0 range or SD of F0 for ironic utterances in the aforementioned languages.

5 Conclusion and outlook Most studies on verbal irony have investigated differences in the mean F0, SD of F0, and F0 range between ironic and literal utterances. However, only a few studies have considered differences in F0 contours. These studies reported mixed results with either no differences between ironic and literal utterances, a flat F0 contour in ironic utterances, or a rising F0 contour in ironic utterances. The analyses of the present study show that, in Standard Austrian German, not only mean F0 and SD of F0 differ between ironic and literal utterances, but also F0 contours have different shapes: Literal utterances are characterised by a rising F0 at the beginning of the utterance followed by a decrease of F0 towards the end of the utterances. In contrast, in ironic utterances, F0 showed a flat contour decreasing slightly over the whole utterance. The results revealed the importance of the analysis of F0 movement within utterances in the investigation of acoustic characteristics of verbal irony. The importance of F0 parameters for the recognition of irony will be analysed in subsequent perception experiments. With the help of regression analyses, the performance of listeners in combination with the acoustic parameters of the utterances will provide insights into the relevant cues for irony perception in Standard Austrian German. Additional acoustic analysis of a subset of the data can be found in Leykum (2019).

References Adachi, T. 1996. Sarcasm in Japanese. , 20(1), 1-36. Anolli, L., Ciceri, R., & Infantino, M. G. 2000. Irony as a game of implicitness: Acoustic profiles of ironic communication. Journal of Psycholinguistic Research, 29(3), 275-311. Attardo, S., Eisterhold, J., Hay, J., & Poggi, I. 2003. Multimodal markers of irony and sarcasm. Humor - International Journal of Humor Research, 16(2), 243-260. Bates, D., Mächler, M., Bolker, B., & Walker, S. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. Cheang, H. S. & Pell, M. D. 2009. Acoustic markers of sarcasm in Cantonese and English. The Journal of the Acoustical Society of America, 126(3), 1394-1405. Chen, A. & Boves, L. 2018. What's in a word: Sounding sarcastic in British English. Journal of the International Phonetic Association, 48(1), 57-76. Kreuz, R. J. & Roberts, R. M. 1995. Two cues for verbal irony: Hyperbole and the ironic tone of voice. Metaphor and Symbolic Activity, 10(1), 21-31. Laval, V. & Bert-Erboul, A. 2005. French-Speaking children’s understanding of sarcasm: The role of intonation and context. Journal of Speech, Language, and Hearing Research, 48, F0 contours of ironic and literal utterances 83 610-620. Lenth, R. V. 2016. Least-squares means: The R package lsmeans. Journal of Statistical Software, 69(1). Leykum, H. 2019. Acoustic characteristics of verbal irony in Standard Austrian German. Proceedings of the 19th International Congress of Phonetic Sciences (ICPhS). 3398-3402. Leykum, H. 2020. A pilot study on the diversity in irony production and irony perception. In: Colston, H. L. & Athanasiadou, A. (eds.): The Diversity of Irony (Cognitive Linguistics Research 65). Berlin, Boston: De Gruyter Mouton. 278-303. Lœvenbruck, H., Jannet, A. B. M., D'Imperio, M., Spini, M., & Champagne-Lavau, M. 2013. Prosodic cues of sarcastic speech in French: Slower, higher, wider. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 3537-3541. Moosmüller, S. 1991. Hochsprache und Dialekt in Österreich: Soziophonologische Untersuchungen zu ihrer Abgrenzung in Wien, Graz, Salzburg und Innsbruck. Sprachwissenschaftliche Reihe, 1. Wien: Böhlau. Moosmüller, S., Schmid, C., & Brandstätter, J. 2015. Standard Austrian German. Journal of the International Phonetic Association, 45(3), 339-348. Nauke, A. & Braun, A. 2011. The production and perception of irony in short context-free utterances. Proceedings of the 17th International Congress of Phonetic Sciences (ICPhS). 1450-1453. Niebuhr, O. 2014. "A little more ironic" – Voice quality and segmental reduction differences between sarcastic and neutral utterances. 7th International Conference of Speech Prosody. 608-612. Noll, A., Stuefer, J., Klingler, N., Leykum, H., Lozo, C., Luttenberger, J., Pucher, M., & Schmid, C. 2019. Sound Tools eXtended (STx) 5.0 - A powerful sound analysis tool optimized for speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2370-2371. Noll, A., White, J., Balazs, P., & Deutsch, W. 2007. STx - Intelligent Sound Processing. Programmer's Reference. Available from: https://www.kfs.oeaw.ac.at/stx. R Core Team. 2015. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Available from: http://www.R-project.org/. Rockwell, P. 2000. Lower, slower, louder: Vocal cues of sarcasm. Journal of Psycholinguistic Research, 29(5), 483-495. Scharrer, L. & Christmann, U. 2011. Voice modulations in German ironic speech. Language and Speech, 54(4), 435-465. Schmiedel, A. 2017. Phonetik ironischer Sprechweise: Produktion und Perzeption sarkastisch ironischer und freundlich ironischer Äußerungen. Schriften zur Sprechwissenschaft und Phonetik, Band 8. Berlin: Frank & Timme. Wood, S. & Scheipl, F. 2017. Gamm4: Generalized additive mixed models using ’mgcv’ and ’lme4’. R package version 0.2-5. Available from: http://CRAN.R- project.org/package=gamm4.

ISPHS MEMBERSHIP APPLICATION FORM

Please mail the completed form to: Treasurer: Prof. Dr. Ruth Huntley Bahr, Ph.D. Treasurer’s Office: Dept. of Communication Sciences and Disorders 4202 E. Fowler Ave. PCD 1017 University of South Florida Tampa, FL 33620 USA

I wish to become a member of the International Society of Phonetic Sciences

Title:____ Last Name: ______First Name: ______Company/Institution: ______Full mailing address: ______Phone: ______Fax: ______E-mail: ______Education degrees: ______Area(s) of interest: ______

The Membership Fee Schedule (check one): 1. Members (Officers, Fellows, Regular) $ 30.00 per year 2. Student Members $ 10.000 per year 3. Emeritus Members NO CHARGE 4. Affiliate (Corporate) Members $ 60.000 per year 5. Libraries (plus overseas airmail postage) $ 32.000 per year 6. Sustaining Members $ 75.000 per year 7. Sponsors $ 150.000 per year 8. Patrons $ 300.000 per year 9. Institutional/Instructional Members $ 750.000 per year Go online at www.isphs.org and pay your dues via PayPal using your credit card.  I have enclosed a cheque (in US $ only), made payable to ISPhS. Date ______Full Signature ______Students should provide a copy of their student card

NEWS ON DUES

Your dues should be paid as soon as it convenient for you to do so. Please send them directly to the Treasurer: Prof. Ruth Huntley Bahr, Ph.D. Dept. of Communication Sciences & Disorders 4202 E. Fowler Ave., PCD 1017 University of South Florida Tampa, FL 33620-8200 USA Tel.: +1.813.974.3182, Fax: +1.813.974.0822 e-mail: rbahr@ usf.edu VISA and MASTERCARD: You now have the option to pay your ISPhS membership dues by VISA or MASTERCARD using PayPal. Please visit our website, www.isphs.org, and click on the Membership tab and look under Dues for “paid online via PayPal.” Click on this phrase and you will be directed to PayPal. The Fee Schedule: 1. Members (Officers, Fellows, Regular) $ 30.00 per year 2. Student Members $ 10.00 per year 3. Emeritus Members NO CHARGE 4. Affiliate (Corporate) Members $ 60.00 per year 5. Libraries (plus overseas airmail postage) $ 32.00 per year 6. Sustaining Members $ 75.00 per year 7. Sponsors $ 150.00 per year 8. Patrons $ 300.00 per year 9. Institutional/Instructional Members $ 750.00 per year Special members (categories 6–9) will receive certificates; Patrons and Institutional members will receive plaques, and Affiliate members will be permitted to appoint/elect members to the Council of Representatives (two each national groups; one each for other organizations). Libraries: Please encourage your library to subscribe to The Phonetician. Library subscriptions are quite modest – and they aid us in funding our mailings to phoneticians in Third World Countries. Life members: Based on the request of several members, the Board of Directors has approved the following rates for Life Membership in ISPhS: Age 60 or older: $ 150.00 Age 50–60: $ 250.00 Younger than 50 years: $ 450.00