<<

9th International Conference on Speech Prosody 2018 13-16 June 2018, Poznań, Poland

Classification of Swedish dialects using a hierarchical prosodic analysis

Marcin Włodarczak1, Juraj Simkoˇ 2, Antti Suni 2, Martti Vainio 2

1 University, 2 University of Helsinki, [email protected] juraj.simko,antti.suni,martti.vainio @helsinki.fi { }

Abstract Table 1: Tonal typology of Swedish dialects according to the number and timing of pitch peaks, based on [5]. The present study investigates dialectal variation of Swedish word accents by means of wavelet-based analysis of f0 and en- Type Accent 1 Accent 2 ergy. The analysis yields a measure of prosodic similarity be- tween dialects expressed in terms of mutual perplexity of uni- 0 — — gram models trained on derivatives of the wavelet-decomposed One peak One peak input features. A comparison of models trained on energy, f 0 1A Early in the stressed Late in the stressed syl- and a combination of both features indicates that the energy+f 0 syllable lable model reaches the highest classification accuracy, in line with 1B Late in the stressed syl- Early in the post-tonic the existing descriptions of tonal dialects in terms of the number lable syllable and timing of pitch peaks with respect to the stressed syllable. At the same time, prosodic similarity between geographically One peak Two peaks close but typologically distinct dialects suggests an interaction 2A Late in the stressed or One in each syllable between the traditional distinction between type-1 and type-2 early in the post-tonic dialects and areal variation, giving rise to northern and southern syllable type-2 dialects (with little difference between 2A and 2B sub- 2B In the post-tonic sylla- One in each syllable types), and a parallel distinction between 1A and 1B varieties. ble Index Terms: word accent, tonal dialect, prosodic typology

1. Introduction of analysed speakers was not specified). The distribution of ac- To the delight of Scandinavian phoneticians, Swedish speakers cent realisations resembled to a large extent Gårding’s classifi- use tonal word accents on top of lexical stress. The distinction cation of Meyers’s material. consists of two accent types, ‘Accent 1’ (‘acute’) and ‘Accent In addition to these manual methods, there have been some 2’ (‘grave’), associated with mono- or bisyllabic stems, respec- attempts at automatic classification of Swedish regional tonal tively [1]. While the distinction is of limited use in terms of lex- variants, based on low-level acoustic features. For instance, Frid ical contrast with only around 350 reported minimal pairs [2], it [8] used CART trees to predict, among others, Gårding’s di- is has been described as highly sensitive to regional variation. alect in realisations of the words “dollar” and “kronor” from a A preliminary description of Swedish tonal dialects was subset of the SweDia 2000 locations. Three pitch parametrisa- done by Meyer [3, 4], who distinguished three major groups, tion methods were used, based on (1) temporal and pitch-related depending on the number and timing of pitch peaks with respect properties of the first fall beginning before the stressed vowel to the stressed syllable. Meyer’s material was subsequently re- offset, (2) f0 level at the stressed vowel onset and (3) the Tilt analysed by Gårding [5], who proposed five distinct dialects, model [9]. The best performing feature set (using method 1) summarised in Table 1. Briefly, the distinction between types 1 achieved an accuracy of 59% against a random 20% baseline. and 2 reflects the number of pitch peaks in Accent 2 (one or two Given that the same f0 pattern might correspond to different peaks, respectively) while the distinction between types A and word accent depending on the dialect and presence of narrow B is linked to the pitch peak timing (early or late, respectively). focus, separate classification experiments were conducted for Gårding assigned each type to a specific area: type 1A being accent type and focus. Overall, focused words and Accent 2 spoken in Southern Sweden, 1B in and , 2A words reached higher accuracy than the other word classes. in Central Sweden (Svea dialect region) and 2B in the area be- More recently, Lidberg and Bromqvist [10] used Gaus- tween Southern and Central Sweden. In addition, she included sian mixture models and convolutional neural networks trained type 0, spoken in Finland and in the North of Sweden, which on MFCCs and path signatures of a subset of SweDia 2000 has no contrastive tonal word accents, as well as a subtype of wordlists to distinguish between five geographically defined di- 2A (not included in Table 1), used in Oland.¨ alects. The overall accuracy for the Gaussian classifier trained Bruce [6] compiled an updated map of Gårding’s Swedish on individual words equalled 61%. Performance further im- tonal dialects using a larger number of sites, sampled from the proved (80%) when several single-word classifiers were com- SweDia 2000 corpus [7]. The analysis was based on focally ac- bined. Dialect classification on spontaneous speech from three cented phrase-final occurrences of ‘dollar’ (Accent 1) and ‘kro- regions reach still higher accuracy (88%), which was expected nor’ (Accent 2) in a sample of older men’s speech (the number given the reduced number of categories.

304 10.21437/SpeechProsody.2018-62 In the present paper, we revisit the problem of automatic identification of Swedish tonal dialects using a hierarchical Dialect 0 1A 1B 2A 2B prosodic analysis method. The method was previously used for comparison of languages and was found to capture typological relationships between language families [11]. Specifically, we 1.00 use unigram models trained on f0 and energy derivative (∆) fea- tures of f0 and energy signals decomposed using Continuous Wavelet Transform (CWT). The individual components have 0.75 been demonstrated to accurately characterise familiar levels of the prosodic hierarchy, such as syllables, words and phrases [12]. Given that the regional variation of Swedish word accents 0.50 involves both the number of peaks and their timing with respect to the stressed vowel, the hierarchical character of the analy- accuracy sis, preserving the relationships between distinct prosodic lev- els, is particularly well-suited for this task. The distances be- 0.25 tween dialects are then expressed in terms of a perplexity-based measure, allowing comparisons at a chosen level of granularity, from regions and dialects to villages to speakers to individual 0.00 words. energy f0 energy + f0 model 2. Method Figure 1: Prediction accuracy for energy, f and energy + f 2.1. Material 0 0 models grouped by dialect. The material from the present study consisted of wordlists recorded as part of the SweDia 2000 corpus of Swedish di- alects collected in 107 locations around Sweden and Swedish- cluded. speaking areas of Finland in 1999. The wordlists were com- The models were evaluated by calculating perplexity (mean posed to represent both segmental and prosodic characteristics log(p)) for each word in the material. Subsequently, the indi- of the dialects and for this reason are not completely identical vidual− perplexity values were averaged per-site and collected in across recordings sites. a confusion matrix with cell [site , site ] referring to the mean Given that SweDia has not been annotated for tonal dialect i j perplexity of the site model across all words from site . The types and since it includes more sites than analysed by Gård- i j confusion matrix can be used for deriving mean perplexities ing [5], the labels from [8], inferred from neighbouring loca- for higher-order units. For instance, the mean perplexity of a tions in Gårding, were used. Borderline cases, for which could dial model on a dial material can be obtained by averag- not be easily assigned to a single category, were excluded from n m ing cells [site , site ], for all i and j such that site dial the analysis. Altogether, the analysed material amounted to over i j i n and site dial , etc. Similarly, the distance between site∈ and 210,122 words from 86 different sites. j ∈ m i sitej was calculated by averaging perplexities in [sitei, sitej] and [site , site ]. We refer to this measure as mutual perplexity. 2.2. Prosodic analysis and language model comparison j i In addition, for each site we predict dialect type by selecting Pitch was extracted with Praat [13] at 10 ms time step, us- the category with the lowest per-row mean perplexity. ing the two-step extraction procedure described in [14]. The resulting contour was then smoothed (10 Hz bandwidth) and 3. Results the voiceless frames were interpolated linearly. Energy was ob- tained from downsampled (8 kHz) waveforms decomposed with Using both f0 and energy, the dialect type has been predicted wavelet transform (Morlet mother wavelet, ω0 = 3). Compo- correctly in 78% of sites (67 out of 86 cases), against a 48% nents with pseudoperiods of 0.25, 0.5 and 1 s were then summed majority baseline. By comparison, energy- and f0-only mod- to estimate amplitude envelope of the signal. els achieved much lower accuracy, 30% and 42%, respectively. We have trained a separate unigram model for each SweDia Figure 1, shows the same results grouped by dialect type. With location, using the procedure described in [11]. Briefly, deriva- the exception of 0, 1B and 2B dialects, addition of f0 improves tives (∆) were calculated for 200, 400 and 800 ms pseudope- prediction accuracy over energy-only models, with further im- riod components obtained from the continuous wavelet trans- provement when the two feature sets are combined. Inspection form (Morlet mother wavelet, ω0 = 3) of the energy and f0 sig- of confusion matrices (not included here) revealed that for 2B th nals in Matlab. The values of the derivatives between the 5 dialects, inclusion of f0 resulted in increased confusion with 1A and the 95th percentiles of each speaker were subsequently dis- and 2A dialects, which, however, is outweighed by the cumula- cretised into an odd number of bins. Three models were tested: tive improvement of energy and f0. 0-type dialects achieve high models using energy and f0 components only used a five-bin accuracy for all three models (with a slight deterioration for the discretisation, while a combined energy + f0 model used only f0-only model due to a misclassification of a single village). 1B three states due to the greater number of parameters to learn. dialects were completely unaffected by the feature set used. Each of the three bins effectively carries information about the Notably, the same feature set is quite good at predicting re- slope of a particular signal component: rising, falling or rela- gion (Gotaland,¨ , , Finland). Here, the accu- tively flat. Finally, the discretised components of all signals at racy of the energy + f0 model equals 72% (against a majority each time point were combined into a single state. States with baseline of 36%). The model trained on f0 achieves somewhat values falling outside the 5th-95th percentile interval were ex- lower accuracy of 55% (but substantially higher than in case of

305 Dialect 0 1A 1B 2A 2B

Figure 2: Mutual perplexities for eight nearest neighbours of each recording site, grouped by tonal dialect. Red indicates low perplexity (and thus higher degree of prosodic similarity) and yellow indicates high perplexity. dialect type), with energy-only model performing worse than 2A and 2B varieties and Svealand 2A type. Notably, with the the majority baseline (34%). exception of Norrland-0 (represented by only one village) this Taken together, these results pose a question about an in- part of the tree groups all occurrences of type-2 dialects. The teraction between dialect type and region. To investigate this highest-level links include Svealand-2A, Finland, as well as the further, in Figure 2 we plot mutual perplexity for eight nearest 1B dialects spoken in Gotaland¨ and Svealand, the last two form- neighbours of each SweDia site overlayed on a map of Swe- ing a separate branch. den and Finland. Perhaps the most striking feature of that plot is lack of clear-cut dialect boundaries, which would manifest 4. Discussion and conclusions themselves as high-perplexity (low prosodic similarity) links between neighbouring locations belonging to different dialectal The present paper presents an attempt at automatic description varieties. Instead, geographically close villages from different of the Swedish tonal dialects using a CWT-based hierarchical dialects show a high degree of similarity. For instance, all Nor- analysis of prosody, previously applied to classification of a ty- rland dialects (2A, 2B and even 0) form one prosodic cluster. By pologically and genealogically varied sample of languages. By contrast, the variety of the 2A dialect spoken in Svealand seems contrast, this study deals with what is a relatively a subtle vari- to differ from its Norrland counterpart, forming (together with ation of a prosodic parameter, spoken over a well-defined and 1B) an area of relatively high-perplexity cutting across Cen- limited territory. The difficulty of this task is indeed reflected in tral Sweden. Southern dialects and the Finnish Swedish dialects greater model perplexity values, indicating a greater degree of also form low-perplexity groupings. prosodic similarity between the analysed types. In order to verify whether regional effects do in fact to some Nevertheless, the results clearly demonstrate that inclusion extent override (or mediate) dialectal classifications, we calcu- of f0 improves the overall accuracy of dialect classification over lated mean mutual perplexity values for all region-dialect com- a model trained on energy-based features only, and combining binations and input the resulting distance matrix into a hierar- the two features offers further improvement. This is expected chical clustering procedure, using the hclust function in R. The given that the dialectal variation of Swedish word accents is resulting dendrogram, depicted in Figure 3, indeed supports this predominantly characterised by the pattern of f0 and its timing hypothesis. Namely, Norrland dialects (including 0, which lacks with respect to syllabic boundaries. Indeed, it is only when both the word accent distinction) cluster first, followed by Gotaland’s¨ energy and f0 are used together that the overall accuracy ex-

306 r n otenSee ihahg-epeiybl cutting belt high-perplexity a with Sweden Southern and North- in ern perplexity) mutual (low rela- similarity of prosodic regions high indicate tively values results between-group the Rather, high perplexity. mutual and of char- within-group dialects low not between by did boundaries acterised work clear-cut this of la- in presence reveal employed dialect prox- analysis geographical familiar by the the overridden Specifically, partly imity. that least suggests at indeed 2 are bels Figure neighbouring in of plotted perplexity Mutual sites similarity. prosodic of from gree material speech In whether proximity. interested di territorial were ex- we by what words, mediated to other is question variation the dialectal investigated tent subsequently we land), (G variation regional for monosyl- of presence di dialectal the the for where controlled words, not labic were record- the and con- across sites By identical ing entirely 10]. not [8, were words wordlists specific our similar trast, for in models than used varied e more which also these studies, was of used robustness material the relative in addition, the contrasts to tonal testifies of variability Swedish dialectal to related and relationship energy of variability the nevertheless capture the to That manage deltas. signal by represented memory, ited is data unseen on research. results future the for of left evaluation formal an A of site. perplexity ticular the on based is averaged label predicted the addition, In charac- sequential and general intensity express of teristics models classification the dialect Rather, of mind. task train- particular in the the that with noted done be not was should ing it trained, were models the which for than (1A) dialects some for greater (1B). is others gain the although that dialects, individual seems of it level the on e observed be combined also The can baseline. majority the ceeds perplexity. mutual on based 3: Figure ff rn u egahclycoedaet xiissm de- some exhibits dialects close geographically but erent ie htcmaal lsicto cuaywsachieved was accuracy classification comparable that Given lim- rather a have work this in used models the Notably, on data same the on calculated was accuracy the While Gotaland-1B¨ edormo einaddaettp combinations type dialect and region of Dendrogram ilc oe ahrta nteprlxt fapar- a of perplexity the on than rather model dialect Svealand-1B

Finland-0

Gotaland-1A¨ f 0 tln,Nrln,Seln n Fin- and Svealand Norrland, otaland, ¨

ovraini h nlsdmaterial. analysed the in co-variation Svealand-2A

Svealand-2B ff rne r neutralised. are erences Norrland-2B

Norrland-0 ff f c fenergy of ect 0 Norrland-2A swl stheir as well as Gotaland-2A¨ ff cs In ects. Gotaland-2B¨ + f 0 307 lcs ihadtoa einlsbgopnso G of sub-groupings regional di- side additional type-2 right-hand with to exclusively alects, the almost where corresponds dendrogram 3, the of Figure in clustering hierarchical Sweden. Central across ilc aesfrteSei 00corpus. 2000 SweDia the for labels dialect second the to 12933481) author. (No. project DLT Finland of Academy 2014-1072 project samtal Council Research part Swedish in between the and grant Helsinki by of collaboration University a the by and University part Stockholm in funded was work This sites. individual across generalise rep- to few models too the involve for might pre- etitions it purposes, that our revealed for has suited analysis well liminary particularly is set data the former While interviewer. an with conversations unscripted volving un- elicited di ‘D-mark’ set, and der “prosody” ‘kronor’ the ‘dollar’, namely words data, of SweDia consisting the of subsets other the signal. or energy the segmental land- from relevant inferred the the are Instead, of marks data. knowledge the of any structure the require suprasegmental [8], not Frid by does used method procedures particu- parametrisation In the unlike descriptions. lar, dialectological sizes conventional set data in the magnitude found of speech orders of several by amounts exceeding large auto- data, of literature. fully a processing in of allowing light means analysis, found by matic new accents at arrived sheds word were insights thus Swedish these Notably, study of present typology the The on varieties. between northern 1B distinction parallel and to a 1A rise and dialects, giving type-2 variation, southern di- regional and type-2 by and mediated type-1 between alects distinction traditional tree. the the tween in branch separate a forms accents word tonal from of distant tion relatively are and G G together in spoken grouped dialect of are 1B clusters Svealand the 2B discernible contrast, no and By with 2A type. role either between secondary distinction a the plays that Norrland-0). varieties seems of it instance particular, lone In the (including dialects Norrland G Bruce, G. [6] Gårding, E. [5] —,“i noaini cwdshn I i norrl Die II: Schwedischen. in Intonation “Die ——, [4] Meyer, A. E. [3] C-.Eet Tnlt nSeih ue n ito minimal of list a and Rules Swedish: in “Tonality Elert, C.-C. [2] Riad, T. [1] tln’ Adaet rdcal,Fnadwt t neutralisa- its with Finland Predictably, dialect. 1A otaland’s ¨ ovrigcnlsosaesgetdb h eut fthe of results the by suggested are conclusions Converging h uhr ol iet hn oa rdfrsaighis sharing for Frid Johan thank to like would authors The on analysis the repeat to planning are we work, future In be- interaction strong a towards point results the Altogether, c uttal och 1977. 1954. 11, vol. Mundarten,” mundarten 1971, Mouton, Hague: The Eds. 151–173. pp. O’Neil, W. and Hasselmo, N. ar, in pairs,” 2014. Press, ff rn ou odtos n h sotnos es in- sets, “spontaneous” the and conditions, focus erent ( rahn nconversation in Breathing h hnlg fSwedish of phonology The ud tdnltrtrA,2010. AB, Studentliteratur Lund: . tde o ia Haugen Einar for Studies å oeik ega.O vnkn cetr melodi accenter, svenskans Om geografi. fonetiska Vår esnfr:Fizs 1937. Fritzes, Helsingfors: . .Acknowledgements 5. h cniainwr accents word Scandinavian The tchl tde nSadnva Philology Scandinavian in Studies Stockholm i noaini cwdshn i Svea- Die Schwedischen. im Intonation Die .References 6. otefis uhr n the and author, first the to ) xod xodUniversity Oxford Oxford: . .S rco,G Kaaren, G. Frichow, S. E. , ud Gleerup, Lund: . tln and otaland tln and otaland ¨ ¨ nnn i Andning andischen ¨ , [7] A. Eriksson, “SweDia 2000: A Swedish dialect database,” in Babylonic Confusion Resolved: Proceedings of the Nordic Sympo- sium on the Comparison of Spoken Languages, ser. Copenhagen Working Papers in LSP, P. J. Henrichsen, Ed., no. 1, 2004, pp. 33–48. [8] J. Frid, “Lexical and acoustic modelling of Swedish prosody,” Ph.D. dissertation, Lund University, Lund, 2003. [9] P. Taylor, “Analysis and synthesis of intonation using the Tilt model,” The Journal of the Acoustical Society of America, vol. 107, no. 3, pp. 1697–1714, 2000. [10] D. Lidberg and V. Blomqvist, “Swedish dialect classification using artificial neural networks and gaussian mixture models,” Master’s thesis, Chalmers University of Technology, , 2017. [11]J. Simko,ˇ A. Suni, K. Hiovain, and M. Vainio, “Comparing lan- guages using hierarchical prosodic analysis,” in Proceedings of Interspeech 2017, Stockholm, Sweden, 2017, pp. 1213–1217. [12] A. Suni, J. Simko,ˇ D. Aalto, and M. Vainio, “Hierarchical rep- resentation and estimation of prosody using continuous wavelet transform,” Computer Speech & Language, vol. 45, pp. 123–136, 2017. [13] P. Boersma and D. Weenink, “Praat: doing phonetics by computer (version 6.0.24),” Computer program, 2017, http://www.praat.org/. [14] C. De Looze and D. Hirst, “Detecting changes in key and range for the automatic modelling of intonation,” in Proceedings of Speech Prosody 2008, Campinas, Brazil, 2008, pp. 135–138.

308