A Quantitative Categorization of Phonemic Dialect Features in Context
Total Page:16
File Type:pdf, Size:1020Kb
A Quantitative Categorization of Phonemic Dialect Features in Context Naomi Nagy Xiaoli Zhang George Nagy Edgar W. Schneider University of New Hampshire Rensselaer Polytechnic Institute Rensselaer Polytechnic Institute Regensburg University [email protected] [email protected] [email protected] [email protected] Summary Data Organization We test a method of clustering dialects of English 9 .Each element wij corresponds to a variant of a according to patterns of shared phonological features. phonological feature for variety V 21 i Previous linguistic research has generally considered . 69 phonological features Fi, o 2-7 variants (possible values) per feature phonological features as independent of each other, but 19 10 .Each binary feature vector wi has 168 elements context is important: rather than considering each (of which 13 are shown here). phonological feature individually, we compare the patterns of co-occurring features, or Mutual Information (W) Binary features (F ) for two vowels in 13 dialects (V ) (MI). The dependence of one phonological feature on i i the others is quantified and exploited. The results of this (www.jayzeebear.com/ map/world.gif) KIT DRESS method of categorizing 59 dialect varieties by 168 Variety binary internal (pronunciation) features are compared to The list of vowel features builds on the lexical central raised basic close open traditional groupings based on external features (e.g., sets devised by J.C. Wells, a system of distinct Orkney & Shetland 0 0 1 0 1 ethnic, geographic). The MI and size of the groups are vowel types identified by key words (e.g. KIT North of England 1 0 0 0 1 calculated for taxonomies at various levels of granularity for the vowel in this and ridge; DRESS for the East Anglia 1 0 0 0 1 and these groups are compared to other analyses of vowel in bet and said). Philadelphia 1 0 0 0 1 geographic and ethnic distribution. Possible variants of the vowel of KIT: Newfoundland 0 0 1 0 1 (1) canonical or basic high front [I] Next steps (2) raised and fronted [i] (as in seed) Cajun English 1 0 0 0 1 (3) centralized [\] (as in cup) Jamaican Creole 0 1 0 0 1 Test these methods at all levels of the continuum from (4) with an offglide, e.g. [i\/i\] Tobago Basilect 1 0 0 0 1 idiolect to language, using many idiolects from each dialect Australian Creoles 1 0 0 1 0 Predict, for a partially unanalyzed dialect, what features it will exhibit (based on knowledge of some subset of features Feature type # features #variants Tok Pisin 0 1 0 1 0 that it does exhibit) Vowel 28 121 Fijian English 0 0 1 0 1 Apply to speaker identification Vowel merger 4 4 Nigerian Pidgin 1 0 0 0 1 o stochastic description of a speaker’s full dialect Consonant 32 38 Cape Flats English 0 0 1 1 0 o base on a sample containing a subset of phonemes Automated speech recognition Prosodic 5 5 TOTAL 7 2 4 3 10 o accuracy could be raised by exploiting the consistency TOTAL 69 168 and the statistical dependencies in the pronunciation of speakers of a given dialect cluster Methods: Clustering Results: Clustering Dialect clusters created by clustering algorithm Dissimilarity ρij between 2 varieties = Traditional family tree model • Complete Link Algorithm to create 1 - |wi ∧ wj|/|wi| |wj| = 1 - cos (wi wj) ) clusters θ • Clusters are merged K=10, when the maximum dissimilarity between θ=.63 a variety in one cluster and a variety in the other cluster is < θ. Dissimilarity threshold ( Variety Cluster 8 Cluster 7 Cluster 9 Varieties 1 2 3 4 5 Cluster 1 Cluster 3 Cluster 5 Cluster 2 Cluster 6 Cluster 4 (Crystal 2003:70) Bislama Pijin Pisin Tok Africa English East Nigerian English southern Nigerian Pidgin Cameroon Pidgin English Hawaiian Creole Fiji English Ghanaian English Ghanaian Pidgin Cameroon English African English Black South Singapore English Orkney and Shetland Islands English Welsh British Creole Aboriginal English Tobago & Trinidad basilect Tobago Jamaican English Indian English Nigerian English northern Gullah Barbados Jamaican Creole Australian Creoles Suriname creoles South Urban South Cajun English Liberian Settler English Irish English American English Standard Chicano English The Inland North and Midwest West The City York New Philadelphia New England Newfoundland English Vernac. American African Canadian English North of England Channel Islands St. Helena English African English Indian South Anglia East New Zealand English Maori English Australian English African English White South Cape Flats English 1 0 .36 .40 .46 .50 Malaysian English Scottish English Philippines English Received Pronunciation (RP) Bahamas Pakistani English 2 0 .47 .44 .40 3 0 .50 .55 4 0 .55 5 0 Methods: Mutual Information (MI) Results: Mutual Information Results: Clustering and Mutual Information There is a degree of MI across every pair— Calculation of joint frequency (p(Fj,m Fl,n|Vi ∈ Ck)) MI for 4 tense and Cluster (K = 10, θ = 0.63) and marginal frequencies (p(Fj,m|Vi ∈ Ck) and any word recognition application would be improved by • The amount of context = 4 lax vowels 1 2 3 4 5 6 p(Fl,n|Vi ∈ Ck)) of two features in 13 dialects including MI in its calculations. the average MI between KIT, KIT 1.16 0 0.92 1.92 1.37 1.37 (There are no cases of completely independent variation) pairs of features. Calculation of KIT KIT, DRESS 0.57 0 0.07 0.92 0.72 0.97 Mutual basic raised central F2 Lax vowels Tense vowels KIT, FOOT 0 0 0.25 0 0.17 0 Information .54 .16 .30 KIT, THOUGHT 0 0 0.31 0 0.82 0 • MI is based on the close .24 .08 .08 .08 KIT DRESS FOOT THOUGHT FLEECE FACE GOAT GOOSE DRESS F1 KIT, FLEECE 0.47 0 0.46 0.58 0.82 0 marginal and joint open .31 .76 .08 .22 KIT 2.00 0.41 0.58 0.33 0.52 0.61 0.69 0.51 KIT, FACE 0.09 0 0.46 0.79 0.97 0.42 probabilities of the KIT, GOAT 0.04 0 0.46 1.58 1.37 0.97 DRESS 1.48 0.13 0.30 0.24 0.30 .040 0.32 features within a cluster. 6 individual components of MI KIT, GOOSE 0.13 0 0.20 0.32 1.37 0 FOOT 1.4 0.28 0.48 0.58 0.53 0.29 DRESS, DRESS 1.55 0.44 0.50 1.25 0.72 1.52 I(xi, yj) = -.06 .08 .01 • MI = the relative entropy THOUGHT 1.41 0.24 0.44 0.41 0.56 DRESS, FOOT 0 0.01 0.04 0 0.07 0 (∑ = .05) .08 -.05 -.01 between the two FLEECE 1.53 0.57 0.68 0.42 DRESS, FLEECE 0.24 0.44 0.04 0.71 0.72 0 DRESS, FACE 0.11 0.01 0.04 0.46 0.32 0.17 distributions: MI indicates FACE 2.24 1.30 0.58 IDRESS,KIT = 0.05 < H(x) = 0.54 < log22 = 1.00; DRESS, GOAT 0.05 0.03 0.04 0.92 0.72 1.12 how much each H(y) = 1.41 < log 3 = 1.59 2 GOAT =highest values 2.33 0.57 distribution reveals about DRESS, GOOSE 0.13 0.26 0.02 0.11 0.72 0 the other. GOOSE =auto-comparisons 1.56 ... Adapted from the Language Samples Project (www.ic.arizona.edu/~lsp) Highlighted cells show the value of combining clustering and MI: these values are all greater within their clusters than for the 59 dialects as a whole (where MI=0.41). [i] [æ] [i] [u] “0” = no variation within that cluster for that vowel pair: if there is complete predictability for one of the words, then knowing about th th FLEECE TRAP FLEECE GOOSE where Fj,m is the m variant of the j feature of variety Vi in dialect cluster Ck the other cannot improve our predictions of the first. Aside from high, front low, front high, back low, back these cases, MI would always improve performance of ASR. .