Computational Analysis of Gondi Dialects

Computational analysis of Gondi dialects Taraka Rama and Çagrı˘ Ç oltekin¨ and Pavel Sofroniev Department of Linguistics University of Tubingen,¨ Germany [email protected] [email protected] [email protected] Abstract meaning. The average similarity between two sites is determined as the average number of identical This paper presents a computational anal- words between the two sites. The author describes ysis of Gondi dialects spoken in central the experiments of the results qualitatively and India. We present a digitized data set of does not perform any quantitative analysis. Until the dialect area, and analyze the data us- now, there has been no computational analysis of ing different techniques from dialectome- the lexical word lists to determine the exact rela- try, deep learning, and computational bi- tionship between these languages. We digitize the ology. We show that the methods largely dataset and then perform a computational analysis. agree with each other and with the ear- Recent years have seen an increase in the num- lier non-computational analyses of the lan- ber of computational methods applied to the study guage group. of both dialect and language classification. For in- stance, Nerbonne (2009) applied Levenshtein dis- 1 Introduction tance for the classification of Dutch and German Gondi languages are spoken across a large region dialects. Nerbonne finds that the classification of- in the central part of India (cf. figure 1). The lan- fered by Levenshtein distance largely agrees with guages belong to the Dravidian language family the traditional dialectological knowledge of Dutch and are closely related to Telugu, a major literary and German areas. In this paper, we apply the di- language spoken in South India. The Gondi lan- alectometric analysis to the Gondi language word guages received wide attention in comparative lin- lists. guistics (Burrow and Bhattacharya, 1960; Garap- In the related field of computational histor- ati, 1991; Smith, 1991) due to their dialectal vari- ical linguistics, Gray and Atkinson (2003) ap- ation. On the one hand, the languages look like a plied Bayesian phylogenetic methods from com- dialect chain while, on the other hand, some of the putational biology to date the age of Proto-Indo- dialects are shown to exhibit high levels of mutual European language tree. The authors use cog- unintelligibility (Beine, 1994). nate judgments given by historical linguists to in- Smith (1991) and Garapati (1991) perform clas- fer both the topology and the root age of the Indo- sical comparative analyses of the dialects and clas- European family. In parallel to this work, Kon- sify the Gondi dialects into two subgroups: North- drak (2009) applied phonetically motivated string west and Southeast. Garapati (1991) compares similarity measures and word alignment inspired Gondi dialects where most of the dialects belong methods for the purpose of determining if two to Northwest subgroup and only three dialects be- words are cognates or not. This work was fol- long to Southeast subgroup. In a different study, lowed by List (2012) and Rama (2015) who em- Beine (1994) collected lexical word lists tran- ployed statistical and string kernel methods for de- scribed in International Phonetic Alphabet (IPA) termining cognates in multilingual word lists. for 210 concepts belonging to 46 sites and at- In typical dialectometric studies (Nerbonne, tempted to perform a classification based on word 2009), the assumption that all the pronunciations similarity. Beine (1994) determines two words to of a particular word are cognates is often justified be cognate (having descended from the same com- by the data. However, we cannot assume that this mon ancestor) if they are identical in form and is the case in Gondi dialects since there are sig- 26 Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 26–35, Valencia, Spain, April 3, 2017. c 2017 Association for Computational Linguistics Figure 1: The Gondi language area with major cities in focus. The dialect/site codes and the geographical distribution of the codes are based on Beine (1994). nificant amount of lexical replacement due to bor- ships of the world’s dialects and languages from rowing (from contact) and internal lexical innova- published scholarly linguistic articles. For refer- tions. Moreover, the previous comparative linguis- ence, we provide the Glottolog classification1 of tic studies classify the Gondi dialects using sound the Gondi dialects in table 1. The Glottolog clas- correspondences and lexical cognates. In this pa- sification is derived from comparative linguistics per, we will use the Pointwise Mutual Information (Garapati, 1991; Smith, 1991) and dialect mutual (Wieling et al., 2009) method for obtaining sound intelligibility tests (Beine, 1994). change matrices and use the matrix to automati- cally identify cognates. Dialect codes Classification The comparative linguistic research classified gdh, gam, gar, gse, glb, Northwest Gondi, Northern the Gondi dialects into five different genetic gtd, gkt, gch, prg, gka, Gondi gwa, grp, khu, ggg, gcj, groups (cf. table 1). However, the exact branching bhe, pmd, psh, pkh, ght of the Gondi dialects is yet a open question. In this rui, gki, gni, dog, gut, Northwest Gondi, Southern paper, we apply both dialectometric and phyloge- gra, lxg Gondi netic approaches to determine the exact branching met, get, mad, gba, goa, Southeast Gondi, General structure of the dialects. mal, gja, gbh, mbh Southeast Gondi, Hill The paper is organized as followed. In sec- Maria-Koya, Hill Maria tion 2, we describe the dataset and the gold stan- mku, mdh, ktg, mud, Southeast Gondi, General mso, mlj, gok Southeast Gondi, Muria dard dialect classification used in our experiments. bhm, bhb, bhs Southeast Gondi, General In section 3, we describe the various techniques Southeast Gondi, Bison Horn for computing and visualizing the dialectal differ- Maria ences. In section 4, we describe the results of the different analyses. We conclude the paper in sec- Table 1: Classification of the 46 sites according to tion 5. Glottolog (Nordhoff and Hammarstrom,¨ 2011). 2 Datasets The whole dialect region is divided into two The word lists for our experiments are derived major groups: Northwest Gondi and Southeast from the fieldwork of Beine (1994). Beine (1994) Gondi which are divided into five major sub- provides multilingual word lists for 210 mean- groups: Northern Gondi, Southern Gondi, Hill ings in 46 sites in central India which is shown Maria, Bison Horn Maria, Muria where Northern in figure 1. In the following sections, we use Gondi and Southern Gondi belong to the North- the Glottolog classification (Nordhoff and Ham- west Gondi branch whereas the rest of the sub- marstrom,¨ 2011) as gold standard to evaluate the groups belong to Southeast Gondi branch. It has various analyses. Glottolog is a openly avail- 1http://glottolog.org/resource/ able resource that summarizes the genetic relation- languoid/id/gond1265 27 to be noted that there is no gold standard about the encoder and decoder. The encoder network takes internal structure of dialects belonging to each di- a word as an input and transforms the word to a alect group. fixed dimension representation. The fixed dimension representation is then supplied as an input to 3 Methods for comparing and visualizing a decoder network that attempts to reconstruct the dialectal differences input word. In our paper, both the encoder and decoder networks are Long-Short Term Memory We use the IPA transcribed data to compute networks (Hochreiter and Schmidhuber, 1997). both unweighted and weighted string similar- In this paper, each word is represented as a ity/distance between two words. We use the same sequence (x , . x ) of one-hot vectors of di- IPA data to train LSTM autoencoders introduced 1 T mension P where P is the total number (58) by Rama and Ç oltekin¨ (2016) and project the au- | | of IPA symbols across the dialects. The encoder toencoder based distances onto a map. is a LSTM network that transforms each word As mentioned earlier, the dialectometric analy- into h Rk where k is predetermined before- ses typically assume that all words that share the ∈ hand (in this paper, k is assigned a value of 32). same meaning are cognates. However, as shown The decoder consists of another LSTM network by Garapati (1991), some Gondi dialects exhibit that takes h as input at each timestep to predict a clear tree structure. Both dialectometric and an output representation. Each output represen- autoencoder methods only provide an aggregate tation is then supplied to a softmax function to amount of similarity between dialects and do not P yield xˆt R| |. The autoencoder network is work with cognates directly. The methods are sen- ∈ trained using the binary cross-entropy function sitive to lexical differences only through high dis- ( x log(x ˆ ) + (1 x )log(1 xˆ )) where, similarity of phonetic strings. Since lexical and − t t t − t − t xt is a 1-hot vector and xˆt is the output of the soft- phonetic differences are likely to indicate differ- P max function at timestep t to learn both the en- ent processes of linguistic change, we also analyze coder and decoder LSTM’s parameters. Following the categorical differences due to lexical borrow- Rama and Ç oltekin¨ (2016), we use a bidirectional ings/changes. For this purpose, we perform auto- LSTM as the encoder network and a unidirec- matic cognate identification and then use the in- tional LSTM as the decoder network. Our autoen- ferred cognates to perform both Bayesian phylo- coder model was implemented using Keras (Chol- genetic analysis and dialectometric analysis. let, 2015) with Tensorflow (Abadi et al., 2016) as 3.1 Dialectometry the backend. 3.1.1 Computing aggregate distances 3.1.2 Visualization In this subsection, we describe how Levenshtein We use Gabmap, a web-based application for di- distance and autoencoder based methods are em- alectometric analysis for visualizing the site-site ployed for computing site-site distances.

Load more