Download This PDF File
Total Page:16
File Type:pdf, Size:1020Kb
Volume 3 Issue 1 INTERNATIONAL JOURNAL OF HUMANITIES AND June 2016 CULTURAL STUDIES ISSN 2356-5926 North Caucasian languages: comparison of three classification approaches Valery Solovyev Kazan State University, Kazan, Russia [email protected] Abstract In the paper three approaches to reconstruction of languages evolution trees are compared on the material of North Caucasian languages: the expert one (comparative-historical method), lexicostatistics, application of phylogenetic algorithms to databases. It is shown that degree of coherence of different computer solutions is approximately the same as degree of coherence of expert solutions. A new classification of North Caucasian languages is proposed, as a result of applying the consensus method to different known classifications. Keywords: North Caucasian languages, phylogenetic algorithms, evolution trees, linguistic databases, consensus method. http://www.ijhcs.com/index Page 1309 Volume 3 Issue 1 INTERNATIONAL JOURNAL OF HUMANITIES AND June 2016 CULTURAL STUDIES ISSN 2356-5926 1. Introduction Over the last years comparative linguists have developed language classification methods based on computer-aided calculations of linguistic similarities. Such methods have added substantially to the toolset of comparative linguistics. Methods utilizing computer programs to construct phylogenetic trees are conventionally called “automated”. The most complete overview of the state of affairs in this area is given in Nichols and Warnow (2008). This work is concerned with both the comparison of algorithms for constructing trees and the analysis of attempts to apply them to various language families. In order to determine the possibilities and usefulness of phylogenetic algorithms, it is proposed to test them on data from well-described families with unquestionable structure (benchmark or Gold Standard) and to compare the trees generated by computational algorithms with those obtained in a traditional manner. The Indo-European language family is the family which has been the subject of the largest number of works applying computational classification methods (Gray & Atkinson 2003; Atkinson & Gray 2006; Nakhleh et al. 2005; Nicholls & Gray 2008; Rexová et al. 2003; Ringe et al. 2002; Nakhleh et al. 2005a). Also, a large number of works has been devoted to Bantu languages (Holden 2002; Holden & Gray 2006; Rexová et al. 2006; Brown et al. 2008; Serva & Petroni 2008; Bastin 1983; Holden et al. 2005; Marten 2006). Also, several papers focus on Austronesian and Papuan languages (Gray & Jordan 2000; Dunn et al. 2005; Donohue & Musgrave 2007; Dunn et al. 2007; Saunders 2005). Finally, some papers look at Native American Languages (Wichmann & Saunders 2007; Cysouw et al. 2006; Brown et al. 2008). In the view of Nichols and Warnow (2008: 814), much of this work is somewhat disappointing: “One of the main observations in the studies reviewed here is that trees obtained for the same language family but using different datasets and/or different methods can differ in substantial ways … while the development of methods for phylogenetic estimation in linguistics is exciting, we still do not have evidence that any of these methods is capable of accurate estimation of linguistic phylogenies.”. Thus, there is a necessity for developing new sets of empirical data and improve on models of language evolution and phylogenetic methods. Let us point out that while there are more than 300 language families in the world, computational phylogenetic methods have only been applied to a minority part of them. It makes sense to extend the set of families on which phylogenetic methods are tested. Different language families evolve under different conditions. For example, Indo-Europeans populate vast territories, migrated frequently and established many contacts with other people. In contrast, people of the North Caucasus live in an extremely limited territory with specific conditions of communication (mountains and gorges complicate contacts), and have occupied this region for long time without essential resettlements. It is quite probable that such differences have correlates in the way that phylogenies evolve. It is in any case important to extend case studies of phylogenetic methods to new families. In this paper we consider North Caucasian languages, a family for which we have high- quality sets of data and a well-studied phylogenetic structure. The situation with the classification of North Caucasian languages is approximately the same as that of Indo- European. There is a large tradition of studies of these languages, and researchers tend to http://www.ijhcs.com/index Page 1310 Volume 3 Issue 1 INTERNATIONAL JOURNAL OF HUMANITIES AND June 2016 CULTURAL STUDIES ISSN 2356-5926 concur in their opinions on basic issues. Higher-level subgroups of the upper level are generally accepted although controversy does persist in some cases. Some clades at the lower classificatory levels are also not well established. There are some recent works on the application of computational methods to the classification to North Caucasian languages (Koryakov 2006; Kassian 2015). In Kassian (2015), the application of six phylogenetic algorithms to the classification of the Lesbian subgroup is compared. In all cases 110-item Swadesh lists constitute the input. Philogenetic methods can be used to solve other problems than reconstruction of the trees. So, in (Shijulal et al. 2011) these methods are applied for investigation of borrowings; however, they are not widespread yet. An overview of such problems and approaches can be found in (Dunn 2014). The aim of the present paper is a discussion and analysis of previously published classifications of North Caucasian languages. Four of them were obtained by computational methods. Two of these four classifications were developed within the framework of the Automated Similarity Judgment Program (ASJP) (Müller et al. 2013; Jäger 2013). A third was developed within the framework of the Global Lexicostatistic Database (GLD) (Starostin 2015). The fourth tree is the result of more traditional lexicostatistical method (Burlak 2005). Moreover, seven expert classifications: Ethnologue (Lewis 2013); Haspelmath et al. (2005); Ruhlen (1987); Schulze (2014); Alekseev (2001); Burlak (2005), Diakonov & Starostin (1988) as well as other works of relevance for the classification of North Caucasian languages (Nichols 2003; Nikolaev & Starostin 1994; Kassian 2015; Talibov 1980) are considered. Thus, all modern classifications of North Caucasian languages are taken into account. When comparing the classifications obtained automatically, the attention is focused on comparing the datasets used rather than the algorithms for constructing trees. We are interested just in the structure of trees (pure topologies), not in the time of divergence of languages (branch lengths). All exploited databases are lexical ones. In (Rama & Kolachina 2012) an overview of application of philogenetic algorithms to typological databases is given. However, there are no publications in which typological databases would be applied to North Caucasian languages. Moreover, comparison of results of philogenetic algorithm NJ (neighbour joining) application to typological databases Jazyki Mira (Polyakov & Solovyev 2006), WALS and lexical database ASJP revealed (Polyakov et al. 2009) significant advantages of ASJP. Let us initially describe the main differences between the approaches used in the projects ASJP, GLD and traditional lexicostatistics. In the both ASJP and GLD projects phonetic similarities among languages are determined automatically. The main, shared principles in these approaches are the following: (1) a short choice list of basic vocabulary is chosen (some variant of the Swadesh list); (2) a phonetic similarity between words representing the same meanings in different languages is determined; (3) an algorithm is applied to construct a language family tree using the phonetic similarity measure. The above- mentioned procedures differ in the selection of basic vocabulary (in GLD it is larger), methods of phonetic similarity calculation, and algorithms for trees constructing. However, both approaches are similar in spirit. In particular, they circumvent some of the ideas peculiar to the comparative method of historical linguistics, including on the identification of shared innovations and the establishment of cognates. The differences between them mainly concern details of how phonetic similarities are computed. The only difference in the two ASJP http://www.ijhcs.com/index Page 1311 Volume 3 Issue 1 INTERNATIONAL JOURNAL OF HUMANITIES AND June 2016 CULTURAL STUDIES ISSN 2356-5926 versions consists in the approaches used to calculate distances between languages on the basis of words from the lists of basic vocabulary. The lexicostatistical method (Burlak 2005) has features that link it both to the two approaches describe above and to the comparative method. The etymological approach to establishing cognates links the lexicostatistical method to the comparative method, whereas the use of restricted set of lexical data (which typically varies from 35 to 200 basic items in different approaches) and the use of computer programs for tree constructing is similar to the ASJP and GLD procedures. All three approaches differ in the extent to which expert knowledge is being drawn upon and with regard to the computer-aided calculations used. The lexicostatistical method requires cognates to be established and hence uses non- trivial expertise.