Linguistic Homoplasy and Phylogeny Reconstruction. the Cases of Lezgian and Tsezic Languages (North Caucasus)
Total Page:16
File Type:pdf, Size:1020Kb
Alexei Kassian (Institute of Linguistics of the Russian Academy of Sciences) [email protected], 21 October, 2014 Linguistic homoplasy and phylogeny reconstruction. The cases of Lezgian and Tsezic languages (North Caucasus) The paper deals with the problem of linguistic homoplasy (parallel or back developments), how it can be detected, what kinds of linguistic homoplasy can be distinguished and what kinds are more deleterious for language phylogeny reconstruction. It is proposed that language phylogeny reconstruction should consist of two main stages. Firstly, a consensus tree, based on high-quality input data elaborated with help of the main phylogenetic methods (such as NJ, Bayesian MCMC, MP), and ancestral character states are to be reconstructed that allow us to reveal a certain amount of homoplastic characters. Secondly, after these homoplastic characters are eliminated from the input matrix, the consensus tree is to be compiled again. It is expected that, after homoplastic optimization, individual problem clades can be better resolved and generally the homoplasy-optimized phylogeny should be more robust than the initially reconstructed tree. The proposed procedure is tested on the 110-item Swadesh wordlists of the Lezgian and Tsezic groups. Lezgian and Tsezic results generally support theoretical expectations. The Minimal lateral network method, currently implemented in the LingPy software, is a helpful tool for linguistic homoplasy detection. 1. How to reveal homoplasy .................................................................................................................................. 1 2. What kind of homoplasy is more deleterious? ............................................................................................... 5 3. Data........................................................................................................................................................................ 8 4. Phylogenetic methods......................................................................................................................................... 9 5. Lezgian case........................................................................................................................................................ 10 6. Tsezic case........................................................................................................................................................... 42 7. Conclusions ........................................................................................................................................................ 75 8. References ........................................................................................................................................................... 76 1. How to reveal homoplasy.1 Homoplasy is parallel or back (reverse) developments arising in the evolutionary process. This is a phenomenon which perturbs input data and makes it difficult to produce a robust phylogenetic tree of the language family. In some cases, intensive homoplasy makes it impossible to reveal a true phylogeny. A good indicator of the potential presence of secondary, i.e., homoplastic matches between two lects is a situation, when the lexicostatistical distances between the involved lects do not fulfil the condition of additivity. In Fig. 1a–b, the distances are more normal for natural language evolution (in Fig. 1a, the lects L1 & L2 form a distinct clade; in Fig. 1b, the lects L1, L2 & L3 form a ternary node) than Fig. 1c–d. As concerns lexicostatistics and the Swadesh wordlist, there are different views on the problem of rate of cognate 1 This section partially overlaps with List et al. 2014b. List et al. focus on loanword detection, but borrowings of any kind can be formally treated as a particular case of homoplasy. 1 replacement. For example, the original idea of Morris Swadesh (Swadesh 1952; Swadesh 1955; Lees 1953) was that that cognate replacement within the basic vocabulary can be described by the strict clock model (evolutionary rates across lineages are constant or nearly constant). The linguistic data collected by the Moscow school (the Tower of Babel and Global Lexicostatistical Database projects) generally conform to this approach, although with certain “relaxing” improvements proposed by Sergei Starostin (S. Starostin 1989/2007; S. Starostin 1999/2000; Novotná & Blažek 2007; Balanovsky et al. 2011). On the other hand, a number of scholars prefer to apply the relaxed molecular clock model to language evolution, implying that the mean rate of lexical replacement varies among branches (e.g., Gray & Atkinson 2003; Kitchen et al. 2009). In any case, it is unlikely that the range within which the mean rate of basic vocabulary replacement in practice varies can be very large (except for some rare special cases such as Icelandic). Thus, the pairs L1-L2 or L2-L3 in Fig. 1c and L1-L2 or L1-L3 in Fig. 1d are suspected to have secondary matches. Fig. 1. Reverse distances between three lects (L1, L2, L3). Higher percentage of the shared character states means greater closeness. (a) L2 & L3 are close to each other, both are equally remote from L1; (b) the three are equally distant from each other; (c) the three distances are not equal to each other; (d) (a) L2 & L3 are remote from each other, both are equally close to L1. (a) (b) (c) (d) L1 L1 L1 L1 60% 60% 50% 50% 40% 40% 40% 50% 40% L2 L3 L2 50% L3 L2 60% L2 60% L3 L3 A more difficult task is to detect exactly what characters are homoplastic. The original linguistic dataset represents a multistate matrix (for matrix compilation, see, e.g., Atkinson & Gray 2006: 93–94). If we are dealing with lexical characters (lexicostatistics), synonyms, i.e., more than one word in one slot, are almost inevitable. To my best knowledge, Starling (S. Starostin 1993/2007; Burlak & Starostin 2005: 270 ff.) is the only phylogenetic software which is able to process input matrices with synonyms (when the same Swadesh slot is occupied by more than one word, i.e., by several synonyms, all possible pairs of involved words between two languages are compared within this slot: if there is at least one matching pair, Starling treats the whole slot as a match). In order to make the dataset importable in most popular phylogenetic packages, it was proposed by Gray & Atkinson 2003; Atkinson & Gray 2006 to convert the original multistate matrix into binary format. Binarization is coding the presence “1” or absence “0” of the specific proto- root with the specific Swadesh meaning in the given language, Swadesh items superseded 2 by loanwords or simply not documented are marked as “?” (the difference between this procedure, accepted in the Global Lexicostatistical Database project, and the conversion, described in Atkinson & Gray 2006, is that Atkinson and Gray treat loanwords as full-fledged items with distinct cognate indices). It remains unclear how seriously such a conversion corrupts input data and causes model misspecification (cf. Barbançon et al. 2013: 164), but up today all available tests suggest that phylogenetic results of a multistate matrix and its binary counterpart are quite similar if not identical. Not all homoplastic developments can be revealed. Firstly, some cases of back evolution cannot be detected (at least without extra evidence such as ancient texts or old borrowings in neighboring languages): Fig. 2. Fig. 2. A character has two states: A, B. A B B A Secondly, parallel evolution within the same clade can hardly be distinguishable from evolution of the intermediate ancestral state: Fig. 3. Fig. 3. A character has two states: A, B. A A A B vs. B B B B From the formal point of view, if we have two characters in a multistate matrix each of them has at least two states with equal cost of change between the states (e.g. one has the states A & B, the second — C & D), and they take all four possible pairs of states in the matrix: “AC”, “AD”, “BC”, “BD”, these characters are incompatible and at least one of them is homoplastic (see, e.g., Semple & Steel 2003: 69 ff.). In such and some other cases, the reconstructed tree topology can suggest exactly which character is homoplastic: Fig. 4. 3 Fig. 4. Two incompatible characters. The first character has the states A, B; the second one has the states C, D. The second character demonstrates homoplasy. A C~D A A B B C D C D As one can see, reconstructed tree helps to detect homoplasy within one multistate character (the so-called “criss-crossed” configuration): Fig. 5. Fig. 5. A character has two states: C, D. C~D C DC D However, the maximum amount of homoplastic developments in multistate or binary matrix can be revealed, if ancestral character states, i.e., character states for the proto- language are reconstructed. Such a reconstruction is actually a non-trivial theoretical and practical task (Kassian, Zhivlov & Starostin forth.), particularly the reconstruction is impossible without the established phylogenetic tree. The picture is somewhat different when we are dealing with a binary lexicostatistical matrix, converted from an original multistate matrix with “1” denoting a marked state of the character and “0” — an unmarked one (i.e., “1” = presence, whereas “0” = absence of the specific proto-root with the specific Swadesh meaning in the given language; the so- called presence/absence matrix). Even there are two incompatible characters in the input matrix which take all four possible pairs of states: “00”, “01”, “10”, “11”, the change 1 > 0 (loss of the root)