<<

From wordlists to proto-wordlists: reconstruction as ʻoptimal selectionʼ

George Starostin*

INTRODUCTION In several of my previous publications (Starostin 2010, 2013a, 2013b), I have repeatedly stressed the importance of combining, rather than opposing the classic , elaborated by several generations of historical linguists over the past two hundred years, and lexicostatistical methodology, originally introduced by Morris Swadesh and his colleagues in the 1950s and once again popular these days due to a massive influx of computational phylogenetic methods from adjacent branches of science. It was with that precise purpose — to integrate the two approaches — that a team of Moscow-based historical linguists set up The Global Lexicostatistical Database (GLD)1, a large-scale project that aims at applying a uniform, maximally formalized lexicostatistical methodology to all the languages of the world in order to arrive at a reasonable genetic classification, while at the same time corroborating the results with knowledge gained from traditional and philology. One of the most important postulates of the GLD is that , applied on the basis of superficial phonetic comparison of words that belong to thousands of different languages (e.g. similar to the procedure that is currently employed in S. Wichmann's ASJP project, see Holman et al. 2008), will not be of much use to describe language relationships that go beyond several thousand years. In other words, the best way to explore the issue of, say, the external relations of Indo-European on a lexicostatistical basis would be to operate with a single wordlist for Proto-Indo-European, rather than with several hundred wordlists for its daughter languages. Even more than that — to explore the far less controversial, but almost as problematic issue of the internal classification of Indo-European, it would make more sense to operate with the reconstructed proto-wordlists for Proto-Germanic, Proto-Celtic, Proto-Slavic, etc., than with mass data from modern languages, which clutter the data with large amounts of secondary accumulated ʻnoiseʼ and raise the risks of distorting the resulting classification.

* Russian State University for the Humanities, Moscow; Russian Presidential Academy, Moscow: [email protected] 1 The GLD currently operates as an autonomous website, subordinate to the larger ʻTower of Babelʼ project, originally launched by as a digital hub for etymological databases (http://starling.rinet.ru).

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 178 George Starostin

Consequently, in this paper I will discuss some important, but frequently overlooked or understated details of how one should go about trying to reconstruct what might be called an ʻoptimalʼ proto-wordlist for a protolanguage. By wordlist I typically understand a Swadesh-type wordlist, i.e. a list of basic words whose meanings are relatively well defined2 and in most cases (although there may be solitary exceptions) find precise equivalents in any given language of the world. As a rule, the GLD operates with the regular 100-item Swadesh wordlist or its truncated 50-item version (see Starostin 2013a for details on the validity and usage of the shortened version), but the rules and recommendations presented below would work equally well for any other version, as long as the elements on the wordlist are assigned discrete meanings3.

1. PRELIMINARY REMARKS Although from the point of view of phylogenetic thinking, step-by-step reconstruction of proto-wordlists would seem to be a logical move, such a procedure is not commonly encountered in historical linguistics. Reconstructing a proto-wordlist is not at all the same thing as reconstructing morphemes, a far more natural occupation for etymologists all over the world. Above all, it presumes that the reconstructed morpheme (or word) (1) will be assigned a concrete, rather than vague, semantic definition, coinciding with the required meaning of the corresponding element on the wordlist (e.g. ʻseeʼ rather than ʻlook, see, observe, examine, thinkʼ; ʻliverʼ rather than ʻk. of internal organ or intestineʼ; ʻredʼ rather than ʻblood, , brightly colouredʼ, etc.); (2) will have a better chance of representing this particular meaning in its basic (most stylistically neutral or frequent) usage on the proto-level than any other reconstructible morpheme (or word). We should stress that within the context of a lexicostatistical evaluation, the main goal of reconstructing a proto-wordlist is not a thorough and conclusive study of phonetic correspondences, including rare and non-trivial ones (although such a study is always useful). First and foremost, such a reconstruction helps define the optimal probable set of etyma4 that could have served as the main

2 It is, of course, well-known that many of the elements on the original Swadesh wordlist were not all that precisely defined in Swadesh's original works, which gave researchers a solid pretext to replace any of the elements that they deemed uncomfortable for their own purposes. An attempt to eliminate many of these problems by associating various Swadesh items with diagnostic syntactic contexts and semantic commentary has been published as Kassian et al. 2010 and is currently used as a set of guidelines in the construction of the GLD. 3 This paper essentially represents a somewhat reworked and improved English translation of one single subsection in the ʻMethodologyʼ chapter of the author's recent Russian language monograph on the lexicostatistical classification of (Starostin 2013), with the addition of several important details and a special illustrative appendix. 4 We typically define an etymon as a correlated pair that may be reconstructed for a proto-language based on regular reflexes in descendant languages.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 179 carriers for meanings constituting the wordlist on the proto-level. It is sufficient for our starred forms to be phonetically compatible, i.e. cognacy decisions should be made based on transparent argumentation and not contradict any historical or typological evidence. This circumstance is important inasmuch as the application of this methodology to language units all over the world will inevitably run into situations where there is simply not enough linguistic data to allow for a thorough clarification of all phonetic correspondences between poorly described, extinct, or too distantly related languages. Much more significant is to properly account for the distribution of various with identical basic semantics across languages belonging to the investigated taxon. Just as it is very important to minimize (preferably, completely eliminate) synonymy while collecting material for Swadesh-type wordlists of attested languages, reconstruction of the proto-wordlist also demands minimization of available candidates: operating on the principle of uniformitarianism, we have no theoretical reason to assume that lexical synonymy on the proto-level, particularly when it comes to strictly defined basic lexicon items, should have been more rampant in reconstructed proto-languages than it is in their modern descendants5. To the best of our knowledge, so far there have not been any detailed formalized attempts at describing any manual algorithms for the selection of optimal proto-language etyma in comparative-historical linguistics; this can be attributed, on one , to a generally skeptical attitude towards lexicostatistics (for which this procedure is of particular importance), and on the other, to the traditional unpopularity of rigorous semantic reconstruction among historical linguists. One of the most important exceptions from this overall tendency is a (relatively) recent study by Leonid Kogan (2006), where such an attempt has been undertaken on the basis of Semitic languages; nevertheless, the procedure suggested by Kogan clearly does not exhaust the entire potential of the comparative-historical method and is therefore in need of a series of additions, a brief list of which was suggested in Starostin 2010. In order to make the description of the procedure as transparent and demonstrative as possible, let us define it on a concrete object — a so-called

The term should be technically distinguished from or stem, since the latter refer primarily to the phonetic shape of the entity, leaving its semantic content relatively vague and ambiguous; that said, it is natural that in many particular contexts all these terms may be partially synonymous and interchangeable. 5 Our views on the issue of synonymy (particularly in the sphere of basic lexicon, where it is easier to operate with discrete, rather than vague and continuous meanings, than in the sphere of cultural lexicon) have been originally described in Starostin 2010. Later, in Starostin 2013b: 109-116, it was even more rigidly stated that there are almost no cases of total synonymy (i.e. complete interchangeability of two different words with the exact same meaning in any given context, register, or sociolinguistic situation) that could be reliably attested in any of the world's languages, and that most cases where such a synonymy could be suspected are really cases of technical synonymy (where we simply do not have sufficient data to understand the proper differentiation) or transit synonymy (where we deal with a temporary situation in which the old word is caught in a state of being gradually replaced by the new word).

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 180 George Starostin

ʻtaxon of the 1st levelʼ, i.e. language group6, whose lexicostatistical classification, based on the analysis of 100-item wordlists in attested languages, would result in the following tree structure: *P

*P1

*PAB *PCD *PEF

A B C D E F

The issue of correlation between chronology and node lengths is of no crucial importance in this context, much like the issue of the exact number of tree levels. Clearly, for a relatively young 1st level taxon a multi-level branching system would be somewhat unrealistic (or unprovable in practice), but even if the analysed taxon were to reveal a more complex branching system, the rules described below would be recursively applicable to any degree of branching (i.e. an etymon reconstructible for level *P1 may be projected upwards as one of the possible candidates for level *P and so on). Likewise, the same rules would be equally applicable for less complex situations than the one illustrated by the tree figure above. Assuming that the original 100-item wordlists have been constructed properly, and that they are already accompanied with cognacy indexes based on phonetic correspondences or phonetic compatibility, the mechanism for the reconstruction of the protowordlist will be defined by two sets of rules: distributional rules, which only take into account the distribution of potential reflexes of protolanguage etyma in daughter languages, and extra-distributional rules that can help clarify the situation in those complicated cases where distributional rules alone are not sufficient to make the best possible choice.

2. DISTRIBUTIONAL RULES Let us now first list the principal distributional rules of the reconstruction for protolanguage P of an etymon *EP, with the Swadesh meaning S: D.1. If the meaning S is expressed by the same root in all the languages of the taxon (i.e. EA = EB ... = EF), the same root in the meaning S should be projected onto the proto-level, i.e. EA = *EP.

6 A ʻlanguage groupʼ, for our purposes, may be defined as a set of genetically related languages whose period of divergence from the common ancestor does not exceed 2,000 – 3,000 years (a number that is usually correlated with no less than 40% of etymological matches on the Swadesh wordlist); as a rule, such a relationship is intuitively obvious for specialists (and sometimes for speakers as well) and completely non-controversial. Typical examples of groups would be Slavic, Germanic, Turkic, Polynesian, etc.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 181

Example: The stem *golva ʻheadʼ is unquestionably reconstructed on the Proto-Slavic record, since regular reflexes of this stem are attested in every Slavic language, without exceptions. (This rule is pretty much the same as the one formulated in Kogan 2006: 465: ʻIf a PS (Proto-Semitic — G. S.) root functions with the same basic meaning in all Semitic languages, there is hardly any reason to doubt that it did so also in the proto-languageʼ). D.2. If (a) the meaning S is expressed by the same root in at least one language from each of the minimal nodes of the taxon and (b) the other languages do not show any additional etymological matches in between any of the minimal nodes (for instance, EA = EC = EE, but EA ≠ EB ≠ ED ≠ EF), that root is reconstructed in the required meaning S for the proto-language, i.e. EA = EC = EE = *EP. It should be noted that such situations are fairly uncommon on group level (especially if we are talking about the ultra-stable 50-item wordlist); however, as we move on to deeper chronological levels, the chance of encountering them becomes progressively higher. Example: Transition from two different binary nodes, ʻTokyo dialect / Kagoshima dialectʼ and ʻShuri dialect / Hateruma dialectʼ, to the level of Proto- Japonic7 in regard to the etymon ʻtongueʼ joins together the distantly related dialects of Tokyo (shita) and Hateruma (sɨta); the Kagoshima dialect shows a different entity (bero), and so does the Shuri dialect (šiba). The logical reconstruction for Proto-Japonic is *sita ʻtongueʼ (Starostin S. 1991: 263, 268), even despite the fact that no reliable etymology has been found for either bero or šiba. Theoretically, it is not completely excluded that it is really one of these two forms, and not *sita, that may have functioned as the primary equivalent for ʻtongueʼ in Proto-Japonic. However, such a hypothesis would have to assume an uneconomical historical scenario — an independent semantic development *sita ʻ?ʼ → ʻtongueʼ in two dialects that are not even areally close to each other. Without any additional arguments in favor of such a scenario, it is automatically precluded from receiving the status of optimal solution. D.3. If (a) the meaning S is expressed by the same root in at least two languages that represent two different primary nodes of the taxon (i.e. nodes that reflect the first chronological split of protolanguage *P) and (b) this isogloss is unique, i.e. no other distributionally similar isogloss is attested between other languages of the taxon (e.g., EA = EE, but EB ≠ EE, EC ≠ EE, ED ≠ EE, EB ≠ EF, EC ≠ EF, ED ≠ EF), the isogloss in question is projected onto the proto-level in the required meaning S (i.e. EA = EE = *EP). On the chronological level of a group this rule is purely theoretical: situations where the older root is preserved in its original meaning only on the periphery of the language continuum, but in no less than two different points of this periphery, are quite hard to come by empirically. However, as we make the transition from

7 These four dialects have been selected for illustrative purposes (the complete picture for all known dialects of Japanese is, of course, much more complicated). Shuri and Hateruma here represent the Ryukyuan ʻdialectsʼ, recognized today by many linguists as independent languages in their own right, and joined with the Japanese dialects spoken on the main islands of Japan in a language family called ʻJaponicʼ.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 182 George Starostin

1st level taxa to deeper and wider linguistic units, the probability to encounter them also begins to rise.

3. EXTRA-DISTRIBUTIONAL RULES: NON-INTERSECTING COMPETITIVE SITUATIONS These three distributional rules should suffice to reconstruct the optimal etymon for the proto-language in non-competitive situations, conditions for which have been listed in subpoints (b) of rules D.2 and D.3: isoglosses between languages are either limited to just a single case (D.1, D.2), or to a single parallel between different high-level nodes (D.3; in the case of two different isoglosses (1) EA /= EB/ = EE /= EF/ and (2) EC = ED only isogloss (1) can pretend to the status of proto-etymon *P, since the alternate scenario will be less economical). In addition to these types of situations, however, quite frequently even on group level (and far more frequently on the level of linguistic families and macro- families) we have to deal with competitive cases, when the distribution of roots across languages alone is insufficient to allow us to make a single choice. In their turn, these cases may be classified into cases of non-intersecting and intersecting competition. Of these two, the non-intersecting competitive situations, as a rule, tend to be more frequent and naturally expected. In these cases a certain number n of roots that express the same basic meaning S find themselves in complementary distribution across n primary nodes of the tree (usually n = 2, but there are also cases when the ancestral protolanguage rapidly splits into three or even more branches, in each of which the required meaning has its own etymological equivalent, different from all the others). Example: in regard to the three primary nodes of the Germanic group of languages, the basic meaning ʻmeatʼ is correlated with the phonological equivalent (1) mimz in the Gothic branch, (2) *flaiskaz in the Eastern branch, (3) *kjɔt in the Northern branch. Clearly, resting solely within the confines of rules D.1-D.3, we are incapable of selecting the optimal candidate for Proto-Germanic. The simplest and, in a way, the most justified solution would be to admit the impossibility of adequately filling in the corresponding slot in the proto-wordlist, and simply leave it empty. However, such a solution would be directly acceptable only if we have no intention to use the accumulated information on the basic lexicon of the protolanguage on a deeper level of historical-linguistic comparison (e.g. if we are only interested in Proto-Germanic, but not in Proto-Indo- European). In this case it is permissible to simply state the fact that for protolanguage *P we have been able, with the aid of distributional rules D.1-D.3, to recover n elements from the 100-item or the 50-item wordlist, and stop right there; in particular, this is the decision made by L. Kogan for the Semitic material, where he ultimately admits that it is only possible to offer a credible phonetic-semantic reconstruction for 52 out of 100 etyma on the Proto-Semitic 100-item wordlist (Kogan 2006: 483). However, if our reconstruction of the proto-wordlist on relationship level a (group or family) is needed as an intermediate stage on the way towards reconstructing the proto-wordlist on relationship level a+1 (family or macro-

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 183 family), it would be risky at best and useless or harmful at worst to restrict ourselves exclusively to etyma that are unambiguously reconstructible on the basis of rules D.1-D.3. To some, such a self-imposed restriction might look like an explicit demonstration of healthy scientific skepticism; I would rather qualify it as a textbook example of hyper-skepticism, which not only blocks us from slipping into the world of subjective personal fantasy, but also prevents us from the possibility to construct perfectly realistic and highly probable historic- linguistic scenarios. The key moment in this is that, whenever we have to deal with situations like Germanic ʻmeatʼ, our scientific conclusion should not be a negative ʻthe Proto- Germanic slot for ʻmeatʼ in the 100-item wordlist may not be filled with any concrete etymonʼ, but rather a positive ʻthe Proto-Germanic slot for ʻmeatʼ has a comparable probability of being filled with any one of the three roots that are represented in the three primary branches of this groupʼ8. Consciously or subconsciously, the former conclusion would lead us to utterly reject the use of any data on this slot for any of our purposes, be it internal or external reconstruction; the latter, on the other hand, given the proper precautions, would help us reliably expand the material foundation on which we plan to construct the next levels of our classification. Thus, in the situation when (EA = EB) ≠ (EC = ED) ≠ (EE = EF), the first step is to assume, in the place of a single EP, three hypothetical entities: EA = EB = *E1P, EC = ED = *E2P, EE = EF = *E3P. One may surmise a priori that all three of them (*E1P, *E2P, *E3P) could represent actual roots or stems that were present in protolanguage *P, but only one of them could represent an etymon that would be eligible for the respective Swadesh slot in the protolanguage9. In many cases, to

8 There is yet another possible (and, occasionally, even provable) outcome — namely, that the etymon in the protolanguage was actually represented by a completely different stem, lost (at least in the Swadesh meaning) in all daughter languages. However, the principle of parsimony would seem to recommend avoiding this scenario on practice, provided no specific arguments are available to back it up. 9 Probability of transit synonymy in this situation is seriously low, since the very idea of transit synonymy surmises that the replacement of the old etymon by the new etymon is all but inevitable and is merely a matter of time (several generations at most). Reconstruction of transit synonymy would be permitted only in those cases where the old etymon is still being preserved in descendant languages with either transformed semantics, bound (idiomatic) distribution, or stylistically defined archaic connotations (e.g. restricted to ritual / literary contexts, etc.). Thus, for instance, the common Indo-European stem *ōus- ʻearʼ in Proto-Iranian has been generally replaced by the innovative stem *gauša-, derived from the Indo-Iranian verbal root *ghauš- ʻto be heard, to resoundʼ (Rastorguyeva, Edelman 2007: 247). However, in Avestan the new lexeme gaoša- is still found to co-exist with the old dual form uši ʻearsʼ, usually applied to animals or used in the figurative meaning ʻhearing, perceptionʼ (Bartholomae 1961: 414). This implies that at a certain stage in the history of Proto- Iranian it must have gone through transit synonymy (*uš- → *gaoša-), a scenario that agrees reasonably well with the larger Indo-European perspective. There are, however, no straightforward arguments that could convince us that on the eve of the disintegration of Proto-Iranian into several branches its speakers were still actively using the old stem *uš-

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 184 George Starostin evaluate the relative probability for each of these candidates to be selected as the required proto-etymon, it is possible to apply several extra-distributional rules that require additional historical analysis of the material. Although a complete and better formalized set of such rules is yet to be established, even today our accumulated experience of such analysis and linguistic reconstruction allows to suggest at least a few useful preliminaries. Of crucial importance in this matter could be successful mutual etymologization of the competing items, in our particular case — discovery of etymological (not lexicostatistical, i.e. semantically different) parallels for stems of the protolanguage *PAB in protolanguages *PCD / *PEF, and vice versa. As a rule, it is not particularly difficult to come up with such parallels on low (group) levels, provided that (a) such parallels really do exist, i.e. the basic etymon was not eliminated completely on the level of a particular branch, but was preserved in a semantically transformed fashion; and that (b) we have enough data at our disposal to prove (a). If mutual etymologization is impossible for some reason, another way to rank our candidates in the order of increasing or decreasing probability to represent the original Swadesh meaning is to rely on certain techniques of internal reconstruction, which can sometimes yield useful hints at whether the item in question is archaic or represents the result of recent linguistic evolution. Finally, a third possible method to eliminate some of the competition requires a careful investigation of the areal connections of the linguistic taxon under analysis. In our practice of compilation of 100-item wordlists we attempt to uncover and isolate the most obvious borrowings from the very beginning, since exclusion of borrowed items (particularly in cases of mass borrowing) is vital for producing reliable results of lexicostatistical calculations. However, it sometimes makes sense to return to this issue once more when we analyze our competitive cases, because some of the less obvious borrowings (for instance, those having undergone additional phonetic or semantic change already after being borrowed) may be easily missed during the wordlist compilation stage, and any particular competitive case can actually serve as a warning that additional investigation is required. Let us now list the main set of extra-distributional rules that can be applied to cases of binary oppositions, when, for instance, EA ≠ EB and, consequently, distributional rules alone do not suffice to determine what *E(PAB) must have been. For simplicity's sake, we shall demonstrate these rules on the material of etymologically well studied binary groups, where we only have two languages (or two clusters of closely related dialects, sometimes defined as ʻmacrolanguagesʼ) or where sufficient lexical data are available only for two languages. The rules themselves, however, may be used on a recursive basis. as the basic equivalent for ʻearʼ; on the contrary, data from all Iranian languages (including Avestan) suggests the opposite. Consequently, if we are constructing a 100- item or a 50-item wordlist for Proto-Iranian, we should only fill in the corresponding slot with *gaoša-; its connection with the semantically close stem *uš- should be defined as a typologically common semantic correlation, but not as ʻsynonymyʼ from a lexicostatistical perspective.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 185

E.1. Suppose that etymon EA shows polysemy, i.e. conveys both the Swadesh meaning S1 and a semantically close meaning S2, while etymon EB is only attested in the meaning S1; additionally, language B has an etymological parallel for EA only in the meaning S2. If so, it is natural to assign the meaning S2 to EA and the Swadesh meaning S1 to EB on the proto-level. Example: In the Kott language the word halčiːg ʻS1 = (of finger or toe) ʼ; ʻS2 = hoofʼ corresponds to Ket qɔĺeś ʻS2 = hoofʼ; the meaning ʻS1 = nailʼ in Ket is expressed by the unrelated word iń (Starostin S. 1995: 195, 304). Theoretically this can be explained in two ways: (a) Proto-Yeniseian had just one stem, *χɔlV[č]iG10, with polysemy ʻnailʼ / ʻhoofʼ, which was subsequently replaced in Ket by the stem iń of unclear origin; (b) Proto-Yeniseian had two stems, *χɔlV- [č]iG ʻhoofʼ and *ʔiːń- ʻnailʼ, with subsequent merger in Kott. Without further arguments, scenario (b) should be preferred as the one with more explanatory power. We should stress that this reconstruction of Proto-Yeniseian *ʔiːń- ʻnailʼ rather than Proto-Yeniseian *χɔlV[č]iG ʻnailʼ should by no means be treated as conclusive (and the same disclaimer should be rightfully applied to most of the other extra-distributional rules below). What we are really talking about here is ranging the possible hypotheses in an optimal order, one that may easily find itself re-arranged as we move on to the next levels of comparison. In general, however, one should expect that optimal arrangements on classificatory level n will usually correlate with the same arrangements on level n+1; classifications where this does not happen will contain an internal conflict that will have to be resolved by special means. E.2. If EA is attested in the Swadesh meaning S1, but in language B the same root has a non-Swadesh meaning S2, the decision on whether the meaning S1 in A is archaic or not may depend on the data of diachronic semantic typology. If accumulated evidence shows that the semantic development S1 → S2 is improbable or impossible, one should assume a semantic innovation in A, and vice versa. Example: Mingrelian dudi ʻheadʼ ← Proto-Kartvelian *dud- is opposed to Laz ti id. ← *(s1)taw- (Klimov 1964: 75, 175). Additionally, Laz ti regularly corresponds to the Mingrelian stem ti- that has an entire series of meanings, all of which share the common semantics of ʻchiefʼ, ʻhead of the familyʼ. The metaphorical development ʻheadʼ → ʻchiefʼ is very common and natural among the world's languages, with numerous supporting parallels (cf. Russian глава, French chef, etc.), whereas the reverse, even if it cannot be theoretically excluded once and for all, would still happen to be a rather unique exception. This allows us, even without any additional Kartvelian data, to assume (as a highly probable working hypothesis) that the Zan (Laz-Mingrelian) root *ti- was the basic means of expressing the Swadesh meaning ʻheadʼ in Proto-Zan, and to regard Mingrelian dudi as an innovation. Regarding the latter, in Laz the stem dud- means ʻedge, tipʼ, which could very well represent the primary semantics of

10 Proto-Yeniseian forms are quoted in accordance with Sergei Starostin's reconstructions in Starostin 1995.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 186 George Starostin this root; once again, the development ʻtipʼ → ʻtop, headʼ in Mingrelian would be quite common and natural. It should, however, be stressed that this particular development, as opposed to ʻheadʼ → ʻchiefʼ, could be easily reversed (ʻheadʼ → ʻtipʼ is every bit as common as ʻtipʼ → ʻheadʼ), and could not be used as an argument to illustrate rule E.2. E.3. If it can be shown that EA in the Swadesh meaning S is a derived stem, so that its derived character can be demonstrated either on the synchronic level of language A or on the diachronic level of the protolanguage *AB, whereas EB in the same Swadesh meaning S is represented with a simple stem, this should be counted as a significant argument for projecting EB rather than EA onto the proto- level in the basic Swadesh meaning. Example: the word ʻbloodʼ in the Nilgiri subgroup of the South Dravidian group is expressed by the word netr in the Kota language and the word poːx in the Toda language (Burrow & Emeneau 1984: 335, 482). In Kota as well as on the Proto-Nilgiri level, the word netr is not segmentable and cannot be derived from any other lexeme. The Toda noun poːx, on the other hand, even on the synchronic Toda level may be regarded as a productive formation (through grammatical conversion) from the verbal stem poːx- ʻto flow (out)ʼ. It makes sense to assume that the original meaning of the word poːx may have been ʻflowing bloodʼ (e.g. ʻblood spilled in the process of killing an animalʼ), with subsequent generalization of the meaning — and that, consequently, netr rather than poːx expressed the general (basic) meaning ʻbloodʼ in Proto-Nilgiri. (Somewhat ironically, subsequent external comparison, as any Dravidologist knows, will show that netr is a derived stem as well, going back to an archaic root concatenation of *ney- ʻoil, fatʼ + *toːr- ʻto flowʼ, but for our current example this higher level etymologization is irrelevant; it does, however, confirm that Kota netr ʻbloodʼ is indeed more archaic in that meaning than Toda poːx). E.4. If EA in its Swadesh meaning S is represented by a stem that cannot be tentatively identified as a borrowing on the basis of specific arguments, whereas for EB one may find such arguments, preference should be given to EA. This easily understandable rule should nevertheless be accompanied with certain detailed comments and restrictions. Experience shows that, when the need arises, just about any etymon that breaks out of the formal historical framework envisioned by the researcher may be branded a potential ʻborrowingʼ — for instance, some form or other that is claimed to violate the system of regular correspondences that the researcher has set up, even if the reliability of that very system still remains questionable. Truly substantial arguments that would allow us to suggest borrowed origins for EB would have to match at least one of the following situations. (1) Language B is situated in the same geographical area as the donor language C (preferably bordering areas, although this is not a strict requirement, since linguistic boundaries change over time); the lexicon of B contains a set of words {E} that coincide phonetically or show regular phonetic correspondences with words of C, however, B and C are not closely genetically related, since they share very few items in their basic lexica; etymon EB belongs to the set {E}. This is an

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 187

ideal situation for us to assume that EB was originally borrowed from C and exclude it from our reconstruction. (2) Same situation, except that B and C do not share common linguistic boundaries and are, in fact, rather remote from each other. Such a situation is rare and, if we are talking group level rather than family level, all but impossible, since it would presume that the period of intense contacts between B and C was interrupted by migratory processes (migration of speakers of B or C to a different territory or, alternately, their separation by a third party D, infiltrating the area where contacts between B and C originally took place) that usually require some time. However, on deeper levels of comparison these situations are more common (cf., for instance, multiple Turkic borrowings in Hungarian, expectedly absent from Hungarian's closest relatives, the Ob-Ugrian languages; or, on an even deeper level, Indo-Iranian borrowings in Proto-Fenno-Ugric). (3) The basic lexicon of language B contains a set of words {E} that are phonetically similar to words of language or language group C, but do not show discernible regular phonetic correspondences with this language or language group. If the set {E} has no etymological parallels in languages that are closely related to B, one might assume a hidden substrate-like influence on the part of language group C, including languages that may have become extinct, their traces recoverable only through investigation of such hidden borrowings in unrelated languages such as B. The set {E}, however, must be representative enough to lower or exclude the probability of accidental resemblances between B and C. (4) The basic lexicon of language B contains a set of words {E} that may not be transparently etymologized on the basis of comparison with closely related languages (by ʻtransparentʼ we surmise etymology that is not tainted with the assumption of irregular phonetic correspondences or unique semantic developments) and may not be explained by influence on the part of any attested languages that are not closely related to B — that is to say, a set of words that gives the impression of ʻcoming from nowhereʼ. Like (3), such a situation may also imply the presence of a concealed substrate language, but unlike (3), this would have to be a substrate that became extinct without leaving behind any descendants or close linguistic relatives. Examples of such phantom substrates are well known to specialists in various language families all over the world (e.g. the Pre-Indo-European substrate in Greek, or a much less popular substrate in Kwegu, obscuring its genetic status as a Southeast Surmic language11). If the available data are sufficient, situations (1) and (2) are easily identifiable already at the stage of compiling wordlists: in Africa, for instance, such obvious contact scenarios as Berberisms in Songay, Ethiosemitisms in East Sudanic languages, Bantuisms in Khoisan languages, etc., are well known to specialists and commonly accepted as incontestable facts. Situations (3) and (4) are far more controversial, since most of the argumentation in such cases remains indirect by definition.

11 See Starostin 2013: 125-129 for an explanation of how the pre-Kwegu substrate may be identified on the basis of lexicostatistical calculations.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 188 George Starostin

As long as we remain at the first stage of lexicostatistics — scoring cognacies between attested languages — we are not obliged to make rigorous proclamations on whether such suspicious entries should be considered as borrowed or inherited. Technically, one might simply follow the convention that no suspicious etymon should be marked as a borrowing unless it is possible to specify a precise, non-controversial source of this borrowing. However, all such cases should be remembered, and if their total amount exceeds a certain critical mass (for instance, their exclusion from lexicostatistical calculations as borrowings changes the position of the language on the resulting tree), the respective branch / group should be represented with two alternate tree models, to be verified already on the next level of comparison. (Such an approach would be well justified for linguistic families whose classification remains controversial, e.g. Semitic, where there is no scholarly consensus on the position of Modern South Arabian languages on the overall tree due to strange behaviour of its basic lexicon). However, once we move on to the stage of proto-wordlist reconstruction, a less equivocal decision becomes desirable, as part of the general strategy to minimize pseudo-synonymy on proto-levels. Hence, the last and, in a certain sense, the most important one of the extra-distributional rules: E.5. If EA in the Swadesh meaning S is represented by a stem that has an etymological parallel (with related meaning S1) in B, whereas EB in the same Swadesh meaning S has no identifiable etymological parallels in A, our reconstruction preferences should be given to EA, since under these conditions, EB has a higher chance of being an undisclosed borrowing from an unattested source. Example: in choosing between Kolyma Yukaghir poynǝ- ʻwhiteʼ and Tundra Yukaghir ɲaːwe- id. (Nikolayeva 2006: 291, 355) preference will be given to the Kolyma item, since poynǝ- finds an etymological parallel in Tundra Yukaghir poyine- ʻwhitish, off colourʼ (this nuanced semantics is found in Kurilov 2001: 376), whereas Tundra ɲaːwe- has no parallels in Kolyma whatsoever. Although on the whole rule E.5 seems simple and logical, its application may result in internal conflict with rules E.1 and E.2. In these cases, rules E.1 and E.2 should normally take preference, since otherwise we may end up with a contradictory historical scenario. Let us illustrate this with an example where typology of semantic change turns out to be crucial in our decision-making. Suppose that languages A and B are represented with the following two etyma:

A B

E1 ʻbloodʼ ʻoil, fatʼ

E2 — ʻbloodʼ

In accordance with rule E.5, in choosing the optimal candidate for protolanguage *AB, we have to give preference to etymon E1, which is present in both languages, albeit in different meanings. However, there is a serious typological problem if we assign the meaning ʻbloodʼ to E1 on the proto-level: while the semantic development ʻfat / oilʼ → ʻbloodʼ is not uncommon in the world's languages (cf. the already mentioned example of Dravidian *ney-toːr

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 189

ʻbloodʼ as literally ʻoil-flowʼ above), the reverse seems to have so far not been encountered anywhere on the planet12. Consequently, in order to make our reconstruction more natural, it would be preferable to reconstruct E1 in the meaning ʻoil, fatʼ and E2 in the meaning ʻbloodʼ. The plausibility of such a reconstruction will be maximized if subsequent analysis helps us discover another etymon, for instance, one whose meanings are as follows:

A B

E3 ʻfat, oilʼ ʻto smearʼ

In this case our historical scenario gets a logical and elegant conclusion — in all three cases, language B preserves the semantics of the protolanguage, whereas language A shows parallel developments, one metonymical (ʻto smearʼ → ʻfat, oilʼ as ʻsmeared substanceʼ) and one metaphorical-euphemistic (the old ʻfat, oilʼ becomes the new ʻbloodʼ). One might even suggest a relative chronology for these shifts: since complete polysemy ʻfat, oil / bloodʼ will naturally result in communicative problems, the language must first provide for a new, auxiliary synonym for ʻfat, oilʼ, not associated with ʻbloodʼ, before the other shift can safely take place. This means that the general importance of rule E.5, no matter how convenient it would be to apply it in the reconstruction process, should not be overestimated; as a rule, it should be applied last, only after the semantics of the analyzed etyma and of their correlates in closely related languages has been scrupulously analyzed.

4. EXTRA-DISTRIBUTIONAL RULES: INTERSECTING COMPETITIVE SITUATIONS Let us now consider the most complex of all possible situations — intersecting competition. Turning again to our exemplary tree on p. 180, we realize that up to now, we have only been discussing cases where competition was attested between members of binary oppositions, connected by close relationship — for instance, different roots that express the same meaning in languages A and B, making it hard to reconstruct a single etymon for *PAB; or different reconstructions on levels *PAB and *PCD, complicating further reconstruction on proto-level *P1. Let us suppose that EA ≠ EB and EC ≠ ED, but EA = EC. Such a situation would be easily explainable based on simple distributional rules: the original etymon preserved its semantics in one out of two languages in each branch and shifted it

12 It goes without saying that our current empirical knowledge of semantic typology is far from perfect, due both to lack of comprehensive reference bases and to insufficient synchronic and diachronic research on numerous linguistic families around the world. However, this problem is gradually being remedied, and enough knowledge has been extracted from available data for us to be already capable of drawing conclusions based on attested or not attested patterns of semantic change.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 190 George Starostin to something different (or was altogether eliminated) in the other members of each pair. But what if, at the exact same time, EB = ED?

*P1

*PAB *PCD

A B C D

EAC EBD

On group levels, such semantic criss-crossings are encountered fairly rarely, but as we move to deeper levels of comparison, their number tends to increase to such an extent that ignoring the issue becomes untenable (Starostin 2013b). Therefore, it is necessary from the very start to offer both a theoretical explanation and a practical method for dealing with these situations. Possible reasons for their arisal, as it seems, may be caused by four types of circumstances. (1) Incorrectly generated tree. Each identified case of semantic criss-crossing on the Swadesh wordlist is a good pretext to check and re-check whether the original wordlists were accurately constructed and whether the original cognation indexes were correctly assigned. Nevertheless, it does not make sense to state that any correctly generated tree (i.e. a tree that pretends to roughly, but essentially accurately reflect the true historical process of language divergence) should be completely free from such intersections — any specialist in the comparative-historical lexicology of at least one large family knows all too well that an ideal tree, all the characters on which are mapped with perfect parsimony, simply does not exist. (2) Synonymy. EAC and EBD used to express the required meaning as complete, freely interchangeable synonyms both at the highest proto-level *P (the common ancestor for all four languages) and at the intermediate proto-levels *PAB and *PCD. Only in modern languages (all four of them!) the synonymy was eliminated, as each of the four languages discarded one of the synonymous forms and retained the other. From a theoretical / logical point of view, such a scenario would seem tremendously improbable. First, it surmises, for all protolanguage levels, an empirically unrealistic situation of complete synonymy in the sphere of basic lexicon — with the stress on complete, since in our approach, which requires maximal clarification of Swadesh meanings, partial synonymy is not taken into consideration: for each proto-level, it is assumed that one and only one primary equivalent corresponds to every Swadesh meaning. Transit synonymy is also out of the question, because if the old word had already been ʻmarked for deletionʼ in protolanguage *P, its miraculous survival in two out of four daughter languages would have been impossible. Second, it is a tremendously non-parsimonious

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 191 scenario, requiring us to assume as much as four independent semantic shifts involving all four languages. As paradoxical as it may sound, in real practice (especially in work on etymological dictionaries) it is precisely this approach to resolving such situations that tends to prevail — something that can only be explained by the general tendency to undermine semantics in linguistic reconstruction. For instance, if one sees in the Indo-European dictionary by J. Pokorny (1958) something like five or six ʻProto-Indo-Europeanʼ root entries correlated with the exact same meaning, this must not be understood as a statement that Proto-Indo- European really had five or six roots whose meanings were precisely the same. In the majority of cases this is simply pseudo-synonymy, assumed for the sake of freeing the researchers from the necessity to spend extra time and effort on constructing optimal scenarios for the historical development of semantics. Thus, the meaning ʻseeʼ in Pokorny's dictionary is correlated with about a dozen different roots, but the superficial nature of this glossing becomes clear upon detailed analysis of the data — which shows that, for instance, the root *ṷel- really means ʻto seeʼ only in Brittonic languages (Welsh gweled, etc.), as well as, possibly, in Old English wlī-tan (although the detailed vocabulary of Bosworth, Toller 1898: 1259 prefers to render the semantics as ʻto look, gazeʼ rather than ʻto seeʼ); or that the root *sekʷ- only means ʻto seeʼ in Germanic and, possibly, in Hittite (sakuṷai-, but the situation is actually more complex, since this word is formally derived from sakuṷa- ʻeyeʼ), etc. Mechanistically projecting all these criss-crossings onto the semantic level of the protolanguage creates the illusion of rampant synonymy, which, in turn, may trigger unwarranted conclusions (such as hypothesizing about fundamental differences between the linguistic structures of Early Neolithic people and modern speakers). Because of this, I believe that the synonymy explanation for semantic intersections should be rejected as not only improbable, but downright harmful for any future development of historical linguistics. Protolanguage synonymy should be taken into consideration as a possibility only when at least one of the attested daughter languages (A, B, C, D) provides us with an explicit argument — that is, displays such a synonymy itself. And in that regard, it must be added that in my entire experience of working with 50-item wordlists for linguistic families of Eurasia and Africa such a situation was not reliably attested for any of the concerned languages (although it becomes somewhat more complex with 100-item wordlists, where one has to deal with less transparent semantic definitions for many of the less stable Swadesh items). (3) Previously unidentified areal convergence. It is theoretically possible that, for instance, languages B and D, despite phylogenetically belonging to different branches of the group / family, came into contact with each other through migrations of their speakers, so that the old vertical connections between their elements became complemented with new horizontal ties. In this case their common lexical element EBD may be either a straightforward loan (from B into D or vice versa) or the result of a common areal semantic innovation: a certain stem, inherited from their common protolanguage but not possessing a Swadesh meaning either in *PAB or in *PCD, acquired it in B (or D), after which the

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 192 George Starostin corresponding stem in D (or B) underwent the same transformation under the influence of tense contacts with neighboring speakers (following a wave model of lexical diffusion). (4) Independent unidirectional semantic innovations (or, borrowing an interdisciplinary term, semantic homoplasies). This last possibility has until recently remained somewhat underestimated in traditional literature on comparative-historical linguistics, but more and more attention is being brought to it in recent times through the application of biological phylogenetic methods to linguistic data (List et al. 2014, Kassian 2015). These are situations in which etymon E with protolanguage meaning S shifts that meaning to S1 in two (or more) members of the same group / family as a result of similarly acting impulses that are not interdependent, but arise independently of each other in both (all) of the concerned languages. The psychological factor of being subjectively impressed with the very fact of semantic similarity or equality is usually responsible for the treatment of such homoplasies as phylogenetically meaningful isoglosses (e.g. ʻGermanic- Anatolianʼ, ʻCelto-Germanicʼ, ʻAlbanian-Balticʼ, etc.); each such isogloss may be intuitively perceived as non-accidental — and as they accumulate, the seemingly chaotic resulting picture becomes a source of mistrust towards a simple tree-like classification, which cannot account for these conflicting isoglosses. In turn, this leads to all sorts of vague hypotheses like the wave theory, overestimation of the role of dialectal continua, skepticism towards the very notion of proto-language, etc. However, it does not take a great leap of faith to understand the natural and possibly even imminent presence of independent unidirectional semantic innovations if one remembers just this one simple empirical truth: the typological inventory of reliably attested or reliably reconstructed semantic shifts is actually quite scarce in comparison with the theoretically imaginable limitations on such shifts. Roughly speaking, the absolute majority of the lexical elements of language (at least, its basic elements), when in some abstract need of shifting their semantics, usually follow typical paths of semantic change, and the average number of typical paths for any element of the basic lexicon hardly exceeds half a dozen. It is certainly true that in each specific case one might meet some exotic development, but on the global scale their percentage remains negligible. Let us suppose that for the word ʻeyeʼ, the three most probable sources of semantic development are (a) the verbal stem ʻto seeʼ; (b) the verbal stem ʻto shine, give lightʼ; (c) the nominal stem ʻhole, openingʼ; all the other scenarios shall be counted as typologically much less probable (this is an approximation, but one that is not very far removed from the empirical truth). If so, the probability of an independent development of the meaning ʻeyeʼ out of meanings (a), (b), or (c) in different languages from the same family will directly depend on the number of languages in that family — the more languages there are and the more time has passed, the more probable it is that in at least two of them the exact same change will occur. This correlation easily explains the impression that one gets from scrutinizing etymological dictionaries for different families: the deeper we go (from Germanic

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 193 to Indo-European, from Indo-European to Nostratic, etc.), the more synonymous semantic definitions one finds in the dictionary entries. These definitions, though, are not the result of semantic reconstruction; at best, they represent an intuitive attempt to find a common semantic denominator for all the meanings that one finds in descendant languages. Let the Welsh and the Old English reflexes of a given Indo-European root be close to (not even fully coincide with) ʻto seeʼ, and we already find ʻto seeʼ assigned as a potential meaning to the root in the protolanguage — and even if the researcher admits the possibility of a semantic homoplasy, in most cases this possibility will probably not be explicitly stated in the dictionary. It is already obvious from this discussion that with the general rejection of the second solution (synonymy) and with the assumption that our data have been checked and re-checked to the fullest, all remaining semantic intersections that prevent us from adequately reconstructing the proto-wordlist must be resolved in terms of solutions (3) (areal contact) and (4) (independent homoplasies). As to the particular method of choosing between the two variants (returning to our original scheme, the question was whether EP should be equated with EAC or EBD), it should not be different from choosing in situations of non-intersecting competition, meaning that one should take into consideration the same extra- distributional factors (polysemy, morphological derivation, mutual etymologization, possibility of being interpreted as a borrowing, etc.). As an example, let us consider the case of Indo-European ʻtoothʼ (Indo- European is a family rather than a group, but in this context, the precise taxonomic status is irrelevant). Most classic sources, including Pokorny's dictionary, quote two main stems in this meaning: *(H)dont- ~ *(H)dn̥ t- and *ĝombho-. The two are found in criss-crossed distribution regardless of whichever classificatory model of Indo-European we select as optimal, cf.: a) *(H)d(o)nt-: Sanskrit dan ~ danta-, Iranian *dantan-, Greek ὀδών, Baltic *danti- (Lithuanian dantìs, Prussian dantis), Germanic *tanþu-, Latin dēn-s, Celtic *danto-; b) *ĝombho-: Tokharian *keme, (?) Sanskrit jambha-, Albanian dhëmb, Slavic *zǫbъ, Baltic *žamb- (Latvian zùobs). Conflicting distribution is most transparently seen on the material of Balto- Slavic languages: distributional rules suggest *danti- as the optimal form for Proto-Baltic (since this is an isogloss between Lithuanian and Prussian, representing two primary branches) — however, Latvian, an unmistakably Baltic language, shows here a common isogloss with Slavic, rather than the rest of the Baltic languages. The simplest thing would be to explain this situation away through synonymy in the protolanguage; yet this conflicts with the fact that there is not a single Indo- European language in which this synonymy would be preserved. It is hardly a coincidence that the only language which might, for a brief while, look like an exception is Sanskrit, where both dan and jambha- are usually translated with the word ʻtoothʼ, because existing contexts do not allow to establish any transparent functional distribution. However, it is worth noting that (a) both medieval Prakrits and modern Indo-Aryan languages only show reflections of dan(ta)-,

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 194 George Starostin never jambha-, in the meaning ʻtoothʼ; (b) only the etymon jambha-, in addition to the general meaning ʻtoothʼ, is sometimes rendered in dictionaries by more narrow meanings such as ʻeye-tooth, fangʼ (cf. also the derived stem jambhya- ʻincisorʼ); (c) only the stem jambha- may be formally analyzed as a nominal derivative from the verbal root jabh- ʻto snap (the jaws)ʼ (Turner 1966: 282, 283, 352). All these facts indicate that in Old Indian texts as well, this synonymy was never total, but was only limited to certain contexts in which the semantic opposition between dan(ta)- and jambha- could become neutralized without any serious loss of information. If total synonymy as a means to eliminate competition is inacceptable, one of these two stems should have replaced the other in several branches of Indo- European on an independent basis. Parsimony considerations alone would lead us to regard *ĝombho- as innovative, but fortunately, we have an even more convincing argument, namely, polysemy: in all Indo-European languages where it is attested, the primary meaning of *(H)d(o)nt- is always ʻtoothʼ13, whereas *ĝombh(o)- is regularly encountered either as a verbal stem (see Sanskrit jabh- above) or in the general meaning ʻsharp objectʼ (cf. Greek γόμφος ʻpegʼ, Proto- Germanic *kamba-z ʻcombʼ, or Lithuanian žаm̃ bаs ʻsharp objectʼ, ʻedgeʼ, ʻcapeʼ). Considering the semantics of the verbal root in Old Indian, it makes sense to suggest the primary nature of the verb *ĝembh- ʻto snap, bite hardʼ, whence one

13 According to a rather popular opinion, IE *(H)d(o)nt- may be explained as an old form of the active participle from the verb *(H)ed- ʻto eatʼ (which is especially appropriate for supporters of the laryngeal theory, since the reconstruction *Hed- → *Hd-(o)nt- could allegedly help explain the vocalic prothesis in Greek ὀ-δών). However, first of all, this internal etymology is not very convincing, since semantic derivation ʻeatʼ → ʻtoothʼ (*ʻeating /one/ʼ, ʻeaterʼ) is not supported typologically (the typical association of ʻtoothʼ is usually with such verbs as ʻto biteʼ, ʻto chewʼ, ʻto gnawʼ, denoting actions that are by their very nature carried out with teeth); the fact that in Sanskrit, Greek, and other ancient IE languages the word shares the same grammatical properties as active verbal participles may simply be due to *(H)dent- need not necessarily be interpreted that it originally was a participle (any nominal root stem ending in *-nt- would have to be declined the same way as a participle in *-nt-). Second, even if we accept it, it is still an internal etymology, i.e. a speculation on a possible transition from pre-Proto-Indo-European to Proto-Indo- European. No ancient or modern IE language yields any evidence that for its speaker, the nominal stem *(H)d(o)nt- could somehow be associated with the verbal stem *(H)ed- ʻto eatʼ, and we have no solid arguments to suggest such an association on the level of the nearest common ancestor for all these languages. (It was further pointed out to me by an anonymous reviewer that an interesting argument was recently made in Garnier 2011: based on some assumed derivatives of the verbal stem *(H)ed-, the author suggested that the original meaning of the root must have been ʻto rip the flesh out with the teethʼ (ʻarracher la chair avec ses dents, déchiqueterʼ), which would seem to agree much better with the semantic derivation for ʻtoothʼ. Unfortunately, not only does the argumentation depend on a highly complicated and ambiguous series of phonetic and semantic assumptions, but the idea is also hard to take seriously because it is unlikely that such a specific, complex meaning (unlike something like ʻto biteʼ or ʻto eatʼ) could have been primary for this simple verbal root; at the very least, it does not seem to find any typological confirmations on IE territory.)

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 195 has the derived noun *ĝombho-s ʻ*biter, snapperʼ → ʻfang / incisorʼ, or, figuratively, ʻsharp objectʼ, opposed to the neutral / generic stem *(H)d(o)nt- ʻtoothʼ. With such a correlation of the two words' semantics on the Proto-Indo- European level there is hardly any surprise in the idea that *ĝombho- (possibly as part of a common tendency to replace older words with somewhat jargonized equivalents) may have replaced the old stem *(H)d(o)nt- in three branches of the same family independently of each other. Additionally, areal influence cannot be ruled out — for instance, the situation in Slavic may have served as a trigger for the same change both in Albanian and Latvian, although to suggest the same for Tokharian would be pushing it. In many works on comparative linguistics we sometimes encounter the conception of ʻcentreʼ and ʻperipheryʼ, assumed to have vital importance for the study of the dynamics of linguistic change. According to this conception, retention of archaisms should rather be expected in the peripheral regions of an area occupied by a language family, since it is more natural to expect innovations to originate in the ʻcentreʼ and then gradually diffuse across adjacent dialects, without necessarily reaching the periphery. From this point of view one could try to compare the scenario presented above with its opposite — namely, that *ĝombh(o)-, as a rare stem represented at the periphery rather than the centre of the Indo-European world, has survived in these areas without getting replaced by the centrifugal innovation *(H)d(o)nt-. However, even if this model seems logical, credible, and has been backed with substantial evidence, accepting it as a universal guiding principle is no less harmful than uncritically accepting the ʻmaximum diversityʼ hypothesis as indicative of the original homeland of any given linguistic family (another unjustly exaggerated principle, based on premature generalization of certain observations without taking into account a whole series of additional crucial parameters). First, centrifugal innovations are only possible at the earliest stage of existence of a language family, when the recently separated descendants of a formerly single protolanguage still find themselves in a state of linguistic contact with elements of mutual intelligibility. Second, one should naturally expect transparent internal etymologization from these ʻcentrifugal innovationsʼ, since they are generated later than ʻperipheral archaismsʼ. In our case, however, internal (secondary) etymologization is much easier to suggest for *ĝombho- rather than for *(H)d(o)nt-.

5. EXTERNAL COMPARISON AND ITS INFLUENCE ON PROTO-WORDLIST RECONSTRUCTION

Hopefully, the rules and guidelines listed and illustrated above cover most of the difficulties that might arise while attempting to reconstruct a proto-wordlist. However, there is one more important detail that should be mentioned in conclusion. Reconstructing proto-wordlists is a complex procedure that often surmises the choice of an optimal variant on the basis of a system of filtering rules, transparently applied to each case of branching. More often than not, such an

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 196 George Starostin optimal variant can be established. Even in situations where, upon first sight, arguments pro et contra the two candidates are distributed evenly, there will always be some additional considerations, no matter how indirect (e.g. oddities in the phonological or phonotactic shape of the word, suggesting non-native origin; or the absolute number of languages that retain or innovate it), that will tilt the balance in favour of one over the other. Yet regardless of the relative strength of our distributional and extra- distributional arguments, optimality of our chosen variant should by no means imply elimination of all alternatives. Any reconstruction, no matter how solid, is always an approximation: even if Proto-Slavic *golva ʻheadʼ is established on the basis of perfectly agreeing data from all Slavic languages, this is not equivalent to proof that *golva ʻheadʼ was indeed the basic equivalent for this meaning in Proto-Slavic. In theory, one could imagine a scenario according to which *golva only became this basic equivalent after the disintegration of Proto-Slavic, on the level of daughter protolanguages or dialectal clusters, sometimes independently of each other, sometimes due to areal linguistic interaction. One could only state that material arguments in favour of Proto-Slavic *golva ʻheadʼ significantly exceed arguments in favour of any other candidate (in this particular case, ʻsignificantlyʼ = 100%): this allows us to estimate the probability of any alternate scenario as negligible, and subsequently operate with Proto-Slavic *golva ʻheadʼ as a linguistic element that is almost as much of a reality as its attested reflexes in daughter languages. In competitive situations, however, arguments are distributed differently, and practice shows that it is easy to make a mistake by overrating some of them at the disadvantage of others. Therefore, even upon having already made a choice that looks optimal, one must allow for the possibility to change one's decision in those cases when either internal data (such as our gaining access to relevant, but previously unavailable, linguistic data) or external data (comparison on a higher taxonomic level) enter in direct contradiction with previous conclusions. In other words, when we make our optimal choice between etyma E1, E2 ... En, what is meant is not our selecting one and forgetting about all the others, but rather a procedure of ranking all these elements in order of increasing / decreasing probability of their representing the required meaning on the proto- level. The optimal choice is always a working hypothesis, and it should remain as such until it becomes evident that new data which could reduce its rank are undiscoverable (an almost impossible situation, but if our internal choice has been corroborated with external evidence, this means that we are getting close). Thus, returning once again to the already mentioned difficult situation of choice between Gothic mimz, West Germanic *flaiskaz and Scandinavian *kjɔt in the meaning ʻmeatʼ, one might remark that consistent application of our set of rules to these three items will result in our selecting *flaiskaz as the optimal candidate — if only because out of all three stems, only *flaiskaz is reliably reconstructible on the Common Germanic level, since the West Germanic proto- form corresponds to Old Norse flesk ʻpork, baconʼ (Orel 2003: 104; Vries 1962: 130), whereas neither mimz nor *kjɔt show any further parallels in Germanic

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 197 languages from other branches. As for semantics, the narrowing ʻmeatʼ → ʻporkʼ is just as logical as the extension ʻporkʼ → ʻmeatʼ. If, based on this, we make a final decision to regard *flaiskaz and nothing but *flaiskaz as the Proto-Germanic equivalent for ʻmeatʼ, we will be making a lexicostatistical and a historical error, because the next stage of reconstruction makes it obvious that the main Indo-European equivalent for ʻmeatʼ is the nominal stem *mēms-, of which the regular and logical continuation in Germanic is Gothic mimz. Parsimony as well as the commonly accepted principle of irreversibility of language change (at least when one deals with relatively short time spans) dictate us that it must be the Gothic etymon that should be projected onto the Proto-Germanic level, and it is no wonder that, for instance, V. Orel assuredly reconstructs Proto-Germanic *memzan ʻmeatʼ (Orel 2003: 267) exclusively on the basis of the Gothic form — just because it regularly reflects the Indo-European stem. Therefore, our last recommendation is that reconstruction of the proto-wordlist must necessarily include the option of correction of said reconstruction on subsequent stages of data analysis — provided, of course, that our reconstruction is not a means in itself and that we intend to use it in order to see how our language group or family behaves within a larger taxonomy. The general results of the reconstruction procedure should ideally look thus: 1. For each of the analyzed elements of the Swadesh list there should be a ranked list of candidates, in order of increasing / decreasing probability (ʻprobabilityʼ is not understood here in a strictly mathematical sense — at present, it is too difficult to assign a mathematical weight to the entire agglomeration of phonetic, semantic, and distributional arguments; rather, we view the duty of the researcher in providing a detailed and transparent explication of his/her reasons for such a particular ranking). 2. The optimal candidates for each Swadesh list entry should be arranged in an optimal wordlist. My own standard procedure for this (followed in Starostin 2013, etc.) is as follows: a) if optimality of the candidate may be reliably established on the basis of nothing but distributional rules, the corresponding slot in the proto-wordlist database is filled with this and only this candidate; b) if optimality of the candidate is established on the basis of extra- distributional rules, the corresponding slot in the proto-wordlist database is filled with this candidate, while others are being listed in a special commentary to the database; c) if neither distributional nor extra-distributional rules work adequately, the corresponding slot in the proto-wordlist database is left empty14.

14 Actually, for technical reasons, the slot may be filled with a random candidate (with all the others listed in the commentary), or it may be split into several sub-slots in which all the candidates will be treated as technical synonyms.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 198 George Starostin

APPENDIX. SOME EXAMPLES OF DISTRIBUTIONAL / EXTRA-DISTRIBUTIONAL RULES ON THE MATERIAL OF NAKH LANGUAGES.

In order to very briefly illustrate most of the rules described above on the material of one single linguistic taxon, we have chosen a small linguistic group (Nakh) that consists of only three languages — the closely related Chechen and Ingush, constituting one primary branch (Vainakh), and the somewhat more remote Batsbi language, constituting the other primary branch15. Complete annotated wordlists for this group have been compiled by the author of the present paper and are freely available online, along with the reconstructed proto-wordlist; below we list only a few samples, one or two per each distributional and extra-distributional rule. D.1 (most numerous group): ʻashesʼ: Chechen yuqʼ = Ingush yoqʼ = Batsbi yopʼqʼ ← PN *yobqʼ16. Proto-etymon is preserved in all three languages; ʻblackʼ: Chechen ʕärž-a = Ingush ʕärž-a = Batsbi ʕarčʼ-ĩ ← PN *ʡaːrčʼ-i. Proto- etymon is preserved in all three languages. D.2, D.3 (these two rules are essentially the same when we only deal with three languages): ʻhair (of )ʼ: (a) Chechen mas; (b) Ingush čo, Batsbi čo ← PN *čo. Since Ingush and Batsbi represent two primary branches of Nakh, *čo is a more optimal (in this case, more parsimonious) candidate than Chechen mas, which is treated as an innovation. ʻseedʼ: (a) Chechen hu, Batsbi huw ← PN *fuw; (b) Ingush gi. Since Chechen and Batsbi represent two primary branches of Nakh, *fuw is a more optimal (in this case, more parsimonious) candidate than Ingush gi, which is treated as an innovation. E.1 (the rule of polysemy): ʻeggʼ: (a) Chechen hoa, Ingush fuʔ; (b) Batsbi gagã ← PN *gaga-n. In the Vainakh branch, the word shows polysemy ʻegg / forage, grainʼ, whereas Batsbi gagã has no meaning other than ʻeggʼ, and the Batsbi correlate for (a) is oʔ ʻgrainʼ. This allows to reconstruct *gaga-n ʻeggʼ vs. *foʔ ʻgrainʼ. E.2 (the rule of semantic typology): ʻearthʼ: (a) Chechen latta, Ingush lätta ← PN *laːtte; (b) Batsbi yobstʼ. Vainakh forms correspond to Batsbi latt ʻrubbishʼ; Batsbi yobstʼ, in its turn, corresponds to Ingush yostʼ ʻloose earthʼ. The semantic narrowing ʻearthʼ → ʻrubbishʼ is natural, whereas the reverse (ʻrubbishʼ → ʻearthʼ in general) currently remains unconfirmed and also seems less likely than the shift from ʻloose earthʼ to ʻearthʼ in general. This allows to reconstruct the optimal configuration as *laːtte ʻearthʼ vs. *yobstʼ ʻloose earthʼ. Combination of E.1 and E.2: ʻtreeʼ: (a) Chechen ditt; (b) Ingush ga; (c) Batsbi xẽ. This is a very interesting and elegant case. Chechen ditt means either ʻtreeʼ (in general) or ʻmulberry treeʼ (specific); the latter is the word's only meaning in Ingush. Ingush ga means either ʻtreeʼ or ʻbranchʼ; it also exclusively means ʻbranchʼ in Chechen, and the Batsbi cognate gag means ʻbunch (of grapes)ʼ, a meaning that is metonymically closer to ʻbranchʼ than to ʻtreeʼ. Finally,

15 Since there are only three languages, it is impossible to use this group to illustrate situations of intersecting competition, which require no less than two different languages in two primary branches. For some examples of such situations on the data of another North Caucasian language group, Lezghian, see Kassian 2015. 16 Proto-Nakh reconstructions generally follow the phonetic correspondences and coincide with the forms adduced in Nikolayev, Starostin 1994, although in a few cases modifications were made, based on additional data analysis.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access From wordlists to proto-wordlists 199

Batsbi xẽ corresponds to Vainakh *xeːn 'wood (material)' > Chechen xen, Ingush xi id. The optimal (most parsimonious and typologically plausible) scenario is to reconstruct PN *xeːn ʻtreeʼ (preserved in Batsbi; common shift ʻtreeʼ → ʻwoodʼ in Proto-Vainakh), PN *gag ʻbranchʼ (with subsequent semantic widening in Ingush), and PN *ditt ʻmulberry treeʼ (with generalization in Chechen). E.3 (the rule of derivation): ʻwomanʼ: (a) Chechen zud-a; (b) Ingush qal-sag; (c) Batsbi pstʼuy-no. All three stems are derived, but only for the first two the semantics of the original stem are significantly different from the semantics of the derived stem. Thus, Chechen zud-a is formally generated from zud ʻbitchʼ; Ingush qal-sag is formed from qal ʻmareʼ and sag ʻpersonʼ; Batsbi pstʼuy-no, however, is a secondary derivate from pstʼu ʻwifeʼ, which is almost the same meaning as (and a very typical and natural polysemous extension of) ʻwomanʼ. The optimal scenario is to postulate independent innovations in Chechen and Ingush, and to reconstruct PN *pstʼuw ʻwomanʼ. Additionally, it should be noted that the Batsbi word regularly corresponds to Chechen stuː, Ingush suw ʻprincessʼ. The semantic development ʻwomanʼ → ʻprincessʼ is typologically natural (cf. English queen from Indo-European gʷen- ʻwomanʼ), but the reverse is also quite possible (cf. Norwegian dame ʻwomanʼ, etc.), so we would not invoke rule E.2 here on its own, without the confirming support of rule E.3. E.4 (the rule of borrowing): ʻbarkʼ: (a) Chechen kewst-ig, Ingush kɔst ← PN *kaːbst; (b) Batsbi kerk. The Batsbi word is ineligible, since it represents a transparent borrowing from Georgian (and there are several more examples like this on the same list, confirming that the similarity with Georgian kerki is not accidental, nor should we suspect the opposite direction of the borrowing, since most of these Georgian words have Common Kartvelian etymologies). E.5 (the ʻno etymologyʼ rule): ʻyellowʼ: (a) Chechen moːž-a; (b) Ingush ʕaža-ɣa; (c) Batsbi kʼapʼraš. Chechen and Ingush entries are at least traceable to the Proto-Vainakh stage: Chechen moːž-a corresponds to Ingush mɔža ʻorange (colour)ʼ, and Ingush ʕaža-ɣa = Chechen ʕoːža ʻlight-bay (of horses)ʼ. On the contrary, Batsbi kʼapʼraš has no etymology whatsoever, and, in addition, has a rather unusual phonetic shape for an archaism. Chechen and Ingush have comparable chances at representing this Swadesh item on the proto-level, but the Batsbi word has to be ranked much lower.

BIBLIOGRAPHY Bartholomae Ch., 1961, Altiranisches Wörterbuch. 2. Unveränderte Auflage, Berlin, Walter de Gruyter & Co. Bosworth B., 1898, An Anglo-Saxon Dictionary. Edited and Enlarged by T. Northcote Toller, Oxford University Press. Burrow Th. & Emeneau M. B., 1984, A Dravidian Etymological Dictionary. Second Edition, Oxford, Clarendon Press. Garnier R., Sur l’étymologie du grec ὠμός ʻcruʼ, 2011, Bulletin de la société de linguistique de Paris 106/1, p. 249-262. Holman E. W., Wichmann S., Brown C., Velupillai V., Müller A. & Bakker D., 2008, Explorations in automated lexicostatistics, Folia Linguistica 42.2, p. 331-354. Kassian A., 2015, Towards a Formal Genealogical Classification of the Lezgian Languages (North Caucasus): Testing Various Phylogenetic Methods on Lexical Data, PLoS ONE 10(2): e0116950.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access 200 George Starostin

Kassian A., Starostin G., Dybo A. & Chernov V., 2010, The Swadesh wordlist: an attempt at semantic specification, Journal of Language Relationship 4, p. 46-89. Klimov G. A., 1964, Etimologicheskij slovar' kartvel'skikh jazykov [Etymological dictionary of Kartvelian languages], Moscow, Izdatel'stvo Akademii Nauk SSSR. Kogan L., 2006, Lexical Evidence and the Genealogical Position of Ugaritic (I), Babel und Bibel 3. Orientalia et Classica vol. XIV, Winona Lake, Indiana, Eisenbrauns, p. 429-488. Kurilov G. N., 2001, Jukagirsko-russkij slovar' [Yukaghir-Russian dictionary], Novosibirsk, Nauka Publishers. List J.-M. & Nelson-Sathi S. & Geisler H. & Martin W., 2014, Networks of lexical borrowing and lateral gene transfer in language and genome evolution, BioEssays 36.2, p. 141-150. Nikolayev S. & Starostin S., 1994, A North Caucasian Etymological Dictionary, Moscow, Asterisk Publishers. Nikolayeva I., 2006, A Historical Dictionary of Yukaghir, Berlin – New York, Mouton de Gruyter. Orel V., 2003, A Handbook of Germanic Etymology, Leiden – Boston, Brill. Pokorny J., 1958, Indogermanisches etymologisches Wörterbuch, Bern- München, Francke Verlag. Rastorguyeva V. S. & Edelman D. I., 2007, Etimologicheskij slovar' iranskikh jazykov. Tom 3: f-h [Etymological dictionary of Iranian languages. Vol. 3: f- h], Мoscow, Vostochnaja literatura. Starostin G., 2010, Preliminary lexicostatistics as a basis for language classification: A new approach, Journal of Language Relationship 3, p. 79- 116. Starostin G., 2013a, Lexicostatistics as a basis for language classification: increasing the pros, reducing the cons, in H. Fangerau, H. Geisler, Th. Halling, W. Martin (ed.), Classification and Evolution in Biology, Linguistics and the History of Science: Concepts – Methods – Visualization, Stuttgart, Franz Steiner Verlag, p. 125-146. Starostin G., 2013b, Jazyki Afriki. Opyt postrojenija leksikostatisticheskoj klassifikacii. Tom I: Metodologija. Kojsanskije jazyki [Languages of Africa: an attempt at a lexicostatistical classification. Volume I: Methodology. Khoisan Languages], Moscow, Jazyki slav'anskoj kul'tury. Starostin S., 1991, Altajskaja problema i proiskhozhdenije japonskogo jazyka [The Altaic problem and the origins of the Japanese language], Moscow, Nauka Publishers. Starostin S., 1995, Sravnitel'nyj slovar' jenisejskikh jazykov [Comparative dictionary of Yeniseian languages], Ketskij sbornik. Lingvistika [Ket Volume. Linguistics], Moscow, Vostochnaja literatura, p. 176-315. Turner R. L., 1966, A Comparative Dictionary of the Indo-Aryan Languages, Oxford University Press. Vries J. de, 1962, Altnordisches Etymologisches Wörterbuch, Leiden, Brill.

Downloaded from Brill.com09/29/2021 06:39:37AM via free access