From wordlists to proto-wordlists: reconstruction as ʻoptimal selectionʼ

George Starostin*

INTRODUCTION In several of my previous publications (Starostin 2010, 2013a, 2013b), I have repeatedly stressed the importance of combining, rather than opposing the classic , elaborated by several generations of historical linguists over the past two hundred years, and lexicostatistical methodology, originally introduced by Morris Swadesh and his colleagues in the 1950s and once again popular these days due to a massive influx of computational phylogenetic methods from adjacent branches of science. It was with that precise purpose — to integrate the two approaches — that a team of Moscow-based historical linguists set up The Global Lexicostatistical Database (GLD)1, a large-scale project that aims at applying a uniform, maximally formalized lexicostatistical methodology to all the languages of the world in order to arrive at a reasonable genetic classification, while at the same time corroborating the results with knowledge gained from traditional and philology. One of the most important postulates of the GLD is that , applied on the basis of superficial phonetic comparison of words that belong to thousands of different languages (e.g. similar to the procedure that is currently employed in S. Wichmann's ASJP project, see Holman et al. 2008), will not be of much use to describe language relationships that go beyond several thousand years. In other words, the best way to explore the issue of, say, the external relations of Indo-European on a lexicostatistical basis would be to operate with a single wordlist for Proto-Indo-European, rather than with several hundred wordlists for its daughter languages. Even more than that — to explore the far less controversial, but almost as problematic issue of the internal classification of Indo-European, it would make more sense to operate with the reconstructed proto-wordlists for Proto-Germanic, Proto-Celtic, Proto-Slavic, etc., than with mass data from modern languages, which clutter the data with large amounts of secondary accumulated ʻnoiseʼ and raise the risks of distorting the resulting classification.

* Russian State University for the Humanities, Moscow; Russian Presidential Academy, Moscow: [email protected] 1 The GLD currently operates as an autonomous website, subordinate to the larger ʻTower of Babelʼ project, originally launched by Sergei Starostin as a digital hub for etymological databases (http://starling.rinet.ru). 178 George Starostin

Consequently, in this paper I will discuss some important, but frequently overlooked or understated details of how one should go about trying to reconstruct what might be called an ʻoptimalʼ proto-wordlist for a protolanguage. By wordlist I typically understand a Swadesh-type wordlist, i.e. a list of basic words whose meanings are relatively well defined2 and in most cases (although there may be solitary exceptions) find precise equivalents in any given language of the world. As a rule, the GLD operates with the regular 100-item Swadesh wordlist or its truncated 50-item version (see Starostin 2013a for details on the validity and usage of the shortened version), but the rules and recommendations presented below would work equally well for any other version, as long as the elements on the wordlist are assigned discrete meanings3.

1. PRELIMINARY REMARKS Although from the point of view of phylogenetic thinking, step-by-step reconstruction of proto-wordlists would seem to be a logical move, such a procedure is not commonly encountered in historical linguistics. Reconstructing a proto-wordlist is not at all the same thing as reconstructing morphemes, a far more natural occupation for etymologists all over the world. Above all, it presumes that the reconstructed morpheme (or word) (1) will be assigned a concrete, rather than vague, semantic definition, coinciding with the required meaning of the corresponding element on the wordlist (e.g. ʻseeʼ rather than ʻlook, see, observe, examine, thinkʼ; ʻliverʼ rather than ʻk. of internal organ or intestineʼ; ʻredʼ rather than ʻblood, red, brightly colouredʼ, etc.); (2) will have a better chance of representing this particular meaning in its basic (most stylistically neutral or frequent) usage on the proto-level than any other reconstructible morpheme (or word). We should stress that within the context of a lexicostatistical evaluation, the main goal of reconstructing a proto-wordlist is not a thorough and conclusive study of phonetic correspondences, including rare and non-trivial ones (although such a study is always useful). First and foremost, such a reconstruction helps define the optimal probable set of etyma4 that could have served as the main

2 It is, of course, well-known that many of the elements on the original Swadesh wordlist were not all that precisely defined in Swadesh's original works, which gave researchers a solid pretext to replace any of the elements that they deemed uncomfortable for their own purposes. An attempt to eliminate many of these problems by associating various Swadesh items with diagnostic syntactic contexts and semantic commentary has been published as Kassian et al. 2010 and is currently used as a set of guidelines in the construction of the GLD. 3 This paper essentially represents a somewhat reworked and improved English translation of one single subsection in the ʻMethodologyʼ chapter of the author's recent Russian language monograph on the lexicostatistical classification of (Starostin 2013), with the addition of several important details and a special illustrative appendix. 4 We typically define an etymon as a correlated pair that may be reconstructed for a proto-language based on regular reflexes in descendant languages.