From Wordlists to Proto-Wordlists: Reconstruction As ʻoptimal Selectionʼ

From wordlists to proto-wordlists: reconstruction as ʻoptimal selectionʼ George Starostin* INTRODUCTION In several of my previous publications (Starostin 2010, 2013a, 2013b), I have repeatedly stressed the importance of combining, rather than opposing the classic comparative method, elaborated by several generations of historical linguists over the past two hundred years, and lexicostatistical methodology, originally introduced by Morris Swadesh and his colleagues in the 1950s and once again popular these days due to a massive influx of computational phylogenetic methods from adjacent branches of science. It was with that precise purpose — to integrate the two approaches — that a team of Moscow-based historical linguists set up The Global Lexicostatistical Database (GLD)1, a large-scale project that aims at applying a uniform, maximally formalized lexicostatistical methodology to all the languages of the world in order to arrive at a reasonable genetic classification, while at the same time corroborating the results with knowledge gained from traditional historical linguistics and philology. One of the most important postulates of the GLD is that lexicostatistics, applied on the basis of superficial phonetic comparison of words that belong to thousands of different languages (e.g. similar to the procedure that is currently employed in S. Wichmann's ASJP project, see Holman et al. 2008), will not be of much use to describe language relationships that go beyond several thousand years. In other words, the best way to explore the issue of, say, the external relations of Indo-European on a lexicostatistical basis would be to operate with a single wordlist for Proto-Indo-European, rather than with several hundred wordlists for its daughter languages. Even more than that — to explore the far less controversial, but almost as problematic issue of the internal classification of Indo-European, it would make more sense to operate with the reconstructed proto-wordlists for Proto-Germanic, Proto-Celtic, Proto-Slavic, etc., than with mass data from modern languages, which clutter the data with large amounts of secondary accumulated ʻnoiseʼ and raise the risks of distorting the resulting classification. * Russian State University for the Humanities, Moscow; Russian Presidential Academy, Moscow: [email protected] 1 The GLD currently operates as an autonomous website, subordinate to the larger ʻTower of Babelʼ project, originally launched by Sergei Starostin as a digital hub for etymological databases (http://starling.rinet.ru). Downloaded from Brill.com09/24/2021 07:10:16AM via free access 178 George Starostin Consequently, in this paper I will discuss some important, but frequently overlooked or understated details of how one should go about trying to reconstruct what might be called an ʻoptimalʼ proto-wordlist for a protolanguage. By wordlist I typically understand a Swadesh-type wordlist, i.e. a list of basic words whose meanings are relatively well defined2 and in most cases (although there may be solitary exceptions) find precise equivalents in any given language of the world. As a rule, the GLD operates with the regular 100-item Swadesh wordlist or its truncated 50-item version (see Starostin 2013a for details on the validity and usage of the shortened version), but the rules and recommendations presented below would work equally well for any other version, as long as the elements on the wordlist are assigned discrete meanings3. 1. PRELIMINARY REMARKS Although from the point of view of phylogenetic thinking, step-by-step reconstruction of proto-wordlists would seem to be a logical move, such a procedure is not commonly encountered in historical linguistics. Reconstructing a proto-wordlist is not at all the same thing as reconstructing morphemes, a far more natural occupation for etymologists all over the world. Above all, it presumes that the reconstructed morpheme (or word) (1) will be assigned a concrete, rather than vague, semantic definition, coinciding with the required meaning of the corresponding element on the wordlist (e.g. ʻseeʼ rather than ʻlook, see, observe, examine, thinkʼ; ʻliverʼ rather than ʻk. of internal organ or intestineʼ; ʻredʼ rather than ʻblood, red, brightly colouredʼ, etc.); (2) will have a better chance of representing this particular meaning in its basic (most stylistically neutral or frequent) usage on the proto-level than any other reconstructible morpheme (or word). We should stress that within the context of a lexicostatistical evaluation, the main goal of reconstructing a proto-wordlist is not a thorough and conclusive study of phonetic correspondences, including rare and non-trivial ones (although such a study is always useful). First and foremost, such a reconstruction helps define the optimal probable set of etyma4 that could have served as the main 2 It is, of course, well-known that many of the elements on the original Swadesh wordlist were not all that precisely defined in Swadesh's original works, which gave researchers a solid pretext to replace any of the elements that they deemed uncomfortable for their own purposes. An attempt to eliminate many of these problems by associating various Swadesh items with diagnostic syntactic contexts and semantic commentary has been published as Kassian et al. 2010 and is currently used as a set of guidelines in the construction of the GLD. 3 This paper essentially represents a somewhat reworked and improved English translation of one single subsection in the ʻMethodologyʼ chapter of the author's recent Russian language monograph on the lexicostatistical classification of the languages of Africa (Starostin 2013), with the addition of several important details and a special illustrative appendix. 4 We typically define an etymon as a correlated pair <sound (X) : meaning (Y)> that may be reconstructed for a proto-language based on regular reflexes in descendant languages. Downloaded from Brill.com09/24/2021 07:10:16AM via free access From wordlists to proto-wordlists 179 carriers for meanings constituting the wordlist on the proto-level. It is sufficient for our starred forms to be phonetically compatible, i.e. cognacy decisions should be made based on transparent argumentation and not contradict any historical or typological evidence. This circumstance is important inasmuch as the application of this methodology to language units all over the world will inevitably run into situations where there is simply not enough linguistic data to allow for a thorough clarification of all phonetic correspondences between poorly described, extinct, or too distantly related languages. Much more significant is to properly account for the distribution of various roots with identical basic semantics across languages belonging to the investigated taxon. Just as it is very important to minimize (preferably, completely eliminate) synonymy while collecting material for Swadesh-type wordlists of attested languages, reconstruction of the proto-wordlist also demands minimization of available candidates: operating on the principle of uniformitarianism, we have no theoretical reason to assume that lexical synonymy on the proto-level, particularly when it comes to strictly defined basic lexicon items, should have been more rampant in reconstructed proto-languages than it is in their modern descendants5. To the best of our knowledge, so far there have not been any detailed formalized attempts at describing any manual algorithms for the selection of optimal proto-language etyma in comparative-historical linguistics; this can be attributed, on one hand, to a generally skeptical attitude towards lexicostatistics (for which this procedure is of particular importance), and on the other, to the traditional unpopularity of rigorous semantic reconstruction among historical linguists. One of the most important exceptions from this overall tendency is a (relatively) recent study by Leonid Kogan (2006), where such an attempt has been undertaken on the basis of Semitic languages; nevertheless, the procedure suggested by Kogan clearly does not exhaust the entire potential of the comparative-historical method and is therefore in need of a series of additions, a brief list of which was suggested in Starostin 2010. In order to make the description of the procedure as transparent and demonstrative as possible, let us define it on a concrete object — a so-called The term should be technically distinguished from root or stem, since the latter refer primarily to the phonetic shape of the entity, leaving its semantic content relatively vague and ambiguous; that said, it is natural that in many particular contexts all these terms may be partially synonymous and interchangeable. 5 Our views on the issue of synonymy (particularly in the sphere of basic lexicon, where it is easier to operate with discrete, rather than vague and continuous meanings, than in the sphere of cultural lexicon) have been originally described in Starostin 2010. Later, in Starostin 2013b: 109-116, it was even more rigidly stated that there are almost no cases of total synonymy (i.e. complete interchangeability of two different words with the exact same meaning in any given context, register, or sociolinguistic situation) that could be reliably attested in any of the world's languages, and that most cases where such a synonymy could be suspected are really cases of technical synonymy (where we simply do not have sufficient data to understand the proper differentiation) or transit synonymy (where we deal with a temporary situation in

From Wordlists to Proto-Wordlists: Reconstruction As ʻoptimal Selectionʼ

Recent Developments in Spanish (And Romance) Historical Semantics

Eric Smith, Santa Fe Institute

16 Semantic Change

GRAMMAR of SOLRESOL Or the Universal Language of François SUDRE

North Caucasian Languages

Lectures on English Lexicology

Semantic Reanalysis and Language Change Regine Eckardt* University of Go¨Ttingen

Comparative-Historical Linguistics and Lexicostatistics

Current Progress in Altaic Etymology

Time Depth in Historical Linguistics”, Edited by Colin Renfrew, April

A Pipeline for Computational Historical Linguistics

"Evolution of Human Languages": Current State of Affairs