Reconstructing the Evolutionary History of Natural Languages

Home , Evolution of languages

University of Pennsylvania ScholarlyCommons

IRCS Technical Reports Series Institute for Research in Cognitive Science

June 1995

Reconstructing the evolutionary history of natural languages

Tandy Warnow University of Pennsylvania

Donald Ringe University of Pennsylvania, [email protected]

Ann Taylor University of Pennsylvania

Follow this and additional works at: https://repository.upenn.edu/ircs_reports

Warnow, Tandy; Ringe, Donald; and Taylor, Ann, "Reconstructing the evolutionary history of natural languages" (1995). IRCS Technical Reports Series. 129. https://repository.upenn.edu/ircs_reports/129

University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-95-16.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/129 For more information, please contact [email protected]. Reconstructing the evolutionary history of natural languages

Abstract In this paper we present a new methodology for determining the evolutionary history of related languages. Our methodology uses linguistic information encoded as qualitative characters, so that prospective trees can be evaluated according to various optimization criteria, much as is done in the practice of inferring evolutionary history for biological species. By contrast with biology, however, we find that the linguistic data support evolutionary trees with extremely good compatibility scores, and that for such data it is possible to find optimal trees quickly. We have applied this method to the classification of Indo-European (IE) languages; we have been able to resolve one longstanding open problem (the Indo- Hittite hypothesis), and have indicated exactly what needs to be established in order to resolve another longstanding open problem (the Italo-Celtic hypothesis). We have also discovered rather surprising facts about the history of Germanic within this family. Thus, this method provides an ability to resolve difficult questions in Historical Linguistics that have proved resistent to traditional character-based methodologies and to the more recent distance based approaches of lexicostatistics. The results of our methodology also indicate weaknesses in methods currently accepted and practiced in historical linguistics. One of our more important results is the ability to detect and handle loan words that are not distinguishable from more important results is the ability to detect and handle loan words that are not distinguishable from true cognates by traditional methods. Finally, this methodology permits the linguist to develop and test assumptions about the evolutionary relevance of different linguistic characters.

Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-95-16.

This technical report is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/129 Institute for Research in Cognitive Science

Reconstructing the evolutionary history of natural languages

Tandy Warnow Donald Ringe Ann Taylor

University of Pennsylvania 3401 Walnut Street, Suite 400C Philadelphia, PA 19104-6228 June 1995

Site of the NSF Science and Technology Center for Research in Cognitive Science

IRCS Report 95-16

Reconstructing the evolutionary history of natural languages

Tandy Warnow Donald Ringe

Department of Computer and Information Science Department of Linguistics

UniversityofPennsylvania UniversityofPennsylvania

Ann Taylor

Department of Linguistics

UniversityofPennsylvania

June

Abstract

In this pap er wepresent a new metho dology for determining the evolutionary history of related lan

guages Our metho dology uses linguistic information enco ded as qualitativecharacters so that prosp ec

tive trees can b e evaluated according to various optimization criteria much as is done in the practice of

inferring evolutionary history for biological sp ecies By contrast with biologyhowever we nd that the

linguisti c data supp ort evolutionary trees with extremely go o d compatibility scores and that for such

data it is p ossible to nd optimal trees quicklyWehave applied this metho d to the classication of

IndoEurop ean IE languages wehave b een able to resolve one longstanding op en problem the Indo

Hittite hypothesis and have indicated exactly what needs to b e established in order to resolve another

longstandin g op en problem the ItaloCeltic hypothesis Wehave also discovered rather surprising facts

ab out the history of Germanic within this familyThus this metho d provides an ability to resolve dicult

questions in Historical Linguistics that haveproved resistent to traditional characterbased metho dolo

lexicostatistics The results of our metho dology gies and to the more recent distance based approaches of

also indicate weaknesses in metho ds currently accepted and practiced in historical linguisti cs One of our

more imp ortant results is the ability to detect and handle loan words that are not distingui shab le from

true cognates by traditional metho ds Finally this metho dology p ermits the linguist to develop and test

assumptions ab out the evolutionary relevance of dierent linguisti c characters

Intro duction

A set of languages is said to b e genetical ly related if it meets the following criteria all the languages

of the set were once a single language called a protolanguage and that protolanguage diversied

into the languages of the set through the regular pro cess of rstlanguage transmission in whichchildren

learn their native rst language from the adults of their immediate sp eech community The patterns of

similarities b etween languages that genetic relationship gives rise to are largely dierent from those caused

by contact between adult sp eechcommunities in particular they app ear chiey in the most fundamental

areas of languages structure namely their grammars and their most basic vo cabularyIfalargenumber

of sp eechcommunities all originally sp eaking a single language gradually diversify in continual contact

for centuries the pattern of relationships b etween them is p erhaps b est mo delled as what linguists call a

network just a general directed graph but if the languages lose contact as they diversifyorifthenetwork

of diversifying dialects is sampled at p oints suciently distant from one another the pattern of relationships

observed is b est mo delled as a tree The use of trees to mo del the evolution of languages in traditional

historical linguistics is thus largely justied by the observed facts

The determination of evolutionary trees for natural languages is a ma jor endeavor within historical

linguistics Muchisknown ab out the evolution of some language families such as IndoEurop ean IE while

for others essentially nothing has b een determined Even for the IE familyhowever which is the most

ery rudimentary separation extensively studied the early evolutionary history is not resolved b eyond a v

into subfamilies Germanic Italic etc to a large extent this is due to diculties in interpreting data

correctly determining which information really is relevanttoevolutionary history and to the computational

diculties of exploring the tree space Thus while it has b een p ossible for linguists to check their scholarly

interpretations of the data against a hyp othetical tree the diculties in exploring the exp onentially sized

tree space and the arguments ab out interpretation of data have led to giant impasses

In this pap er we will present a metho d for eciently inferring evolutionary history of languages known

to b e related ie memb ers of the same family of languages The metho d has three parts

We showhow to enco de information ab out languages interpreted bythescholar as qualitativechar

acters so that hyp othetical trees can then b e evaluated according to various ob jective criteria suchas

those used in evolutionary tree construction in Biology when based up on biomolecular sequences

We showhow to eciently nd the optimal and nearoptimal trees with resp ect to the compatibility

criterion a standard criterion for evaluating evolutionary trees in use for certain kinds of biological

data and appropriate to the linguistic context when the data supp orts a tree which has extremely

go o d scores

Wehave develop ed metho ds appropriate to linguistic trees for nding consensus and agreement trees

which can then b e applied to determine the common features of the b est trees for the data set In this

waywe can establish those asp ects of the evolutionary history whichhave strong supp ort as opp osed

to those whichhaveweak supp ort

Wehave applied this metho d to the problem of inferring evolutionary history for the IE family of lan

guages and have made several surprising and strikingly strongly supp orted ndings Our motivation for

studying this particular family of languages was that little progress had b een made on denitively settling the

rstorder subgrouping of IE despite the fact that it is the b est attested and b est studied family of languages

available to historical linguists A solution to the rstorder subgrouping of this family would establish the

applicability of this metho dology b eyond question We analyzed the IE data with particular interest in

determining whether our new metho dol could lay to rest the debate on two longstanding conjectures the

IndoHittite hypothesis and the ItaloCeltic hypothesis The former arms that the Anatolian subfamily is

one of two rstorder subgroups of IE In terms of the tree then this asserts that the ro ot should havetwo

children one of which is Anatolian represented in our database by Hittite its b est attested memb er The

latter claims that the Italic and Celtic subfamilies are siblings in the tree

The implications of this metho dology go b eyond settling op en problems within a single family of lan

guages however wehaveshown that standard metho ds in use in Historical Linguistics need to b e mo died

For example the incidence of undetectable b orrowing that is b orrowing that has o ccurred so early as to

b e indistinguishable from true cognation is higher than was previously thought but can b e detected and

handled through an appropriate use of our metho dology

The rest of the pap er is organized as follows In Section we b egin with a discussion of computational

asp ects of inferring evolutionary trees b oth from qualitativecharacters and distances In Section we

describ e the history of the metho dology of reconstructing evolutionary history of natural languages In

Section we discuss our metho dology and its computational complexity The application of our metho dology

to IE is presented in Section We conclude in Section with a discussion of the contribution this metho d

makes to historical linguistics

Inferring Evolutionary Trees

An evolutionary tree or phylogenyfora set S of taxa ie of sp ecies or languages describ es the evolution

of the taxa in S from their most recent common ancestor The taxa in S lab el the leaves of the tree the

internal no des of the tree represent ancestors and the tree is ro oted at the most recent common ancestor of

all the taxa in S The top ology of the tree represents the chronology of sp eciation events p oints at which

a sp ecies splits into two or more sp ecies Data of dierenttyp es can b e used as input for metho ds of tree

construction typical are distance data the basis of lexicostatistics see b elow and character data which

reect sp ecic observable characteristics of the sp ecies under study morphological data in biology there

is no comparable term in linguistics in which morphology is used narrowly to mean the grammatical

characteristics of words

Z where Z is the set of integers Thus A qualitativecharacter or simply character is a function c S

acharacter denes a partition of the set of sp ecies S into equivalence classes each class characterized bya

single state of the character

Given a set S of n taxa whether biological or linguistic describ ed by a set C of k characters wecan

represent the information byan k matrix M of character state information in which M is the state of

sp ecies s for the j character In this wayeach sp ecies is represented byavector of character states An

evolutionary tree T for S has its leaves lab elled bythevectors representing S and its internal no des also

lab elled byvectors of character states Thus for eachcharacter C theevolutionary tree T for S C

denes an extension of V T Z When the tree T is given we will denote by fv V T v

ig otherwise in the absence of such a tree we will let denote the set fs S sig

Evolutionary trees based up on characters are typically evaluated using either parsimony or compatibility

see for a discussion of these dierent criteria Wesaythata character is compatible also called

convex with a vectorlab elled tree T if for every state of the no des having that state form a connected

subgraph of T The compatibility score of a tree T is dened by cT jf C is convex on Tgj

Given an input S C nding the tree T of maximum compatibility score is called the Compatibility Criteria

Problem The other criterion in p opular use is the parsimony criterion The parsimony score of a tree T is

the sum H e where H e is the hamming distance on the edge e that is the numb er of p ositions

eE T

in which the endp oints of e dier Finding the tree with minimum parsimony score for a given data set is

called the maximum parsimony problemandis NPhard

In evaluating evolutionary trees for languages wehave found the compatibility criterion more relevant

than the parsimony criterion as a rst comparison If a character is not convex on a tree T then either

the tree is incorrect or the scholarly judgement that the character would b e convex on the true tree is false

Since this metho dology explicitly eliminates all characters for whichwebelieve there is a nonnegligible

probability of parallel development the fact that a character is not convex on the tree under consideration

is much more signicant than the precise numb er of extra evolutionary steps required bythatcharacter on

that tree Thus the numberofcharacters whicharenotconvex on T is a more accurate measure of the

badness of T than the exact numb er of extra transitions which o ccur on the tree For this reason the

compatibility score of a tree is more signicant than the parsimony score when evaluating evolutionary trees

for natural languages However the parsimony score of a tree is useful in letting us compare two trees with

equivalent compatibility scores

Wenow consider the computational complexity of the compatibility criteria problem

Indep endent Set Problem

Input Graph G V Eandaninteger k

Question Do es there exist a subset V V of size k such that for all fv wg V v w E

Theorem from The Compatibility Criteria Problem is NPhard and cannot be approximatedbya

polynomialtime algorithm within a factor of jC j unless QNP coQR

Pro of Wesketch here a reduction from Indep endent Set similar to that in Let G V Ekbean

input to the Indep endent Set problem Let S V E fr gwhere r denotes an added elementnotin

V E For eachvertex v V let c S f g b e dened by c fv gfv wv w E gItis

known that when the sp ecies set includes a ro ot that is a sp ecies r with cr forall c C then

a pair of binary characters and are compatible if and only if and are either disjointor

one contains the other As a result c and c are compatible if and only if v w E Thus the graph

v w

c v V g has a set of k pairwise G has an indep endent set of size k if and only if the character set f

compatible characters Since pairwise compatibility of binary characters ensures setwise compatibility

G has an indep endent set of size k if and only if the character set has a compatible subset of k characters

Thus Compatibility Criteria is NPhard

Since this is a linear reduction the Compatibility Criteria problem is as hard to approximate as Maximum

Indep endent Set Bellare and Sudan have proved that the Maximum Clique and therefore Maximum

Indep endent Set since it is just Clique on the complemented graph on a graph with n no des cannot b e

approximated to within a factor of n unless QNP coQR See Johnson for more details

Later in this pap er we will show that linguistic data supp orts a tree with almost p erfect data that is a

tree T which has compatibility score at least k t for very small t and that on suchdatawe can nd the

r t

provably optimal trees in time O nk

Subgrouping Metho dologies

Metho dologies for subgrouping related languages have b een debated for over a century There are two

basic typ es of metho dologies classical or traditional metho ds which are character based and lexicostatistical

metho ds which are distance based These twotyp es of metho dologies nevertheless have some features in

common All reliable metho ds of subgrouping languages must start from the comparative methodthe

simple but rigorous mathematical metho d for reconstructing protolanguages co died in Without the

comparative metho d one cannot even recognize cognate vo cabulary that is words inherited by genetically

related languages from their protolanguage as opp osed to words b orrowed through language contact or

words that happ en through sheer chance to b e similar in sound and meaning

It is also imp ortant to use the earliest wellattested stages of languages in attempting to construct

evolutionary trees for them b ecause natural language change steadily ero des and obliterates original features

of the protolanguage as the daughter languages develop over time Using more recentversions of languages

y will not rule out the correct tree but can lead to characters which t not only the correct tree but man

trees and thus leads to underdierentiated trees This is for example the problem with the placementof

Albanian in the IE tree b ecause our Albanian data is from the th century and Albanian is not a very

conservative language there are very few characters which group Albanian with any other language in the

family As a result Albanian can b e t almost anywhere in a given evolutionary tree with equally go o d

scores Thus using more ancientwellattested forms of languages allows us to retain information and thus

increases the probability of selecting the correct tree

Classical Metho ds

In classical metho ds languages are assigned to the same subgroup that is memb ers of a set of related

languages are b elieved to dep end from the same no de of the evolutionary tree only if two conditions are

met the languages in question exclusively share innovations and those innovations are unlikely to

have o ccurred indep endently To use the terms wehave dened ab ove the interpretation of character

information for classical subgrouping is as follows character states which are innovations should b e

convex on the true evolutionary tree and only characters which are unlikely to b e aected by parallel

development are used

Lexicostatistics

An alternative metho d of subgrouping lexicostatistics was develop ed in the s and s

For lexicostatistical analysis one determines what prop ortion of the most basic vo cabulary is shared by each

pair of languages under investigation it is assumed that most shared items are retained inheritances and that

the prop ortion of basic items replaced correlates roughly with the time elapsed from the p oint at whichthe

two languages in question were still a single language The validity of these assumptions while questioned

has not led to a complete rejection of these distancebased metho ds Lexicostatistics refers in general to

any metho d for constructing trees for languages based up on distances obtained in this wa y but the usual

metho d makes languages whichhave the smallest distance b etween them into siblings a new parent language

is created and the metho d applied recursively

Critique of these metho dologies

The ma jor weaknesses of lexicostatistical metho ds are that they rely up on derived rather than primary

data and most asso ciated optimization problems are NPhard Indeed it seems that the real reason

that distance based metho ds are so p opular in Linguistics is that software is available for constructing trees

optimal or not from distance data and the software is easy to use and fast These software packages

however are based up on heuristics whichhave unproven p erformance and thus do not reliably nd trees

which are optimal with resp ect to any ob jective criterion While many linguists use the softwareasanal

to ol the b est linguists use it only to generate trees which can then b e evaluated through the use of

classical metho ds

The classical metho ds are character based and indicate a key insightinto the correct waytodoevolu

tionary tree construction Determining whichcharacters denote evolutionary information is a sophisticated

and dicult matter and a fair amount of the debate in the eld concerns these judgements and howto

use the linguistic information to dene characters The signicance of inectional p eculiarities is esp ecially

often unclear There are other problems b eyond the choice of characters however whichinvolve the limited

use by historical linguists of the character information Sp ecically linguists have known that all character

states should if p ossible b e convex on the true tree but this understanding was never made explicit in

ere innovations the literature That is the co died knowledge only sp ecically indicated that states that w

should b e convex the realization that this implied convexity for all states of the character was never made

precise

The situation is dierent for Lithuanian b ecause although the Lithuanian data are likewise from the th century the

language is unusually conservative

Our metho dology for constructing evolutionary trees from lin

guistic characters

Our metho dology has three essential comp onents encoding linguistic information using qualitative charac

ters an algorithm to nd the optimal and nearoptimal trees and methods for nding the common features

of the best trees

The enco ding of the linguistic information as qualitativecharacters involves a great deal of linguistic

scholarship and to some degree the judgements can b e op en to debate as dierent linguists will deem

dierent information as relevant or not relevant to the evolutionary tree and even when agreeing to the

relevance of a character they may dier in their enco ding of the character states The enco ding of linguistic

information as characters thus involves linguistic judgementaswell as mathematical mo delling This is

describ ed in Section

The input to an algorithm for evolutionary tree construction from linguistic data is as we will show a

combination of character state and directionality information whichwehave enco ded as qualitativechar

acters in which certain asp ects of the output trees are required while others are desirable but not forced

We will dene an optimization criterion called directedcompatibility for trees constructed from such data

Having dened our optimization criterion we will then showthatwe can eciently nd all the optimal and

nearoptimal trees for this criterion Our algorithm is given in Section

The motivation wehave for lo oking at all the optimal and nearoptimal trees is that while we b elievethe

ve a go o d score we cannot b e sure that it is absolutely the b est tree and not for example true tree will ha

the second b est tree Instead by examining all the close to optimal trees we can with high condence b e

sure that the correct tree is among these trees and we can also b e condent therefore that the features which

are true ab out most or p erhaps all of the nearoptimal trees will b e true ab out the true evolutionary tree

Thus our algorithm for nding the optimal tree is extended to nd all trees within some sp ecied b ound of

optimum This p ermits us to apply consensus and agreement metho ds to the prole of nearoptimal trees

so that we can infer the common features This is describ ed in Section

Typ es of Linguistic Characters

Lexical For lexical characters the character is the semantic slot as for example the meaning hand

Languages whichhave reexes of the same protolexeme for this semantic slot exhibit the same state for the

character Determining whichwords are cognate is accomplished through the application of the Comparative

Metho d whichwas co died by Henry Ho enigswald in The Comparative Metho d pro duces equivalence

classes of cognates and not just a similarity score and except in unusual cases discussed later on in this

pap er these judgements are entirely accurate Thus for semantic slots we can dene equivalence classes

and hence represent lexical information as characters

Morphological For morphological characters the character is generally a grammatical feature as for

example the formation of the future stem the way the passiveismarked the genitive singular ending of

ostem nouns and adjectives etc Languages in which the feature is instantiated in the same wayorbya

reex of the same protomorpheme exhibit the same state for the character

Phonological For phonological characters the character is a sound change Languages which share the

same outcome generally those that undergo the change versus those that do not exhibit the same state for

the character Phonological characters are not as useful as morphological and lexical ones however b ecause

of the high probability of indep endent parallel dev elopment in this area Most sound changes are natural

and the fact that two languages b oth undergo the same change do es not if the change is natural enough

necessarily indicate common innovation Thus only sound changes that are rare or fairly complex can b e

safely used as characters An example of this typ e of sound change in the IE family is the socalled ruki

rule whichinvolves the retraction of s after r u k and i

Enco ding linguistic information as characters

The selection of characters requires determining which linguistic information is genetic rather than indica

tive either of chance relationship or historical contact This is accomplished through adhering strictly to the

comparative metho d and using only basic vo cabulary as the basis of the lexical characters b ecause wemust

use prop erties of language which are passed genetically and are resistant to b orrowing Resistance to change

of any kind although desirable in that it is more likely to result in characters which narrowdown the space

of optimal trees is not required

The enco ding of linguistic data is in many cases quite straightforward Reliable cognation judgements can

b e made through a rigorous application of the comparative metho d In the absence of indep endent parallel

development the characters derived in this manner will b e compatible with the true evolutionary tree so

that a p erfect phylogeny will exist for the data set However b ecause indep endent parallel developmentof

character states do es o ccur wehave develop ed techniques for detecting and handling it

Detecting and handling parallel development

Borrowing The most obvious kind of b orrowing event o ccurs when one language uses a word from another

language Obvious b orrowings are easily detected the use of croissant in English for example and can

b e treated as lexical innovations In this way when the directionality of the b orrowing is clear we can still

use the lexical character in the analysis of the data set by assigning a unique state to this character for the

language doing the b orrowing and using cognation judgements for the remaining languages Undetected

b orrowings which cannot b e distinguished from true cognates are a more serious problem Fortunately these

wings are rare b ecause they must o ccur b etween languages that are so similar that words undetected b orro

in one language lo ok likewords in the other in other words languages that havenotdiverged very much

from their common ancestor If these b orrowings o ccur in sucientnumb ers however they create a distinct

pattern in the data in which a certain language shares states for a substantial group of lexical characters

with one language or group of languages while for the rest of the lexical characters and the ma jorityofthe

morphological characters it shares states with a dierent language or group This is the case of Germanic in

the IE family see Section for details

Indep endent parallel semantic shift Asecondtyp e of innovation which can create false cognates is

indep endent parallel semantic shift In this case two or more languages indep endently shift the meaning of

one lexical item to another eg a numb er of IE languages use the stem wiro originally meaning young

man warrior in the meaning man Most of these cases are fairly obvious eg the use of a ro ot meaning

give light for mo on ro ots meaning blow breathe for wind etc When the directionality of the shift

can b e established an obvious case is words for animal b eing based up on words for live or breathe then

we can again include the character in our analysis by enco ding all languages exhibiting innovations arising

from parallel semantic shift with their own unique states This allows us to include these characters in our

analysis As there is no way to predict what kind of semantic shift a language will undergo undetected cases

will no doubt remain

Same choice among alternative ro ots In some languages a semantic slot is asso ciated with two or

more lexical items b oth of which can b e reconstructed for the protolanguage without detectable distinction

w w

in meaning eg ProtoIE warm g her tepwash lewh neyg etc Although the choice

of one or other of these alternatives may representaninnovation on the part of a subgroup of languages

since there are a small number of choices the probability that the languages indep endently chose the same

alternativeishighFor this reason all characters with two states reconstructible to the protolanguage are

eliminated