<<

University of Pennsylvania ScholarlyCommons

IRCS Technical Reports Series Institute for Research in Cognitive Science

June 1995

Reconstructing the evolutionary history of natural languages

Tandy Warnow University of Pennsylvania

Donald Ringe University of Pennsylvania, [email protected]

Ann Taylor University of Pennsylvania

Follow this and additional works at: https://repository.upenn.edu/ircs_reports

Warnow, Tandy; Ringe, Donald; and Taylor, Ann, "Reconstructing the evolutionary history of natural languages" (1995). IRCS Technical Reports Series. 129. https://repository.upenn.edu/ircs_reports/129

University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-95-16.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/129 For more information, please contact [email protected]. Reconstructing the evolutionary history of natural languages

Abstract In this paper we present a new methodology for determining the evolutionary history of related languages. Our methodology uses linguistic information encoded as qualitative characters, so that prospective trees can be evaluated according to various optimization criteria, much as is done in the practice of inferring evolutionary history for biological species. By contrast with biology, however, we find that the linguistic data support evolutionary trees with extremely good compatibility scores, and that for such data it is possible to find optimal trees quickly. We have applied this method to the classification of Indo-European (IE) languages; we have been able to resolve one longstanding open problem (the Indo- Hittite hypothesis), and have indicated exactly what needs to be established in order to resolve another longstanding open problem (the Italo-Celtic hypothesis). We have also discovered rather surprising facts about the history of Germanic within this family. Thus, this method provides an ability to resolve difficult questions in Historical that have proved resistent to traditional character-based methodologies and to the more recent distance based approaches of lexicostatistics. The results of our methodology also indicate weaknesses in methods currently accepted and practiced in . One of our more important results is the ability to detect and handle loan words that are not distinguishable from more important results is the ability to detect and handle loan words that are not distinguishable from true cognates by traditional methods. Finally, this methodology permits the linguist to develop and test assumptions about the evolutionary relevance of different linguistic characters.

Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-95-16.

This technical report is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/129 Institute for Research in Cognitive Science

Reconstructing the evolutionary history of natural languages

Tandy Warnow Donald Ringe Ann Taylor

University of Pennsylvania 3401 Walnut Street, Suite 400C Philadelphia, PA 19104-6228 June 1995

Site of the NSF Science and Technology Center for Research in Cognitive Science

IRCS Report 95-16

Reconstructing the evolutionary history of natural languages

Tandy Warnow Donald Ringe

Department of Computer and Information Science Department of Linguistics

UniversityofPennsylvania UniversityofPennsylvania

Ann Taylor

Department of Linguistics

UniversityofPennsylvania

June

Abstract

In this pap er wepresent a new metho dology for determining the evolutionary history of related lan

guages Our metho dology uses linguistic information enco ded as qualitativecharacters so that prosp ec

tive trees can b e evaluated according to various optimization criteria much as is done in the practice of

inferring evolutionary history for biological sp ecies By contrast with biologyhowever we nd that the

linguisti c data supp ort evolutionary trees with extremely go o d compatibility scores and that for such

data it is p ossible to nd optimal trees quicklyWehave applied this metho d to the classication of

IndoEurop ean IE languages wehave b een able to resolve one longstanding op en problem the Indo

Hittite hypothesis and have indicated exactly what needs to b e established in order to resolve another

longstandin g op en problem the ItaloCeltic hypothesis Wehave also discovered rather surprising facts

ab out the history of Germanic within this familyThus this metho d provides an ability to resolve dicult

questions in Historical Linguistics that haveproved resistent to traditional characterbased metho dolo

lexicostatistics The results of our metho dology gies and to the more recent distance based approaches of

also indicate weaknesses in metho ds currently accepted and practiced in historical linguisti cs One of our

more imp ortant results is the ability to detect and handle loan words that are not distingui shab le from

true cognates by traditional metho ds Finally this metho dology p ermits the linguist to develop and test

assumptions ab out the evolutionary relevance of dierent linguisti c characters

Intro duction

A set of languages is said to b e genetical ly related if it meets the following criteria all the languages

of the set were once a single language called a protolanguage and that protolanguage diversied

into the languages of the set through the regular pro cess of rstlanguage transmission in whichchildren

learn their native rst language from the adults of their immediate sp eech community The patterns of

similarities b etween languages that genetic relationship gives rise to are largely dierent from those caused

by contact between adult sp eechcommunities in particular they app ear chiey in the most fundamental

areas of languages structure namely their grammars and their most basic vo cabularyIfalargenumber

of sp eechcommunities all originally sp eaking a single language gradually diversify in continual contact

for centuries the pattern of relationships b etween them is p erhaps b est mo delled as what linguists call a

network just a general directed graph but if the languages lose contact as they diversifyorifthenetwork

of diversifying dialects is sampled at p oints suciently distant from one another the pattern of relationships

observed is b est mo delled as a tree The use of trees to mo del the evolution of languages in traditional

historical linguistics is thus largely justied by the observed facts

The determination of evolutionary trees for natural languages is a ma jor endeavor within historical

linguistics Muchisknown ab out the evolution of some language families such as IndoEurop ean IE while

for others essentially nothing has b een determined Even for the IE familyhowever which is the most

ery rudimentary separation extensively studied the early evolutionary history is not resolved b eyond a v

into subfamilies Germanic Italic etc to a large extent this is due to diculties in interpreting data

correctly determining which information really is relevanttoevolutionary history and to the computational

diculties of exploring the tree space Thus while it has b een p ossible for linguists to check their scholarly

interpretations of the data against a hyp othetical tree the diculties in exploring the exp onentially sized

tree space and the arguments ab out interpretation of data have led to giant impasses

In this pap er we will present a metho d for eciently inferring evolutionary history of languages known

to b e related ie memb ers of the same family of languages The metho d has three parts

We showhow to enco de information ab out languages interpreted bythescholar as qualitativechar

acters so that hyp othetical trees can then b e evaluated according to various ob jective criteria suchas

those used in evolutionary tree construction in Biology when based up on biomolecular sequences

We showhow to eciently nd the optimal and nearoptimal trees with resp ect to the compatibility

criterion a standard criterion for evaluating evolutionary trees in use for certain kinds of biological

data and appropriate to the linguistic context when the data supp orts a tree which has extremely

go o d scores

Wehave develop ed metho ds appropriate to linguistic trees for nding consensus and agreement trees

which can then b e applied to determine the common features of the b est trees for the data set In this

waywe can establish those asp ects of the evolutionary history whichhave strong supp ort as opp osed

to those whichhaveweak supp ort

Wehave applied this metho d to the problem of inferring evolutionary history for the IE family of lan

guages and have made several surprising and strikingly strongly supp orted ndings Our motivation for

studying this particular family of languages was that little progress had b een made on denitively settling the

rstorder subgrouping of IE despite the fact that it is the b est attested and b est studied family of languages

available to historical linguists A solution to the rstorder subgrouping of this family would establish the

applicability of this metho dology b eyond question We analyzed the IE data with particular interest in

determining whether our new metho dol could lay to rest the debate on two longstanding conjectures the

IndoHittite hypothesis and the ItaloCeltic hypothesis The former arms that the Anatolian subfamily is

one of two rstorder subgroups of IE In terms of the tree then this asserts that the ro ot should havetwo

children one of which is Anatolian represented in our database by Hittite its b est attested memb er The

latter claims that the Italic and Celtic subfamilies are siblings in the tree

The implications of this metho dology go b eyond settling op en problems within a single family of lan

guages however wehaveshown that standard metho ds in use in Historical Linguistics need to b e mo died

For example the incidence of undetectable b orrowing that is b orrowing that has o ccurred so early as to

b e indistinguishable from true cognation is higher than was previously thought but can b e detected and

handled through an appropriate use of our metho dology

The rest of the pap er is organized as follows In Section we b egin with a discussion of computational

asp ects of inferring evolutionary trees b oth from qualitativecharacters and distances In Section we

describ e the history of the metho dology of reconstructing evolutionary history of natural languages In

Section we discuss our metho dology and its computational complexity The application of our metho dology

to IE is presented in Section We conclude in Section with a discussion of the contribution this metho d

makes to historical linguistics

Inferring Evolutionary Trees

An evolutionary tree or phylogenyfora set S of taxa ie of sp ecies or languages describ es the evolution

of the taxa in S from their most recent common ancestor The taxa in S lab el the leaves of the tree the

internal no des of the tree represent ancestors and the tree is ro oted at the most recent common ancestor of

all the taxa in S The top ology of the tree represents the chronology of sp eciation events p oints at which

a sp ecies splits into two or more sp ecies Data of dierenttyp es can b e used as input for metho ds of tree

construction typical are distance data the basis of lexicostatistics see b elow and character data which

reect sp ecic observable characteristics of the sp ecies under study morphological data in biology there

is no comparable term in linguistics in which is used narrowly to mean the grammatical

characteristics of words

Z where Z is the set of integers Thus A qualitativecharacter or simply character is a function c S

acharacter denes a partition of the set of sp ecies S into equivalence classes each class characterized bya

single state of the character

Given a set S of n taxa whether biological or linguistic describ ed by a set C of k characters wecan

represent the information byan k matrix M of character state information in which M is the state of

ij

th

sp ecies s for the j character In this wayeach sp ecies is represented byavector of character states An

i

evolutionary tree T for S has its leaves lab elled bythevectors representing S and its internal no des also

lab elled byvectors of character states Thus for eachcharacter C theevolutionary tree T for S C

denes an extension of V T Z When the tree T is given we will denote by fv V T v

i

ig otherwise in the absence of such a tree we will let denote the set fs S sig

i

Evolutionary trees based up on characters are typically evaluated using either parsimony or compatibility

see for a discussion of these dierent criteria Wesaythata character is compatible also called

convex with a vectorlab elled tree T if for every state of the no des having that state form a connected

subgraph of T The compatibility score of a tree T is dened by cT jf C is convex on Tgj

Given an input S C nding the tree T of maximum compatibility score is called the Compatibility Criteria

Problem The other criterion in p opular use is the parsimony criterion The parsimony score of a tree T is

P

the sum H e where H e is the hamming distance on the edge e that is the numb er of p ositions

eE T

in which the endp oints of e dier Finding the tree with minimum parsimony score for a given data set is

called the maximum parsimony problemandis NPhard

In evaluating evolutionary trees for languages wehave found the compatibility criterion more relevant

than the parsimony criterion as a rst comparison If a character is not convex on a tree T then either

the tree is incorrect or the scholarly judgement that the character would b e convex on the true tree is false

Since this metho dology explicitly eliminates all characters for whichwebelieve there is a nonnegligible

probability of parallel development the fact that a character is not convex on the tree under consideration

is much more signicant than the precise numb er of extra evolutionary steps required bythatcharacter on

that tree Thus the numberofcharacters whicharenotconvex on T is a more accurate measure of the

badness of T than the exact numb er of extra transitions which o ccur on the tree For this reason the

compatibility score of a tree is more signicant than the parsimony score when evaluating evolutionary trees

for natural languages However the parsimony score of a tree is useful in letting us compare two trees with

equivalent compatibility scores

Wenow consider the computational complexity of the compatibility criteria problem

Indep endent Set Problem

Input Graph G V Eandaninteger k

Question Do es there exist a subset V V of size k such that for all fv wg V v w E

Theorem from The Compatibility Criteria Problem is NPhard and cannot be approximatedbya

o

polynomialtime algorithm within a factor of jC j unless QNP coQR

Pro of Wesketch here a reduction from Indep endent Set similar to that in Let G V Ekbean

input to the Indep endent Set problem Let S V E fr gwhere r denotes an added elementnotin



V E For eachvertex v V let c S f g b e dened by c fv gfv wv w E gItis

v

v

known that when the sp ecies set includes a ro ot that is a sp ecies r with cr forall c C then

 

a pair of binary characters and are compatible if and only if and are either disjointor

one contains the other As a result c and c are compatible if and only if v w E Thus the graph

v w

c v V g has a set of k pairwise G has an indep endent set of size k if and only if the character set f

v

compatible characters Since pairwise compatibility of binary characters ensures setwise compatibility

G has an indep endent set of size k if and only if the character set has a compatible subset of k characters

Thus Compatibility Criteria is NPhard

Since this is a linear reduction the Compatibility Criteria problem is as hard to approximate as Maximum

Indep endent Set Bellare and Sudan have proved that the Maximum Clique and therefore Maximum

Indep endent Set since it is just Clique on the complemented graph on a graph with n no des cannot b e

o

approximated to within a factor of n unless QNP coQR See Johnson for more details

Later in this pap er we will show that linguistic data supp orts a tree with almost p erfect data that is a

tree T which has compatibility score at least k t for very small t and that on suchdatawe can nd the

r t

provably optimal trees in time O nk

Subgrouping Metho dologies

Metho dologies for subgrouping related languages have b een debated for over a century There are two

basic typ es of metho dologies classical or traditional metho ds which are character based and lexicostatistical

metho ds which are distance based These twotyp es of metho dologies nevertheless have some features in

common All reliable metho ds of subgrouping languages must start from the comparative methodthe

simple but rigorous mathematical metho d for reconstructing protolanguages co died in Without the

comparative metho d one cannot even recognize cognate vo cabulary that is words inherited by genetically

related languages from their protolanguage as opp osed to words b orrowed through or

words that happ en through sheer chance to b e similar in sound and meaning

It is also imp ortant to use the earliest wellattested stages of languages in attempting to construct

evolutionary trees for them b ecause natural language change steadily ero des and obliterates original features

of the protolanguage as the daughter languages develop over time Using more recentversions of languages

y will not rule out the correct tree but can lead to characters which t not only the correct tree but man

trees and thus leads to underdierentiated trees This is for example the problem with the placementof

Albanian in the IE tree b ecause our Albanian data is from the th century and Albanian is not a very

conservative language there are very few characters which group Albanian with any other language in the

family As a result Albanian can b e t almost anywhere in a given evolutionary tree with equally go o d

scores Thus using more ancientwellattested forms of languages allows us to retain information and thus

increases the probability of selecting the correct tree

Classical Metho ds

In classical metho ds languages are assigned to the same subgroup that is memb ers of a set of related

languages are b elieved to dep end from the same no de of the evolutionary tree only if two conditions are

met the languages in question exclusively share innovations and those innovations are unlikely to

have o ccurred indep endently To use the terms wehave dened ab ove the interpretation of character

information for classical subgrouping is as follows character states which are innovations should b e

convex on the true evolutionary tree and only characters which are unlikely to b e aected by parallel

development are used

Lexicostatistics

An alternative metho d of subgrouping lexicostatistics was develop ed in the s and s

For lexicostatistical analysis one determines what prop ortion of the most basic vo cabulary is shared by each

pair of languages under investigation it is assumed that most shared items are retained inheritances and that

the prop ortion of basic items replaced correlates roughly with the time elapsed from the p oint at whichthe

two languages in question were still a single language The validity of these assumptions while questioned

has not led to a complete rejection of these distancebased metho ds Lexicostatistics refers in general to

any metho d for constructing trees for languages based up on distances obtained in this wa y but the usual

metho d makes languages whichhave the smallest distance b etween them into siblings a new parent language

is created and the metho d applied recursively

Critique of these metho dologies

The ma jor weaknesses of lexicostatistical metho ds are that they rely up on derived rather than primary

data and most asso ciated optimization problems are NPhard Indeed it seems that the real reason

that distance based metho ds are so p opular in Linguistics is that software is available for constructing trees

optimal or not from distance data and the software is easy to use and fast These software packages

however are based up on heuristics whichhave unproven p erformance and thus do not reliably nd trees

which are optimal with resp ect to any ob jective criterion While many linguists use the softwareasanal

to ol the b est linguists use it only to generate trees which can then b e evaluated through the use of

classical metho ds

The classical metho ds are character based and indicate a key insightinto the correct waytodoevolu

tionary tree construction Determining whichcharacters denote evolutionary information is a sophisticated

and dicult matter and a fair amount of the debate in the eld concerns these judgements and howto

use the linguistic information to dene characters The signicance of inectional p eculiarities is esp ecially

often unclear There are other problems b eyond the choice of characters however whichinvolve the limited

use by historical linguists of the character information Sp ecically linguists have known that all character

states should if p ossible b e convex on the true tree but this understanding was never made explicit in

ere innovations the literature That is the co died knowledge only sp ecically indicated that states that w

should b e convex the realization that this implied convexity for all states of the character was never made

precise

The situation is dierent for Lithuanian b ecause although the Lithuanian data are likewise from the th century the

language is unusually conservative

Our metho dology for constructing evolutionary trees from lin

guistic characters

Our metho dology has three essential comp onents encoding linguistic information using qualitative charac

ters an algorithm to nd the optimal and nearoptimal trees and methods for nding the common features

of the best trees

The enco ding of the linguistic information as qualitativecharacters involves a great deal of linguistic

scholarship and to some degree the judgements can b e op en to debate as dierent linguists will deem

dierent information as relevant or not relevant to the evolutionary tree and even when agreeing to the

relevance of a character they may dier in their enco ding of the character states The enco ding of linguistic

information as characters thus involves linguistic judgementaswell as mathematical mo delling This is

describ ed in Section

The input to an algorithm for evolutionary tree construction from linguistic data is as we will show a

combination of character state and directionality information whichwehave enco ded as qualitativechar

acters in which certain asp ects of the output trees are required while others are desirable but not forced

We will dene an optimization criterion called directedcompatibility for trees constructed from such data

Having dened our optimization criterion we will then showthatwe can eciently nd all the optimal and

nearoptimal trees for this criterion Our algorithm is given in Section

The motivation wehave for lo oking at all the optimal and nearoptimal trees is that while we b elievethe

ve a go o d score we cannot b e sure that it is absolutely the b est tree and not for example true tree will ha

the second b est tree Instead by examining all the close to optimal trees we can with high condence b e

sure that the correct tree is among these trees and we can also b e condent therefore that the features which

are true ab out most or p erhaps all of the nearoptimal trees will b e true ab out the true evolutionary tree

Thus our algorithm for nding the optimal tree is extended to nd all trees within some sp ecied b ound of

optimum This p ermits us to apply consensus and agreement metho ds to the prole of nearoptimal trees

so that we can infer the common features This is describ ed in Section

Typ es of Linguistic Characters

Lexical For lexical characters the character is the semantic slot as for example the meaning hand

Languages whichhave reexes of the same protolexeme for this semantic slot exhibit the same state for the

character Determining whichwords are cognate is accomplished through the application of the Comparative

Metho d whichwas co died by Henry Ho enigswald in The Comparative Metho d pro duces equivalence

classes of cognates and not just a similarity score and except in unusual cases discussed later on in this

pap er these judgements are entirely accurate Thus for semantic slots we can dene equivalence classes

and hence represent lexical information as characters

Morphological For morphological characters the character is generally a grammatical feature as for

example the formation of the future stem the way the passiveismarked the genitive singular ending of

ostem nouns and adjectives etc Languages in which the feature is instantiated in the same wayorbya

reex of the same protomorpheme exhibit the same state for the character

Phonological For phonological characters the character is a sound change Languages which share the

same outcome generally those that undergo the change versus those that do not exhibit the same state for

the character Phonological characters are not as useful as morphological and lexical ones however b ecause

of the high probability of indep endent parallel dev elopment in this area Most sound changes are natural

and the fact that two languages b oth undergo the same change do es not if the change is natural enough

necessarily indicate common innovation Thus only sound changes that are rare or fairly complex can b e

safely used as characters An example of this typ e of sound change in the IE family is the socalled ruki

rule whichinvolves the retraction of s after r u k and i

Enco ding linguistic information as characters

The selection of characters requires determining which linguistic information is genetic rather than indica

tive either of chance relationship or historical contact This is accomplished through adhering strictly to the

comparative metho d and using only basic vo cabulary as the basis of the lexical characters b ecause wemust

use prop erties of language which are passed genetically and are resistant to b orrowing Resistance to change

of any kind although desirable in that it is more likely to result in characters which narrowdown the space

of optimal trees is not required

The enco ding of linguistic data is in many cases quite straightforward Reliable cognation judgements can

b e made through a rigorous application of the comparative metho d In the absence of indep endent parallel

development the characters derived in this manner will b e compatible with the true evolutionary tree so

that a p erfect phylogeny will exist for the data set However b ecause indep endent parallel developmentof

character states do es o ccur wehave develop ed techniques for detecting and handling it

Detecting and handling parallel development

Borrowing The most obvious kind of b orrowing event o ccurs when one language uses a word from another

language Obvious b orrowings are easily detected the use of croissant in English for example and can

b e treated as lexical innovations In this way when the directionality of the b orrowing is clear we can still

use the lexical character in the analysis of the data set by assigning a unique state to this character for the

language doing the b orrowing and using cognation judgements for the remaining languages Undetected

b orrowings which cannot b e distinguished from true cognates are a more serious problem Fortunately these

wings are rare b ecause they must o ccur b etween languages that are so similar that words undetected b orro

in one language lo ok likewords in the other in other words languages that havenotdiverged very much

from their common ancestor If these b orrowings o ccur in sucientnumb ers however they create a distinct

pattern in the data in which a certain language shares states for a substantial group of lexical characters

with one language or group of languages while for the rest of the lexical characters and the ma jorityofthe

morphological characters it shares states with a dierent language or group This is the case of Germanic in

the IE family see Section for details

Indep endent parallel semantic shift Asecondtyp e of innovation which can create false cognates is

indep endent parallel semantic shift In this case two or more languages indep endently shift the meaning of

one lexical item to another eg a numb er of IE languages use the stem wiro originally meaning young

man warrior in the meaning man Most of these cases are fairly obvious eg the use of a ro ot meaning

give light for mo on ro ots meaning blow breathe for wind etc When the directionality of the shift

can b e established an obvious case is words for animal b eing based up on words for live or breathe then

we can again include the character in our analysis by enco ding all languages exhibiting innovations arising

from parallel semantic shift with their own unique states This allows us to include these characters in our

analysis As there is no way to predict what kind of semantic shift a language will undergo undetected cases

will no doubt remain

Same choice among alternative ro ots In some languages a semantic slot is asso ciated with two or

more lexical items b oth of which can b e reconstructed for the protolanguage without detectable distinction

w w

in meaning eg ProtoIE warm g her tepwash lewh neyg etc Although the choice

of one or other of these alternatives may representaninnovation on the part of a subgroup of languages

since there are a small number of choices the probability that the languages indep endently chose the same

alternativeishighFor this reason all characters with two states reconstructible to the protolanguage are

eliminated

Other enco ding problems

Unlike linguists who employ lexicostatistics we need not eliminate a lexical character b ecause some of the

languages lackaword for that semantic slot Lackofwords for semantic slots is a fairly prevalent problem

however for a variety of reasons For extinct languages this is generally b ecause the word is simply not

attested in our sometimes limited corpus It could also happ en however that a language simply do es not

enco de a particular semantic slot by means of a single lexical item as for example a tropical language might

lackaword meaning ice or snow A character which is missing one or more states can still b e used

however as long as each missing lexical item is co ded as a separate state

Similarly for morphological characters the problem is how to enco de loss Take as an example the

IE augment which marks the past tenses in some archaic IE languages All the languages whichhavethe

augment clearly exhibit the same state for this character but b ecause loss is an easily rep eatable indep endent

innovation those languages whichdonothave the augment cannot b e assumed to exhibit a single state

Rather without any evidence to the contraryitmust b e assumed that each could have lost it indep endently

We enco de this byhaving each of the languages whichdonothave the augment exhibit a unique state for

this character This enco ding do es not force us to group the languages lacking the augment together in one

subtree

Some of the linguistic data gives directionalitybetween character states For example some words can

wn to b e later forms of others b ecause of regular sound changes This implies a directionality b e clearly sho

in the evolutionary history of the states of the character representing the semantic slot for those words

Sometimes the information is more limited and only indicates that a particular character state is ancestral

without indicating anything ab out the relationship of the other states to each other All such information

allows us to eliminate certain unro oted trees from consideration and for those trees which are not eliminated

it identies a region within the tree in which the ro ot protolanguage must b e placed

To summarize the directionality constraints we observehave one of the following two forms

The state of the ro ot maybeknown for some character

For some character wemay know that state o ccupies a subtree ab ove state ie the path from

i j

the ro ot r in T to the subtree of no des lab elled by passes through at least one no de v such that

j

v i Both and are required to b e convex on T

i j

The additional constraints we are considering ab ove are considered absolute constraints as opp osed to

desirable constraints This means that any tree we wish to consider as an evolutionary tree for our data

set must have the prop erties stated ab ove W enowshowhow to enco de these directionality constraints as

undirected characters

Lemma We can encode the additional directionality information by adding at most one character per

information item and one additional species

Pro of Let x indicate the added sp ecies For each of the input characters c C wewillset cxtobe

a state unused byany other sp ecies unless otherwise sp ecied b elow To enco de a constraintoftyp e

whichsays that state i of character is ancestral weset xiWe do not need to add any additional

characters to the data set for this kind of constraint For a constraintoftyp e requiring that be above

i

such that we addacharacter and set x s for all s such that si and s for all s

j

sj All remaining sp ecies are set to unique states one for each sp ecies We note that the convexity

requirementof and translates directly into a convexity requirementof sothatwe can transfer the

i j

convexity requirement to the newly added character It can then b e veried that these added characters are

compatible with a tree T if and only if T ro oted at x satises these directionality constraints

Finding optimal trees

Our optimization criterion is dened as follows In eachcasewe will assume that we are given a set C of



characters with additional character set C enco ding the directionality constraints as dened ab ove



Denition The directedcompatibility score of T with resp ect to C C is if one of the directionality



constraints is violated and otherwise it is the numberofcharacters in C C which are convex on T

Even though we will b e examining trees primarily with resp ect to directed compatibilitywe also need to

dene the directed parsimony score of a tree



Denition The directedparsimony score of T with resp ect to C C is if one of the directionality

constraints is violated and otherwise it is the parsimony score of T

The b est p ossible tree for a set of sp ecies dened bycharacters has every character convex on it Sucha

geny It is not hard to see that when a p erfect phylogeny exists it has a minimum tree is called a perfect phylo

parsimony score and a maximum compatibility score Determining if a p erfect phylogeny exists called the

Perfect Phylogeny ProblemisNPComplete but bycontrast with the parsimony and compatibility

problems it can b e solved in p olynomial time when any of the relevant parameters n jS jk jC jorthe

maximum number r of states p er character is b ounded We will show that linguistic data

is close to p erfect in the sense that the compatibility and parsimony scores are close to those achievable

by p erfect phylogenies Preciselywe will dene the imp erfection of a data set as follows

Denition Aset S of sp ecies dened by the character set C has imperfection t if the optimal tree has

jC j t characters convex on it

Wenowgivethekey observation for an ecient metho d for nding optimal trees on data sets with

very small imp erfection

r r t

Theorem We can nd the best treewithresp ect to directedcompatibility in O nc nk time



where n jS j t is the imperfection of the input set c jC j and k jC j



Pro of We b egin by ensuring that there is at least one p erfect phylogeny consistent with C by running the

 r

p erfect phylogeny algorithm of onC This costs us O nc time If these characters are compatible

 

then we can search among all subsets C such that C C C C in decreasing order of cardinality

t

calls to for a total cost of until we nd a set which supp orts a p erfect phylogeny This requires O k

r t

O nk time for this second phase The total of the two phases is thus as stated ab ove

The problem of inferring the optimal trees with resp ect to parsimony andor directed parsimony is more

complicated If the b est tree T with resp ect to parsimony or directed parsimony has parsimony score pT

P

then letting t pT r we note that T has imp erfection b ounded from ab oveby t Here

c

cC

P

r is the numb er of states attained on S for character csothat r is the parsimony score that

c c

cC

a p erfect phylogenywould attain were it to exist Thus in particular T has compatibility score b ounded

from b elowby jC j t and so we can use the greedy algorithm describ ed ab ovetondT provided that we

can explicitly examine al l trees with compatibility scores ab ove a given threshold

Unfortunately the numb er of all such trees is not necessarily p olynomial even for b ounded r and t This

has not b een a problem on our linguistic data sets in which there are v ery few trees with optimal or near

optimal compatibility scores so that eachsuch tree can b e examined At this p oint the question is then how

to set the lab els of the internal no des of each xed tree T so as to obtain a minimum directed parsimony

score for T Todothiswe use Fitchs algorithm This algorithm takes as input a leaflab elled tree where

k k

the lab els are vectors in Z and assigns lab els from Z to the internal no des so as to obtain a minimum

parsimony score and do es so in O nk time Optimizing for the parsimony criterion on a xed top ology

ensures an optimum score for b oth the directed parsimonyaswell as the directed compatibility criteria and

given a lab elled tree it is straightforward to compute the directed parsimony andor directed compatibility

scores

Computing Minimal Trees

In Linguistics as in Biologyweareinterested in minimal trees A tree T is said to b e minimal with respect

to directed compatibility if the contracting of any edge decreases the directed compatibility score of T

Similarly we will say that a tree T is minimal with respect to directed parsimony if the contraction of any

edge increases the directed parsimony score of T The reason weareinterested in minimalphylogenies is

that we wish the tree to represent the information forced by the data set and no other Thus for example

a tree T of the IE family will indicate supp ort for the in ItaloCeltic hypothesis if and only if the leaves for

Old Irish and representatives of Celtic and Italic subfamilies resp ectively are siblings and haveno

additional siblings so that the parent of these leaves has no other children If the tree is not minimalit

may falsely indicate supp ort

To compute minimal trees whether with resp ect to directed parsimony or directed compatibilityis

not dicult Since the denition of minimality do es not imply that the score is minimal only that edge

contractions change the score for the worse we can takeany tree T as a starting p oint and simply contract

all unnecessary edges That is we identify and contract all edges whose contractions do not change the

directed parsimony score for example and contract all such edges Identifying these edges requires one

application p er edge of the algorithm describ ed in the section ab ove for computing the directed parsimony

score of a leaflab elled tree Since each xed top ology costs us O nk time and there are O n edges this is

O n k time At the end of this pro cess the tree which results is minimal with resp ect to directed parsimony

A similar pro cess can construct trees minimal with resp ect to directed compatibility

To nd trees which are minimal with resp ect to directed compatibility and whichalsohaveoptimal

or nearoptimal compatibility scores can b e accomplished by using the p olynomial time p olynomial delay

listing algorithm in This version of the algorithm only outputs minimal p erfect phylogenies and so the

optimal trees with resp ect to compatibility criteria that result from using this algorithm are necessarily

minimal

Finding Common Themes

Finding the common themes of a prole of trees eac h leaflab elled by the same sp ecies set S is a standard

problem in evolutionary tree construction In our case we will construct the prole by using either the

directed parsimony or directed compatibility criteria and selecting all the trees which are suciently close to

optimal One waytoachieve this is to use the p olynomial time p olynomial delay p erfect phylogeny algorithm

in as part of a greedy heuristic to compute all the b est minimal trees with resp ect to compatibility or

directed compatibility Having gathered these trees we can then consider the question of inferring from

these trees either a single tree called a consensus tree on the entire set of languages or a set of trees each

called an agreement tree on subsets of the languages

There are many mo dels of consensus trees but the one whch seems most relevanttothe

evolutionary tree problem for languages is the relaxed discord lo cal consensus tree Another relevant

approachistoidentify all maximal agreement subsets of the set S

Since there is little variation in the optimal trees a third approachistoidentify all the common themes

indicated by the maximum agreement trees and a limited set of options which each tree in the prole

selects from This has b een our approach in the analysis of IE

The subgrouping of IndoEurop ean

In order to test the metho dology we attempted a subgrouping of IE among the b est understo o d of the worlds

language families We selected from each of the subfamilies within IE the oldest wellattested language to

represent the subfamilyThus wehave Latin LA st century BCE representing Italic Old Irish OI

thth cc CE representing Celtic Hittite HI thth cc BCE representing Anatolian Vedic VE

ca BCE representing Indic Avestan thth cc BCE representing Iranian OE

thth cc CE representing Germanic Tocharian B TB thth cc CE Greek GK Classical Attic

dialect th c BCE Armenian AR th c CE Albanian AL th c CE Lithuanian LI th c

CE representing Baltic and OCS th c CE representing Slavic The following

is a detailed description of our ndings for the IE family

Cho osing characters

In order to reduce the p ossibilityofborrowings among the lexical characters and bias on our part in cho osing

these characters we used an existing basic vo cabulary list of semantic slots Each semantic slot

was treated as a single character and judgements of cognation were made on the basis of this metho d Once

the states were enco ded for eachcharacter we detected evidence of parallel development These included

all characters for whichtwo or more lexical ro ots are reconstructable for the protolanguag total

other characters in which parallel semantic shift or b orrowing has clearly taken place or in whichthe

probability that it has app ears to b e very high total

Of the characters in the directionality of the parallel semantic shift or b orrowing or could b e detected

in all but cases so that we could include in our analysis all but characters Of the full set of characters

ery p ossible tree on the leaf set were informative ie characters that do not t ev

Since nothing similar to a basic vo cabulary list exists for morphological and phonological characters and

since these will vary from family to family an appropriate set of morphophonological characters has to b e

develop ed for each familyFor the IE test we used ten ProtoIndoEurop ean morphological items whichhave

a reex in most of the IE languages and four phonological developments whichwe judged to b e suciently

abnormal as not to b e easily rep eatable These characters are organization of the verb system presence of

the augment presence of a thematized aorist pro ductive function of skeo function of dhi mediopassive

primary marker sg and pl thematic optative sux most archaic future stem genitive singular of o

stem nouns and adjs sup erlative sux satem sound change retraction of s in rukienvironments shap e

of oblique dual and plural case endings and initial d in tears Of these morphologicalphonological

characters ten proved to b e informative

Thus at the end we had informative lexical characters and informative morphological characters

Applying the metho dology

Our initial analysis of the data determined that the inclusion of Germanic represented by Old English OE

resulted in trees with low compatibility score We analyzed the data without Old English and found four

trees with extremely high compatibility scores ranging from all but to all but characters convex on the

tree

Comparing the b est trees without Germanic

Our analysis indicates that Albanian can b e placed anywhere in the tree with equal b enet provided as

it is ab ove the Satem Core and not in the minimal subtree containing b oth Greek and Armenian there is



Our list has one more item than Tischlers b ecause we split the item day into two items p eriod of hours and p eriod

of daylight

not enough data linking Albanian to sp ecic other IE languages As a consequence we do not indicate the

placement of Albanian in any of these trees

The four b est trees have many common features and can b e distinguished only in terms of the following

two criteria

the placementofTocharian B it has however only two p ossible lo cations and

whether the ItaloCeltic hyp othesis is denied or not note that none of these trees actively supp orts

this hyp othesis

Thus most of the structure of the IE family tree can b e deduced from examining these four trees In

particular all of our trees supp ort the IndoHittite hyp othesis and in fact all of the trees we examined with

parsimony scores anywhere close to the optimal parsimony score supp orted this hyp othesis This is the rst

evidence for the IndoHittite hyp othesis that do es not rest on inconclusive traditional arguments

The problem of Germanic

Our analysis with Germanic indicated a sharp distinction b etween the lexical data and the morphological

data in that the lexical data supp orted a placement of Old English much higher in the tree as compared to

the morphological information This dual allegiance of Germanic is unique among the rstorder subgroups

of IE It app ears to p oint to a situation in which Germanic b egan to develop within the Satem Core as

evidenced by its morphology but moved away b efore the nal satem innovations It then moved into close

contact with the western languages Celtic and Italic and b orrowed much of its distinctivevo cabulary

from them at a p erio d early enough that these b orrowings cannot b e distinguished from true cognates We

represent this situation with two graphs one is the genetic tree given in Figure and the other is the

historical tree given in Figure at a time after the migration of Germanic out of the Satem Core Note that

in the historical tree we do not indicate the placement of Germanic with tree edges but rather with directed

edges to indicate historical contact rather than genetic descent

Conclusions for IndoEurop ean

Our analysis establishes the following conclusions

The IndoHittite hyp othesis is completely supp orted without question by the data

The ItaloCeltic hyp othesis is weakly denied by the data To argue in favor of the this conjecture it is

necessary to impugn the two lexical characters eye and yewhich are incompatible with the b est tree

and at the same time to nd at least one reasonable character morphological or lexical which forces

a grouping of Latin and Old Irish together Suchacharacter would have the same state for Latin and

Old Irish and a dierent state shared b etween Hittite or ProtoIE and some other IE language

Germanic b egan to develop in the Satem Core and then migrated out of the core at an early date to

join the western languages

Summary

The relative merits of traditional subgrouping lexicostatistics and the metho d wehave b een developing can

b e seen by comparing the results of those metho dologies as applied to a traditionally intractable problem

the rstorder subgrouping of the IE language familyTraditional metho ds failed to pro duce a convincing

tree and conservative IEists settled for a description of the relations b etween the languages resembling

a network of geographical dialects Subsequentwork along traditional lines pro duced no further

p ositive results arguments in favor of a more articulated tree structure supp orting the IndoHittite

hyp othesis and the ItaloCeltic hyp othesis were debated at length and rejected However many

linguists continue to susp ect that new arguments to supp ort these hyp otheses can b e found Lexicostatistical

work made no substantial advances even the most careful and sophisticated applications of lexicostatistics

to the IE problem pro duced equivo cal and contradictory results and the b estinformed mathematical

linguist who has attempted suchwork makes notably mo dest and reserved claims for the metho d

By contrast wehave b een able to construct a robust evolutionary tree of the IE languages as detailed in

Section wehaveeven b een able to show that the Germanic subgroup of the family underwent a surprising

shift in its aliations at a very early p erio d of its indep endent historyan unexp ected but thoroughly plausible

nding that has startling implications for the history of Germanic While a considerable number of

problems remain to b e solved our promising preliminary results give us reason to hop e that wehave nally

evolved a metho d which preserves the strengths of traditional subgrouping techniques as lexicostatistics

do es not while avoiding the wellknown weaknesses of traditional metho dology

Acknowledgements

The rst author wishes to thank Paul Angello the National Science Foundation for a National Young

Investigator Award CCR and ARO GrantDAALC whose nancial supp ort has

made this research p ossible

Bibliography

References

Agarwala R and D FernandezBaca Fast and simple algorithms for p erfect phylogeny and

triangulating colored graphs D IMACS TR

Agarwala R and FernandezBaca D A p olynomial time algorithm for the phylogeny problem

when the numb er of states is xed SIAM Journal on Computing

M Bellare and M Sudan Improved nonapproximability results Proceedings of the TwentySixth

Annual ACM Symposium on Theory of Computing Montreal ACM pp

Bo dlaender H Fellows M and Warnow T Two strikes against p erfect phylogeny Pro ceedings

of the International Congress on Automata and Language Pro cessing

Cowgill Warren Italic and Celtic sup erlatives and the dialects of IndoEurop ean in Cardona

George Henry M Ho enigswald and Alfred Senn eds IndoEuropean and IndoEuropeans University

of Pennsylvania Press Philadelphia

Cowgill Warren More evidence for IndoHittite the tenseasp ect systems in Heilmann Luigi

ed Proceedings of the Eleventh International Congress of Linguists Mulino Bologna

Cowgill Warren Anatolian hiconjugation and IndoEurop ean p erfect instalment I I in Neu

Erich and Wolfgang Meid eds Hethitisch und Indogermanisch Innsbrucker Beitrage zur Sprachwis

senschaft Innsbruck

W H E Day and D Sankoff Computational complexity of inferring phylogenies by comp atibility

Syst Zo ol Vol No pp

Day WHE Computationally dicult parsimony problems in phylogenetic systematics Journal

of Theoretical Biology

Day WHE Optimal algorithms for comparing trees with lab eled leaves Journal of Classication

Day WHE Johnson DS and Sanko D The computational complexity of inferring ro oted

phylogenies by parsimony Mathematical Biosciences

Day WHE Computational complexity of inferring phylogenies from dissimilarity matrices

Bulletin of Mathematical Biology

Dyen Isidore The lexicostatistically determined relationship of a language group IJAL

Dyen Isidore On the validity of comparative lexicostatistics in Linguistic Subgrouping and Lexi

costatistics pp Mouton Paris

Dyen Isidore Kruskal Joseph B and Black Paul An Indo europ ean Classication A Lexicosta

tistical Exp eriment Transactions of American Philosophical Society Philadelphia PA

Embleton Sheila M S tatistics in historical linguistics Bro ckmeyer Bo chum

Farach M Kannan S and Warnow T A robust mo del for nding optimal evolutionary trees

app eared Algorithmica

Farach M and Thorup M Fast Comparison of Evolutionary Trees Pro ceedings of the Fifth

Annual ACMSIAM Symp osium on Discrete Algorithms

Felsenstein J Numerical metho ds for inferring evolutionary trees The Quarterly Review of

biologyVol No

olution minimum change for a sp ecied tree Fitch WM Toward dening the course of ev

top ology Systematic Zoology

Foulds LR and Graham RL The steiner problem in phylogeny is NPComplete Advances in

Applied Mathematics

Guseld D Ecient algorithms for inferring evolutionary trees NetworksVol pp

Ho enigswald Henry M Language Change and Linguistic ReconstructionUniversity of Chicago

Press Chicago

D S Johnson A catalog of complexity classes in Algorithms and Complexityvolume A of Handbook

of Theoretical Computer Science Elsevier science publishing company Amsterdam pp

Kannan S and Warnow T Inferring evolutionary history from DNA sequences SIAM Journal

on Computing

Kannan S and Warnow T A fast algorithm for the computation and enumeration of p erfect

phylogenies Pro ceedings of ACMSIAM Symp osium on Discrete Algorithms

Kannan S Warnow T and Yo oseph S Computing the lo cal consensus of trees Pro ceedings

of ACMSIAM Symp osium on Discrete Algorithms

Keselman D and Amir A Maximum agreement subtree in a set of evolutionary trees metrics

and ecient algorithms Proceedings of the ACMSIAM Symposium on Discrete Algorithms pp

McMorris FR Warnow T and Wimer T Triangulating vertex colored graphs S IAM Journal

on Discrete Mathematics

Meillet A La Methode Comparative en Linguistique Historique H Aschehoug Co Oslo

Porzig Walter Die Gliederung des indogermanischen Sprachgebiets Carl Winter Heidelb erg

Phillips CA and Warnow TJ A new model for building consensus treesmanuscript

Ringe Donald A Jr Laryngeal isoglosses in the western IndoEurop ean languages in Bammes

b erger Alfred ed Die Laryngaltheorie Carl Winter Heidelb erg

Ringe Donald A Jr Evidence for the p osition of Tocharian in the IndoEurop ean family Die

Sprache

Ringe Donald A Jr On Calculating the Factor of Chance in Language Comparison Transactions

of American Philosophical Society Vol no Philadelphia PA

Ruvolo Maryellen Reconstructing genetic and linguistic trees phenetic and cladistic approaches

in Henry M Ho enigswald and Linda F Wiener edsBiological Metaphor and Cladistic Classication

pp UniversityofPennsylvania Press Philadelphia

Steel MA The complexity of reconstructing trees from qualitativecharacters and subtrees

Journal of Classication

Steel M and Warnow T Kaikoura tree theorems computing the maximum agreement subtree Infor

mation Pro cessing Letters pp

Sturtevant Edgar H Acomparative grammar of the Linguistic So ciety of Amer

ica Philadelphia

gie und Lexikostatistik Innsbrucker Beitrage zur Sprachwis Tischler Johann Glottochronolo

senschaft Innsbruck

Warnow T Tree compatibility and inferring evolutionary history J of Algorithms pp

PIE

H

H

H

Hi X

H

H

H

H

OI

X

H

H

H

H

H

La X

H

H

H

H

H

X X

H H

H

H

H

H

Ar X

BaltoSlavic IndoIranian

H

H

H

H

H H

Gk TB

OCS Li Ve Av

Figure Best tree not including Germanic nonconvex lex morph

PIE

H

H

H

X Hi

H

H

H

X OI

H

H

H

H

X La

H

H

H

H

X TB

H

H

H

H

H

X X

H

H

H

H

H

Ar Gk

H

BaltoSlavic IndoIranian

H

H

H H

OCS Li Ve Av

Figure Second b est tree not including Germanic nonconvex lex morph

PIE

H

H

H

H

H

H

H

X Hi

H

H

H

H

H

H

H

H

H

H

X La OI

H

H

H

H

H

X X

H H

H

H

H

H

Ar X

BaltoSlavic IndoIranian

H

H

H

H

H H

Gk TB

Ve Av OCS Li

Figure Third b est tree not including Germanic nonconvex lex morph

PIE

H

H

H

H

H

Hi X

H

H

H

H

H

H

H

X La OI

H

H

H

H

X TB

H

H

H

H

H

X X

H

H

H

H

H

Ar Gk

H

BaltoSlavic IndoIranian

H

H

H H

OCS Li Ve Av

Figure Fourth b est tree not including Germanic nonconvex lex morph

PIE

H

H

H

X Hi

H

H

H

OI

X

H

H

H

H

X La

H

H

H

H

X TB

H

H

H

H

H

X X

H

H

H

H

H

Ar Gk

H

X IndoIranian

H

H

H

H

H

Ve Av

BaltoSlavic OE

H

H

OCS Li

Figure The most likely genetic tree for the entire IE family

HI

OI

OE LA

OCS LI VE AV AR GK TB

Figure The historical contact tree after the migration of Germanic