Non-Parametric Bayesian Areal

Hal Daume´ III School of Computing University of Utah Salt Lake City, UT 84112 [email protected]

Abstract what areal features are (is there a linear ordering of “borrowability” (Katz, 1975; Curnow, 2001) or is We describe a statistical model over linguis- that too prescriptive?), and what causes sharing to tic areas and phylogeny. Our model recov- take place (does social status or number of speakers ers known areas and identifies a plausible hi- play a role (Thomason, 2001)?). erarchy of areal features. The use of areas improves genetic reconstruction of In this paper, we attempt to provide a statistical both qualitatively and quantitatively according answer to some of these questions. In particular, to a variety of metrics. We model linguistic we develop a Bayesian model of typology that al- areas by a Pitman-Yor process and linguistic lows for, but does not force, the existence of linguis- phylogeny by Kingman’s coalescent. tic areas. Our model also allows for, but does not force, preference for some feature to be shared are- 1 Introduction ally. When applied to a large typological database Why are some languages more alike than others? of linguistic features (Haspelmath et al., 2005), we This question is one of the most central issues in his- find that it discovers linguistic areas that are well torical linguistics. Typically, one of three answers documented in the literature (see Campbell (2005) is given (Aikhenvald and Dixon, 2001; Campbell, for an overview), and a small preference for cer- 2006). First, the languages may be related “genet- tain features to be shared areally. This latter agrees, ically.” That is, they may have all derived from a to a lesser degree, with some of the published hi- common ancestor . Second, the similarities erarchies of borrowability (Curnow, 2001). Finally, may be due to chance. Some language properties we show that reconstructing trees is are simply more common than others, which is of- significantly aided by knowledge of areal features. ten attributed to be mostly due to linguistic univer- We note that Warnow et al. (2005) have indepen- sals (Greenberg, 1963). Third, the languages may dently proposed a model for phonological change in be related areally. Languages that occupy the same Indo-European (based on the Dyen dataset (Dyen et geographic area often exhibit similar characteristics, al., 1992)) that includes notions of borrowing. Our not due to genetic relatedness, but due to sharing. model is different in that we (a) base our model on Regions (and the languages contained within them) typological features rather than just lexical patterns that exhibit sharing are called linguistic areas and and (b) we explicitly represent language areas, not the features that are shared are called areal features. just one-time borrowing phenomena. Much is not understood or agreed upon in the field 2 Background of areal linguistics. Different linguists favor differ- We describe (in Section 3) a non-parametric, hier- ent defintions of what it means to be a linguistic area archical Bayesian model for finding linguistic areas (are two languages sufficient to describe an area or and areal features. In this section, we provide nec- do you need three (Thomason, 2001; Katz, 1975)?), essary background—both linguistic and statistical—

593 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 593–601, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics for understanding our model. Finnic languages (especially Estonian and Livo- 2.1 Areal Linguistics nian). (Sometimes many more are included, such as: Belorussian, Lavian, Lithuanian, Norwegian, Old Areal effects on linguistic typology have been stud- Prussian, Polish, Romani, Russian, Ukranian.) ied since, at least, the late 1920s by Trubetzkoy, though the idea of tracing family trees for languages Ethiopia: Afar, Amharic, Anyuak, Awngi, Beja, goes back to the mid 1800s and the comparative Ge’ez, Gumuz, Janjero, Kefa, Sidamo, Somali, Ti- study of historical linguistics dates back, perhaps to gre, Tigrinya and Wellamo. Giraldus Cambrenis in 1194 (Campbell, In press). Needless to say, the exact definition and extent of A recent provides a short introduction to both the actual areas is up to significant debate. More- the issues that surround areal linguistics, as well as over, claims have been made in favor of many lin- an enumeration of many of the known language ar- guistic areas not defined above. For instance, Dixon eas (Campbell, 2005). A fairly wide, modern treat- (2001) presents arguments for several Australian lin- ment of the issues surrounding areal diffusion is also guistic areas and Matisoff (2001) defines a South- given by essays in a recent book edited by Aikhen- East Asian language area. Finally, although “folk vald and Dixon (2001). The essays in this book pro- lore” is in favor of identifying a linguistic area in- vide a good introduction to the issues in the field. cluding English, French and certain Norse languages Campbell (2006) provides a critical survey of these (Norwegian, Swedish, Low Dutch, High German, and other hypotheses relating to areal linguistics. etc.), there are counter-arguments to this position There are several issues which are basic to the (Thomason, 2001) (see especially Case Study 9.8). study of areal linguistics (these are copied almost 2.1.2 Linguistic Features directly from Campbell (2006)). Must a linguistic Identifying which linguistic features are most eas- area comprise more than two languages? Must it ily shared “areally” is a long standing problem in comprise more than one language family? Is a sin- contact linguistics. Here we briefly review some of gle trait sufficient to define an area? How “nearby” the major claims. Much of this overview is adoped must languages in an area be to one another? Are from the summary given by Curnow (2001). some feature more easily borrowed that others? Haugen (1950) considers only borrowability as Despite these formal definitional issues of what far as the lexicon is concerned. He provided evi- constitutes a language area and areal features, most dence that nouns are the easiest, followed by verbs, historical linguists seem to believe that areal effects adjectives, adverbs, prepositions, etc. Ross (1988) play some role in the change of languages. corroborates Haugen’s analysis and deepens it to 2.1.1 Established Linguistic Areas cover morphology, syntax and phonology. He pro- Below, we list some of the well-known linguistic poses the following hierarchy of borrowability (eas- areas; Campbell (2005) provides are more complete iest items coming first): nouns > verbs > adjectives listing together with example areal features for these > syntax > non-bound function words > bound areas. For each area, we list associated languages: morphemes > phonemes. Coming from a “con- The : Albanian, Bulgarian, Greek, Mace- straints” perspective, Moravcsik (1978) suggests donian, Rumanian and Serbo-Croatian. (Sometimes: that: lexical items must be borrowed before lexi- Romani and Turkish) cal properties; inflected words before bound mor- South Asian: Languages belonging to the Dravid- phemes; verbal items can never be borrowed; etc. ian, Indo-Aryan, Munda, Tibeto-Burman families. Curnow (2001) argues that coming up with a rea- Meso-America: Cuitlatec, Huave, Mayan, Mixe- sonable hierarchy of borrowability is that “we may Zoquean, Nahua, Otomanguean, Tarascan, Tequist- never be able to develop such constraints.” Never- latecan, Totonacan and Xincan. theless, he divides the space of borrowable features North-west America: Alsea, Chimakuan, Coosan, into 15 categories and discusses the evidence sup- Eyak, Haida, Kalapuyan, Lower Chinook, Salishan, porting each of these categories, including: phonet- Takelman, Tlingit, Tsimshian and Wakashan. ics (rare), phonology (common), lexical (very com- The Baltic: Baltic languages, Baltic German, and mon), interjections and discourse markers (com-

594 mon), free grammatical forms (occasional), bound the Pitman-Yor process, the Nth customer sits at a grammatical forms (rare), position of morphology new table with probability proportional to α + Kd (rare), syntactic frames (rare), clause-internal syntax and sits at a previously occupied table k with proba- (common), between-clause syntax (occasional). bility proportional to # d, where # is the num- k − k ber of customers already seated at table k. Finally, 2.2 Non-parametric Bayesian Models with each table k we associate a parameter θk, with We treat the problem of understanding areal linguis- each θk drawn independently from G0. An impor- tics as a statistical question, based on a database of tant property of the Pitman-Yor process is that draws typological information. Due to the issues raised in from it are exchangable: perhaps counterintuitively, the previous section, we do not want to commit to the distribution does not care about customer order. the existence of a particular number of linguistic ar- The Pitman-Yor process induces a power-law dis- eas, or particular sizes thereof. (Indeed, we do not tribution on the number of singleton tables (i.e., the even want to commit to the existence of any linguis- number of tables that have only one customer). This tic areas.) However, we will need to “unify” the can be seen by noticing two things. In general, languages that fall into a linguistic area (if such a the number of singleton tables grows as (αN d). thing exists) by means of some statistical param- O When d = 0, we obtain a Dirichlet process with the eter. Such problems have been studied under the number of singleton tables growing as (α log N). name non-parametric models. The idea behind non- O parametric models is that one does not commit a pri- 2.2.2 Kingman’s Coalescent ori to a particularly number of parameters. Instead, Kingman’s coalescent is a standard model in pop- we allow the data to dictate how many parameters ulation genetics describing the common genealogy there are. In Bayesian modeling, non-parametric (ancestral tree) of a set of individuals (Kingman, distributions are typically used as priors; see Jor- 1982b; Kingman, 1982a). In its full form it is a dis- dan (2005) or Ghahramani (2005) for overviews. In tribution over the genealogy of a countable set. our model, we use two different non-parametric pri- Consider the genealogy of n individuals alive at ors: the Pitman-Yor process (for modeling linguistic the present time t = 0. We can trace their ances- areas) and Kingman’s coalescent (for modeling lin- try backwards in time to the distant past t = . −∞ guistic phylogeny), both described below. Assume each individual has one parent (in genet- 2.2.1 The Pitman-Yor Process ics, haploid organisms), and therefore genealogies One particular example of a non-parametric prior of [n] = 1, . . . , n form a directed forest. King- { } is the Pitman-Yor process (Pitman and Yor, 1997), man’s n-coalescent is simply a distribution over ge- which can be seen as an extension to the better- nealogies of n individuals. To describe the Markov known Dirichlet process (Ferguson, 1974). The process in its entirety, it is sufficient to describe Pitman-Yor process can be understood as a particu- the jump process (i.e. the embedded, discrete-time, lar example of a Chinese Restaurant process (CRP) Markov chain over partitions) and the distribution (Pitman, 2002). The idea in all CRPs is that there over coalescent times. In the n-coalescent, every exists a restaurant with an infinite number of ta- pair of lineages merges independently with rate 1, bles. Customers come into the restaurant and have with parents chosen uniformly at random from the to choose a table at which to sit. set of possible parents at the previous time step. The Pitman-Yor process is described by three pa- The n-coalescent has some interesting statistical rameters: a base rate α, a discount parameter d and properties (Kingman, 1982b; Kingman, 1982a). The a mean distribution G0. These combine to describe marginal distribution over tree topologies is uni- a process denoted by (α, d, G ). The parameters form and independent of the coalescent times. Sec- PY 0 α and d must satisfy: 0 d < 1 and α > d. In ondly, it is infinitely exchangeable: given a geneal- ≤ − the CRP analogy, the model works as follows. The ogy drawn from an n-coalescent, the genealogy of first customer comes in and sits at any table. After any m contemporary individuals alive at time t 0 ≤ N customers have come in and seated themselves embedded within the genealogy is a draw from the (at a total of K tables), the Nth customer arrives. In m-coalescent. Thus, taking n , there is a distri- →∞ 595 bution over genealogies of a countably infinite pop- corresponding to the value preferences in the lan- ulation for which the marginal distribution of the ge- gauge area to which language n belongs. If it is de- nealogy of any n individuals gives the n-coalescent. rived genetically, then Xn,f is drawn from a variable Kingman called this the coalescent. corresponding to value preferences for the genetic Teh et al. (2007) recently described efficient in- substrate to which language n belongs. The set of ference algorithms for Kingman’s coalescent. They areas, and the area to which a language belongs are applied the coalescent to the problem of recovering given by yet more latent variables. It is this aspect of linguistic phylogenies. The application was largely the model for which we use the Pitman-Yor process: successful—at least in comparison to alternative al- languages are customers, areas are tables and area gorithms that use the same data-. Unfortunately, value preferences are the parameters of the tables. even in the results they present, one can see signif- 3.1 The formal model icant areal effects. For instance, in their Figure(3a), We assume that the value a feature takes for a par- Romanian is very near Albanian and Bulgarian. This ticular language (i.e., the value of X ) can be ex- is likely an areal effect: specifically, an effect due to n,f plained either genetically or areally.2 We denote this the Balkan langauge area. We will revisit this issue by a binary indicator variable Z , where a value 1 in our own experiments. n,f means “areal” and a value 0 means “genetic.” We as- 3 A Bayesian Model for Areal Linguistics sume that each Zn,f is drawn from a feature-specific N We will consider a data set consisting of lan- binomial parameter πf . By having the parameter guages and F typological features. We denote the feature-specific, we express the fact that some fea- value of feature f in language n as Xn,f . For sim- tures may be more or less likely to be shared than plicity of exposition, we will assume two things: (1) others. In other words, a high value of πf would there is no unobserved data and (2) all features are mean that feature f is easily shared areally, while a binary. In practice, for the data we use (described in low value would mean that feature f is hard to share. Section 4), neither of these is true. However, both Each language n has a known latitude/longitude `n. extensions are straightforward. We further assume that there are K linguistic ar- When we construct our model, we attempt to be eas, where K is treated non-parametrically by means as neutral to the “areal linguistics” questions defined of the Pitman-Yor process. Note that in our context, in Section 2.1 as possible. We allow areas with only a linguistic area may contain only one language, two languages (though for brevity we do not present which would technically not be allowed according them in the results). We allow areas with only one to the linguistic definition. When a language belongs family (though, again, do not present them). We are to a singleton area, we interpret this to mean that it generous with our notion of locality, allowing a ra- does not belong to any language area. dius of 1000 kilometers (though see Section 5.4 for Each language area k (including the singleton ar- an analysis of the effect of radius).1 And we allow, eas) has a set of F associated parameters φk,f , where but do not enforce trait weights. All of this is ac- φk,f is the probability that feature f is “on” in area k. complished through the construction of the model It also has a “central location” given by a longitude and the choice of the model hyperparameters. and latitude denoted ck. We only allow languages At a high-level, our model works as follows. Val- to belong to areas that fall within a given radius R ues Xn,f appear for one of two reasons: they are ei- of them (distances computed according to geodesic ther areally derived or genetically derived. A latent distance). This accounts for the “geographical” con- variable Zn,f determines this. If it is derived areally, straints on language areas. We denote the area to then the value X is drawn from a latent variable n,f which language n belongs as an. 1An reader might worry about exchangeability: Our method We assume that each language belongs to a “fam- of making language centers and locations part of the Pitman-Yor ily tree.” We denote the parent of language n in the distribution ensures this is not an issue. An alternative would be to use a location-sensitive process such as the kernel stick- 2As mentioned in the introduction, (at least) one more option breaking process (Dunson and Park, 2007), though we do not is possible: chance. We treat “chance” as noise and model it in explore that here. the data generation process, not as an alternative “source.”

596 in(θpn,f ) if Zn,f = 0 Xn,f B feature values are derived genetically or areally ∼ in(φa ,f ) if Zn,f = 1  B n Z in(π ) feature source is a biased coin, parameterized per feature n,f ∼ B f ` all(c ,R) language position is uniform within a ball around area center, radius R n ∼ B an π et(1, 1) bias for a feature being genetic/areal is uniform f ∼ B (p, θ) Coalescent(π , m ) language hierarchy and genetic traits are drawn from a Coalescent ∼ 0 0 (a, φ, c ) (α , d , et(1, 1) ni) area features are drawn Beta and centers Uniformly across the globe h i ∼ PY 0 0 B × U Figure 1: Full hierarchical Areal model; see Section 3.1 for a complete description. family tree by pn. We associate with each node i in cessed by Daume´ III and Campbell (2007). the family tree and each feature f a parameter θi,f . In WALS, languages a grouped into 38 language As in the areal case, θi,f is the probability that fea- families (including Indo-European, Afro-Asiatic, ture f is on for languages that descend from node i Austronesian, Niger-Congo, etc.). Each of these lan- in the family tree. We model genetic trees by King- guage families is grouped into a number of language man’s coalescent with binomial mutation. geni. The Indo-European family includes ten geni, Finally, we put non-informative priors on all the including: Germanic, Romance, Indic and Slavic. hyperparameters. Written hierarchically, our model The Austronesian family includes seventeen geni, has the following shown in Figure 1. There, by including: Borneo, Oceanic, Palauan and Sundic. (p, θ) Coalescent(π , m ), we mean that the tree Overall, there are 275 geni represented in WALS. ∼ 0 0 and parameters are given by a coalescent. We further preprocess the data as follows. For the Indo-European subset (hence-forth, “IE”), we re- 3.2 Inference move all languages with 10 known features and ≤ Inference in our model is mostly by Gibbs sam- then remove all features that appear in at most 1/4 pling. Most of the distributions used are conju- of the languages. This leads to 73 languages and gate, so Gibbs sampling can be implemented effi- 87 features. For the whole-world subset, we remove ciently. The only exceptions are: (1) the coales- languages with 25 known features and then fea- ≤ cent for which we use the GreedyRate1 algorithm tures that appear in at most 1/10 of the languages. described by Teh et al. (2007); (2) the area centers c, This leads to 349 languages and 129 features. for which we using a Metropolis-Hastings step. Our 5 Experiments proposal distribution is a Gaussian centered at the 5.1 Identifying Language Areas previous center, with standard deviation of 5. Ex- Our first experiment is aimed at discovering lan- perimentally, this resulted in an acceptance rate of guage areas. We first focus on the IE family, and about 50%. then extend the analysis to all languages. In both In our implementation, we analytically integrate cases, we use a known family tree (for the IE ex- out π and φ and sample only over Z, the coalescent periment, we use a tree given by the language genus tree, and the area assignments. In some of our ex- structure; for the whole-world experiment, we use a periments, we treat the family tree as given. In this tree given by the language family structure). We run case, we also analytically integrate out the θ param- each experiment with five random restarts and 2000 eters and sample only over Z and area assignments. iterations. We select the MAP configuration from 4 Typological Data the combination of these runs. The database on which we perform our analysis is In the IE experiment, the model identified the the World Atlas of Language Structures (henceforth, areas shown in Figure 5.1. The best area identi- WALS) (Haspelmath et al., 2005). The database fied by our model is the second one listed, which contains information about 2150 languages (sam- clearly correlates highly with the Balkans. There pled from across the world). There are 139 typologi- are two areas identified by our model (the first and cal features in this database. The database is sparse: last) that include only Indic and Iranian languages. only 16% of the possible language/feature pairs are While we are not aware of previous studies of these known. We use the version extracted and prepro- as linguistic areas, they are not implausible given

597 (Indic) Bhojpuri, Darai, Gujarati, Hindi, Kalami, Kashmiri, Model Rand F-Sc Edit NVI Kumauni, Nepali, Panjabi, Shekhawati, Sindhi (Iranian) Or- K-means 0.9149 0.0735 0.1856 0.5889 muri, Pashto Pitman-Yor 0.9637 0.1871 0.6364 0.7998 (Albanian) Albanian (Greek) Greek (Modern) (Indic) Romani Areal model 0.9825 0.2637 0.8295 0.9090 (Kalderash) (Romance) Romanian, Romansch (Scharans), Ro- mansch (Sursilvan), Sardinian (Slavic) Bulgarian, Macedonian, Table 1: Area identification scores for two baseline algo- Serbian-Croatian, Slovak, Slovene, Sorbian rithms (K-means and Pitman-Yor clustering) that do not (Baltic) Latvian, Lithuanian (Germanic) Danish, Swedish use hierarchical structure, and for the Areal model we (Slavic) Polish, Russian have presented. Higher is better and all differences are (Celtic) Irish (Germanic) English, German, Norwegian (Ro- statistically significant at the 95% level. mance) French (Indic) Prasuni, Urdu (Iranian) Persian, Tajik tion 2.1.1). Plus 46 non-areal languages Despite the difficulty humans have in assigning Figure 2: IE areas identified. Areas that consist of just linguistic areas, In Table 1, we explicitly compare one genus are not listed, nor are areas with two languages. the quality of the areal clusters found on the IE sub- set. We compare against the most inclusive areal (Mayan) Huastec, Jakaltek, Mam, Tzutujil (Mixe-Zoque) Zoque (Copainala)´ (Oto-Manguean) Mixtec (Chalcatongo), lists from Section 2.1.1 for IE: the Balkans and the Otom´ı (Mezquital) (Uto-Aztecan) Nahualtl (Tetelcingo), Pipil Baltic. When there is overlap (eg., Romani appears (Baltic) Latvian, Lithuanian (Finnic) Estonian, Finnish in both lists), we assigned it to the Balkans. (Slavic) Polish, Russian, Ukranian We compare our model with a flat Pitman-Yor Khasi Telugu Bengali (Austro-Asiatic) (Dravidian) (IE) model that does not use the hierarchy. We also (Sino-Tibetan) Bawm, Garo, Newari (Kathmandu) compare to a baseline K-means algorithm. For K- means, we ran with K 5, 10, 15,..., 80, 85 Figure 3: A small subset of the world areas identified. ∈ { } and chose the value of K for each metric that did best (giving an unfair advantage). Clustering per- the history of the region. The fourth area identi- formance is measured on the Indo-European task fied by our model corresponds roughly to the de- according to the Rand Index, F-score, Normalized bated “English” area. Our area includes the req- Edit Score (Pantel, 2003) and Normalized Variation uisite French/English/German/Norwegian group, as of Information (Meila, 2003). In these results, we well as the somewhat surprising Irish. However, in see that the Pitman-Yor process model dominates the addition to being intuitively plausible, it is not hard K-means model and the Areal model dominates the to find evidence in the literature for the contact re- Pitman-Yor model. lationship between English and Irish (Sommerfelt, 1960). 5.2 Identifying Areal Features In the whole-world experiment, the model identi- Our second experiment is an analysis of the features fied too many linguistic areas to fit (39 in total that that tend to be shared areally (as opposed to genet- contained at least two languages, and contained at ically). For this experiment, we make use of the least two language families). In Figure 5.1, we de- whole-world version of the data, again with known pict the areas found by our model that best corre- language family structure. We initialize a Gibbs spond to the areas described in Section 2.1.1. We sampler from the MAP configuration found in Sec- acknowledge that this gives a warped sense of the tion 5.1. We run the sampler for 1000 iterations and quality of our model. Nevertheless, our model is take samples every ten steps. able to identify large parts of the the Meso-American From one particular sample, we can estimate a area, the Baltic area and the South Asian area. (It posterior distribution over each πf . Due to con- also finds the Balkans, but since these languages jugacy, we obtain a posterior distribution of π f ∼ are all IE, we do not consider it a linguistic area in et(1 + Z , 1 + [1 Z ]). The 1s come B n n,f n − n,f this evaluation.) While our model does find areas from the prior. From this Beta distribution, we can P P that match Meso-American and North-west Ameri- ask the question: what is the probability that a value can areas, neither is represented in its entirety (ac- of πf drawn from this distribution will have value cording to the definition of these areas given in Sec- < 0.5? If this value is high, then the feature is likely

598 p(gen) #f Feature Category Indo-European .00 1 Tea Model Accuracy Log Prob Baseline 0.635 ( 0.007) 0.583 ( 0.008) .73 19 Phonology ± − ± Areal model 0.689 ( 0.010) 0.526 ( 0.027) .73 9 Lexicon ± − ± .74 4 Nominal Categories / Numerals World .79 5 Simple Clauses / Predication Model Accuracy Log Prob Baseline 0.628 ( 0.001) 0.654 ( 0.003) .80 5 Verbal Categories / Tense and Aspect ± − ± Areal model 0.635 ( 0.002) 0.565 ( 0.011) .87 8 Nominal Syntax ± − ± .87 8 Simple Clauses / Simple Clauses Table 3: Prediction accuracies and log probabilities for .91 12 Nominal Categories / Articles and Pronouns IE (top) and the world (bottom). .94 17 Word Order .99 10 Morphology .99 6 Simple Clauses / Valence and Voice item which is 99% genetic in our model). .99 7 Complex Sentences .99 7 Nominal Categories / Gender and Number 5.3 Genetic Reconstruction .99 5 Simple Clauses / Negation and Questions In this section, we investigate whether the use of 1.0 1 Other / Clicks areal knowledge can improve the automatic recon- 1.0 2 Verbal Categories / Suppletion struction of language family trees. We use King- 1.0 9 Verbal Categories / Modality 1.0 4 Nominal Categories / Case man’s coalescent (see Section 2.2.2) as a probabilis- tic model of trees, endowed with a binomial muta- Table 2: Average probability of genetic for each feature tion process on the language features. category and the number of features in that category. Our baseline model is to run the vanilla coalescent on the WALS data, effective reproducing the results to be a “genetic feature”; if it is low, then the feature presented by Teh et al. (2007). This method was al- is likely to be an “areal feature.” We average these ready shown to outperform competing hierarchical probabilities across all 100 samples. clustering algorithms such as average-link agglom- The features that are most likely to be areal ac- erative clustering (see, eg., Duda and Hart (1973)) cording to our model are summaries in Table 2. In and the Bayesian Hierarchical Clustering algorithm this table, we list the categories to which each fea- (Heller and Ghahramani, 2005). ture belongs, together with the number of features in We run the same experiment both on the IE sub- that category, and the average probability that a fea- set of data and on the whole-world subset. We eval- ture in that category is genetically transmitted. Ap- uate the results qualitatively, by observing the trees parently, the vast majority of features are not areal. found (on the IE subset) and quantitatively (below). We can treat the results presented in Table 2 as a For the qualitative analysis, we show the subset of hierarchy of borrowability. In doing so, we see that IE that does not contain Indic languages or Iranian our hierarchy agrees to a large degree with the hier- languages (just to keep the figures small). The tree archies summarized in Section 2.1.2. Indeed, (aside derived from the original data is on the left in Fig- from “Tea”, which we will ignore) the two most ure 4, below: easily shared categories according to our model are The tree based on areal information is on the right in phonology and the lexicon; this is in total agreement Figure 4, below. As we can see, the use of areal in- with the agreed state of affairs in linguistics. formation qualitatively improves the structure of the Lower in our list, we see that noun-related cat- tree. Where the original tree had a number of errors egories tend to precede their verb-related counter- with respect to Romance and , parts (nominal categories before verbal categores, these are sorted out in the areally-aware tree. More- nominal syntax before complex sentences). Accord- over, Greek now appears in a more appropriate part ing to Curnow (2001), the most difficult features to of the tree and English appears on a branch that is borrow are phonetics (for which we have no data), further out from the Norse languages. bound grammatical forms (which appear low on our We perform two varieties of quantitative analysis. list), morphology (which is 99% genetic, according In the first, we attempt to predict unknown feature to our model) and syntactic frames (which would values. In particular, we hide an addition 10% of roughly correspond to “complex sentences”, another the feature values in the WALS data and fit a model

599 ra oe LOAco Evru eu non- Genus metric. versus evaluation IE the the on on depending of withstanding), Acc favor in (LOO generally model are As Areal results below. the 4, see, Table can in we are results The as “classes.” genus the language and family language computed and both are 1, against scores These and better). 0 are between scores are higher scores (all accuracies leave-one-out log and Ghahramani, normalized) nodes labels, interior leaf and of pure with number (the (Heller scores subtree scores 2005), purity present deviation. standard one outside are but differences large, The not below. are 3, Table in shown are experiment results this repeat We model 10 noisy). the less under is data the latter hidden as (the the well of as probability accuracy log absolute the both compute areal. as identified features netic tree according family predictions the make to The to is tree. model family augmented the to according predictions make hidden the predict remaining the to (low). family (top); against genus and (mid) against genus compared against as world IE for for Scores 4: Table o h eodqatttv nlss euse we analysis, quantitative second the For ie ihadfeetrandom different a with times [Germanic][Germanic] German German ra model Areal Baseline Model n codn otelnusi area linguistic the to according and ra model Areal Baseline Model model Areal Baseline Model [Germanic][Germanic] Frisian Frisian [Germanic][Germanic] Dutch Dutch [Romance[Romance ] Romanian ] Romanian iue4 eei re fI agae.(et ihn ra nweg;(ih)wt ra model. areal with (Right) knowledge; areal no with (Left) languages. IE of trees Genetic 4: Figure [Germanic][Germanic] Icelandic Icelandic [Germanic][Germanic] English English [Germanic][Germanic] Norwegian Norwegian

noErpa essGenus versus Indo-European [Germanic][Germanic] Swedish Swedish [Romance[Romance ] Romansch (Scharans) ] Romansch (Scharans) [Germanic][Germanic] Danish Danish [Romance[Romance ] Sardinian ] Sardinian ol essFamily versus World ol essGenus versus World [Romance[Romance ] Romansch (Sursilvan) ] Romansch (Sursilvan) 90% o hs etrsietfida ge- as identified features those for 0

0 0 [Romance[Romance ] Portuguese ] Portuguese 0 Purity 0 0 Purity Purity 10% . .

. . [Romance[Romance ] French ] French . . 5143 4163 4001 6494 3599 6078 [Romance[Romance ] Spanish ] Spanish [Romance[Romance ] Italian ] Italian ete s htmdlto model that use then We . [Romance[Romance ] Catalan ] Catalan

h aeiemdli to is model baseline The . [Slavic[Slavic ] Polish ] Polish [Baltic[Baltic ] Latvian ] Latvian [Slavic[Slavic ] Slovene ] Slovene

Subtree [Slavic[Slavic ] Serbian−Croatian ] Serbian−Croatian 0 Subtree Subtree 0 0 0 0 0 [Slavic[Slavic ] Ukrainian ] Ukrainian ...... 3318 o ohstig,we settings, both For 3280 2450 5455 [Slavic[Slavic ] Russian ] Russian 2253 5065 [Baltic[Baltic ] Lithuanian ] Lithuanian [Celtic[Celtic ] Irish (Donegal) ] Irish (Donegal) [Slavic[Slavic ] Czech ] Czech 10% [Greek[Greek ] Greek (Modern) ] Greek (Modern) [Slavic[Slavic ] Macedonian ] Macedonian [Slavic[Slavic ] Slovak ] Slovak O Acc LOO O Acc LOO Acc LOO [Slavic[Slavic ] Bulgarian ] Bulgarian 0 0 0 0 0 0 [Slavic[Slavic ] Sorbian ] Sorbian idn The hidden...... 5198 4842 7982 3218 [Germanic][Germanic] Faroese Faroese 7747 2528 [Slavic[Slavic ] Belorussian ] Belorussian [Armenian][Armenian] Armenian (Western) Armenian (Western) o those for [Armenian][Armenian] Armenian (Eastern) Armenian (Eastern) [Celtic[Celtic ] Irish ] Irish [Celtic[Celtic ] Gaelic (Scots) ] Gaelic (Scots) [Celtic[Celtic ] Welsh ] Welsh [Celtic[Celtic ] Breton ] Breton

600 [Albanian][Albanian] Albanian Albanian [Celtic[Celtic ] Cornish ] Cornish [Germanic][Germanic] Afrikaans Afrikaans

a atal upre yNFgatIIS0712764. grant NSF by supported work partially This was helpful. very three were the reviewers from anonymous and comments Teh discussions; for Whye Yee Xing Eric Campbell, Lyle to thanks Deep Acknowledgments ex- consideration. world” deserve “whole lin- that the periments) new in (particularly suggests areas model guistic our it expedition, however, “data mining” any issues, like Moreover, these successful. for largely of appears account failure to the model Despite our implications. must univeral our borrowing obey temporal; at to is known contact much is assymetric; borrowing is be for: account there note not to does successes, important model our is It despite that languages. of phylo- recover have trees to we genetic ability areas, the in this well- improvement recover Using shown to able areas. is linguistic that known model a presented We Discussion radii. of range 6 a for grow robust Never- to is performance overfitting. areas theless, perhaps allow is This we large. as seems arbitrarily improve (strangely) a to LOO continue is to optimal. there seems scores, model the subtree and around can spot” purity As “sweet by radii. seen varying with be perfor- built shows models 5 for Table hyper- mance radius performance. the on of effect parameter the evaluate we Radius Finally, of Effect 5.4 [Germanic] English [Germanic] Icelandic [Germanic] German

al :Soe o Ev eu tvrigradii. varying at genus vs IE for Scores 5: Table [Germanic] Frisian [Germanic] Dutch [Germanic] Swedish

4000 2000 1000 500 250 125 Radius [Germanic] Norwegian [Germanic] Danish [Romance ] Romanian [Romance ] Spanish [Romance ] Italian [Romance ] Catalan [Romance ] Sardinian [Romance ] Romansch (Sursilvan)

0 [Romance ] Portuguese 0 0 0 0 0 Purity .

. . . . . [Romance ] French 6494 6342 6464 6483 6457 6237 [Albanian] Albanian [Slavic ] Macedonian

500 [Greek ] Greek (Modern) [Slavic ] Sorbian [Slavic ] Slovak [Slavic ] Bulgarian Subtree

to [Slavic ] Polish 0 0 0 0 0 0 [Baltic ] Latvian ...... 5455 5455

4156 4935 5325 4855 [Baltic ] Lithuanian 1000 [Slavic ] Slovene [Slavic ] Serbian−Croatian [Slavic ] Ukrainian [Slavic ] Russian [Celtic ] Irish (Donegal)

ioeeswhere kilometers [Slavic ] Czech O Acc LOO [Germanic] Faroese 0 0 0 0 0 0 [Slavic ] Belorussian ...... 4138

3218 2528 2413 2299 2013 [Armenian] Armenian (Western) [Armenian] Armenian (Eastern) [Celtic ] Irish [Romance ] Romansch (Scharans) [Celtic ] Gaelic (Scots) [Celtic ] Breton [Celtic ] Welsh [Celtic ] Cornish [Germanic] Afrikaans References Michael I. Jordan. 2005. Dirichlet processes, Chinese restaurant processes and all that. Tutorial presented at Alexandra Aikhenvald and R.M.W. Dixon, editors. 2001. NIPS conference. Areal diffusion and genetic inheritance: problems in Harmut Katz. 1975. Generative Phonologie und phonol- . Oxford University Press. ogische Sprachbunde¨ des Ostjakischen un Samojedis- Lyle Campbell. 2005. Areal linguistics. In Keith Brown, chen. Wilhelm Fink. editor, Encyclopedia of Language and Linguistics. El- J. F. C. Kingman. 1982a. The coalescent. Stochastic sevier, 2 edition. Processes and their Applications, 13:235–248. Lyle Campbell. 2006. Areal linguistics: the problem J. F. C. Kingman. 1982b. On the genealogy of large to the answer. In April McMahon, Nigel Vincent, and populations. Journal of Applied Probability, 19:27– Yaron Matras, editors, Language contact and areal lin- 43. Essays in Statistical Science. guistics . . 2001. Genetic versus contact relation- Lyle Campbell. In press. Why Sir William Jones ship: prosodic diffusibility in South-East Asian lan- got it all wrong, or Jones’ role in how to estab- guages. In Aikhenvald and Dixon, editors, Areal diffu- lish language families. In Joseba Lakarra, editor, sion and genetic inheritance: problems in comparative Festschrift/Memorial volume for Larry Trask. linguistics, pages 291–327. Oxford University Press. Timothy Curnow. 2001. What language features can be Marina Meila. 2003. Comparing clusterings. In Pro- ”borrowed”? In Aikhenvald and Dixon, editors, Areal ceedings of the Conference on Computational Learn- diffusion and genetic inheritance: problems in com- ing Theory (COLT). parative linguistics, pages 412–436. Oxford Univer- E. Moravcsik. 1978. Language contact. In J.H. Green- sity Press. berg, C. Ferguson, and E. Moravcsik, editors, Univer- Hal Daume´ III and Lyle Campbell. 2007. A Bayesian sals of Human Language, volume 1; Method and The- model for discovering typological implications. In ory, pages 3–123. Stanford University Press. Proceedings of the Conference of the Association for Patrick Pantel. 2003. Clustering by Committee. Ph.D. Computational Linguistics (ACL). thesis, University of Alberta. R.M.W. Dixon. 2001. The Australian linguistic area. J. Pitman and M. Yor. 1997. The two-parameter Poisson- In Aikhenvald and Dixon, editors, Areal diffusion and Dirichlet distribution derived from a stable subordina- genetic inheritance: problems in comparative linguis- tor. Annals of Probability, 25:855–900. tics, pages 64–104. Oxford University Press. Jim Pitman. 2002. Combinatorial stochastic processes. R. O. Duda and P. E. Hart. 1973. Pattern Classification Technical Report 621, University of California at And Scene Analysis. Wiley and Sons, New York. Berkeley. Lecture notes for St. Flour Summer School. David Dunson and Ju-Hyun Park. 2007. Kernel stick M.D. Ross. 1988. Proto Oceanic and the Austronesian breaking processes. Biometrika, 95:307–323. languages of western melanesia. Canberra: Pacific Isidore Dyen, Joseph Kurskal, and Paul Black. 1992. An Linguitics, Australian National University. Indoeuropean classification: A lexicostatistical experi- Alf Sommerfelt. 1960. External versus internal factors ment. Transactions of the American Philosophical So- in the development of language. Norsk Tidsskrift for ciety, 82(5). American Philosophical Society. Sprogvidenskap, 19:296–315. Thomas S. Ferguson. 1974. Prior distributions on spaces Yee Whye Teh, Hal Daume´ III, and Daniel Roy. 2007. of probability measures. The Annals of Statistics, Bayesian agglomerative clustering with coalescents. 2(4):615–629, July. In Advances in Neural Information Processing Sys- Zoubin Ghahramani. 2005. Nonparametric Bayesian tems (NIPS). methods. Tutorial presented at UAI conference. Sarah Thomason. 2001. Language contact: an introduc- Joseph Greenberg, editor. 1963. Universals of Lan- tion. Edinburgh University Press. guages. MIT Press. T. Warnow, S.N. Evans, D. Ringe, and L. Nakhleh. 2005. Martin Haspelmath, Matthew Dryer, David Gil, and A stochastic model of language evolution that incor- Bernard Comrie, editors. 2005. The World Atlas of porates homoplasy and borrowing. In Phylogenetic Language Structures. Oxford University Press. Methods and the Prehistory of Language. Cambridge E. Haugen. 1950. The analysis of linguistic borrowing. University Press. Invited paper. Language, 26:210–231. Katherine Heller and Zoubin Ghahramani. 2005. Bayesian hierarchical clustering. In Proceedings of the International Conference on Machine Learning (ICML), volume 22.

601