Non-Parametric Bayesian Areal Linguistics
Total Page:16
File Type:pdf, Size:1020Kb
Non-Parametric Bayesian Areal Linguistics Hal Daume´ III School of Computing University of Utah Salt Lake City, UT 84112 [email protected] Abstract what areal features are (is there a linear ordering of “borrowability” (Katz, 1975; Curnow, 2001) or is We describe a statistical model over linguis- that too prescriptive?), and what causes sharing to tic areas and phylogeny. Our model recov- take place (does social status or number of speakers ers known areas and identifies a plausible hi- play a role (Thomason, 2001)?). erarchy of areal features. The use of areas improves genetic reconstruction of languages In this paper, we attempt to provide a statistical both qualitatively and quantitatively according answer to some of these questions. In particular, to a variety of metrics. We model linguistic we develop a Bayesian model of typology that al- areas by a Pitman-Yor process and linguistic lows for, but does not force, the existence of linguis- phylogeny by Kingman’s coalescent. tic areas. Our model also allows for, but does not force, preference for some feature to be shared are- 1 Introduction ally. When applied to a large typological database Why are some languages more alike than others? of linguistic features (Haspelmath et al., 2005), we This question is one of the most central issues in his- find that it discovers linguistic areas that are well torical linguistics. Typically, one of three answers documented in the literature (see Campbell (2005) is given (Aikhenvald and Dixon, 2001; Campbell, for an overview), and a small preference for cer- 2006). First, the languages may be related “genet- tain features to be shared areally. This latter agrees, ically.” That is, they may have all derived from a to a lesser degree, with some of the published hi- common ancestor language. Second, the similarities erarchies of borrowability (Curnow, 2001). Finally, may be due to chance. Some language properties we show that reconstructing language family trees is are simply more common than others, which is of- significantly aided by knowledge of areal features. ten attributed to be mostly due to linguistic univer- We note that Warnow et al. (2005) have indepen- sals (Greenberg, 1963). Third, the languages may dently proposed a model for phonological change in be related areally. Languages that occupy the same Indo-European (based on the Dyen dataset (Dyen et geographic area often exhibit similar characteristics, al., 1992)) that includes notions of borrowing. Our not due to genetic relatedness, but due to sharing. model is different in that we (a) base our model on Regions (and the languages contained within them) typological features rather than just lexical patterns that exhibit sharing are called linguistic areas and and (b) we explicitly represent language areas, not the features that are shared are called areal features. just one-time borrowing phenomena. Much is not understood or agreed upon in the field 2 Background of areal linguistics. Different linguists favor differ- We describe (in Section 3) a non-parametric, hier- ent defintions of what it means to be a linguistic area archical Bayesian model for finding linguistic areas (are two languages sufficient to describe an area or and areal features. In this section, we provide nec- do you need three (Thomason, 2001; Katz, 1975)?), essary background—both linguistic and statistical— 593 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 593–601, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics for understanding our model. Finnic languages (especially Estonian and Livo- 2.1 Areal Linguistics nian). (Sometimes many more are included, such as: Belorussian, Lavian, Lithuanian, Norwegian, Old Areal effects on linguistic typology have been stud- Prussian, Polish, Romani, Russian, Ukranian.) ied since, at least, the late 1920s by Trubetzkoy, though the idea of tracing family trees for languages Ethiopia: Afar, Amharic, Anyuak, Awngi, Beja, goes back to the mid 1800s and the comparative Ge’ez, Gumuz, Janjero, Kefa, Sidamo, Somali, Ti- study of historical linguistics dates back, perhaps to gre, Tigrinya and Wellamo. Giraldus Cambrenis in 1194 (Campbell, In press). Needless to say, the exact definition and extent of A recent article provides a short introduction to both the actual areas is up to significant debate. More- the issues that surround areal linguistics, as well as over, claims have been made in favor of many lin- an enumeration of many of the known language ar- guistic areas not defined above. For instance, Dixon eas (Campbell, 2005). A fairly wide, modern treat- (2001) presents arguments for several Australian lin- ment of the issues surrounding areal diffusion is also guistic areas and Matisoff (2001) defines a South- given by essays in a recent book edited by Aikhen- East Asian language area. Finally, although “folk vald and Dixon (2001). The essays in this book pro- lore” is in favor of identifying a linguistic area in- vide a good introduction to the issues in the field. cluding English, French and certain Norse languages Campbell (2006) provides a critical survey of these (Norwegian, Swedish, Low Dutch, High German, and other hypotheses relating to areal linguistics. etc.), there are counter-arguments to this position There are several issues which are basic to the (Thomason, 2001) (see especially Case Study 9.8). study of areal linguistics (these are copied almost 2.1.2 Linguistic Features directly from Campbell (2006)). Must a linguistic Identifying which linguistic features are most eas- area comprise more than two languages? Must it ily shared “areally” is a long standing problem in comprise more than one language family? Is a sin- contact linguistics. Here we briefly review some of gle trait sufficient to define an area? How “nearby” the major claims. Much of this overview is adoped must languages in an area be to one another? Are from the summary given by Curnow (2001). some feature more easily borrowed that others? Haugen (1950) considers only borrowability as Despite these formal definitional issues of what far as the lexicon is concerned. He provided evi- constitutes a language area and areal features, most dence that nouns are the easiest, followed by verbs, historical linguists seem to believe that areal effects adjectives, adverbs, prepositions, etc. Ross (1988) play some role in the change of languages. corroborates Haugen’s analysis and deepens it to 2.1.1 Established Linguistic Areas cover morphology, syntax and phonology. He pro- Below, we list some of the well-known linguistic poses the following hierarchy of borrowability (eas- areas; Campbell (2005) provides are more complete iest items coming first): nouns > verbs > adjectives listing together with example areal features for these > syntax > non-bound function words > bound areas. For each area, we list associated languages: morphemes > phonemes. Coming from a “con- The Balkans: Albanian, Bulgarian, Greek, Mace- straints” perspective, Moravcsik (1978) suggests donian, Rumanian and Serbo-Croatian. (Sometimes: that: lexical items must be borrowed before lexi- Romani and Turkish) cal properties; inflected words before bound mor- South Asian: Languages belonging to the Dravid- phemes; verbal items can never be borrowed; etc. ian, Indo-Aryan, Munda, Tibeto-Burman families. Curnow (2001) argues that coming up with a rea- Meso-America: Cuitlatec, Huave, Mayan, Mixe- sonable hierarchy of borrowability is that “we may Zoquean, Nahua, Otomanguean, Tarascan, Tequist- never be able to develop such constraints.” Never- latecan, Totonacan and Xincan. theless, he divides the space of borrowable features North-west America: Alsea, Chimakuan, Coosan, into 15 categories and discusses the evidence sup- Eyak, Haida, Kalapuyan, Lower Chinook, Salishan, porting each of these categories, including: phonet- Takelman, Tlingit, Tsimshian and Wakashan. ics (rare), phonology (common), lexical (very com- The Baltic: Baltic languages, Baltic German, and mon), interjections and discourse markers (com- 594 mon), free grammatical forms (occasional), bound the Pitman-Yor process, the Nth customer sits at a grammatical forms (rare), position of morphology new table with probability proportional to α + Kd (rare), syntactic frames (rare), clause-internal syntax and sits at a previously occupied table k with proba- (common), between-clause syntax (occasional). bility proportional to # d, where # is the num- k − k ber of customers already seated at table k. Finally, 2.2 Non-parametric Bayesian Models with each table k we associate a parameter θk, with We treat the problem of understanding areal linguis- each θk drawn independently from G0. An impor- tics as a statistical question, based on a database of tant property of the Pitman-Yor process is that draws typological information. Due to the issues raised in from it are exchangable: perhaps counterintuitively, the previous section, we do not want to commit to the distribution does not care about customer order. the existence of a particular number of linguistic ar- The Pitman-Yor process induces a power-law dis- eas, or particular sizes thereof. (Indeed, we do not tribution on the number of singleton tables (i.e., the even want to commit to the existence of any linguis- number of tables that have only one customer). This tic areas.) However, we will need to “unify” the can be seen by noticing two things. In general, languages that fall into a linguistic area (if such a the number of singleton tables grows as (αN d). thing exists) by means of some statistical param- O When d = 0, we obtain a Dirichlet process with the eter. Such problems have been studied under the number of singleton tables growing as (α log N). name non-parametric models. The idea behind non- O parametric models is that one does not commit a pri- 2.2.2 Kingman’s Coalescent ori to a particularly number of parameters. Instead, Kingman’s coalescent is a standard model in pop- we allow the data to dictate how many parameters ulation genetics describing the common genealogy there are.