Comparative Linguistics via Language Modeling Scott Drellishak Department of Linguistics University of Washington Box 35430 Seattle, WA 98195-4340
[email protected] language modeling techniques, a metric of textual Abstract (and, it is hoped, genetic) similarity between pairs of single-language texts. Pairs of texts that are Natural languages are related to each oth- more similar, it is hypothesized, ought to be more er genetically. Traditionally, such genetic closely related genetically than pairs that are less relationships have been demonstrated us- similar. The metric of similarity described in §3 is ing the comparative method. I describe a based on an n-gram model of the distribution of software system that implements an alter- individual letters in each text. native method for detecting such relation- ships using simple statistical language 2 Data modeling. In order to compare texts using statistical model- ing, it is necessary that they be in the same writing 1 Introduction system, or nearly so. Consider the case of a pair of texts, one of which is written in the Latin alphabet, The historical processes of change that affect natu- the other in Japanese kanji. Statistically modeling ral languages result in genetic relationships be- the Latin text based on kanji character frequencies, tween those languages. Speakers of a single or vice-versa, would produce a probability of zero language may become divided into multiple iso- (modulo smoothing) because the character sets do lated speech communities, each of which subse- not overlap at all. In order to avoid this problem, quently develops a different new language. Well- all texts were selected from languages written in known examples include the Indo-European lan- some variant of the Latin alphabet.