An Automated Framework for Fast Cognate Detection and Bayesian Phylogenetic Inference in Computational Historical Linguistics

An automated framework for fast cognate detection and Bayesian phylogenetic inference in computational historical linguistics Taraka Rama Johann-Mattis List Department of Linguistics Dep. of Ling. and Cult. Evolution (DLCE) University of North Texas MPI-SHH (Jena) [email protected] [email protected] Abstract make use of expert judgments to determine cognate words in linguistic datasets, because detect- We present a fully automated workflow for ing cognates is usually regarded as hard to auto- phylogenetic reconstruction on large datasets, consisting of two novel methods, one for fast mate. The problem of manual annotation is that detection of cognates and one for fast Bayesian the process is very time consuming and may show phylogenetic inference. Our results show that a lack of objectivity, as inter-annotator agreement the methods take less than a few minutes to is rarely tested when creating new datasets. The process language families that have so far re- last twenty years have seen a surge of work in quired large amounts of time and computa- the development of methods for automatic cog- tional power. Moreover, the cognates and nate identification. Current methods reach high the trees inferred from the method are quite close, both to gold standard cognate judgments accuracy scores compared to human experts (List and to expert language family trees. Given et al., 2017) and even fully automated workflows its speed and ease of application, our frame- in which phylogenies are built from automatically work is specifically useful for the exploration inferred cognates do not differ a lot from phylo- of very large datasets in historical linguistics. genies derived from expert’s cognate judgments (Rama et al., 2018). 1 Introduction Despite the growing amount of research de- Computational historical linguistics is a relatively voted to automated word comparison and fully au- young discipline which aims to provide automated tomated phylogenetic reconstruction workflows, solutions for those problems which have been scholars have so far ignored the computational traditionally dealt with in an exclusively manual effort required to apply the methods to large fashion in historical linguistics. Computational amounts of data. While the speed of the current historical linguists thus try to develop automated workflows can be ignored for small datasets, it approaches to detect historically related words becomes a challenge with increasing amounts of (called “cognates”;J ager¨ et al. 2017; List et al. data, and some of the currently available methods 2017; Rama et al. 2017; Rama 2018a), to infer lan- for automatic cognate detection can only be ap- guage phylogenies (“language trees”; Rama et al. plied to datasets with maximally 100 languages. 2018; Greenhill and Gray 2009), to estimate the Although methods for phylogenetic inference can time depths of language families (Rama, 2018b; handle far more languages, they require enormous Chang et al., 2015; Gray and Atkinson, 2003), computational efforts, even for small language to determine the homelands of their speakers families of less than 20 varieties (Kolipakam et al., (Bouckaert et al., 2012; Wichmann et al., 2010), 2018), which make it impossible for scholars per- to determine diachronic word stability (Pagel and form exploratory studies in Bayesian frameworks. Meade, 2006; Rama and Wichmann, 2018), or to In this paper, we propose an automated frame- estimate evolutionary rates for linguistic features work for fast cognate detection and fast Bayesian (Greenhill et al., 2010). phylogenetic inference. Our cognate detection al- Despite the general goal of automating tradi- gorithm uses an alignment-free technique based tional workflows, the majority of studies con- on character skip-grams (Jarvelin¨ et al., 2007), cerned with phylogenetic reconstruction (includ- which has the advantage of neither requiring hand- ing studies on dating and homeland inference) still crafted nor statistically trained matrices of proba- 6225 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6225–6235 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics ble sound correspondences to be supplied.1 Our of a wordlist are compared with each other in or- fast approach to Bayesian inference uses a sim- der to compute a matrix of pairwise distances or ulated annealing variant (Andrieu et al., 2003) similarities. In a second stage, a flat cluster algo- of the original MCMC algorithm to compute a rithm or a network partitioning algorithm is used maximum-a-posteriori (MAP) tree in a very short to partition all words into cognate sets, taking the amount of time. information in the matrix of word pairs as basis Testing both our fast cognate detection and our (List et al., 2018b). Differences between the algo- fast phylogenetic reconstruction approach on pub- rithms can be found in the way in which the pair- licly available datasets, we find that the results wise word comparisons are carried out, to which presented in the paper are comparable to the al- degree some kind of pre-processing of the data is ternative, much more time-consuming algorithms involved, or which algorithm for flat clustering is currently in use. Our automatic cognate detec- being used. tion algorithm shows results comparable to those Since any automated word comparison that achieved by the SCA approach (List, 2014), which starts from the comparison of word pairs needs to n2−n is one of the best currently available algorithms calculate similarities or distances for all 2 pos- that work without inferring regular sound cor- sible word pairs in a given concept slot, the com- respondences prior to computation (List et al., putation cost for all algorithms which employ this 2017). Our automatically inferred MAP trees strategy exponentially increases with the number come close to the expert phylogenies reported in of words being compared. If methods addition- Glottolog (Hammarstrom¨ et al., 2017), and are at ally require to pre-process the data, for example least as good as the phylogenies inferred with Mr- to search across all language-pairs for language- Bayes (Ronquist et al., 2012), one of the most specific similarities, such as regularly correspond- popular programs for phylogenetic inference. In ing sounds (List et al., 2017;J ager¨ et al., 2017), combination, our new approaches offer a fully au- the computation becomes impractical for datasets tomated workflow for phylogenetic reconstruction of more than 100 languages. in computational historical linguistics, which is so A linear time solution was first proposed by fast that it can be easily run on single core ma- Dolgopolsky(1964). Its core idea is to represent chines, yielding results of considerable quality in all sound sequences in a given dataset by their con- less than 15 minutes for datasets of more than 50 sonant classes. A consonant class is hereby un- languages. derstood as a rough partitioning of speech sounds In the following, we describe the fast cognate into groups that are conveniently used by histor- detection program in Section2. We describe both ical linguistics when comparing languages (such the regular variant of the phylogenetic inference as velars, [k, g, x], dentals [t, d, T], or liquids [r, program and our simulated annealing variant in l, K], etc.). The major idea of this approach is Section3. We present the results of our automated to judge all words as cognate whose initial two cognate detection and phylogenetic inference ex- consonant classes match. Given that the method periments and discuss the results in Section4. We requires only that all words be converted to their conclude the paper and present pointers to future first consonant classes, this approach, which is work in Section5. now usually called consonant-class matching approach (CCM, Turchin et al. 2010), is very fast, 2 Fast Cognate Detection since its computation costs are linear with respect to the number of words being compared. The task Numerous methods for automatic cognate detec- of assigning a given word to a given cognate set is tion in historical linguistics have been proposed already fulfilled by assigning a word a given string in the past (Jager¨ et al., 2017; List, 2014; Rama of consonant classes. et al., 2017; Turchin et al., 2010; Arnaud et al., The drawback of the CCM approach is a certain 2017). Most of them are based on the same gen- lack of accuracy. While being quite conservative eral workflow, by which – in a first stage – all pos- when applied to words showing the same meaning, sible pairs of words within the same meaning slot the method likewise misses many valid matches and thus generally shows a low recall. This is most 1Although Rama(2015) uses skip-grams, the approach in the paper requires hand-annotated data which we intend to likely due to the fact that the method does not not overcome in this paper. 6226 contain any alignment component. Words are con- The difference is that we do not compute the align- verted to sound-class strings and only complete ments between a sequence pair only, but project matches are allowed, while good partial matches each word to a potential (and likewise also re- can often be observed in linguistic data, as can stricted) alignment representation. Note also that be seen from the comparison of English daughter, – even if skip-grams may take some time to com- represented as TVTVR in sound classes compared pute – our approach presented here is essentially to German Tochter TVKTVR. linear in computation time requirements, since the In order to develop an algorithm for automatic skip-gram calculation represents a constant factor. cognate detection which is both fast and shows When searching for potential cognates in our bi- a rather high degree of accuracy, we need to partite network, we can say that (A) all connected (1) learn from the strategy employed by the CCM components correspond to cognate sets, or (B) use method in avoiding any pairwise word compari- some additional algorithm to partition the bipar- son, while – at the same time – (2) avoiding the tite network into our putative cognate sets.

Load more