Identifying Cognate Sets Across Dictionaries of Related Languages

Identifying Cognate Sets Across Dictionaries of Related Languages Adam St Arnaud David Beck Grzegorz Kondrak Dept of Computing Science Dept of Linguistics Dept of Computing Science University of Alberta University of Alberta University of Alberta [email protected] [email protected] [email protected] Abstract 1992). While cognates are valuable to linguists, their identification is a time-consuming process, We present a system for identifying cog- even for experts, who have to sift through hun- nate sets across dictionaries of related lan- dreds or even thousands of words in related languages. The likelihood of a cognate re- guages. The languages that are the least well stud- lationship is calculated on the basis of a ied, and therefore the ones in which historical lin- rich set of features that capture both pho- guists are most interested, often lack cognate in- netic and semantic similarity, as well as formation. the presence of regular sound correspon- A number of computational methods have been dences. The similarity scores are used proposed to automate the process of cognate iden- to cluster words from different languages tification. Many of the systems focus on iden- that may originate from a common proto- tifying cognates within classes of semantically word. When tested on the Algonquian lan- equivalent words, such as Swadesh lists of basic guage family, our system detects 63% of concepts. Those systems, which typically con- cognate sets while maintaining cluster pu- sider only the phonetic or orthographic forms of rity of 70%. words, can be further divided into the ones that operate on language pairs (pairwise) vs. multilin- 1 Introduction gual approaches. However, because of seman- Cognates are words in related languages that have tic drift, many cognates are no longer exact syn- originated from the same word in an ancestor lan- onyms, which severely limits the effectiveness of guage; for example English earth and German such systems. For example, a cognate pair like En- Erde. On average, cognates display higher pho- glish bite and French fendre “to split” cannot be netic and semantic similarity than random word detected because these words are listed under dif- pairs between languages that are indisputably re- ferent basic meanings in the Comparative Indoeu- lated (Kondrak, 2013). The term cognate is some- ropean Database (Dyen et al., 1992). In addition, times used within computational linguistics to de- the number of basic concepts is typically small. note orthographically similar words that have the In this paper, we address the challenging task of same meaning (Nakov and Tiedemann, 2012). In identifying cognate sets across multiple languages this work, however, we adhere to the strict linguis- directly from dictionary lists representing related tic definition of cognates and aim to distinguish languages, by taking into account both the forms them from lexical borrowings by detecting regular of words and their dictionary definitions (c.f. Fig- sound correspondences. ure1). Our methods are designed for less-studied Cognate information between languages is crit- languages — we assume only the existence of ba- ical to the field of historical and comparative lin- sic dictionaries containing a substantial number of guistics, where it plays a central role in determin- word forms in a semi-phonetic notation, with the ing the relations and structures of language fami- meaning of words conveyed using one of the major lies (Trask, 1996). Automated phylogenetic recon- languages. Such dictionaries are typically created structions often rely on cognate information as in- before Bible translations, which have been accom- put (Bouchard-Cotˆ e´ et al., 2013). The percentage plished for most of the world’s languages. of shared cognates can also be used to estimate the While our approach is unsupervised, assuming time of pre-historic language splits (Dyen et al., no cognate sets from the analyzed language fam- 2519 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2519–2528 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics aːyweːpiwin ease, rest kaːskipiteːw he pulls him scraping C mihkweːkin red cloth C aːyweːpiwin ease, rest … 42 O aːnweːpiwin rest, repose ahkawaːpiwa he watches meškweːkenwi red woolen handcloth F yoːwe earlier, before … C mihkweːkin red cloth F meškweːkenwi red woolen handcloth 1725 M mæhkiːkan red flannel iːnekænok they are so big, tall O miskweːkin red cloth kaːskeponæːw he scratches him M mæhkiːkan red flannel … C kaːskipiteːw he pulls him scraping aːnweːpiwin rest, repose M kaːskeponæːw he scratches him 872 kaːškipin scrape, claw O kaːškipin scrape, claw O miskweːkin red cloth … Figure 1: Example of multilingual cognate set identification across four Algonquian dictionaries: Cree (C), Fox (F), Menominee (M) and Ojibwa (O). Cognate set numbers are shown on the right. ily to start with, it incorporates supervised ma- netic forms. Turchin et al.(2010) apply a heuris- chine learning models that either leverage cognate tic based on consonant classes to identify the ratio data from unrelated families, or use self-training of cognate pairs to non-cognate pairs between lan- on subsets of likely cognate pairs. We derive guages in an effort to determine the likelihood that two types of models to classify pairs of words they are related. Ciobanu and Dinu(2013) find across languages as either cognate or not. The cognate pairs by referring to dictionaries contain- language-independent general model employs a ing etymological information. Rama(2015) ex- number of features defined on both word forms periments with features motivated by string ker- and definitions, including word vector representa- nels for pairwise cognate classification. tions. The additional specific models exploit regu- A more challenging version of the task is to lar sound correspondences between specific pairs cluster cognates within lists of words that have of languages. The scores from the general and identical definitions. Hauer and Kondrak(2011) specific models inform a clustering algorithm that use confidence scores from a binary classifier that constructs the proposed cognate sets. incorporates a variety of string similarity features We evaluate our system on dictionary lists that to guide an average score clustering algorithm. represent four indigenous North American lan- Hall and Klein(2010, 2011) define generative guages from the Algonquian family. On the task models that model the evolution of words along of pairwise classification, we achieve a 42% error a phylogeny according to automatically learned reduction with respect to the state of the art. On sound laws in the form of parametric edit dis- the task of multilingual clustering, our system de- tances. List and Moran(2013) propose an ap- tects 63% of gold sets, while maintaining a cluster proach based on sound class alignments and an av- purity score of 70%. The system code is publicly 1 erage score clustering algorithm. List et al.(2016) available. extend the approach to include partial cognates 2 Related Work within word lists. Cognate identification that considers semantic Most previous work in automatic cognate identi- information is a less-studied problem. Again, the fication only consider words as cognates if they task can be framed as either a pairwise classifi- have identical definitions. As such, they make lim- cation or multi-lingual clustering. In a pairwise ited or no use of semantic information. The sim- context, Kondrak(2004) describes a system for plest variant of this task is to make pairwise cog- identifying cognates between language dictionar- nate classifications based on orthographic or pho- ies which is based on phonetic similarity, com- 1https://github.com/ajstarna/SemaPhoR plex multi-phoneme correspondences, and seman- 2520 tic information. The method of Wang and Sitbon LCSR is the longest common subsequence ra- • (2014) employs word sense disambiguation com- tio of the words. bined with classic string similarity measures for Alignment score reflects an overall phonetic finding cognate pairs in parallel texts to aid lan- • similarity, provided by the ALINE phonetic guage learners. aligner (Kondrak, 2009). Finally, very little has been published on creat- ing cognate sets based on both phonetic and se- Consonant match returns the number of • mantic information, which is the task that we fo- aligned consonants normalized by the num- cus on in this paper. Kondrak et al.(2007) com- ber of consonants in the longer word. bine phonetic and sound correspondence scores For example, consider the words meskweˇ :kenwi with a simple semantic heuristic, and create cog- and mæhki:kan (meSkwekenwi and mEhkikan in nate sets by using graph-based algorithms on con- ASJP notation) from cognate set 1725 in Figure1. nected components. Steiner et al.(2011) aim at The corresponding values for the above four fea- a fully automated approach to the comparative tures are 0.364, 0.364, 0.523, and 0.714, respec- method, including cognate set identification and tively. language phylogeny construction. Neither of those The semantic features refer to the dictionary systems and datasets are publicly available for the definitions of words. We assume that the defini- purpose of direct comparison to our method. tions are provided in a single meta-language, such as English or Spanish. We consider not only a def- 3 Methods inition in its entirety, but also its sub-definitions, In this section, we describe the design of our which are separated by commas and semicolons. language-independent general model, as well as We distinguish between a closed class of about the language-specific models. Given a pair of 300 stop words, which express grammatical re- words from related languages, the models produce lationships, and an open class of content words, a score that reflects the likelihood of the words which carry a meaning. Filtering out stopwords re- being cognate. The models are implemented as duces the likelihood of spurious matches between Support Vector Machine (SVM) classifiers via the dictionary definitions.

Identifying Cognate Sets Across Dictionaries of Related Languages

Interdisciplinary Approaches to Stratifying the Peopling of Madagascar

Cognate Words in Mehri and Hadhrami Arabic

Comparing the Cognate Effect in Spoken and Written Second Language Word Production

1 Roger Schwarzschild Rutgers University 18 Seminary Place New

Palatalization/Velar Softening: What It Is and What It Tells Us About the Nature of Language Morris Halle

Jicaque As a Hokan Language Author(S): Joseph H

Cognet: a Large-Scale Cognate Database

Phonetics in Phonology

Working Paper on Exonyms

First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States

Testing the Efficacy of a Cognate Curriculum

Identification of Cognates and Recurrent Sound Correspondences