Identifying Cognate Sets Across Dictionaries of Related Languages

Adam St Arnaud David Beck Grzegorz Kondrak Dept of Computing Science Dept of Dept of Computing Science University of Alberta University of Alberta University of Alberta [email protected] [email protected] [email protected]

Abstract 1992). While cognates are valuable to linguists, their identification is a time-consuming process, We present a system for identifying cog- even for experts, who have to sift through hun- nate sets across dictionaries of related lan- dreds or even thousands of in related lan- guages. The likelihood of a cognate re- guages. The languages that are the least well stud- lationship is calculated on the basis of a ied, and therefore the ones in which historical lin- rich set of features that capture both pho- guists are most interested, often lack cognate in- netic and semantic similarity, as well as formation. the presence of regular sound correspon- A number of computational methods have been dences. The similarity scores are used proposed to automate the process of cognate iden- to cluster words from different languages tification. Many of the systems focus on iden- that may originate from a common proto- tifying cognates within classes of semantically . When tested on the Algonquian lan- equivalent words, such as Swadesh lists of basic guage family, our system detects 63% of concepts. Those systems, which typically con- cognate sets while maintaining cluster pu- sider only the phonetic or orthographic forms of rity of 70%. words, can be further divided into the ones that operate on language pairs (pairwise) vs. multilin- 1 Introduction gual approaches. However, because of seman- Cognates are words in related languages that have tic drift, many cognates are no longer exact syn- originated from the same word in an ancestor lan- onyms, which severely limits the effectiveness of guage; for example English earth and German such systems. For example, a cognate pair like En- Erde. On average, cognates display higher pho- glish bite and French fendre “to split” cannot be netic and semantic similarity than random word detected because these words are listed under dif- pairs between languages that are indisputably re- ferent basic meanings in the Comparative Indoeu- lated (Kondrak, 2013). The term cognate is some- ropean Database (Dyen et al., 1992). In addition, times used within computational linguistics to de- the number of basic concepts is typically small. note orthographically similar words that have the In this paper, we address the challenging task of same meaning (Nakov and Tiedemann, 2012). In identifying cognate sets across multiple languages this work, however, we adhere to the strict linguis- directly from dictionary lists representing related tic definition of cognates and aim to distinguish languages, by taking into account both the forms them from lexical borrowings by detecting regular of words and their dictionary definitions (c.f. Fig- sound correspondences. ure1). Our methods are designed for less-studied Cognate information between languages is crit- languages — we assume only the existence of ba- ical to the field of historical and comparative lin- sic dictionaries containing a substantial number of guistics, where it plays a central role in determin- word forms in a semi-phonetic notation, with the ing the relations and structures of language fami- meaning of words conveyed using one of the major lies (Trask, 1996). Automated phylogenetic recon- languages. Such dictionaries are typically created structions often rely on cognate information as in- before Bible , which have been accom- put (Bouchard-Cotˆ e´ et al., 2013). The percentage plished for most of the world’s languages. of shared cognates can also be used to estimate the While our approach is unsupervised, assuming time of pre-historic language splits (Dyen et al., no cognate sets from the analyzed language fam-

2519 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2519–2528 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics aːyweːpiwin ease, rest kaːskipiteːw he pulls him scraping C mihkweːkin red cloth C aːyweːpiwin ease, rest … 42 O aːnweːpiwin rest, repose

ahkawaːpiwa he watches meškweːkenwi red woolen handcloth F yoːwe earlier, before … C mihkweːkin red cloth F meškweːkenwi red woolen handcloth 1725 M mæhkiːkan red flannel iːnekænok they are so big, tall O miskweːkin red cloth kaːskeponæːw he scratches him M mæhkiːkan red flannel …

C kaːskipiteːw he pulls him scraping aːnweːpiwin rest, repose M kaːskeponæːw he scratches him 872 kaːškipin scrape, claw O kaːškipin scrape, claw O miskweːkin red cloth …

Figure 1: Example of multilingual cognate set identification across four Algonquian dictionaries: Cree (C), Fox (F), Menominee (M) and Ojibwa (O). Cognate set numbers are shown on the right. ily to start with, it incorporates supervised ma- netic forms. Turchin et al.(2010) apply a heuris- chine learning models that either leverage cognate tic based on consonant classes to identify the ratio data from unrelated families, or use self-training of cognate pairs to non-cognate pairs between lan- on subsets of likely cognate pairs. We derive guages in an effort to determine the likelihood that two types of models to classify pairs of words they are related. Ciobanu and Dinu(2013) find across languages as either cognate or not. The cognate pairs by referring to dictionaries contain- language-independent general model employs a ing etymological information. Rama(2015) ex- number of features defined on both word forms periments with features motivated by string ker- and definitions, including word vector representa- nels for pairwise cognate classification. tions. The additional specific models exploit regu- A challenging version of the task is to lar sound correspondences between specific pairs cluster cognates within lists of words that have of languages. The scores from the general and identical definitions. Hauer and Kondrak(2011) specific models inform a clustering algorithm that use confidence scores from a binary classifier that constructs the proposed cognate sets. incorporates a variety of string similarity features We evaluate our system on dictionary lists that to guide an average score clustering algorithm. represent four indigenous North American lan- Hall and Klein(2010, 2011) define generative guages from the Algonquian family. On the task models that model the evolution of words along of pairwise classification, we achieve a 42% error a phylogeny according to automatically learned reduction with respect to the state of the art. On sound laws in the form of parametric edit dis- the task of multilingual clustering, our system de- tances. List and Moran(2013) propose an ap- tects 63% of gold sets, while maintaining a cluster proach based on sound class alignments and an av- purity score of 70%. The system code is publicly 1 erage score clustering algorithm. List et al.(2016) available. extend the approach to include partial cognates 2 Related Work within word lists. Cognate identification that considers semantic Most previous work in automatic cognate identi- information is a less-studied problem. Again, the fication only consider words as cognates if they task can be framed as either a pairwise classifi- have identical definitions. As such, they make lim- cation or multi-lingual clustering. In a pairwise ited or no use of semantic information. The sim- context, Kondrak(2004) describes a system for plest variant of this task is to make pairwise cog- identifying cognates between language dictionar- nate classifications based on orthographic or pho- ies which is based on phonetic similarity, com- 1https://github.com/ajstarna/SemaPhoR plex multi-phoneme correspondences, and seman-

2520 tic information. The method of Wang and Sitbon LCSR is the longest common subsequence ra- • (2014) employs word sense disambiguation com- tio of the words. bined with classic string similarity measures for Alignment score reflects an overall phonetic finding cognate pairs in parallel texts to aid lan- • similarity, provided by the ALINE phonetic guage learners. aligner (Kondrak, 2009). Finally, very little has been published on creat- ing cognate sets based on both phonetic and se- Consonant match returns the number of • mantic information, which is the task that we fo- aligned consonants normalized by the num- cus on in this paper. Kondrak et al.(2007) com- ber of consonants in the longer word. bine phonetic and sound correspondence scores For example, consider the words meskweˇ :kenwi with a simple semantic heuristic, and create cog- and mæhki:kan (meSkwekenwi and mEhkikan in nate sets by using graph-based algorithms on con- ASJP notation) from cognate set 1725 in Figure1. nected components. Steiner et al.(2011) aim at The corresponding values for the above four fea- a fully automated approach to the comparative tures are 0.364, 0.364, 0.523, and 0.714, respec- method, including cognate set identification and tively. language phylogeny construction. Neither of those The semantic features refer to the dictionary systems and datasets are publicly available for the definitions of words. We assume that the defini- purpose of direct to our method. tions are provided in a single meta-language, such as English or Spanish. We consider not only a def- 3 Methods inition in its entirety, but also its sub-definitions, In this section, we describe the design of our which are separated by commas and semicolons. language-independent general model, as well as We distinguish between a closed class of about the language-specific models. Given a pair of 300 stop words, which express grammatical re- words from related languages, the models produce lationships, and an open class of content words, a score that reflects the likelihood of the words which carry a meaning. Filtering out stopwords re- being cognate. The models are implemented as duces the likelihood of spurious matches between Support Vector Machine (SVM) classifiers via the dictionary definitions. software package SVM-Light (Joachims, 1999). Our semantic features can be divided into those The scores from both types of models are used to that focus on surface definition resemblance, and cluster words from different languages into cog- those that attempt to detect the affinity of meaning. nate sets. The features of the first type are the following: Sub-definition match denotes an exact match • 3.1 Features of the General Model between any of the word sub-definitions (c.f. The general model is a supervised classifier that set 42 of Figure1). makes cognate judgments on pairs of words ac- Sub-definition content match is performed af- companied by their semantic definitions. The • ter removing stop words from definitions. model is intended to be language-independent, so that it can be trained on cognate annotations from Normalized word-level edit distance calcu- • well-studied languages, and applied to completely lates the minimum distance between sub- unrelated families. The features of the general definitions at the level of words, normalized model are of two kinds: phonetic, which pertain by the length of the longer sub-definition. to the analyzed word forms, and semantic, which Content overlap fires if any sub-definitions refer to their definitions. • have at least one content word in common. The phonetic features are defined on the word forms, represented in ASJP format (Brown et al., The second type of semantic features are aimed 2008), which is a simplified phonetic representa- at detecting deeper meaning connections between tion. definitions. We use WordNet (Fellbaum, 1998) to identify the relations of synonymy and hyper- Normalized edit distance is calculated at the nymy, and to associate different inflectional forms • character level, and normalized by the length of words. The WordNet-based features are as fol- of the longer word. lows:

2521 Synonym overlap indicates a WordNet syn- For each language pair, we derive a specific • onymy relation between content words across model by implementing the approach of Bergsma sub-definitions (e.g. “ease” and “repose”). and Kondrak(2007). As features, we extract pairs of substrings, up to length 3, that are consistent Hypernym overlap indicates a WordNet hy- • with the alignment induced by the minimum edit pernymy relation between content words distance algorithm. The models are able to learn across sub-definitions (e.g. “flannel” and when a certain substring in one language corre- “cloth”). sponds to a certain substring in another language. Inflection overlap is a feature that associates In order to train the specific models, we need • inflectional variants of content words (e.g. a substantial number of cognate pairs, which are “scrape” and “scraping”). not initially available in our unsupervised setting. We use a heuristic method to overcome this lim- Inflection synonym overlap indicates a syn- • itation. We create sets of words that satisfy the onymy relation between lemmas of content following two constraints: (1) identical dictionary words (e.g. “scratches” and “scraping”). definition, and (2) identical first letter. For ex- Inflection hypernym overlap is defined analo- ample, this heuristic will correctly cluster the two • gously to the inflection synonym overlap fea- words defined as “red cloth” in Figure1, but will ture. miss the two other cognates from Set 1725. We ensure that every set contains words from at least In order to detect subtle definition similarity that two languages. The resulting word sets are mu- goes beyond inflectional variants and simple se- tually exclusive, and contain mostly cognates. (In mantic relations, we add two features designed to fact, we use this method as our baseline in the Ex- take advantage of recent advances in word vec- periments section.) We extract positive training tor representations. The two vector-based features examples from these high-precision sets, and cre- are: ate negative examples by sampling random entries Vector cosine similarity is the cosine sim- from the language dictionaries. A separate specific • ilarity between the two vectors that repre- model is learned for each language pair in order to sent the average of each vector within a sub- capture regular sound correspondences. Note that definition. the specific models include no semantic features. We combine the specific models with the general Content vector cosine similarity is analogous, • model by simply averaging their respective scores. but only includes content words.

As an example, consider the definitions “he is 3.3 Cognate Clustering in mourning” and “she is widowed,” from Table5, which do not fire any of the WordNet-based fea- We apply our general and specific models to score tures. Using the entire definitions yields a vec- pairs of words across languages. Featurizing all tor cosine similarity of 0.566, while considering possible pairs of words from all languages is very only the content words “mourning” and “wid- time consuming, so we first filter out dissimilar owed” produces a feature value of 0.146. word pairs that obtain a normalized score below 0.35 from ALINE. In development experiments, 3.2 Regular Correspondences we observed that over 95% of cognate pairs ex- The features described in the previous section ceed this threshold. are language-independent, but we would also like Once pairwise scores have been computed, we to take into account cognate information that is cluster words into putative cognate sets using a specific to pairs of languages, namely regular variant of the UPGMA clustering algorithm (Sokal sound correspondences. For example, th/d is a and Michener, 1958), which has been used in pre- sound correspondence between English and Ger- vious work on cognate clustering (Hauer and Kon- , occurring in words such as think/denken and drak, 2011; List et al., 2016). Initially, all words leather/Leder. A model trained on another lan- are placed into their own cluster. The score be- guage family would not be able to learn that a cor- tween clusters is computed as the average of all responding th and d is an important indicator of pairwise scores between the words within those cognation in English/German pairs. clusters. In each iteration, the two clusters with

2522 the highest average score are merged. For effi- Family Lang. Entries Sets Cognates Algonquian 4 26,985 3,661 8,675 ciency reasons, only positive scores are included Polynesian 62 27,049 3,690 26,699 in the pairwise similarity matrix, which implies Totonacan 10 43,073 ? ? that merges are only performed if all pairwise scores between two clusters are positive. The al- Table 1: The number of languages, total dictionary gorithm terminates when no pair of clusters have entries, cognate sets, and cognate words for each a positive average score. language family.

4 Experiments proximately 100 billion English words using the In this section, we discuss two evaluation exper- approach of Mikolov et al.(2013). 2 The posi- iments. After describing the datasets, we com- tive training instances are constrained to involve pare our cognate classifier to the current state of languages that belong to different Polynesian sub- the art in pairwise classification. We then consider families. the evaluation metrics for our main task of cog- The final dataset consists of 10 dictionaries of nate set recovery from raw language dictionaries, the Totonacan language family spoken in Mexico. which is followed by the results on the Algonquian Since the definitions of the Totonacan dictionar- dataset. We refer to our system as SemaPhoR, to ies are in Spanish, we use the Spanish WordNet, reflect the fact that it exploits three kinds of ev- a list of 200 stop words, and approximately 1 bil- idence: Semantic, Phonetic, and Regular Sound lion pre-trained Spanish word vectors (Cardellino, Correspondences. 2016) for this dataset.3 The Totonacan data is yet to be fully analyzed by historical linguists, and as 4.1 Data Sets such provides an important motivation for devel- Our experiments involve three different language oping our system. families: Algonquian, Polynesian, and Totonacan. Although the Totonacan dataset includes no The Algonquian dataset consists of four dic- cognate information, we manually evaluated a tionary lists (c.f. Figure1) compiled by Hewson number of candidate cognate sets generated by (1993) and normalized by Kondrak(2004). We our system in the development stage. From these convert the phonetic forms into a Unicode en- annotations, we created a pairwise development coding. The gold-standard annotation consists set, including all possible 6755 cognate pairs and of 3661 cognate sets, which were established by 67,550 randomly selected non-cognate pairs, and Hewson on the basis of the regular correspon- used it for testing our general model that was dences identified by Bloomfield(1946). The trained on the Polynesian dataset. The resulting dataset contains as many as 22,747 unique defini- pairwise F-Score of 88.0% shows that our cognate tions, which highlights the difference between our classification model need not be trained on the task and previous work in cognate identification same language family that it is applied to. More- within word lists, where cognate relationships are over, it confirms that our system can function on restricted to a limited set of basic concepts. datasets where definitions are written in a meta- The second dataset corresponds to a version language that is different from the one used in the of POLLEX, a large-scale comparative dictionary training set. of over 60 Polynesian languages (Greenhill and 4.2 Pairwise Classification Results Clark, 2011). Table1 shows that nearly 99% of words in the POLLEX dataset belong to a cognate Although our main objective is multilingual clus- set, meaning that it is composed almost entirely tering, the goal of the first experiment is to com- of cognate sets rather than language dictionaries. pare the effectiveness of our pairwise classifiers This makes the POLLEX dataset unsuitable for against the system of Kondrak(2004), which was system evaluation; however, we use it to train our designed to process one language pair at a time. general classifier, by randomly selecting 25,000 As much as possible, we try to follow the original cognate and 250,000 non-cognate word pairs. For evaluation methodology, which reports 11-point calculating our word vector based features, we use interpolated precision (Manning et al., 2008, page the Python package gensim (Rehˇ u˚rekˇ and Sojka, 2https://code.google.com/archive/p/word2vec 2010) applied to word vectors pre-trained on ap- 3http://crscardellino.me/SBWCE

2523 K-2004 SemaPhoR 4.3 Evaluation Metrics for Clustering Dev/Train: CO CO POLLEX The choice of evaluation metrics for multilingual CO 78.7 84.8 82.3 cognate clustering, which is our main task, re- CF 69.8 77.8 76.6 quires careful consideration. Pairwise F-score CM 61.8 78.4 80.5 works well for pairwise cognate classification, but FM 65.2 81.7 81.8 in the context of clustering, the number of word FO 69.5 83.3 79.3 pairs grows quadratically with the size of a set, MO 64.1 80.3 81.7 which creates a bias against smaller sets. For ex- Average 66.1 80.3 80.0 ample, a set containing 10 words may contribute as much to the pairwise recall as 45 two-word sets. Table 2: 11-Point interpolated precision on the Al- For the task of clustering words with identi- gonquian dataset. cal definitions, Hauer and Kondrak(2011) pro- pose to use B-Cubed F-score (Bagga and Baldwin, 1998). However, we found that B-Cubed F-score 158) on lists of positively classified word pairs assigns counter-intuitive scores to clusterings in- that have been sorted according to their confidence volving datasets of dictionary size, in which many scores. We also use the same dataset, which is words are outside of any cognate set in the refer- limited to the in the Algonquian data. As ence annotation. For example, on the Algonquian the original system contained no machine-learning dataset, a trivial strategy of placing each word into component, it required no training data, but the its own cluster (MaxPrecision) would achieve a B- Cree-Ojibwa language pair served as the develop- Cubed F-Score of 89.6%. ment and tuning set. In search for a better metric, we considered We evaluate two variants of our general model: MUC (Vilain et al., 1995), which is designed to one trained on the Cree-Ojibwa (CO) noun sub- score co-reference algorithms. MUC assigns pre- set, and another on the POLLEX dataset. The cision, recall and F-Score based on the number of language-specific models are trained on each re- missing links in the proposed clusters. However, as spective language pair, using the unsupervised pointed out by Bagga and Baldwin(1998), when heuristic approach described in Section 3.2. penalizing incorrectly placed elements, MUC is Table2 shows the results on each language pair. insensitive to the size of the cluster in question. K-2004 denotes the results reported in Kondrak For example, a completely useless clustering of all (2004). The increase in the average 11-point pre- Algonquian words into one giant set (MaxRecall) cision on the five test sets (except Cree-Ojibwa) yields a higher MUC F-Score than most of the rea- from 66.1% to 80.3% represents an error reduc- sonably effective approaches. tion of 42%. This improvement demonstrates the We believe that an appropriate measure of re- superiority of a machine learning approach with a call for a cognate clustering system is the total rich feature set over a categorical approach with number of found sets. A set that exists in the manually-tuned parameters. When our classifier gold annotation is considered found if any of the is trained instead on cognate data from an unre- words that belong to the set are clustered together lated Polynesian language family, the average 11- by the system. We report both partially and com- point precision on the test sets drops only slightly pletely found sets. Arguably, the number of par- to 80.0%, which confirms its generality. tially found sets may be more important, as it is The correspondence-based specific models con- easier for a linguist to extend a found set to other tribute towards the high accuracy of our system. languages than to discover the set in the first place. Without them, the average results on the test sets In fact, a discovery of a single pair of cross-lingual decrease by 0.9% to 79.4% for the CO-trained cognates implies the existence of a correspond- model, and by 3.0% to 77.0% for the POLLEX- ing proto-word in their ancestor language, which trained model. We conjecture that the language- is likely to have reflexes in the other languages of specific models are less helpful in the former case the family. because the general model already incorporates As a corresponding measure of precision, we much of the information that is particular to the report cluster purity, which has previously been Algonquian family. used to evaluate cognate clusterings by Hall and

2524 Klein(2011) and Bouchard-C otˆ e´ et al.(2013). In System Found Sets Purity order to calculate purity, each output set is first Heuristic Baseline 18.9 (9.9) 96.4 matched to a gold set with which it has the most LEXSTAT 19.6 (10.5) 97.1 words in common. Then purity is calculated as SemaPhoR 63.1 (48.2) 70.3 the fraction of total words that are matched to the correct gold set. More formally, let G = Table 3: Cognate clustering results on the Algo- G ,G , ..., G be a gold clustering and C = nquian dataset (in %). The absolute percentage of { 1 2 n} C ,C , ..., C be a proposed clustering. Then fully found sets is given in parentheses. { 1 2 m} 1 m purity(C,G) = max G C N j| j ∩ i| number of found sets for systems that are designed Xi=1 to operate on word lists, rather than dictionaries. where N is the total number of words. The trade- For example, most of the cognates in Figure1 can- off between the number of found sets and clus- not be captured by such systems. ter purity gives a good idea of the performance Our system, SemaPhoR, finds approximately of a cognate clustering. For example, both of three times as many cognate sets as LEXSTAT, and the MaxRecall and MaxPrecision strategies men- over 75% of those sets are complete with respect tioned above would obtain 100% scores according to the gold annotation. In practical terms, our sys- to one of the measures, but close to 0% according tem is able to provide concrete evidence for the to the other. existence of most of the proto-words that have re- flexes in the recorded languages, and identifies the 4.4 Cognate Clustering Results majority of those reflexes in the process. The pu- In our main experiment, we apply our system to rity of the produced clusters indicates that there the task of creating cognate sets from the Algo- are many more hits than misses in the system out- nquian dataset. The general classification model put. In addition, the clusters can be sorted accord- is trained on the POLLEX dataset, as described ing to their confidence scores, in order to facilitate in Section 4.2, while the language-specific mod- the analysis of the results by an expert linguist. els are derived following the procedure described in Section 3.2. The scores from both models are 5 Discussion then used to guide the clustering process. Only In this section, we interpret the results of our one word from each language is allowed per clus- feature ablation experiments, and analyze several ter. types of errors made by our system. Since most work done in the area of cognate clustering starts from semantically aligned word 5.1 Feature Ablation lists, it is difficult to make a direct comparison. In order to determine the relative effect of the fea- We report the results obtained with LEXSTAT (List 4 tures described in Section 3.1, we test four vari- and Moran, 2013). The system has no capability ants of the general model, which employ increas- to consider the degree of semantic similarity be- ingly complex subsets of features. The simplest tween words, so we first group together the words variant uses only the phonetic features that are de- that have identical definitions and provide these as fined on the word forms. The next variant adds its input. As a baseline, we adopt the heuristic the features that consider surface definition resem- described in Section 3.2, which creates sets from blance. The third variant also includes WordNet- words that have identical definitions and start with based semantic features. The final variant is the the same letter. full system configuration that incorporates the fea- Table3 shows the results. L EXSTAT performs tures defined on word vector representations, but slightly better than the heuristic baseline, but both without language-specific models. are limited by their inability to relate words that Table4 shows the results. The phonetic fea- have non-identical definitions. In fact, only 21.4% tures alone are sufficient to detect just over half of all gold cognate sets in the Algonquian dataset of the cognate sets. Each successive variant sub- contain at least two words with the same defi- stantially improves the recall at a cost of slightly nition, which establishes an upper bound on the lower precision. The full feature set yields a 27% 4http://lingpy.org relative increase in the number of found sets over

2525 Features Found Sets Purity C ma:ya:citeˇ :he:w he is angry M miana:cetæhæˇ :w he is nauseated Phonetic only 52.0 (36.3) 70.2 C pi:sisiw he is in bits + Definitions 57.4 (41.7) 68.4 O pi:ssisi he is ground up + WordNet 61.9 (46.9) 68.1 C ayiwiskawe:w he is taller than someone else O aniwiskawˇ precede, surpass someone + Word Vectors 66.2 (51.3) 66.5 C si:ka:wiw he is in mourning M se:kawew she is widowed Table 4: Cognate clustering results on the Algo- nquian dataset (in %) with subsets of features. Table 5: Examples of cognates found with the as- sistance of word vector features. the phonetic-only variant, with only a 5% drop in cluster purity. very difficult to identify without word vector tech- In comparison with our full system, which in- nology. corporates the language-specific models, the fi- A substantial number of apparent errors made nal variant finds a greater number of the cognate by our system are due to the complex polysyn- sets, but with a trade-off in overall precision (c.f. thetic morphology of Algonquian, in which a SemaPhoR in Table3). This shows that our sys- single Algonquian word can express a meaning tem is able to exploit regular sound correspon- of several English words. A number of dis- dences to filter out a substantial number of false tinct cognate sets are highly similar in their def- cognates, such as lexical borrowings or chance re- initions and phonetic forms. For example, our semblances. However, the overall contribution of system erroneously places the Menominee word the specific models is relatively small. One possi- a:kuaqtæ:hsen into a cluster with two similar Cree ble explanation is that the Algonquian languages and Ojibwa words, instead of associating it with are relatively closely related, which enables the the Ojibwa word a:kawa:tte:sˇsinˇ (Table6). Al- general model to discern most of the cognate re- though it could be argued that such closely-related lationships on the basis of phonetic and semantic forms are all cognate, we refrain from modifying similarity. For example, many of the regular corre- any gold annotations, even if this negatively im- spondences detected by the specific models, such pacts the overall accuracy of our system. as s:s and hk:kk, involve identical phonemes. The C a:kawa:ste:simo:w he lies down in the shade impact of the specific models could be greater for O a:kawa:tte:sˇsimoˇ :n be in the shadow a more distantly-related language family. M a:kuaqtæ:hsen he is in the shade O a:kawa:tte:sˇsinˇ make shadow 5.2 Error Analysis Table 6: A clustering error due to morphology. A number of omission errors can be traced to the imperfect heuristic that constrains the positive Finally, some apparent errors made by our sys- training instances for the language-specific mod- tem may not be errors at all, but rather reflect the els to begin with the same letter. Indeed, 88.1% of incompleteness of the gold annotation. For exam- Algonquian cognate sets are composed of words ple, consider the two false positive pairs in Table that share the initial phoneme. While this con- 7. Even though they are not listed in Hewson’s straint yields high-precision training sets that sat- (1993) etymological dictionary, the exact defini- isfy the transitivity condition, it also introduces a tion match, coupled with striking phonetic simi- bias against cognates that differ in the first letter. larity and the presence of regular sound correspon- The second type of errors made by our system dences strongly suggest that they are actually cog- are caused by semantic drift that has altered the nates. meaning of the original proto-word. For exam- ple, “sickness” is difficult for our general model M pekuacˇ growing wild to associate with “bitterness, pain.” On the other O pekwaciˇ growing wild hand, there are many instances where our system C niso:te:w twin is successful in identifying non-obvious semantic O ni:soˇ :te:nq twin similarity, often thanks to the word vector features of our model. Table5 provides examples of cog- Table 7: Examples of proposed cognate sets that nates found by our system that would have been are not found in the gold data.

2526 6 Conclusion struction of ancient languages using probabilistic models of sound change. PNAS. We have presented a comprehensive system for the novel task of identifying cognate sets directly Cecil H. Brown, Eric W. Holman, and Soren Wich- from dictionaries of related languages by leverag- mann. 2008. Automated classification of the world’s languages: A description of the method and prelimi- ing both word forms and word definitions. To the nary results. In Language Typology and Universals, best of our knowledge, it is the first system to use volume 61. word vector representations for cognate identifi- cation. The main insight from our work is that Cristian Cardellino. 2016. Spanish billion words cor- pus and embeddings. a cognate classification model can be trained on one language family, and achieve impressive re- Aline Maria Ciobanu and Liviu P. Dinu. 2013. A sults when classifying a completely unrelated lan- dictionary-based approach for evaluating ortho- guage family. This allows cognate information graphic methods in cognates identification. In Pro- ceedings of Recent Advances in Natural Language from a high-resource language family to guide Processing (RANLP). cognate identification between languages that lit- tle is known about. Isidore Dyen, Joseph B Kruskal, and Paul . 1992. There are aspects of cognate identification that An Indoeuropean classification: A lexicostatistical experiment. Transactions of the American Philo- can only be detected by human experts, such as sophical society, 82(5). cognates that have undergone extensive phonetic and semantic changes, or large-scale lexical bor- Christiane Fellbaum. 1998. WordNet: An Eletronic rowing between languages. However, we believe Lexical Database. MIT Press. that our system represents a step towards auto- Simon J. Greenhill and Ross Clark. 2011. Pollex- mated cognate identification, and will prove a use- online: The Polynesian lexicon project online. In ful tool for historical linguists. Oceanic Linguistics, volume 50.

Acknowledgments David Hall and Dan Klein. 2010. Finding cognate groups using phylogenies. In Proceedings of the We thank the following students that contributed 48th Annual Meeting of the Association for Compu- to our cognate-related projects over the last few tational Linguistics (ACL). years (in alphabetical order): Matthew Darling, David Hall and Dan Klein. 2011. Large-scale cognate Jacob Denson, Philip Dilts, Bradley Hauer, Mil- recovery. In Proceedings of the 2011 Conference on dred Lau, Tyler Lazar, Garrett Nicolai, Dylan Empirical Methods in Natural Language Processing Stankievech, Nicholas Tam, and Cindy Xiao. (EMNLP). This research was partially funded by the Natu- Bradely Hauer and Grzegorz Kondrak. 2011. Cluster- ral Sciences and Engineering Research Council of ing semantically equivalent words into cognate sets Canada. in multilinguial lists. In Proceedings of the 5th In- ternational Joint Conference on Natural Language Processing (IJCNLP).

References John Hewson. 1993. A computer-generated dictionary Amit Bagga and Breck Baldwin. 1998. Algorithms for of proto-Algonquian. Canadian Museum of Civi- scoring coreference chains. In In The First Interna- lization, Hull, Quebec. tional Conference on Language Resources and Eval- uation Workshop on Linguistics Coreference. T. Joachims. 1999. Making large-scale SVM learning practical. In B. Scholkopf,¨ C. Burges, and A. Smola, Shane Bergsma and Grzegorz Kondrak. 2007. Align- editors, Advances in Kernel Methods - Support Vec- ment-based discriminative string similarity. In Pro- tor Learning. MIT Press. ceedings of the 45th Annual Meeting of the Associa- tion of Computational Linguistics. Grzegorz Kondrak. 2004. Combining evidence in cog- nate identification. In Proceedings of the Seven- Leonard Bloomfield. 1946. Algonquian. In Harry Hoi- teenth Canadian Conference on Artificial Intelli- jer et al., editor, Linguistic Structures of Native gence. America, volume 6 of Viking Fund Publications in Anthropology. Grzegorz Kondrak. 2009. Identification of cognates and recurrent sound correspondences in word lists. Alexandre Bouchard-Cotˆ e,´ David Hall, Thomas L. Traitement automatique des langues et langues an- Griffiths, and Dan Klein. 2013. Automated recon- ciennes, 50(2):201–235.

2527 Grzegorz Kondrak. 2013. Word similarity, cognation, Marc Vilain, John Burger, John Aberdeen, Dennis Con- and translational equivalence. In Approaches to nolly, and Lynette Hirschman. 1995. A model- Measuring Linguistic Differences, pages 375–386. theoretic coreference scoring scheme. In Proceed- De Gruyter Mouton. ings of the 6th Conference on Message Understand- ing. Grzegorz Kondrak, David Beck, and Philip Dilts. 2007. Creating a comparative dictionary of Totonac- Haoxing Wang and Laurianne Sitbon. 2014. Multi- Tepehua. In Proceedings of the ACL Workshop on lingual lexical resources to detect cognates in non- Computing and Historical Phonology (9th Meeting aligned texts. In The Twelfth Annual Workshop of of SIGMORPHON), pages 134–141. the Australasia Language Technology Association.

Johann-Mattis List, Philippe Lopez, and Eric Bapteste. 2016. Using sequence similarity networks to iden- tify partial cognates in multilingual wordlists. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (ACL).

Johann-Mattis List and Steven Moran. 2013. An open source toolkit for quantitative . In Proceedings of the ACL 2013 System Demonstra- tions.

Christopher D Manning, Prabhakar Raghavan, Hinrich Schutze,¨ et al. 2008. Introduction to information re- trieval, volume 1. Cambridge University Press.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in neural information processing systems.

Preslav Nakov and Jorg¨ Tiedemann. 2012. Combin- ing word-level and character-level models for ma- chine between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL).

Taraka Rama. 2015. Automatic cognate identification with gap-weighted string subsequences. In Proceed- ings of the Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies.

Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software frame- work for topic modelling with large corpora. In Pro- ceedings of the LREC Workshop on New Challenges for NLP Frameworks.

Robert R. Sokal and Charles D. Michener. 1958. A statistical method for evaluating systematic relation- ships. In University of Kansas Science Bulletin, vol- ume 38.

Lydia Steiner, Peter F Stadler, and Michael Cysouw. 2011. A pipeline for computational historical lin- guistics. Language Dynamics and Change, 1(1).

R.L. Trask. 1996. Historical Linguistics. St Martin’s Press, Inc., New York, NY.

Peter Turchin, Ilia Peiros, and Murray Gell-Mann. 2010. Analyzing genetic connections between lan- guages by matching consonant classes. In Journal of Language Relationship.

2528