<<

CogNet: a Large-Scale Cognate Database

Khuyagbaatar Batsuren† Gábor Bella† Fausto Giunchiglia†§ DISI, University of Trento, Trento, Italy† Jilin University, Changchun, China§ {k.batsuren; gabor.bella; fausto.giunchiglia}@unitn.it

Abstract IELex2, or ABVD (Greenhill et al., 2008), cover only the small set of 225 Swadesh basic concepts, This paper introduces CogNet, a new, although with an extremely wide coverage of up to large-scale lexical database that provides 4000 languages. Secondly, in these databases, lex- cognates— of common origin and ical entries that belong to scripts other than meaning—across languages. The database or Cyrillic mostly appear in phonetic transcription currently contains 3.1 million cognate pairs instead of using their actual orthographies in their across 338 languages using 35 writing sys- tems. The paper also describes the automated original scripts. These limitations prevent such re- method by which cognates were computed sources from being used in real-world computa- from publicly available wordnets, with an tional tasks on written language. accuracy evaluated to 94%. Finally, statistics This paper describes CogNet, a new large-scale, and early insights about the cognate data high-precision, multilingual cognate database, as are presented, hinting at a possible future well as the method used to build it. Our main exploitation of the resource1 by various fields technical contributions are (1) a general method of lingustics. to detect cognates from multilingual lexical re- sources, with precision and recall parametrable ac- 1 Introduction cording to usage needs; (2) a large-scale cognate database containing 3.1 million pairs across Cognates are words in different languages that 338 languages, generated with the method above; share a common origin and the same meaning, (3) WikTra, a multilingual transliteration dictio- such as the English letter and the French lettre. nary and library derived from Wiktionary data; and Cognates and the problem of cognate identifica- (4) an online platform that lets users explore the tion have been extensively studied in the fields resource. of language typology and historical , as cognates are considered useful for research- The paper is organised as follows. Section 2 ing the relatedness of languages (Bhattacharya presents the state of the art. Section 3 describes et al., 2018). Cognates are also used in computa- the main cognate discovery algorithm and sec- tional linguistics, e.g., for lexicon extension (Wu tion 4 the way various forms of evidence used and Yarowsky, 2018) or to improve cross-lingual by the algorithm are computed. The method is NLP tasks such as machine or bilingual parametrised and the results are evaluated in sec- word recognition (Kondrak et al., 2003; Tsvetkov tion 5. Section 6 describes the resulting CogNet and Dyer, 2015). database in terms of structure and statistical in- sights. Finally, section 7 concludes the paper. Despite the interest in using cognate data for research, state-of-the-art cognate databases have 2 State of the Art had limited practical uses from an applied perspec- tive, for two reasons. Firstly, popular cognate- To our knowledge, cognates have so far been coded databases that are used in historical lin- defined and explored in two fundamental ways guistics, such as ASJP (Wichmann et al., 2010), by two distinct research communities. On the

1The CogNet resource and WikTra tool are available on 2Indo-European Lexical Cognacy Database, http://cognet.ukc.disi.unitn.it. http://ielex.mpi.nl/

3136 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics one hand, cognate identification has been studied improving certain cross-lingual tasks in NLP, the within linguistic typology and historical linguis- quality of the output often suffers due to not be- tics. On the other hand, computational linguists ing able to handle certain linguistic phenomena have been researching methods for cognate pro- properly. For example, words in languages such duction. as or Hebrew are written without vowels The very definition of the term ‘cognate’ varies and machine-produced transliterations often fail to according to the research community. In histori- vowelize such words (Karimi et al., 2011). The so- cal linguistics, cognates must have a provable ety- lution we propose is the use of a dictionary-based mological relationship and must be translated into transliteration tool over machine transliteration. each language (Bhattacharya et al., 2018). Ac- Our method provides new contributions for both cordingly, the English skyscraper and the German research directions. Firstly, to our knowledge no Wolkenkratzer are considered as cognates but the other work on cognate generation has so far used English song and the Japanese S⇣ /songu/) high-quality multilingual lexical resources on a are not. In computational linguistics, the notion scale as large as ours, covering hundreds of lan- of cognate is relaxed with respect to etymol- guages and more than 100,000 cross-lingual con- ogy and are also considered as cognates cepts. Secondly, this large cross-lingual cover- (Kondrak et al., 2003). For our work we adopted age could only be achieved thanks to a robust the latter, computational point of view. transliteration tool that is part of the contributions of our paper. Finally, our novel, combined use In , cognate identification of multiple—orthographic, semantic, geographic, methods proceed in two main steps. First, a sim- and etymological—sources of evidence for detect- ilarity matrix of all words is estimated by three ing cognates was crucial to obtain high-quality re- types of similarity measures: semantic similar- sults, in terms of both precision and recall. ity, phonetic similarity, and orthographic simi- larity. For information on semantic similarity, 3 The Algorithm special-purpose multilingual dictionaries, such as the well-known Swadesh List, are used. For or- For our work we have adopted a computational- thographic similarity, string metrics (Hauer and linguistic interpretation of the notion of cognate Kondrak, 2011; St Arnaud et al., 2017) are of- (Kondrak et al., 2003): two words in different lan- ten employed, e.g., edit distance, Dice’s coeffi- guages are cognates if they have the same meaning cient, or LCSR. As these methods do not work and present a similarity in orthography, resulting across scripts, they are completed by phonetic from a supposed underlying etymological relation- similarity, exploiting transformations and sound ship (common ancestry or borrowing). changes across related languages (Kondrak, 2000; Based on this interpretation, our algorithm is Jäger, 2013; Rama et al., 2017). Phonetic similar- based on three main principles: (1) semantic ity measures, however, require phonetic transcrip- equivalence, i.e., that the two words share a com- tions to be a priori available. More recently, his- mon meaning; (2) sufficient proof of etymological torical linguists have started exploiting identified relatedness; and (3) the logical transitivity of the cognates to infer phylogenetic relationships across cognate relationship. languages (Rama et al., 2018; Jäger, 2018). The core resource for obtaining cross-lingual In computational linguistics, cognate produc- evidence on semantic equivalence—i.e., the same- tion consists of finding for a word in a given lan- ness of word meanings—is the Universal Knowl- guage its cognate pair in another language. State- edge Core (UKC), a large multilingual lexico- of-the-art methods (Beinborn et al., 2013; Sen- semantic database (Giunchiglia et al., 2018) al- nrich et al., 2016) have employed character-based ready used both in linguistics research as well machine translation, trained from parallel corpora, as for practical applications (Bella et al., 2016; to produce cognates or transliterations. (Wu and Giunchiglia et al., 2017; Bella et al., 2017). Yarowsky, 2018) also employs similar techniques, The UKC includes the lexicons and lexico- as well as multilingual dictionaries, to produce semantic relations for 338 languages, contain- large-scale cognate clusters for Romance and Tur- ing 1,717,735 words and 2,512,704 language- kic languages. Although the cognates produced specific word meanings. It was built from word- in this manner are, in principle, a good source for nets (Miller, 1995) and wiktionaries converted

3137 into wordnets (Bond and Foster, 2013)). As all Algorithm 1: Cognate Discovery Algorithm of the resources composing the UKC were built Input : c, a lexical concept and validated by humans(Giunchiglia et al., 2015), Input : , a lexical resource R we consider the quality of our input data to be Output : G+, graph of all cognates of c high enough for obtaining accurate results on cog- 1 V,E ; ; nates (Giunchiglia et al., 2017). As most wordnets 2 Languages (c); L R map their units of meaning (synsets in WordNet 3 for each language l do 2L terminology) to English meanings, they can effec- 4 for each word w Words (c, l) do 2 R tively be interconnected into a cross-lingual lexi- 5 V V v = ; cal resource. The UKC reifies all of these map- [{ } 6 for each node v1 = V do pings as supra-lingual lexical concepts (107,196 2 7 for each node v2 = V do in total, excluding named entities such as Ulan- 2 8 if l1 = l2 then baatar). For example, if the German Fahrrad and 9 continue; the Italian bicicletta are mapped to the English 10 if EtyRel(w1,l1,w2,l2) then bicycle then a single concept is created to which 11 E E e = ; all three language-specific meanings (i.e., wordnet [{ } 12 else if OrthSim(w1,l1,w2,l2)+TG synsets) will be mapped. ⇥ GeoP rox(l1,l2) >TF then In terms of etymological evidence, we use both 13 E E e = ; [{ 1 2 } direct and indirect evidence of etymological re- 14 G < V, E >; + latedness. Direct evidence is provided by gold- 15 G = TransitiveClosure(G) + standard etymological resources, such as the one 16 return G ; we use and present in section 4.1. Such evidence, however, is relatively sparse and would not, in it- self, provide high recall. We therefore also con- guages and creating the corresponding word nodes sider indirect evidence in the form of a combined in the graph (lines 2–5). All such words thus orthographic–geographic relatedness: a measure fulfil the criterion of semantic equivalence above. of geographic proximity of languages combined Then, for all different-language word pairs that ex- with the orthographic similarity of words, involv- press the concept (lines 6–9), we verify whether ing transliteration, can provide strong clues on lan- etymological evidence exists for a potential cog- guage contact and probable cross-lingual lexical nate relationship. The latter may either be direct borrowing. evidence (EtyRel, line 10) or indirect, which we Finally, we exploit logical transitivity in order implement as a score of relatedness combined of further to improve recall: we build on the intu- orthographic similarity (OrthSim) and geographic ition that if words wa and wb are cognates and wb proximity (GeoProx). We consider indirect evi- and wc are cognates then wa and wc are also cog- dence to be sufficient if this combined score is su- nates. For example, if the German Katze is found perior to an experimental threshold TF (line 12). to be a cognate of the English cat (based on di- In case either direct or indirect evidence is found, rect etymological evidence) and cat is found to be an edge between the two word nodes is created a cognate of the French chat (based on orthogra- (lines 10–13). As the last step, in order to apply phy) then Katze and chat are also considered to be the principle of logical transitivity, the transitive cognates). closure of the graph is computed (line 15). In the + Based on these principles, we have imple- resulting graph G each connected subgraph rep- mented a cognate discovery algorithm as shown resents a group of cognate words. in algorithm 1. Its input is a single lexical concept from the UKC (the algorithm being applicable to 4 Computing Etymological Relatedness every concept in loop). It builds an undirected Our method predicts the etymological relatedness graph where each node represents a word and each of words based on both direct and indirect etymo- edge between two nodes represents a cognate re- logical evidence. Section 4.1 below describes how lationship. the EtyRel function provides direct evidence. Sec- The process starts by retrieving the lexicalisa- tions 4.2 and 4.3 explain how indirect evidence is tions of the input concept in all available lan- computed based on orthographic similarity using

3138 the OrthSim function and on geographic proxim- WikTra is a dictionary-based transliteration tool ity using the GeoProx function. compiled from information collected from Wik- tionary and developed specifically for this work 4.1 Direct Etymological Evidence by the authors4. It is Unicode-based and sup- The EtyRel function in algorithm 1 uses gold- ports 85 languages in 35 writing systems, defin- standard evidence to compute the etymological re- ing transliteration rules and codes according to in- latedness of words. It exploits etymological ances- ternational standards, as developed by the Wik- tor (marked as Anc below) relations for each word tionary community (the largest community in lex- of the word pair being evaluated as cognates. Two icography). words are considered as etymologically related if An illustration of the output provided by Wik- they are found to have at least one common etymo- Tra compared to three existing transliteration tools logical ancestor word (such as the German Ross is provided in table 1. The use of WikTra with and the English horse having as ancestor the proto- respect to existing tools is justified by a need for Germanic root *harss-). high-quality results that also cover complex cases of orthography, e.g., in Semitic scripts where vow- EtyRel(w1,l1,w2,l2)= els are typically omitted. In particular, Junide- 5 true if Anc(w1,l1) Anc(w2,l2) = code is a character-based transliterator, an ap- = \ 6 ; proach that seriously limits its accuracy. The (false otherwise Google transliterator is dictionary-based and is (1) therefore of higher quality, but it supports a lower Ancestor relations are retrieved from the Etymo- number of languages and is not freely available. logical WordNet (EWN)3 (De Melo, 2014), a lex- Finally, uroman (Hermjakob et al., 2018) is a new, ical resource providing relations between words, high-quality, dictionary-based tool that neverthe- e.g., derivational or etymological. EWN was au- less provides a limited support for scripts without tomatically built by harvesting etymological infor- vowels (e.g., Arabic or Hebrew), as also visible in mation encoded in Wiktionary. In this work, we table 1. have only used its 94,832 cross-lingual etymolog- While WikTra gains its high accuracy from ical relations. human-curated Wiktionary data, it still needs to be improved for Thai and Japanese. In Thai, 4.2 Orthographic Similarity WikTra only works on monosyllabic words, and Orthographic similarity is computed using a string it needs an additional tool to recognize sylla- similarity metric LCSSim based on the longest bles. In Japanese, it only works with Hiragana common subsequence (LCS) of the two input and Katakana scripts and not with Kanji (Chinese characters). We therefore combined WikTra with words, returning a similarity score between 0 6 and 1: the Kuromoji transliteration tool.

2 len(LCS(w1,w2)) 4.3 Geographic Proximity LCSSim(w1,w2)= ⇥ (2) len(w1)+len(w2) We exploit geographic information on languages in order to take into account the proximity of lan- When w1 and w2 belong to different writing guage speakers for the prediction of borrowing. systems, LCS returns 0 and thus the formula above Our hypothesis is that, even if in the last century is not directly usable. In order to be able to lexical borrowing on a global scale has been faster identify cognates across writing systems, we ap- than ever before, the effect of geographic distance ply transliteration to the Latin script (also known is still a significant factor when applying cognate as romanization) using the WikTra tool. Ortho- discovery to entire vocabularies. This effect is graphic similarity is thus computed as: combined with orthographic similarity in line 12 of algorithm 1, in a way that geographic proximity OrthSim(w1,w2)=max LCSSim(w1,w2), { (3) increases the overall likelihood of word pairs be- LCSSim(WikTra(w ), WikTra(w )) 1 2 } ing cognates, without being a necessary condition. 4https://github.com/kbatsuren/wiktra 3http://www1.icsi.berkeley.edu/ demelo/etymwn/, 5https://github.com/gcardone/junidecode accessed on 10/14/2018. ⇠ 6https://github.com/atilika/kuromoji

3139 Table 1: with state-of-the art transliteration tools

# Languages Word Uroman Junidecode Google WikTra 1 English book book book book 2 Malayalam malayaallam mlyaallN malayal¯.am˙ malayal¯.am. 3 Arabic nwaa nw@ nawa nawatun¯ 4 Japanese ◆S4E⌧ konpyuta konpiyuta konpyut¯ a¯ konpyut¯ a¯* 5 Thai raachaatiraa raachaathiraad ra¯ cha¯ thi rad¯ raa-chaa-tí-râat b 6 Russian moskva moskva moskva moskva 7 Hindi devanaa devnaagrii devanaagaree devnagr¯ ¯ı 8 Bengali baangla baaNlaa banl¯ a¯ bangla 9 Greek anaute anauteo anaftéo¯ anautéo¯ 10 Kashmiri kampivwuttar khampy[?]w?ttar - kampeut¯.ar 11 Persian armnstan rmnstn - armanestân 12 Hebrew yshshkr yshshkr yissachar yis´s´ak¯ ar¯ ¯ 13 Tamil rehs reHs reh.s rex 14 Ethiopic aadise aababaa ‘aadise ’aababaa ad¯ ¯ısi abeba¯ -ädis -äbäba 15 Tibetan kha pa kh-pr - kha par 16 Korean Bj◆ò:r megapon· megapon megapon megapon 17 Armenian hayiastan hayastan hayastan hayastan 18 Uyghur yeayealae y’y’-lae - a’ile 19 Khmer kromaaro krmaar krama r krâméar 20 Telugu amkapali aNkpaalli ankap˙ al¯.iankap˙ al¯.i 21 Odia oddishaa rodd’ishaa - orisa´ 22 Burmese sannykhre snny[?]:kh[?]e saeehkyay sany:hkre * WikTra in Japanese language only works with scripts of Hiragana and Katakana. b WikTra in Thai language only works with a sequence of syllables.

Our relatively simple solution considers only 5 Evaluation the languages of the input words, computing a lan- guage proximity value between 0 and 1, as fol- This section describes how CogNet was evaluated lows: on a manually built cognate corpus and how its parameters were tuned to optimise results.

TD GeoProx(l1,l2)=min( , 1.0) 5.1 Dataset Annotation GeoDist(l1,l2) (4) While our initial idea was to use existing cognate datasets for evaluation, the most comprehensive databases turned out to represent cognates in their The function GeoDist(l1,l2) is an approximate phonetic transcriptions instead of having words ‘geographic distance’ between two languages l1 written in their original scripts. Such data was not and l2, based on the geographical areas where the usable to test our method that performs transliter- languages are spoken. The constant TD corre- ation on its own. sponds to a minimal distance: if two languages are Consequently, we created a dataset of 40 con- spoken within this distance then they have max- cepts with fully annotated sets of cognate groups. imum geographic relatedness. TD is empirically On average, a concept was represented in 107 lan- set as described in section 5.2. guages by 129 words: 5,142 words in total for the Distances between languages are provided by 40 concepts. The concepts were chosen from the the WALS resource7, one of most comprehensive Swadesh basic word list and from the WordNet language databases. WALS provides latitude and core concepts (Boyd-Graber et al., 2006). The lex- longitude coordinates for a language given as in- icalizations (words) corresponding to these con- put. While a single coordinate returned for a lan- cepts were retrieved from the UKC. For each con- guage may in some cases be a crude approxima- cept, we asked two language experts to find cog- tion of linguistic coverage (e.g., Spanish is spo- nate clusters among its lexicalizations. The ex- ken both in Spain and in most countries of Latin perts made their decisions based on online re- America), even this level of precision was found sources such as Wiktionary and the Online Ety- to improve our evaluation results. mology Dictionary8. Cohen’s Kappa score, inter-

7https://wals.info 8https://www.etymonline.com

3140 Table 2: Parameter configuration and comparisons.

Methods TF TG TD PRF1 Baseline 1: LCS 0.60 - - 94.70 25.62 40,32 Baseline 2: Consonant --- 98.07 19.11 31,98 LCS + Geo 0.60 0.01 1.3 94.02 27.63 42.71 LCS + Geo + EWN 0.60 0.01 1.3 94.10 30.41 45.97 LCS + Geo + WikTra 0.63 0.02 1.2 94.15 42.42 58.49 LCS + Geo + WikTra + EWN 0.63 0.02 1.2 94.20 44.86 60.78 LCS + Geo + Trans 0.68 0.02 1.2 95.94 44.27 60.59 LCS + Geo + Trans + EWN 0.70 0.06 1.3 97.32 53.53 69.07 LCS + Geo + Trans + WikTra 0.72 0.06 1.2 94.14 77.59 85.07 LCS + Geo + Trans + WikTra + EWN 0.71 0.04 1.1 93.94 86.32 89.97 annotator agreement, was 95.15%. The resulting as well as the corresponding precision–recall fig- human-annotated dataset contained 5,142 words, ures (computed on the evaluation dataset) are re- 38,447 pairs of cognate words and 320,338 pairs ported in table 2. Although we set the precision of non-cognate words. We divided this dataset into threshold to 95% for the configuration dataset, we two equal parts: the first 20 concepts for parame- obtained precision results that are slightly lower, ter configuration and the second 20 concepts for about 94%, on the evaluation dataset. evaluation. The results of configuration can be seen in ta- ble 2. The optimal geographic region parame- 5.2 Algorithm Configuration ter TD varies between 1.1 and 1.3, which corre- The goal of configuration was to optimise the al- spond to a radius of 1,100–1,300km: languages gorithm with respect to three hyperparameters: the spoken within such a distance tend to share more threshold of combined orthographic–geographic cognates. One interesting insight from table 2 concerns relatedness TF (section 3), the geographic proxim- the use of logical transitivity. While it is an ex- ity contribution parameter TG, and the minimum tremely efficient component in our algorithm, in distance TD (section 4.3). We have created a three-dimensional grid with order to maintain precision it requires the related- ness threshold TS to be increased from [0.6; 0.63] TF =[0.0; 1.0] (the higher the value, the more the strings need to be similar to be considered as to [0.68; 0.71] and the influence of geographic re- latedness TG from [0.1; 0.2] to [0.2; 0.6]. This cognates), TG =[0.0; 1.0] (the higher the value, the more geographic proximity is considered as means that in order for transitivity to hold, both the overall relatedness criterion and the geographic evidence), and TD =[0.0; 22.0] (here, the unit of 1.0 corresponds to a distance of 1000km, within proximity need to become more strict. which geographic relatedness is a constant maxi- mum). 5.3 Evaluation Results In this grid, we computed optimal values for We evaluated the effect of the various components each parameter (in increments of 0.01) based of our method (geographic relatedness, WikTra on performance on the configuration dataset de- transliteration, Etymological WordNet, transitiv- scribed in section 5.1. With these optimal settings, ity) on its overall performance. As a baseline, we evaluated all possible combinations of the vari- we used two string similarity methods often used ous components of the cognate generation method, in cognate identification (St Arnaud et al., 2017): in order to understand their relative contribution LCS, i.e., the longest common subsequence ratio to the overall score. Since our ultimate goal is of two words (which we also use in equation 2), to generate high-quality knowledge, we favoured and Consonant, which is a heuristic method that precision over recall, setting our minimum pre- checks if the first two consonants of the words cision threshold to 95% and maximizing recall are identical. Although the baseline Consonant with respect to this constraint. The best settings method achieved the highest precision of 98.07%, (computed on the parameter configuration dataset) its recall is the lowest, 19.11%, due to being lim-

3141 ited to Latin characters. is intended for linguists both for the exploration of Adding geographic proximity, direct etymolog- data and for collaborative work on extending the ical evidence, and transliteration to the algorithm resource. increased recall in a consistent manner, by about We also carried out an initial exploration of cog- 2%, 3%, and 15%, respectively, all the while nate data along the axes of language, language maintaining precision at the same level. Comput- family, and geographic distance. Figure 2 shows ing the transitive closure, finally, had a major mul- the number of cognates found at a given geo- tiplicator effect on recall, bringing it to 86.32%. graphic distance (i.e., the distance of the speakers With this full setup we were able to generate of the two languages, as defined in section 4.3). 3,167,642 cognate pairs across 338 languages. We observe that the vast majority of cognates is In order to cross-check the quality of the out- found within a distance of about 3,000km. Our in- put, we randomly sampled 400 cognate pairs not terpretation of these results is that, by and large, covered by the evaluation corpus and had them locality is still a major influence on modern lexi- re-evaluated by the same experts. Accuracy was cons, despite the globalising effects of the last cen- found to fall in the 93–97% range, very much in turies. Let us note that the geographic proximity line with the goal of 95% we initially set in sec- component of our algorithm alone could not have tion 5.2. caused this distribution, as it had a relatively mi- nor overall contribution on the results (see the ge- 6 Exploring CogNet ographic factor TG =0.04 in table 2). At an accuracy of 94%, our algorithm has gen- In order to avoid biasing per-language statistics erated 3,167,642 cognates. They cover 567,960 by the incompleteness of the lexicons (wordnets) words and 80,836 concepts, corresponding to used, we limited our study to the 45 languages 33.06% of all words and 73.52% of all concepts in with a vocabulary size larger than 10,000 words. the UKC: one word out of three and three concepts As a further abstraction from lexicon size, we in- out of four have at least one cognate relationship. troduce the notion of cognate density, defined over In terms of WordNet formalism, cognate rela- a set of words as the ratio of words covered by at tionships can be expressed as cross-lingual sense least one cognate pair of CogNet. In other words, relations that connect (word, synset) pairs—reified working with cognate densities allows us to char- in wordnets as senses—across languages. As not acterise the ‘cognate content’ of each language in- all wordnets represent senses explicitly, CogNet dependently of the wordnet size. encodes these relationships in the following tuple Cognate densities for the 45 languages stud- form: ied show a wide spread between languages with (PWN_synset, w1, l1, w2, l2, metadata) the highest density (the top five language being Indonesian: 60.80%, Czech: 59.05%, Catalan: where PWN_synset is the Princeton WordNet En- 58.66%, Malay: 57.63%, and French: 57.25%) glish synset ID representing the shared meaning and those with the lowest (the bottom five lan- of the cognate pair, w1 and w2 are the two words, guages being Thai: 7.87%, Arabic: 9.01%, Per- l1 and l2 are their respective languages (expressed sian: 9.64%, Mongolian: 10.37%, and Mandarin as ISO-639-3 codes), and metadata is a set of at- Chinese: 11.03%). The main factor behind high tributes describing the cognate pair, such as the cognate density is the presence of closely related type of evidence for the relationship (direct etymo- languages in our data: as Malay and Indone- logical or indirect). The entire CogNet resource is 9 sian are mutually intelligible registers of the same described and freely downloadable from the web . language, the existence of separate wordnets for While we expect CogNet to provide linguistic the two naturally results in a high proportion of insights for both theoretical and applied research, shared vocabulary. Inversely, languages on the we are just starting to exploit its richness. As a other end of the spectrum tend not to have ma- 10 first result, we have developed an online tool for jor living languages that are closely related. Let the visual exploration of cognate data (see figure 1 us finally note that non-perfect transliteration and for an illustration). In the long term, this web tool failed transliteration-based matches may also be a 9http://cognet.ukc.disi.unitn.it reason for low cognate recall for languages with 10http://linguarena.eu very different scripts, such as Chinese, Arabic, or

3142 Figure 1: Cognate sets of the concept ‘song’, represented with different colours. It is easy to observe the effects of language families (e.g., red triangles stand for ) and geographic proximity (e.g., the higher density of orange in South-West Asia and green in Central Asia).

Family Density Family Density

150000 Malay 59.22% Greek 22.99% Romance 53.32% Niger-Congo 18.63% Slavic 36.67% Japanese 12.16% 100000 Indo- 36.08% Sino-Tibetan 11.22% Germanic 34.10% Mongolian 10.37% #cognates 50000 Basque 32.82% Persian 9.64% Dravidian 24.79% Arabic 9.01% Finno-Ugric 24.57% Thai 7.87% 0 0 5 10 15 geographic distance (1 in 1000km) Table 3: Cognate density by language family, com- puted over the 45 largest-vocabulary languages. Figure 2: The number of cognates according to the ge- ographic distance of the language speakers. The use of a large-scale cross-lingual database and a combination of linguistic, semantic, etymologi- Thai. cal, and geographic evidence resulted in what in In order to verify these intuitions, we examined our knowledge is the largest cognate database both cognate densities for the 45 languages manually in terms of the number of concepts and of the writ- clustered into 16 language families (see table 3, ing systems covered. The evaluation showed that the language name was kept for clusters of size 1). the resource has promisingly high quality, with Indeed, families such as Malay, Romance, Slavic, precision and recall adjustable through the algo- or Indo-Aryan, well known for containing several rithm parameters. The resource has been made mutually intelligible language pairs, came out on available online, together with a graphical web- top, while families with generally fewer or mutu- based tool for the exploration of cognate data, our ally non-intelligible members at the bottom. The hope being to attract both linguists and computer only outlier is Basque that, despite being an iso- scientists as potential users. late, is close to the resource-wide average cognate density of 33%. Acknowledgments

7 Conclusions This paper was partly supported by the In- teropEHRate project, co-funded by the European In this paper, we have demonstrated a general Union (EU) Horizon 2020 programme under grant method for building a cognate database using ex- number 826106. isting wordnet resources. Identifying cognates The first author is supported by the Cyprus Cen- based on orthography for words written in 35 dif- ter for Algorithmic Transparency, which has re- ferent writing systems, as opposed to phonetic ceived funding from the European Union’s Hori- data, made the problem statement novel with re- zon 2020 Research and Innovation Program under spect to existing research in cognate identification. Grant Agreement No. 810105.

3143 References Bradley Hauer and Grzegorz Kondrak. 2011. Cluster- ing semantically equivalent words into cognate sets Lisa Beinborn, Torsten Zesch, and Iryna Gurevych. in multilingual lists. In Proceedings of 5th interna- 2013. Cognate production using character-based tional joint conference on natural language process- machine translation. In Proceedings of the Sixth In- ing, pages 865–873. ternational Joint Conference on Natural Language Processing, pages 883–891. Ulf Hermjakob, Jonathan May, and Kevin . 2018. Out-of-the-box universal romanization tool Gabor Bella, Fausto Giunchiglia, and Fiona McNeill. uroman. Proceedings of ACL 2018, System Demon- 2017. Language and Domain Aware Lightweight strations, pages 13–18. Ontology Matching. Web Semantics: Science, Ser- vices and Agents on the World Wide Web. Gerhard Jäger. 2013. Phylogenetic inference from word lists using weighted alignment with empiri- Gabor Bella, Alessio Zamboni, and Fausto cally determined weights. Language Dynamics and Giunchiglia. 2016. Domain-Based Sense Dis- Change, 3(2):245–291. ambiguation in Multilingual Structured Data. In The Diversity Workshop at the 22nd European Gerhard Jäger. 2018. Global-scale phylogenetic lin- Conference on Artificial Intelligence (ECAI 2016). guistic inference from lexical resources. CoRR, abs/1802.06079. Tanmoy Bhattacharya, Nancy Retzlaff, Damián E Blasi, William Croft, Michael Cysouw, Daniel Hr- uschka, Ian Maddieson, Lydia Müller, Eric Smith, Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. Peter F Stadler, et al. 2018. Studying language evo- 2011. Machine transliteration survey. ACM Com- lution in the age of big data. Journal of Language puting Surveys (CSUR), 43(3):17. Evolution. Grzegorz Kondrak. 2000. A new algorithm for the Francis Bond and Ryan Foster. 2013. Linking and ex- alignment of phonetic sequences. In Proceedings tending an open multilingual wordnet. In ACL (1), of the 1st North American chapter of the Asso- pages 1352–1362. ciation for Computational Linguistics conference, pages 288–295. Association for Computational Lin- Jordan Boyd-Graber, Christiane Fellbaum, Daniel Os- guistics. herson, and Robert Schapire. 2006. Adding dense, weighted connections to wordnet. In Proceedings Grzegorz Kondrak, Daniel Marcu, and Kevin Knight. of the third international WordNet conference, pages 2003. Cognates can improve statistical translation 29–36. Citeseer. models. In Proceedings of the 2003 Conference of the North American Chapter of the Association Gerard De Melo. 2014. Etymological wordnet: Trac- for Computational Linguistics on Human Language ing the history of words. In LREC, pages 1148– Technology: companion volume of the Proceedings 1154. Citeseer. of HLT-NAACL 2003–short papers-Volume 2, pages 46–48. Association for Computational Linguistics. Fausto Giunchiglia, Khuyagbaatar Batsuren, and Ga- bor Bella. 2017. Understanding and exploiting lan- George A Miller. 1995. Wordnet: a lexical database for guage diversity. In Proceedings of the Twenty-Sixth english. Communications of the ACM, 38(11):39– International Joint Conference on Artificial Intelli- 41. gence (IJCAI-17), pages 4009–4017. Taraka Rama, Johann-Mattis List, Johannes Wahle, and Fausto Giunchiglia, Khuyagbaatar Batsuren, and Gerhard Jäger. 2018. Are automatic methods for Abed Alhakim Freihat. 2018. One world—seven cognate detection good enough for phylogenetic re- thousand languages. In Proceedings 19th Inter- construction in historical linguistics? In Proceed- national Conference on Computational Linguistics ings of the 2018 Conference of the North American and Intelligent Text Processing, CiCling2018, 18-24 Chapter of the Association for Computational Lin- March 2018. guistics: Human Language Technologies, volume 2, pages 393–400. Fausto Giunchiglia, Mladjan Jovanovic, Mercedes Huertas-Migueláñez, and Khuyagbaatar Batsuren. Taraka Rama, Johannes Wahle, Pavel Sofroniev, and 2015. Crowdsourcing a large scale multilingual Gerhard Jäger. 2017. Fast and unsupervised meth- lexico-semantic resource. In AAAI Conference on ods for multilingual cognate clustering. arXiv Human Computation and Crowdsourcing (HCOMP- preprint arXiv:1702.04938. 15). Rico Sennrich, Barry Haddow, and Alexandra Birch. Simon J Greenhill, Robert Blust, and Russell D Gray. 2016. Neural machine translation of rare words with 2008. The austronesian basic vocabulary database: subword units. In Proceedings of the 54th Annual from bioinformatics to lexomics. Evolutionary Meeting of the Association for Computational Lin- Bioinformatics, 4:EBO–S893. guistics, volume 1, pages 1715–1725.

3144 Adam St Arnaud, David Beck, and Grzegorz Kondrak. 2017. Identifying cognate sets across dictionaries of related languages. In Proceedings of the EMNLP 2017, pages 2519–2528. Yulia Tsvetkov and Chris Dyer. 2015. Lexicon stratifi- cation for translating out-of-vocabulary words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing (Volume 2: Short Papers), vol- ume 2, pages 125–131. Søren Wichmann, André Müller, Viveka Velupillai, Cecil H Brown, Eric W Holman, Pamela Brown, Se- bastian Sauppe, Oleg Belyaev, Matthias Urban, Za- rina Molochieva, et al. 2010. The asjp database (ver- sion 13). URL: http://email. eva. mpg. de/˜ wich- mann/ASJPHomePage. htm, 3. Winston Wu and David Yarowsky. 2018. Creating large-scale multilingual cognate tables. In Proceed- ings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).

3145