Cognet: a Large-Scale Cognate Database
Total Page:16
File Type:pdf, Size:1020Kb
CogNet: a Large-Scale Cognate Database Khuyagbaatar Batsuren† Gábor Bella† Fausto Giunchiglia†§ DISI, University of Trento, Trento, Italy† Jilin University, Changchun, China§ {k.batsuren; gabor.bella; fausto.giunchiglia}@unitn.it Abstract IELex2, or ABVD (Greenhill et al., 2008), cover only the small set of 225 Swadesh basic concepts, This paper introduces CogNet, a new, although with an extremely wide coverage of up to large-scale lexical database that provides 4000 languages. Secondly, in these databases, lex- cognates—words of common origin and ical entries that belong to scripts other than Latin meaning—across languages. The database or Cyrillic mostly appear in phonetic transcription currently contains 3.1 million cognate pairs instead of using their actual orthographies in their across 338 languages using 35 writing sys- tems. The paper also describes the automated original scripts. These limitations prevent such re- method by which cognates were computed sources from being used in real-world computa- from publicly available wordnets, with an tional tasks on written language. accuracy evaluated to 94%. Finally, statistics This paper describes CogNet, a new large-scale, and early insights about the cognate data high-precision, multilingual cognate database, as are presented, hinting at a possible future well as the method used to build it. Our main exploitation of the resource1 by various fields technical contributions are (1) a general method of lingustics. to detect cognates from multilingual lexical re- sources, with precision and recall parametrable ac- 1 Introduction cording to usage needs; (2) a large-scale cognate database containing 3.1 million word pairs across Cognates are words in different languages that 338 languages, generated with the method above; share a common origin and the same meaning, (3) WikTra, a multilingual transliteration dictio- such as the English letter and the French lettre. nary and library derived from Wiktionary data; and Cognates and the problem of cognate identifica- (4) an online platform that lets users explore the tion have been extensively studied in the fields resource. of language typology and historical linguistics, as cognates are considered useful for research- The paper is organised as follows. Section 2 ing the relatedness of languages (Bhattacharya presents the state of the art. Section 3 describes et al., 2018). Cognates are also used in computa- the main cognate discovery algorithm and sec- tional linguistics, e.g., for lexicon extension (Wu tion 4 the way various forms of evidence used and Yarowsky, 2018) or to improve cross-lingual by the algorithm are computed. The method is NLP tasks such as machine translation or bilingual parametrised and the results are evaluated in sec- word recognition (Kondrak et al., 2003; Tsvetkov tion 5. Section 6 describes the resulting CogNet and Dyer, 2015). database in terms of structure and statistical in- sights. Finally, section 7 concludes the paper. Despite the interest in using cognate data for research, state-of-the-art cognate databases have 2 State of the Art had limited practical uses from an applied perspec- tive, for two reasons. Firstly, popular cognate- To our knowledge, cognates have so far been coded databases that are used in historical lin- defined and explored in two fundamental ways guistics, such as ASJP (Wichmann et al., 2010), by two distinct research communities. On the 1The CogNet resource and WikTra tool are available on 2Indo-European Lexical Cognacy Database, http://cognet.ukc.disi.unitn.it. http://ielex.mpi.nl/ 3136 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics one hand, cognate identification has been studied improving certain cross-lingual tasks in NLP, the within linguistic typology and historical linguis- quality of the output often suffers due to not be- tics. On the other hand, computational linguists ing able to handle certain linguistic phenomena have been researching methods for cognate pro- properly. For example, words in languages such duction. as Arabic or Hebrew are written without vowels The very definition of the term ‘cognate’ varies and machine-produced transliterations often fail to according to the research community. In histori- vowelize such words (Karimi et al., 2011). The so- cal linguistics, cognates must have a provable ety- lution we propose is the use of a dictionary-based mological relationship and must be translated into transliteration tool over machine transliteration. each language (Bhattacharya et al., 2018). Ac- Our method provides new contributions for both cordingly, the English skyscraper and the German research directions. Firstly, to our knowledge no Wolkenkratzer are considered as cognates but the other work on cognate generation has so far used English song and the Japanese S⇣ /songu/) high-quality multilingual lexical resources on a are not. In computational linguistics, the notion scale as large as ours, covering hundreds of lan- of cognate is more relaxed with respect to etymol- guages and more than 100,000 cross-lingual con- ogy and loanwords are also considered as cognates cepts. Secondly, this large cross-lingual cover- (Kondrak et al., 2003). For our work we adopted age could only be achieved thanks to a robust the latter, computational point of view. transliteration tool that is part of the contributions of our paper. Finally, our novel, combined use In historical linguistics, cognate identification of multiple—orthographic, semantic, geographic, methods proceed in two main steps. First, a sim- and etymological—sources of evidence for detect- ilarity matrix of all words is estimated by three ing cognates was crucial to obtain high-quality re- types of similarity measures: semantic similar- sults, in terms of both precision and recall. ity, phonetic similarity, and orthographic simi- larity. For information on semantic similarity, 3 The Algorithm special-purpose multilingual dictionaries, such as the well-known Swadesh List, are used. For or- For our work we have adopted a computational- thographic similarity, string metrics (Hauer and linguistic interpretation of the notion of cognate Kondrak, 2011; St Arnaud et al., 2017) are of- (Kondrak et al., 2003): two words in different lan- ten employed, e.g., edit distance, Dice’s coeffi- guages are cognates if they have the same meaning cient, or LCSR. As these methods do not work and present a similarity in orthography, resulting across scripts, they are completed by phonetic from a supposed underlying etymological relation- similarity, exploiting transformations and sound ship (common ancestry or borrowing). changes across related languages (Kondrak, 2000; Based on this interpretation, our algorithm is Jäger, 2013; Rama et al., 2017). Phonetic similar- based on three main principles: (1) semantic ity measures, however, require phonetic transcrip- equivalence, i.e., that the two words share a com- tions to be a priori available. More recently, his- mon meaning; (2) sufficient proof of etymological torical linguists have started exploiting identified relatedness; and (3) the logical transitivity of the cognates to infer phylogenetic relationships across cognate relationship. languages (Rama et al., 2018; Jäger, 2018). The core resource for obtaining cross-lingual In computational linguistics, cognate produc- evidence on semantic equivalence—i.e., the same- tion consists of finding for a word in a given lan- ness of word meanings—is the Universal Knowl- guage its cognate pair in another language. State- edge Core (UKC), a large multilingual lexico- of-the-art methods (Beinborn et al., 2013; Sen- semantic database (Giunchiglia et al., 2018) al- nrich et al., 2016) have employed character-based ready used both in linguistics research as well machine translation, trained from parallel corpora, as for practical applications (Bella et al., 2016; to produce cognates or transliterations. (Wu and Giunchiglia et al., 2017; Bella et al., 2017). Yarowsky, 2018) also employs similar techniques, The UKC includes the lexicons and lexico- as well as multilingual dictionaries, to produce semantic relations for 338 languages, contain- large-scale cognate clusters for Romance and Tur- ing 1,717,735 words and 2,512,704 language- kic languages. Although the cognates produced specific word meanings. It was built from word- in this manner are, in principle, a good source for nets (Miller, 1995) and wiktionaries converted 3137 into wordnets (Bond and Foster, 2013)). As all Algorithm 1: Cognate Discovery Algorithm of the resources composing the UKC were built Input : c, a lexical concept and validated by humans(Giunchiglia et al., 2015), Input : , a lexical resource R we consider the quality of our input data to be Output : G+, graph of all cognates of c high enough for obtaining accurate results on cog- 1 V,E ; ; nates (Giunchiglia et al., 2017). As most wordnets 2 Languages (c); L R map their units of meaning (synsets in WordNet 3 for each language l do 2L terminology) to English meanings, they can effec- 4 for each word w Words (c, l) do 2 R tively be interconnected into a cross-lingual lexi- 5 V V v =<w, l> ; cal resource. The UKC reifies all of these map- [{ } 6 for each node v1 =<w1,l1> V do pings as supra-lingual lexical concepts (107,196 2 7 for each node v2 =<w2,l2> V do in total, excluding named entities such as Ulan- 2 8 if l1 = l2 then baatar). For example, if the German Fahrrad and 9 continue; the Italian bicicletta are mapped to the English 10 if EtyRel(w1,l1,w2,l2) then bicycle then a single concept is created to which 11 E E e = <v1,v2> ; all three language-specific meanings (i.e., wordnet [{ } 12 else if OrthSim(w1,l1,w2,l2)+TG synsets) will be mapped. ⇥ GeoP rox(l1,l2) >TF then In terms of etymological evidence, we use both 13 E E e = <v1,v2> ; [{ } direct and indirect evidence of etymological re- 14 G < V, E >; + latedness. Direct evidence is provided by gold- 15 G = TransitiveClosure(G) + standard etymological resources, such as the one 16 return G ; we use and present in section 4.1.