Cognate Or False Friend? Ask the Web!
Total Page:16
File Type:pdf, Size:1020Kb
Cognate or False Friend? Ask the Web! Svetlin Nakov Preslav Nakov Elena Paskaleva Sofia University Univ. of Cal. Berkeley Bulgarian Academy of Sciences 5 James Boucher Blvd. EECS, CS division 25A Acad. G. Bonchev Str. Sofia, Bulgaria Berkeley, CA 94720 Sofia, Bulgaria [email protected]fia.bg [email protected] [email protected] Abstract • Gift means a poison in German, but a present in We propose a novel unsupervised semantic English; method for distinguishing cognates from false friends. The basic intuition is that if two words • Prost means cheers in German, but stupid in Bul- are cognates, then most of the words in their garian. respective local contexts should be translations of each other. The idea is formalised using the And some examples with a different orthography: Web as a corpus, a glossary of known word trans- lations used as cross-linguistic “bridges”, and • embara¸cada means embarrassed in Portuguese, the vector space model. Unlike traditional or- while embarazada means pregnant in Spanish; thographic similarity measures, our method can easily handle words with identical spelling. The • spenden means to donate in German, but to evaluation on 200 Bulgarian-Russian word pairs spend means to use up or to pay out in English; shows this is a very promising approach. • bachelier means a person who passed his bac exam in French, but in English bachelor means Keywords an unmarried man; • babichka ( ) means an old woman in Bul- Cognates, false friends, semantic similarity, Web as a corpus. babiqka garian, but babochka (baboqka) is a butterfly in Russian; 1 Introduction • godina (godina) means a year in Russian, but godzina is an hour in Polish. Linguists define cognates as words derived from a com- mon root. For example, the Electronic Glossary of In the present paper, we describe a novel semantic Linguistic Terms gives the following definition [5]: approach to distinguishing cognates from false friends. The paper is organised as follows: Sections 2 explains Two words (or other structures) in related the method, section 3 describes the resources, section languages are cognate if they come from the 4 presents the data set, section 5 describes the experi- same original word (or other structure). Gen- ments, section 6 discusses the results of the evaluation, erally cognates will have similar, though of- and section 7 points to important related work. We ten not identical, phonological and semantic conclude with directions for future work in section 8. structures (sounds and meanings). For in- stance, Latin tu, Spanish t´u, Greek s´u, Ger- man du, and English thou are all cognates; all 2 Method mean ‘second person singular’, but they differ in form and in whether they mean specifically 2.1 Contextual Web Similarity ‘familiar’ (non-honorific). We propose an unsupervised algorithm, which given Following previous researchers in computational lin- a Russian word wru and a Bulgarian word wbg to be guistics [4, 22, 25], we adopt a simplified definition, compared, measures the semantic similarity between which ignores origin, defining cognates (or true friends) them using the Web as a corpus and a glossary G as words in different languages that are translations of known Russian-Bulgarian translation pairs, used as and have a similar orthography. Similarly, we define “bridges”. The basic idea is that if two words are false friends as words in different languages with sim- translations, then the words in their respective local ilar orthography that are not translations. Here are contexts should be translations as well. The idea is for- some identically-spelled examples of false friends: malised using the Web as a corpus, a glossary of known word translations serving as cross-linguistic “bridges”, • pozor (pozor) means a disgrace in Bulgarian, but and the vector space model. We measure the semantic attention in Czech; similarity between a Bulgarian and a Russian word, wbg and wru, by construct corresponding contextual • mart (mart) means March in Bulgarian, but a semantic vectors Vbg and Vru, translating Vru into Bul- market in English; garian, and comparing it to Vbg. 1 The process of building Vbg, starts with a query and gbg on the Web, where gbg immediately precedes to Google limited to Bulgarian pages for the target or immediately follows wbg. This number is calculated word wbg. We collect the resulting text snippets (up using Google page hits as a proxy for bigram frequen- to 1,000), and we remove all stop words – preposi- cies: we issue two exact phrase queries “wbg gbg” and tions, pronouns, conjunctions, interjections and some “gbg wbg”, and we sum the corresponding numbers of adverbs. We then identify the occurrences of wbg, and page hits. We repeat the same procedure with wru and we extract three words on either side of it. We filter gru in order to obtain the values for the correspond- out the words that do not appear on the Bulgarian side ing coordinates of the Russian vector Vru. Finally, we of G. Finally, for each retained word, we calculate the calculate the semantic similarity between wbg and wru number of times it has been extracted, thus producing as the cosine between Vbg and Vru. a frequency vector Vbg. We repeat the procedure for wru to obtain a Russian frequency vector Vru, which is then “translated” into Bulgarian by replacing each 3 Resources Russian word with its translation(s) in G, retaining the co-occurrence frequencies. In case of multiple Bulgar- 3.1 Grammatical Resources ian translations for some Russian word, we distribute the corresponding frequency equally among them, and We use two monolingual dictionaries for lemmatisa- in case of multiple Russian words with the same Bul- tion. For Bulgarian, we have a large morphological garian translation, we sum up the corresponding fre- dictionary, containing about 1,000,000 wordforms and quencies. As a result, we end up with a Bulgarian 70,000 lemmata [29], created at the Linguistic Model- vector V for the Russian word w . Finally, we ing Department, Institute for Parallel Processing, Bul- ru→bg ru garian Academy of Sciences. Each dictionary entry calculate the semantic similarity between wbg and wru as the cosine between their corresponding Bulgarian consists of a wordform, a corresponding lemma, fol- vectors, V and V . lowed by morphological and grammatical information. bg ru→bg There can be multiple entries for the same wordform, in case of multiple homographs. We also use a large 2.2 Reverse Context Lookup grammatical dictionary of Russian in the same for- The reverse context lookup is a modification of the mat, consisting of 1,500,000 wordforms and 100,000 above algorithm. The original algorithm implicitly as- lemmata, based on the Grammatical Dictionary of A. sumes that, given a word w, the words in the local Zaliznjak [35]. Its electronic version was supplied by context of w are semantically associated with it, which the Computerised fund of Russian language, Institute is often wrong due to Web-specific words like home, of Russian language, Russian Academy of Sciences. site, page, click, link, download, up, down, back, etc. Since their Bulgarian and Russian equivalents are in 3.2 Bilingual Glossary the glossary G, we can get very high similarity for un- related words. For the same reason, we cannot judge We built a bilingual glossary using an online Russian- 3 such navigational words as true/false friends. Bulgarian dictionary with 3,982 entries in the follow- The reverse context lookup copes with the problem ing format: a Russian word, an optional grammatical as follows: in order to consider w associated with a marker, optional stylistic references, and a list of Bul- word wc from the local context of w, it requires that garian translation equivalents. First, we removed all 1 w appear in the local context of wc as well . More multi-word expressions. Then we combined each Rus- formally, let #(x, y) be the number of occurrences of x sian word with each of its Bulgarian translations – in the local context of y. The strength of association is due to polysemy/homonymy some words had multiple calculated as p(w, wc) = min{#(w, wc), #(wc, w)} and translations. As a result, we obtained a glossary G of is used in the vector coordinates instead of #(w, wc), 4,563 word-word translation pairs (3,794 if we exclude which is used in the original algorithm. the stop words). 2.3 Web Similarity Using Seed Words 3.3 Huge Bilingual Glossary For comparison purposes, we also experiment with the Similarly, we adapted a much larger Bulgarian-Russian seed words algorithm of Fung&Yee’98 [12], which we electronic dictionary, transforming it into a bilingual adapt to use the Web. We prepare a small glossary of glossary with 59,583 word-word translation pairs. 300 Russian-Bulgarian word translation pairs, which is a subset of the glossary used for our contextual Web 2 similarity algorithm . Given a Bulgarian word wbg and 4 Data Set a Russian word wru to compare, we build two vectors, one Bulgarian (Vbg) and one Russian (Vru), both of 4.1 Overview size 300, where each coordinate corresponds to a par- Our evaluation data set consists of 200 Bulgarian- ticular glossary entry (g , g ). Therefore, we have a ru bg Russian pairs – 100 cognates and 100 false friends. It direct correspondence between the coordinates of V bg has been extracted from two large lists of cognates and V . The coordinate value for g in V is cal- ru bg bg and false friends, manually assembled by a linguist culated as the total number of co-occurrences of w bg from several monolingual and bilingual dictionaries. 1 These contexts are collected using a separate query for wc.