Identifying False Friends Between Closely Related Languages
Total Page:16
File Type:pdf, Size:1020Kb
Identifying False Friends between Closely Related Languages Nikola Ljubesiˇ c´ Darja Fiserˇ Faculty of Humanities and Social Sciences Faculty of Arts University of Zagreb University of Ljubljana Ivana Luciˇ ca´ 3 Askerˇ cevaˇ 2 10000 Zagreb, Croatia 1000 Ljubljana, Slovenija [email protected] [email protected] Abstract and have been almost exclusively limited to par- In this paper we present a corpus-based ap- allel data (Inkpen et al., 2005; Nakov and Nakov, proach to automatic identification of false 2009). In this paper we tackle the problem of auto- friends for Slovene and Croatian, a pair matically identifying false friends in weakly com- of closely related languages. By taking parable corpora by taking into account the distri- advantage of the lexical overlap between butional and frequency information collected from the two languages, we focus on measuring non-parallel texts. the difference in meaning between iden- tically spelled words by using frequency Identifying false friends automatically has the and distributional information. We ana- same prerequisite as the problem of detecting lyze the impact of corpora of different ori- cognates – identifying similarly (and identically) gin and size together with different associ- spelled words between two languages, which is far ation and similarity measures and compare from trivial if one takes into account the specificity them to a simple frequency-based base- of inter-language variation of a specific language line. With the best performing setting pair. In this contribution we focus on the prob- we obtain very good average precision of lem of false friends on two quite similar languages 0.973 and 0.883 on different gold stan- with a high lexical overlap – Croatian and Slovene dards. The presented approach works on – which enables us to circumvent the problem of non-parallel datasets, is knowledge-lean identifying similarly spelled words and use identi- and language-independent, which makes it cal words only as the word pair candidate list for attractive for natural language processing false friends. tasks that often lack the lexical resources and cannot afford to build them by hand. Our approach to identifying false friends relies on two types of information extracted from cor- 1 Introduction pora. The first one is the frequency of a false friend False friends are words in two or more languages candidate pair in the corresponding corpora where that are orthographically or semantically similar the greater the difference in frequency, the more but do not have the same meaning, such as the certain one can be that the words are used in dif- noun burro, which means butter in Italian but don- ferent meanings. The second information source is key in Spanish (Allan, 2009). For that reason, they the context from corresponding corpora where the represent a dangerous pitfall for translators, lan- context dissimilarity of the two words in question guage students as well as bilingual computer tools, is calculated through a vector space model. such as machine translation systems, which would all benefit greatly from a comprehensive collection The paper is structured as follows: in Section 2 of false friends for a given language pair. we give an overview of the related work. In Sec- False friends between related languages, such tion 3 we describe the resources we use and in Sec- as English and French, have been discussed by tion 4 we present the gold standards used for eval- lexicographers, translators and language teachers uation. Section 5 describes the experimental setup for decades (Chacon´ Beltran,´ 2006; Granger and and Section 6 reports on the results. We conclude Swallow, 1988; Holmes and Ramos, 1993). How- the paper with final remarks and ideas for future ever, they have so far played a minor role in NLP work. 69 Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, pages 69–77, Sofia, Bulgaria, 8-9 August 2013. c 2010 Association for Computational Linguistics 2 Related Work most similar but contextually most dissimilar word pairs. Automatic detection of false friends was initially The feature weighting used throughout the re- limited to parallel corpora but has been extended lated work is mostly plain frequency with one to comparable corpora and web snippets (Nakov et case of using TF-IDF (Nakov and Nakov, 2007) al., 2007). The approaches to automatically iden- whereas cosine is the most widely used similar- tify false friends fall into two categories: those that ity measure (Nakov and Nakov, 2007; Nakov and only look at orthographic features of the source Nakov, 2009; Schulz et al., 2004) while Mitkov and the target word, and those that combine ortho- et al. (2007) use skew divergence which is very graphic features with the semantic ones. similar to Jensen-Shannon divergence. Orthographic approaches typically rely on com- The main differences between the work we re- binations of a number of orthographic similarity port on in this paper and the related work are: measures and machine learning techniques to clas- sify source and target word pairs to cognates, false 1. we identify false friends on a language pair friends or unrelated words and evaluate the differ- with a large lexical overlap – hence we can ent combinations against a manually compiled list look for false friends only among identically of legitimate and illegitimate cognates. This has spelled words, such as boja, which means been attempted for English and French (Inkpen et buoy in Slovene but colour in Croatian, and al., 2005; Frunza and Inkpen, 2007) as well as not among similarly spelled words, such as for Spanish and Portuguese (Torres and Alu´ısio, the Slovene adjective bucenˇ (made of pump- 2011). kins and noisy) and its Croatian counterpart Most of the approaches that combine ortho- bucanˇ (only noisy); graphic features with the semantic ones have been 2. we inspect multiple association and similarity performed on parallel corpora where word fre- measure combinations on two different cor- quency information and alignments at paragraph, pora pairs, which enables us to assess the sta- sentence as well as word level play a crucial role at bility of those parameters in the task at hand; singling out false friends, which has been tested on Bulgarian and Russian (Nakov and Nakov, 2009). 3. we work on two different corpora pairs which Work on non-parallel data, on the other hand, of- we have full control over (that is not the case ten treats false friend candidates as search queries, with web snippets), and are therefore able to and considers the retrieved web snippets for these examine the impact of corpus type and corpus queries as contexts that are used to establish the size on the task; degree of semantic similarity of the given word 4. we use three categories for the identically pair (Nakov and Nakov, 2007). spelled words: Apart from the web snippets, comparable cor- pora have also been used to extract and clas- (a) we use the term true equivalents (TE) sify pairs of cognates and false friends between to refer to the pairs that have the same English and German, English and Spanish, and meaning and usage in both languages French and Spanish (Mitkov et al., 2007). In (e.g. adjective bivsiˇ , which means for- their work, the traditional distributional approach mer in both languages), is compared with the approach of calculating n- (b) the term partial false friends (PFF) de- nearest neighbors for each false friend candidate in scribes pairs that are polysemous and the source language, translating the nearest neigh- are equivalent in some of the senses but bors via a seed lexicon and calculating the set in- false friends in others (e.g. verb drazitiˇ , tersection to the N nearest neighbors of the false which can mean either irritate or make friend candidate from the target language. more expensive in Slovene but only irri- A slightly different setting has been investigated tate in Croatian), and by Schultz et al. (2004) who built a medical do- (c) we use the term false friends (FF) for main lexicon from a closely related language pair word pairs which represent different (Spanish-Portuguese) and used the standard distri- concepts in the two languages (e.g. noun butional approach to filter out false friends from slovo, which means farewell in Slovene cognate candidates by catching orthographically and letter of the alphabet in Croatian) 70 By avoiding the problem of identifying relevant 4 Gold Standards similarly spelled words prior to the identification The gold standards for this research were built of false friends, in this paper we focus only on the from identically spelled nouns, adjectives and latter and avoid adding noise from the preceding verbs that appeared with a frequency equal or task. higher than 50 in the web corpora for both lan- 3 Resources Used guages. The false friend candidates were categorized in In this paper we use two types of corpora: the three categories defined in Section 2: false Wikipedia corpora (hereafter WIKI) which have friends, partial false friends and true equivalents. gained in popularity lately because of their sim- Manual classification was performed by three ple construction and decent size and web corpora annotators, all of them linguists. Since identify- (hereafter WAC) which are becoming the standard ing false friends is hard even for a well-trained lin- for building big corpora. guist, all of them consulted monolingual dictionar- We prepared the WIKI corpora from the dumps ies and corpora for both languages before making of the Croatian and Slovene Wikipedias by ex- the final decision. tracting their content, tokenizing and annotat- The first annotation session was performed by a ing them with morphosyntactic descriptions and single annotator only. Out of 8491 candidates, he lemma information.