Multilingual Lexical Resources to Detect Cognates in Non-Aligned Texts
Total Page:16
File Type:pdf, Size:1020Kb
Multilingual lexical resources to detect cognates in non-aligned texts Haoxing Wang Laurianne Sitbon Queensland University of Technology Queensland University of Technology 2 George St, Brisbane, 4000 2 George St, Brisbane, 4000 [email protected] [email protected] ers. As word frequency can be partially estimated Abstract by word length, it remains the principal feature for estimation second language learners with the The identification of cognates between assumption that less frequent words are less like- two distinct languages has recently start- ly to have been encountered and therefore acces- ed to attract the attention of NLP re- sible in the memory of the learnt language by search, but there has been little research readers. However such features are not entirely into using semantic evidence to detect adapted to existing reading abilities of learners cognates. The approach presented in this with a different language background. In particu- paper aims to detect English-French cog- lar, for a number of reasons, many languages nates within monolingual texts (texts that have identical words in their vocabulary, which are not accompanied by aligned translat- opens a secondary access to the meaning of such ed equivalents), by integrating word words as an alternative to memory in the learnt shape similarity approaches with word language. For example, English and French lan- sense disambiguation techniques in order guages belong to different branches of the Indo- to account for context. Our implementa- European family of languages, and additionally tion is based on BabelNet, a semantic European history of mixed cultures has led their network that incorporates a multilingual vocabularies to share a great number of similar encyclopedic dictionary. Our approach is and identical words. These words are called cog- evaluated on two manually annotated da- nates. tasets. The first one shows that across Cognates have often slightly changed their or- different types of natural text, our method thography (especially in derived forms), and can identify the cognates with an overall quite often meaning as well in the years and cen- accuracy of 80%. The second one, con- turies following the transfer. Because of these sisting of control sentences with semi- changes, cognates are generally of one of three cognates acting as either true cognates or different types. First of all, true cognates are false friends, shows that our method can English-French word pairs that are viewed as identify 80% of semi-cognates acting as similar and are mutual translations. The spelling cognates but also identifies 75% of the could be identical or not, e.g., prison and prison, semi-cognates acting as false friends. ceramic and céramique. False cognates are pairs of words that have similar spellings but have to- 1 Introduction tally unrelated meanings. For example, main in French means hand, while in English it means Estimating the difficulty of a text for non-native principal or essential. Lastly, semi-cognates are speakers, or learners of a second language, is pairs of words that have similar spellings but gaining interest in natural language processing, only share meanings in some contexts. One way information retrieval and education communities. to understand it in a practical setting is that they So far measures and models to predict readability behave as true cognates or false cognates de- in a cross-lingual context have used mainly text pending on their sense in a given context. features used in monolingual readability contexts In this paper, we present a method to identify the (such as word shapes, grammatical features and words in a text in a given target language and individual word frequency features). These fea- that could acceptably be translated by a true cog- tures have been tested when estimating readabil- nate in a given source language (native language ity levels for K-12 (primary to high school) read- of a reader learning the target language). Accept- Haoxing Wang and Laurianne Sitbon. 2014. Multilingual lexical resources to detect cognates in non-aligned texts. In Proceedings of Australasian Language Technology Association Workshop, pages 14−22. ability in this context does not necessarily mean the candidates in the target language to establish that a translator (human or automatic) would orthographic/phonetic similarity. Formula 1 chose the true cognate as the preferred transla- shows how we measure the cognateness C of an tion, but rather that the true cognate is indeed a English word W based on the word shape simi- synonym of the preferred translation. The meth- larity WSS of all its possible translations CW, od we present takes into account both character- and will motivated further in sections 3 and 4. istics of true cognates, which are similar spelling ( ) ( ( )) (1) and similar meaning. Because there are several types of orthographic Most of the previous work in cognate identifica- and phonetic similarities used in the literature, tion has been operating with bilingual (aligned) we first establish which is most discriminative of corpora by using orthographic and phonetic cognates. We then evaluate a threshold-based measurements only. In such settings, the similari- approach and machine learning based approach ty of meaning is measured by the alignment of to leverage orthographic/phonetic similarity to sentences in parallel texts. Basically, all the discriminate cognates from non cognates. words in a parallel sentence become candidates The first evaluation focuses on the performance that are then evaluated for orthographic similari- of our method on a cognate detection task in nat- ty. ural data. The natural dataset contains 6 different In the absence of aligned linguistic context, we genres of text. A second evaluation focuses spe- propose that candidates with similar meaning can cifically on semi-cognates classification in con- be proposed by a disambiguation system coupled trolled sentences, where 20 semi-cognates were with multilingual sense based lexicon where each each presented in a sentence where they would word is associated to a set of senses, and senses translate as a cognate and a sentence where they are shared by all languages. A multilingual ver- would not. sion of WordNet is an example of such lexicons. The paper is organized as follows. Section 2 pre- In this paper, we use BabelNet, which is an open sents related research on cognate identification resource and freely accessible semantic network and introduces word sense disambiguation with that connects concepts and named entities in a Babelfy. Section 3 describes a general approach very large network of semantic relations built to tackle the cognate identification work, while from WordNet, Wikipedia and some other the- section 4 specifically presents our implementa- sauri. Furthermore, it is also a multilingual ency- tion process. Finally, section 5 focuses on the clopedic dictionary, with lexicographic and en- evaluation and experiment results. Discussion, cyclopedic coverage of terms in different lan- conclusion and future work are presented in sec- guages. In order to disambiguate the word sense, tion 6 and 7 respectively. BabelNet provides an independent Java tool that is called Babelfy. It employs a unified approach 2 Related Work connecting Entity Linking (EL) and Word Sense Disambiguation (WSD) together. Moro et al. 2.1 Identifying cognates using orthograph- (2014) believe that the lexicographic knowledge ic/phonetic similarity used in WSD is useful for tackling EL task, and The most well-known approach to measuring vice versa, that the encyclopedic information how similar two words look to a reader is to utilized in EL helps disambiguate nominal men- measure the Edit Distance (ED) (Levenshtein, tions in a WSD setting. Given an English sen- 1966). The ED returns a value corresponding to tence, Babelfy can disambiguate the meaning of the minimum number of deletions, insertions and each named entity or concept. For example, substitutions needed to transform the source lan- “You will get more <volume when beating egg guage word into the target language word. The whites if you first bring them to room <tempera- Dice coefficient measurement (Brew and ture.” The words with bracket in front are cog- McKelvie, 1996) is defined as the ratio of the nates. Volume has been disambiguated as “The number of n-grams that are shared by two strings amount of 3-dimensional space occupied by an and the total number of n-grams in both strings. object”, and temperature refers to “The degree of The Dice coefficient with bi-grams (DICE) is a hotness or coldness of a body or environment”. particularly popular word similarity measure. In After the English word has been processed, it their work, Brew and McKelvie looked only at will search the words in other languages that pairs of verbs in English and French, pairs that contain this particular sense as candidates. The are extracted from aligned sentences in a parallel English word in the source is then compared to corpus. Melamed (1999) used another popular 15 technique, the Longest Common Subsequence similarity of each pair of word is calculated Ratio (LCSR), that is the ratio of the length of based on the glosses information between a pair the longest (not necessarily contiguous) common of words. The glosses are available in English for subsequence (LCS) and the length of the longer all words in both lists. The overall similarity is a word. Simard, Foster and Isabelle (1992) use linear combination of phonetic and semantic sim- cognates to align sentences in bi-texts. They only ilarity, with different importance assigned to employed the first four characters of the English- them respectively. The final outcome of this sys- French word pairs to determine whether the word tem is a list vocabulary-entry pair, sorted accord- pairs are cognates or not. ing to the estimated likelihood of their cognate- ALINE (Kondrak, 2000), is an example of a ness. Although their evaluation suggested that phonetic approach.