Cognate Identification Using Machine Translation

Cognate Identification using Machine Translation Shervin Malmasi Mark Dras Centre for Language Technology Centre for Language Technology Macquarie University Macquarie University Sydney, NSW, Australia Sydney, NSW, Australia [email protected] [email protected] Abstract The identification of such cognates have also been tackled by researchers in NLP. English and In this paper we describe an approach to French are one such pair that have received much automatic cognate identification in mono- attention, potentially because it has been posited lingual texts using machine translation. that up to 30% of the French vocabulary consists This system was used as our entry in the of cognates (LeBlanc and Seguin,´ 1996). 2015 ALTA shared task, achieving an F1- This paper describes our approach to cognate score of 63% on the test set. Our pro- identification in monolingual texts, relying on sta- posed approach takes an input text in a tistical machine translation to create parallel texts. source language and uses statistical ma- Using the English data from the shared task, the chine translation to create a word-aligned aim was to predict which words have French cog- parallel text in the target language. A ro- nates. In x2 we describe some related work in this bust measure of string distance, the Jaro- area, followed by a brief description of the data Winkler distance in this case, is then ap- in x3. Our methodology is described in x4 and re- plied to the pairs of aligned words to de- sults are presented in x5. tect potential cognates. Further extensions to improve the method are also discussed. 2 Related Work 1 Introduction Much of the previous work in this area has re- lied on parallel corpora and aligned bilingual texts. Cognates are words in different languages that Such approaches often rely on orthographic simi- have similar forms and meanings, often due to a larity between words to identify cognates. This common linguistic origin from a shared ancestor similarity can be quantified using measures such language. as the edit distance or dice coefficient with n- Cognates play an important role in Second Lan- grams. Brew and McKelvie (1996) applied such guage Acquisition (SLA), particularly between re- orthographic measures to extract English-French lated languages. However, although they can ac- cognates from aligned texts. celerate vocabulary acquisition, learners also have Phonetic similarity has also been shown to be to be aware of false cognates and partial cognates. useful for this task. Kondrak (2001), for example, False cognates are similar words that have distinct, proposed an approach that also incorporates pho- unrelated meanings. In other cases, there are par- netic cues and applied it to various language pairs. tial cognates: similar words which have a common Semantic similarity information has been em- meaning only in some contexts. For example, the ployed for this task as well; this can help iden- word police in French can translate to police, pol- tify false and partial cognates which can help im- icy or font, depending on the context. prove accuracy. Frunza and Inkpen (2010) com- Cognates are a source of learner errors and the bine various measures of orthographic similarity detection of their incorrect usage, coupled with using machine learning methods. They also use correction and contextual feedback, can be of word senses to perform partial cognate between great use in computer-assisted language learning two languages. All of their methods were applied systems. Additionally, cognates are also useful for to English-French. Wang and Sitbon (2014) com- estimating the readability of a text for non-native bined orthographic measures with word sense dis- readers. ambiguation information to consider context. Shervin Malmasi and Mark Dras. 2015. Cognate Identification using Machine Translation . In Proceedings of Australasian Language Technology Association Workshop, pages 138−141. Cognate information can also be used in other source language and identifies potential cognates tasks. One example is Native Language Identi- in a target language. The source and target lan- fication (NLI), the task of predicting an author’s guages in our work are English and French, re- first language based only on their second language spectively. writing (Malmasi and Dras, 2015b; Malmasi and The underlying motivation of our approach is Dras, 2015a; Malmasi and Dras, 2014). Nicolai et that many of the steps in this task, e.g. those re- al. (2013) developed new features for NLI based quired for WSD, are already performed by statis- on cognate interference and spelling errors. They tical machine translation systems and can thus be propose a new feature based on interference from deferred to such a pre-existing component. This cognates, positing that interference may cause a allows us to convert the text into an aligned trans- person to use a cognate from their native language lation followed by the application of word similar- or misspell a cognate under the influence of the ity measures for cognate identification. The three L1 version. For each misspelled English word, the steps in our method are described below. most probable intended word is determined using spell-checking software. The translations of this 4.1 Sentence Translation word are then looked up in bilingual English-L1 dictionaries for several of the L1 languages. If the In the first step we translate each sentence in a doc- spelling of any of these translations is sufficiently ument. This was done at the sentence-level to en- similar to the English version (as determined by sure that there is enough context information for the edit distance and a threshold value), then the effectively disambiguating the word senses.1 It word is considered to be a cognate from the lan- is also a requirement here that the translation in- guage with the smallest edit distance. The authors clude word alignments between the original input state that although only applying to four of the 11 and translated text. languages (French, Spanish, German, and Italian), For the machine translation component, we em- the cognate interference feature improves perfor- ployed the Microsoft Translator API.2 The service mance by about 4%. Their best result on the test is free to use3 and can be accessed via an HTTP was 81:73%. While limited by the availability of interface, which we found to be adequate for our dictionary resources for the target languages, this needs. The Microsoft Translator API can also ex- is a novel feature with potential for further use in pose word alignment information for a translation. NLI. An important issue to consider is that the au- We also requested access to the Google Trans- thors’ current approach is only applicable to lan- late API under the University Research Program, guages that use the same script as the target L2, but our query went unanswered. which is Latin and English in this case, and can- not be expanded to other scripts such as Arabic or 4.2 Word Alignment Korean. The use of phonetic dictionaries may be one potential solution to this obstacle. After each source sentence has been translated, the alignment information returned by the API is used to create a mapping between the words in the two 3 Data sentences. An example of such a mapping for a sentence from the test set is shown in Figure 1. The data used in this work was provided as part of the shared task. It consists of several English arti- This example shows a number of interesting cles divided into an annotated training set (11k to- patterns to note. We see that multiple words in the kens) as well as a test set (13k tokens) used for source can be mapped to a single word in the trans- evaluating the shared task. lation, and vice versa. Additionally, some words in the translation may not be mapped to anything in the original input. 4 Method 1We had initially considered doing this at the phrase-level, Our methodology is similar to those described but decided against this. 2 in x2, attempting to combine word sense disam- http://www.microsoft.com/en-us/ translator/translatorapi.aspx biguation with a measure of word similarity. Our 3For up to 2m characters of input text per month, which proposed method analyzes a monolingual text in a was sufficient for our needs. 139 The volunteers were picked to reflect a cross section of the wider population. Les volontaires ont été choisis pour refléter un échantillon représentatif de l'ensemble de la population. Figure 1: An example of word alignment between a source sentence from the test set (top) and its translation (bottom). 4.3 Word Similarity Comparison Using Jaro-Winkler Distance pr tp tp F 1 = 2 where p = ; r = In the final step, words in the alignment mappings p + r tp + fp tp + fn are compared to identify potential cognates using the word forms. Here p refers to precision and r is a measure of For this task, we adopt the Jaro-Winkler dis- recall.5 Results that maximize both will receive a tance which has been shown to work well for higher score since this measure weights both recall matching short strings (Cohen et al., 2003). This and precision equally. It is also the case that aver- measure is a variation of the Jaro similarity met- age results on both precision and recall will score ric (Jaro, 1989; Jaro, 1995) that makes it more ro- higher than exceedingly high performance on one bust for cases where the same characters in two measure but not the other. strings are positioned within a short distance of each other, for example due to spelling variations. 5 Results and Discussion The measure computes a normalized score between 0 and 1 where 0 means no similarity and Our results on the test set were submitted to the 1 denotes an exact match.

Load more