Using Cognates in a French-Romanian
Total Page:16
File Type:pdf, Size:1020Kb
Using Cognates in a French - Romanian Lexical Alignment System: A Comparative Study Mirabela Navlea Amalia Todira şcu Linguistique, Langues, Parole (LiLPa) Linguistique, Langues, Parole (LiLPa) Université de Strasbourg Université de Strasbourg 22, rue René Descartes 22, rue René Descartes BP, 80010, 67084 Strasbourg cedex BP, 80010, 67084 Strasbourg cedex [email protected] [email protected] Romanian - German (Vertan and Gavril ă, 2010). Abstract Most of the cognate identification modules used by these systems are purely statistical. As far as we know, no cognate identification method is available for French and Romanian. This paper describes a hybrid French - Roma- Cognate identification is a difficult task due to nian cognate identification module. This mod- the high orthographic similarities between bilin- ule is used by a lexical alignment system. Our gual pairs of words having different meanings. cognate identification method uses lemma- tized, tagged and sentence-aligned parallel Inkpen et al. (2005) develop classifiers for corpora. This method combines statistical French and English cognates based on several techniques, linguistic information (lemmas, dictionaries and manually built lists of cognates. POS tags) and orthographic adjustments. We Inkpen et al. (2005) distinguish between: evaluate our cognate identification module and - cognates ( liste (FR) - list (EN)); we compare it to other methods using pure sta- - false friends ( blesser (‘to injure’) (FR) - bless tistical techniques. Thus, we study the impact (EN)); of the used linguistic information and the or- - partial cognates ( facteur (FR) - factor or mail- thographic adjustments on the results of the man (EN)); cognate identification module and on cognate - genetic cognates ( chef (FR) - head (EN)); alignment. Our method obtains the best results in comparison with the other implemented sta- - unrelated pairs of words ( glace (FR) - ice (EN) tistical methods. and glace (FR) - chair (EN)). Our cognate detection method identifies cog- 1 Introduction nates, partial and genetic cognates. This method is used especially to improve a French - Roma- We present a new French - Romanian cognate nian lexical alignment system. So, we aim to ob- identification module, integrated into a lexical tain a high precision of our cognate identification alignment system using French - Romanian pa- method. Thus, we eliminate false friends and rallel law corpora. unrelated pairs of words combining statistical We define cognates as translation equivalents techniques and linguistic information (lemmas, having an identical form or sharing orthographic POS tags). We use a lemmatized, tagged and or phonetic similarities (common etymology, sentence-aligned parallel corpus. Unlike Inkpen borrowings). Cognates are very frequent between et al. (2005), we do not use other external re- close languages such as French and Romanian, sources (dictionaries, lists of cognates). two Latin languages with a rich morphology. So, To detect cognates from parallel corpora, several they represent important lexical cues in a French approaches exploit the orthographic similarity - Romanian lexical alignment system. between two words of a bilingual pair. An effi- Few linguistic resources and tools for Romanian cient method is the 4-gram method (Simard et (dictionaries, parallel corpora, MT systems) are al. , 1992). This method considers two words as currently available. Some lexically aligned cor- cognates if their length is greater than or equal to pora or lexical alignment tools (Tufi ş et al ., 4 and at least their first 4 characters are common. 2005) are available for Romanian - English or Other methods exploit Dice’s coefficient (Adam- 247 Proceedings of Recent Advances in Natural Language Processing, pages 247–253, Hissar, Bulgaria, 12-14 September 2011. son and Boreham, 1974) or a variant of this coef- 2 The Parallel Corpus ficient (Brew and McKelvie, 1996). This meas- In our experiments, we use a legal parallel cor- ure computes the ratio between the number of 1 common character bigrams of the two words and pus (DGT-TM ) based on the Acquis Communau- the total number of two word bigrams. Also, taire corpus . This multilingual corpus is availa- some methods use the Longest Common Subse- ble in 22 official languages of EU member states. quence Ratio (LCSR) (Melamed, 1999; Kraif, It is composed of laws adopted by EU member 1999). LCSR is computed as the ratio between states since 1950. DGT-TM contains 9,953,360 the length of the longest common substring of tokens in French and 9,142,291 tokens in Roma- ordered (and not necessarily contiguous) charac- nian. ters and the length of the longest word. Thus, We use a test corpus of 1,000 1:1 aligned com- two words are considered as cognates if LCSR plete sentences (starting with a capital letter and value is greater than or equal to a given thre- finishing with a punctuation sign). The length of shold. Similarly, other methods compute the dis- each sentence has at most 80 words. This test tance between two words, which represents the corpus contains 33,036 tokens in French and 28,645 in Romanian. minimum number of substitutions, insertions and 2 deletions used to transform one word into anoth- We use the TTL tagger available for Romanian (Ion, 2007) and for French (Todira şcu et al ., er (Wagner and Fischer, 1974). These methods 3 use exclusevly statistical techniques and they are 2011) (as Web service ). Thus, the parallel cor- language independent. pus is tokenized, lemmatized, tagged and anno- On the other hand, other methods use the phonet- tated at chunk level. ic distance between two words belonging to a The tagger uses the set of morpho-syntactic de- scriptors (MSD) proposed by the Multext bilingual pair (Oakes, 2000). Kondrak (2009) 4 identifies three characteristics of cognates: recur- Project for French (Ide and Véronis, 1994) and rent sound correspondences, phonetic similarity for Romanian (Tufi ş and Barbu, 1997). In the and semantic affinity. Figure 1, we present an example of TTL’s out- Thus, our method exploits orthographic and pho- put: lemma attribute represents the lemmas of netic similarities between French - Romanian lexical units, ana attribute provides morpho- cognates. We combine n-grams methods with syntactic information and chunk attribute marks linguistic information (lemmas, POS tags) and nominal and prepositional phrases. several input data disambiguation strategies (computing cognates’ frequencies, iterative ex- <seg lang="FR"><s id="ttlfr.3"> traction of the most reliable cognates and their <w lemma ="voir" ana ="Vmps-s">vu</w> deletion from the input data). Our method needs <w lemma ="le" ana ="Da-fs" no external resources (bilingual dictionaries), so chunk ="Np#1">la</w> it could easily be extended to other Romance <w lemma ="proposition" ana ="Ncfs" languages. We aim to obtain a high accuracy of chunk ="Np#1">proposition</w> our method to be integrated in a lexical align- <w lemma ="de" ana ="Spd" ment system. We evaluate our method and we chunk ="Pp#1">de</w> compare it with pure statistical methods to study <w lemma ="le" ana ="Da-fs" the influence of used linguistic information on chunk ="Pp#1,Np#2">la</w> the final results and on cognate alignment. <w lemma ="commission" ana ="Ncfs" In the next section, we present the parallel corpo- chunk ="Pp#1,Np#2">Commission ra used for our experiments. In section 3, we </w> present the lexical alignment method. We also <c>;</c> describe our cognate identification module in </s></seg> section 4. We present the evaluation of our me- thod and a comparison with other methods in Figure 1 TTL’s output for French (in XCES format) section 5. Our conclusions and further work fig- ure in section 6. 1 http://langtech.jrc.it/DGT-TM.html 2 Tokenizing, Tagging and Lemmatizing free running texts 3 https://weblicht.sfs.uni-tuebingen.de/ 4 http://aune.lpl.univ-aix.FR/projects/multext/ 248 3 Lexical Alignment Method c) we align chunks containing translation equivalents aligned in a previous step; The cognate identification module is integrated d) we align elements belonging to chunks in a French - Romanian lexical alignment system by linguistic heuristics. We develop a (see Figure 2). language dependent module applying 27 In our lexical alignment method, we first use morpho-syntactic contextual heuristic GIZA++ (Och and Ney, 2003) implementing rules (Navlea and Todira şcu, 2010). IBM models (Brown et al ., 1993). These models These rules are defined according to build word-based alignments from aligned sen- morpho-syntactic differences between tences. Indeed, each source word has zero, one or French and Romanian. more translation equivalents in the target lan- The architecture of the lexical alignment system guage. As these models do not provide many-to- is presented in the Figure 2. many alignments, we also use some heuristics (Koehn et al ., 2003; Tufi ş et al ., 2005) to detect phrase-based alignments such as chunks: nomin- al, adjectival, verbal, adverbial or prepositional phrases. In our experiments, we use the lemmatized, tagged and annotated parallel corpus described in section 2. Thus, we use lemmas and morpho- syntactic properties to improve the lexical align- ment. Lemmas are followed by the two first cha- racters of morpho-syntactic tag. This operation morphologically disambiguates the lemmas (Tufi ş et al ., 2005). For example, the same French lemma change (= exchange, modify ) can be a common noun or a verb: change_Nc vs. change_Vm. This disambiguation procedure im- proves the GIZA++ system’s performance. We realize bidirectional alignments (FR - RO and RO - FR) with GIZA++, and we intersect them (Koehn et al ., 2003) to select common alignments. To improve the word alignment results, we add Figure 2 Lexical alignment system architecture an external list of cognates to the list of the trans- lation equivalents extracted by GIZA++. This list 4 Cognate Identification Module of cognates is built from parallel corpora by our In our hybrid cognate identification method, we own method (described in the next section). use the legal parallel corpus described in section Also, to complete word alignments, we use a 2.