<<

Using in a French - Romanian Lexical Alignment System: A Comparative Study

Mirabela Navlea Amalia Todira şcu Linguistique, Langues, Parole (LiLPa) Linguistique, Langues, Parole (LiLPa) Université de Strasbourg Université de Strasbourg 22, rue René Descartes 22, rue René Descartes BP, 80010, 67084 Strasbourg cedex BP, 80010, 67084 Strasbourg cedex [email protected] [email protected]

Romanian - German (Vertan and Gavril ă, 2010). Abstract Most of the identification modules used by these systems are purely statistical. As far as we know, no cognate identification method is available for French and Romanian. This paper describes a hybrid French - Roma- Cognate identification is a difficult task due to nian cognate identification module. This mod- the high orthographic similarities between bilin- ule is used by a lexical alignment system. Our gual pairs of words having different meanings. cognate identification method uses lemma- tized, tagged and sentence-aligned parallel Inkpen et al. (2005) develop classifiers for corpora. This method combines statistical French and English cognates based on several techniques, linguistic information (lemmas, dictionaries and manually built lists of cognates. POS tags) and orthographic adjustments. We Inkpen et al. (2005) distinguish between: evaluate our cognate identification module and - cognates ( liste (FR) - list (EN)); we compare it to other methods using pure sta- - false friends ( blesser (‘to injure’) (FR) - bless tistical techniques. Thus, we study the impact (EN)); of the used linguistic information and the or- - partial cognates ( facteur (FR) - factor or mail- thographic adjustments on the results of the man (EN)); cognate identification module and on cognate - genetic cognates ( chef (FR) - head (EN)); alignment. Our method obtains the best results in comparison with the other implemented sta- - unrelated pairs of words ( glace (FR) - ice (EN) tistical methods. and glace (FR) - chair (EN)). Our cognate detection method identifies cog- 1 Introduction nates, partial and genetic cognates. This method is used especially to improve a French - Roma- We present a new French - Romanian cognate nian lexical alignment system. So, we aim to ob- identification module, integrated into a lexical tain a high precision of our cognate identification alignment system using French - Romanian pa- method. Thus, we eliminate false friends and rallel law corpora. unrelated pairs of words combining statistical We define cognates as translation equivalents techniques and linguistic information (lemmas, having an identical form or sharing orthographic POS tags). We use a lemmatized, tagged and or phonetic similarities (common etymology, sentence-aligned parallel corpus. Unlike Inkpen borrowings). Cognates are very frequent between et al. (2005), we do not use other external re- close such as French and Romanian, sources (dictionaries, lists of cognates). two languages with a rich morphology. So, To detect cognates from parallel corpora, several they represent important lexical cues in a French approaches exploit the orthographic similarity - Romanian lexical alignment system. between two words of a bilingual pair. An effi- Few linguistic resources and tools for Romanian cient method is the 4-gram method (Simard et (dictionaries, parallel corpora, MT systems) are al. , 1992). This method considers two words as currently available. Some lexically aligned cor- cognates if their length is greater than or equal to pora or lexical alignment tools (Tufi ş et al ., 4 and at least their first 4 characters are common. 2005) are available for Romanian - English or Other methods exploit Dice’s coefficient (Adam-

247 Proceedings of Recent Advances in Natural Processing, pages 247–253, Hissar, Bulgaria, 12-14 September 2011. son and Boreham, 1974) or a variant of this coef- 2 The Parallel Corpus ficient (Brew and McKelvie, 1996). This meas- In our experiments, we use a legal parallel cor- ure computes the ratio between the number of 1 common character bigrams of the two words and pus (DGT-TM ) based on the Acquis Communau- the total number of two word bigrams. Also, taire corpus . This multilingual corpus is availa- some methods use the Longest Common Subse- ble in 22 official languages of EU member states. quence Ratio (LCSR) (Melamed, 1999; Kraif, It is composed of laws adopted by EU member 1999). LCSR is computed as the ratio between states since 1950. DGT-TM contains 9,953,360 the length of the longest common substring of tokens in French and 9,142,291 tokens in Roma- ordered (and not necessarily contiguous) charac- nian. ters and the length of the longest word. Thus, We use a test corpus of 1,000 1:1 aligned com- two words are considered as cognates if LCSR plete sentences (starting with a capital letter and value is greater than or equal to a given thre- finishing with a punctuation sign). The length of shold. Similarly, other methods compute the dis- each sentence has at most 80 words. This test tance between two words, which represents the corpus contains 33,036 tokens in French and 28,645 in Romanian. minimum number of substitutions, insertions and 2 deletions used to transform one word into anoth- We use the TTL tagger available for Romanian (Ion, 2007) and for French (Todira şcu et al ., er (Wagner and Fischer, 1974). These methods 3 use exclusevly statistical techniques and they are 2011) (as Web service ). Thus, the parallel cor- language independent. pus is tokenized, lemmatized, tagged and anno- On the other hand, other methods use the phonet- tated at chunk level. ic distance between two words belonging to a The tagger uses the set of morpho-syntactic de- scriptors (MSD) proposed by the Multext bilingual pair (Oakes, 2000). Kondrak (2009) 4 identifies three characteristics of cognates: recur- Project for French (Ide and Véronis, 1994) and rent sound correspondences, phonetic similarity for Romanian (Tufi ş and Barbu, 1997). In the and semantic affinity. Figure 1, we present an example of TTL’s out- Thus, our method exploits orthographic and pho- put: lemma attribute represents the lemmas of netic similarities between French - Romanian lexical units, ana attribute provides morpho- cognates. We combine n-grams methods with syntactic information and chunk attribute marks linguistic information (lemmas, POS tags) and nominal and prepositional phrases. several input data disambiguation strategies (computing cognates’ frequencies, iterative ex- traction of the most reliable cognates and their vu deletion from the input data). Our method needs la it could easily be extended to other Romance proposition our method to be integrated in a lexical align- de compare it with pure statistical methods to study la the final results and on cognate alignment. Commission ra used for our experiments. In section 3, we present the lexical alignment method. We also ; describe our cognate identification module in section 4. We present the evaluation of our me- thod and a comparison with other methods in Figure 1 TTL’s output for French (in XCES format) section 5. Our conclusions and further work fig- ure in section 6.

1 http://langtech.jrc.it/DGT-TM.html 2 Tokenizing, Tagging and Lemmatizing free running texts 3 https://weblicht.sfs.uni-tuebingen.de/ 4 http://aune.lpl.univ-aix.FR/projects/multext/

248 3 Lexical Alignment Method c) we align chunks containing translation equivalents aligned in a previous step; The cognate identification module is integrated d) we align elements belonging to chunks in a French - Romanian lexical alignment system by linguistic heuristics. We develop a (see Figure 2). language dependent module applying 27 In our lexical alignment method, we first use morpho-syntactic contextual heuristic GIZA++ (Och and Ney, 2003) implementing rules (Navlea and Todira şcu, 2010). IBM models (Brown et al ., 1993). These models These rules are defined according to build word-based alignments from aligned sen- morpho-syntactic differences between tences. Indeed, each source word has zero, one or French and Romanian. more translation equivalents in the target lan- The architecture of the lexical alignment system guage. As these models do not provide many-to- is presented in the Figure 2. many alignments, we also use some heuristics (Koehn et al ., 2003; Tufi ş et al ., 2005) to detect phrase-based alignments such as chunks: nomin- al, adjectival, verbal, adverbial or prepositional phrases. In our experiments, we use the lemmatized, tagged and annotated parallel corpus described in section 2. Thus, we use lemmas and morpho- syntactic properties to improve the lexical align- ment. Lemmas are followed by the two first cha- racters of morpho-syntactic tag. This operation morphologically disambiguates the lemmas (Tufi ş et al ., 2005). For example, the same French lemma change (= exchange, modify ) can be a common noun or a verb: change_Nc vs. change_Vm. This disambiguation procedure im- proves the GIZA++ system’s performance. We realize bidirectional alignments (FR - RO and RO - FR) with GIZA++, and we intersect them (Koehn et al ., 2003) to select common alignments. To improve the word alignment results, we add Figure 2 Lexical alignment system architecture an external list of cognates to the list of the trans- lation equivalents extracted by GIZA++. This list 4 Cognate Identification Module of cognates is built from parallel corpora by our In our hybrid cognate identification method, we own method (described in the next section). use the legal parallel corpus described in section Also, to complete word alignments, we use a 2. This corpus is tokenized, lemmatized, tagged, French - Romanian dictionary of verbo-nominal and sentence-aligned. collocations (Todira şcu et al ., 2008). They Thus, we consider as cognates bilingual word represent multiword expressions, composed of pairs respecting the linguistic conditions below: words related by lexico-syntactic relations 1) their lemmas are translation equivalents (Todira şcu et al ., 2008). The dictionary contains in two parallel sentences; the most frequent verbo-nominal collocations 2) they have identical lemmas or have or- extracted from legal corpora. thographic or phonetic similarities be- To augment the recall of the lexical alignment tween lemmas; method, we apply a set of linguistically- 3) they are content-words (nouns, verbs, motivated heuristic rules (Tufi ş et al ., 2005): adverbs, etc.) having the same POS tag a) we define some POS affinity classes (a or belonging to the same POS affinity noun might be translated by a noun, a class. We filter out short words such as verb or an adjective); prepositions and conjunctions to limit b) we align content-words such as nouns, noisy output. We also detect short cog- adjectives, verbs, and adverbs, according nates such as il ’he’ vs. el (personal pro- to the POS affinity classes; noun), cas 'case' vs. caz (nouns). We

249 avoid ambiguous pairs such as lui ‘him’ among the first 8 bigrams. At (personal pronoun) (FR) vs. lui ‘s' (pos- least one character of each bi- sessive determiner) (RO), ce 'this' (de- gram is common to both words. monstrative determiner) (FR) vs. ce 'that' This condition allows the jump (relative pronoun) (RO). of a non identical character To detect orthographic and phonetic similarities (souscri re vs. su bscrie 'submit'). between cognates, we look at the beginning of This method applies only to long the words and we ignore their endings. lemmas (length greater than 7). We classify the French - Romanian cognates de- d) 4-bigrams; Lemmas have a tected in the studied parallel corpus (at the ortho- common sequence of characters graphic or phonetic level), in several categories: among the 4 first bigrams. This 1) cross-lingual invariants (numbers, cer- method applies for long lemmas tain acronyms and abbreviations, punc- (length greater than 7) ( homol o- tuation signs); gué vs. omol ogat 'homologated') 2) identical cognates ( document ‘document’ but also for short lemmas (length vs. document ); less than or equal to 7) (gr oup e 3) similar cognates: vs. grup 'group'). a) 4-grams (Simard et al ., 1992); We iteratively extract cognates by identified cat- The first 4 characters of lemmas egories. In addition, we use a set of orthographic are identical. The length of these adjustments and some input data disambiguation lemmas is greater than or equal strategies. We compute frequency for ambiguous to 4 ( auto rité vs. auto ritate candidates (the same source lemma occurs with 'authority'). several target candidates) and we keep the most b) 3-grams; The first 3 characters frequent candidate. At each iteration, we delete of lemmas are identical and the reliable considered cognates from the input data. length of the lemmas is greater We start by applying a set of empirically estab- than or equal to 3 ( act e vs . act lished orthographic adjustments between French 'paper'). - Romanian lemmas, such as: diacritic removal, c) 8-bigrams; Lemmas have a phonetic mappings detection, etc. (see Table 1). common sequence of characters Examples Levels of orthographic adjustments French Romanian FR - RO diacritics x x dépôt - depozit double contiguous letters x x ra pp ort - ra port consonant groups ph f [f] ph ase - faz ă th t [t] mé th ode - me tod ă dh d [d] adh érent - aderent cch c [k] ba cch ante - ba cantă ck c [k] sto ck age - sto care cq c [k] gre cq ue - gre c ch ş [∫] fi ch e - fi şă ch c [k] ch apitre - capitol q q (final) c [k] cin q - cin ci qu(+i) (medial) c [k] équilibre - echilibru qu(+e) (medial) c [k] mar quer - mar ca qu(+a) c(+a) [k] qualité - calitate que (final) c [k] prati que - practi că intervocalic s v + s + v v + z + v pré sent - pre zent w w v wagon - vagon y y i yaourt - iaurt Table 1 French - Romanian cognate orthographic adjustments identify phonetic correspondences between lem- While French uses an etymological writing and mas. Then, we make some orthographic adjust- Romanian generally has a phonetic writing, we ments from French to Romanian. For example,

250 cognates sto ck age 'stock' (FR) vs. sto care (RO) Romanian cognates have one form in French, but become sto cage (FR) vs. sto care (RO). In this two various forms in Romanian ( spécification example, the French consonant group ck [k] be- 'specification' vs. specificare or specifica ţie ). We come c [k] (as in Romanian). We also make ad- recover these pairs by using regular expressions justments in the ambiguous cases, by replacing based on specific lemma endings (ion (fr) vs. with both variants ( ch ([ ∫] or [k])): fi ch e vs. fi şă re| ţie (ro)). ‘sheet’; ch apitre vs. capitol ‘chapter’. Then, we delete the reliable cognate pairs (high We aim to improve the precision of our method. precision) from the input data at the end of the Thus, we iteratively extract cognates by identi- extraction step. This step helps us to disambi- fied categories from the surest ones to less sure guate the input data. For example, the identical candidates (see Table 2). cognates transport vs. transport 'transportation', To decrease the noise of the cognate identifica- obtained in a previous extraction step and deleted tion method, we apply two supplementary strate- from the input data, eliminate the occurrence of gies. We filter out ambiguous cognate candidates candidate transport vs. tranzit as 4-grams cog- (autorité - autoritate|autorizare ), by computing nate, in a next extraction step. their frequencies in the corpus. In this case, we We apply the same method for cognates having keep the most frequent candidate pair. This strat- POS affinity (N-V; N-ADJ). We keep only 4- egy is very effective to augment the precision of grams cognates, due to the significant decrease the results, but it might decrease the recall in cer- of the precision for the other categories 3 (b, c, tain cases. Indeed, there are cases where French - d). Deletion Extraction steps by category Content-words / Frequency Precision from the of cognates Same POS (%) input data 1 : cross lingual invariants x 100 2 : identical cognates x x 100 3 : 4-grams (lemmas’ length x x x 99.05 >= 4) ; 4 : 3-grams (lemmas’ length x x x 93.13 >=3) ; 5 : 8-bigrams (long lemmas, x x 95.24 lemmas’ length >7)

6 : 4-bigrams (long lemmas, x 75 lemmas’ length > 7)

7 : 4-bigrams (short lemmas, x x 65.63 lemmas’ length =< 7) Table 2 Precision of cognate extraction steps empirically establish the threshold of 0.68.

5 Evaluation and Methods’ Comparison length (common _ substring w w ))2,1( LCSR w w )2,1( = We evaluated our cognate identification module max( length w ),1( length w ))2( against a list of cognates initially built from the test corpus, containing 2,034 pairs of cognates. b) thresholding DICE’s coefficient ; We In addition, we also compared the results of our empirically establish the threshold of method with the results provided by pure statis- 0.62. tical methods (see Table 3). These methods are *2 number _ common _ bigrams the following: DICE w w )2,1( = total _ number _ bigrams w w )2,1( a) thresholding the Longest Common Sub- sequence Ratio (LCSR) for two words of a bilingual pair; This measure computes c) 4-grams ; Two words are considered as the ratio between the longest common cognates if they have at least 4 charac- subsequence of characters of two words ters and their first 4 characters are iden- and the length of the longest word. We tical.

251 We implemented these methods using ortho- keeping the most frequent candidate in the stu- graphically adjusted parallel corpus (see Table died corpus. So, the remaining errors mainly 1). Moreover, we evaluate 4-grams method on concern hapax candidates. the initial parallel corpus and on the orthographi- Also, some cognates were not extracted: heure - cally adjusted parallel corpus to study the impact or ă ‘hour’, semaine - săpt ămână ‘week’, lieu - of orthographic adjustments step on the quality loc ‘place’. These errors concern cognates shar- of the results. ing very few orthographic similarities. These methods generally apply for words having at least 4 letters in order to decrease the noise of The lowest scores are obtained by the LCSR me- the results. Cognates are searched in aligned pa- thod (f-measure=50.47%), followed by the rallel sentences. Word characters are almost pa- DICE’s coefficient (f-measure=58.61%). These rallel ( remb ours er vs. rambursare 'refund'). general methods provide a high noise due to the important orthographic similarities between the words having different meanings. Their results Methods P (%) R (%) F (%) might be improved by combining statistical tech- niques with linguistic information such as POS LCSR 44.13 58.95 50.47 affinity or by combining several association scores. DICE 56.47 60.91 58.61

As we mentioned, the output of the cognate iden- 4-grams 91.55 72.42 80.87 tification module is exploited by a French - Ro- manian lexical alignment system (based on GI- Our method 94.78 89.18 91.89 ZA++) described in section 3. We compared the Table 3 Evaluation and methods’ comparison; set of cognates provided by GIZA++ with our P=Precision; R=Recall; F=F-measure results to study their impact on cognate align- ment. GIZA++ extracted 1,532 cognates Our method extracted 1,814 correct cognates representing a recall of 75.32% (see Table 5). from 1,914 provided candidates. The method Our cognate identification module significantly obtains the best scores (precision=94.78% ; re- improved the recall with 13.86%. call=89.18% ; f-measure=91.89%), in compari- son with the other implemented methods. The 4- Number of Number Recall grams method obtains a high precision (90.85%), Systems extracted of total (%) but a low recall (47.84%). Orthographic adjust- cognates cognates ments step improves significantly the recall of 4- GIZA++ 1,532 75.32 grams method with 24.58% (see Table 4). This Our 2,034 1,814 89.18 result is due to the specific properties of the law method parallel corpus. Indeed, many Romanian terms Table 5 Improvement of our method’s recall were borrowed from French and these terms present high orthographic similarities. 6 Conclusions and Further Work

We present a French - Romanian cognate identi- Methods P (%) R (%) F (%) fication module required by a lexical alignment system. Our method combines statistical tech- 4-grams - 90.85 47.84 62.68 niques and linguistic filters to extract cognates Adjustments from lemmatized, tagged and sentence-aligned parallel corpus. The use of the linguistic informa- 4-grams + 91.55 72.42 80.87 tion and the orthographic adjustments signifi- Adjustments cantly improves the results compared with pure Table 4 Evaluation of the 4-grams method before and statistical methods. However, these results are after orthographic adjustments step dependent of the studied languages, of the corpus domain and of the data volume. We need more However, our method extracts some ambiguous experiments using other corpora from other do- candidates such as numéro ‘number’ - nume mains to be able to generalize. Our system ‘name’ , compléter ‘complete’ - compune ‘com- should be improved to detect false friends by pose’. Some of these errors were avoided by using external resources.

252 Cognate identification module will be integrated tical Machine Translation Systems, in Proceedings in a French - Romanian lexical alignment sys- of the Workshop on Exploitation of Multilingual tem. This system is part of a larger project aim- Resources and Tools for Central and (South) East- ing to develop a factored phrase-based statistical ern European Languages, 7th International Confe- machine translation system for French and Ro- rence on Language Resources and Evaluation , Malta, Valletta, May 2010, pp. 41-48. manian. Michael P. Oakes. 2000. Computer Estimation of Vo- References cabulary in Protolanguage from Word Lists in Four Daughter Languages, in Journal of Quantitative George W. Adamson and Jillian Boreham. 1974. The Linguistics , 7(3):233-243. use of an association measure based on character structure to identify semantically related pairs of Franz J. Och and Hermann Ney. 2003. A Systematic words and document titles, Information Storage Comparison of Various Statistical Alignment and Retrieval , 10(7-8):253-260. Models, in Computational Linguistics , 29(1):19- 51. Chris Brew and David McKelvie. 1996. Word-pair ex-traction for lexicography, in Proceedings of In- Michel Simard, George Foster, and Pierre Isabelle. ternational Conference on New Methods in Natural 1992. Using cognates to align sentences, in Pro- Language Processing , Bilkent, Turkey, 45-55. ceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Peter F. Brown, Vincent J. Della Pietra, Stephen A. Translation , Montréal, pp. 67-81. Della Pietra, and Robert L. Mercer. 1993. The ma- thematics of statistical machine translation: Para- Amalia Todira şcu, Ulrich Heid, Dan Ştef ănescu, Dan meter estimation, Computational Linguistics , Tufi ş, Christopher Gledhill, Marion Weller, and 19(2):263-312. François Rousselot. 2008. Vers un dictionnaire de collocations multilingue, in Cahiers de Linguis- Nancy Ide and Jean Véronis. 1994. Multext (multilin- tique , 33(1) :161-186, Louvain, août 2008. gual tools and corpora), in Proceedings of the 15th International Conference on Computational Lin- Amalia Todira şcu, Radu Ion, Mirabela Navlea, and guistics, CoLing 1994 , Kyoto, pp. 90-96. Laurence Longo. 2011. French text preprocessing with TTL, in Proceedings of the Romanian Acad- Diana Inkpen, Oana Frunz ă, and Grzegorz Kondrak. emy, Series A, Volume 12, Number 2/2011, pp. 2005. Automatic Identification of Cognates and 151-158, Bucharest, Romania, June 2011, Roma- False Friends in French and English, RANLP- nian Academy Publishing House. ISSN 1454-9069. 2005, Bulgaria, Sept. 2005, p. 251-257. Dan Tufi ş and Ana Maria Barbu. 1997. A Reversible Radu Ion. 2007. Metode de dezambiguizare semantic ă and Reusable Morpho-Lexical Description of Ro- automat ă. Aplica ţii pentru limbile englez ă şi manian, in Dan Tufi ş and Poul Andersen (eds.), român ă, Ph.D. Thesis , Romanian Academy, Bu- Recent Advances in Technol- charest, May 2007, 148 pp. ogy , pp. 83-93, Editura Academiei Române, Philipp Koehn, Franz J. Och, and Daniel Marcu. Bucure şti, 1997. ISBN 973-27-0626-0. 2003. Statistical Phrase-Based Translation, in Pro- Dan Tufi ş, Radu Ion, Alexandru Ceau şu, and Dan ceedings of Human Language Technology Confe- Ştef ănescu. 2005. Combined Aligners, in Proceed- rence of the North American Chapter of the Asso- ings of the Workshop on Building and Using Paral- ciation of Computational Linguistics, HLT-NAACL lel Texts: Data-Driven Machine Translation and 2003 , Edmonton, May-June 2003, pp. 48-54. Beyond , pp. 107-110, Ann Arbor, USA, Associa- Grzegorz Kondrak. 2009. Identification of Cognates tion for Computational Linguistics. ISBN 978-973- and Recurrent Sound Correspondences in Word 703-208-9. Lists, in Traitement Automatique des Langues Cristina Vertan and Monica Gavril ă. 2010. Multilin- (TAL) , 50(2) :201-235. gual applications for rich morphology language Olivier Kraif. 1999. Identification des cognats et ali- pairs, a case study on German Romanian, in Dan gnement bi-textuel : une étude empirique, dans Tufi ş and Corina For ăscu (eds.): Multilinguality Actes de la 6ème conférence annuelle sur le Trai- and Interoperability in Language Processing with tement Automatique des Langues Naturelles, TALN Emphasis on Romanian , Romanian Academy Pub- 99 , Cargèse, 12-17 juillet 1999, 205-214. lishing House, Bucharest, pp. 448-460, ISBN 978- 973-27-1972-5. Dan I. Melamed. 1999. Bitext Maps and Alignment via Pattern Recognition, in Computational Linguis- Robert A. Wagner and Michael J. Fischer. 1974. The tics , 25(1):107-130. String-to-String Correction Problem, Journal of the ACM , 21(1):168-173. Mirabela Navlea and Amalia Todira şcu. 2010. Lin- guistic Resources for Factored Phrase-Based Statis-

253