Mining a Comparable Text Corpus for a Vietnamese - French Statistical Machine Translation System

Mining a comparable text corpus for a Vietnamese - French statistical machine translation system Thi-Ngoc-Diep Do *,**, Viet-Bac Le *, Brigitte Bigi*, Laurent Besacier*, Eric Castelli** *LIG Laboratory, CNRS/UMR-5217, Grenoble, France ** MICA Center, CNRS/UMI-2954, Hanoi, Vietnam [email protected] lation module for Vietnamese within ITS3, a Abstract multilingual MT system based on the classical analysis-transfer-generation approach. Nguyen This paper presents our first attempt at con- (2006) worked on Vietnamese language and structing a Vietnamese-French statistical Vietnamese-French text alignment. But no com- machine translation system. Since Vietnam- plete MT system for this pair of languages has ese is an under-resourced language, we con- been published so far. centrate on building a large Vietnamese- There are many approaches for MT: rule-based French parallel corpus. A document alignment method based on publication date, spe- (direct translation, interlingua-based, transfer- cial words and sentence alignment result is based), corpus-based (statistical, example-based) proposed. The paper also presents an appli- as well as hybrid approaches. We focus on build- cation of the obtained parallel corpus to the ing a Vietnamese-French statistical machine construction of a Vietnamese-French statis- translation (SMT) system. Such an approach re- tical machine translation system, where the quires a parallel bilingual corpus for source and use of different units for Vietnamese (sylla- target languages. Using this corpus, we build a bles, words, or their combinations) is dis- statistical translation model for source/target lan- cussed. guages and a statistical language model for target language. Then the two models and a search 1 Introduction module are used to decode the best translation (Brown et al., 1993; Koehn et al., 2003). Over the past fifty years of development, ma- Thus, the first task is to build a large parallel chine translation (MT) has obtained good results bilingual text corpus. This corpus can be de- when applied to several pairs of languages such scribed as a set of bilingual sentence pairs. At the as English, French, German, Japanese, etc. How- moment, such a large parallel corpus for Viet- ever, for under-resourced languages, it still re- namese-French is unavailable. (Nguyen, 2006) mains a big gap. For instance, although presents a Vietnamese-French parallel corpus of Vietnamese is the 14 th widely-used language in law and economics documents. Our SMT system the world, research on MT for Vietnamese is was trained using Vietnamese-French news cor- very rare. pus created by mining a comparable bilingual The earliest MT system for Vietnamese is the text corpus from the Web. system from the Logos Corporation , developed Section 2 presents the general methodology of as an English-Vietnamese system for translating mining a comparable text corpus. We present an aircraft manuals during the 1970s (Hutchins, overview of document alignment methods and 2001). Until now, in Vietnam, there are only four sentence alignment methods, and discuss the research groups working on MT for Vietnamese- document alignment method we utilized, which English (Ho, 2005). However the results are still is based on publishing date, special words, and modest. sentence alignment results. Section 3 describes MT research on Vietnamese-French occurs our experiments in automatically mining a multi- even more rarely. Doan (2001) proposed a trans- lingual news website to create a Vietnamese- French parallel text corpus. Section 4 presents Proceedings of the 4th EACL Workshop on Statistical Machine Translation , pages 165–172, Athens, Greece, 30 March – 31 March 2009. c 2009 Association for Computational Linguistics 165 our application to rapidly build Vietnamese- ever, this order is not always respected in a PDP French SMT systems using the obtained parallel (see an example in Table 1). corpus, where the use of different units for Viet- French document Vietnamese document namese (syllables, words, or their combination) is Selon l'Administration Trong s ố g ần 2,8 triệu discussed. Section 5 concludes and discusses fu- nationale du tourisme, les lượt khách qu ốc t ế ñế n Vi ệt ture work. voyageurs en provenance de Nam t ừ ñầ u n ăm ñế n nay, l'Asie du Nord-Est (Japon, lượng khách ñế n b ằng 2 Mining a comparable text corpus République de Corée,...) ñường hàng không v ẫn représentent 33 %, de l'Eu- chi ếm ch ủ ñạ o v ới kho ảng In (Munteanu and Daniel Marcu, 2006), the au- rope, 16 %, de l'Amérique 78 %. thors present a method for extracting parallel du Nord, 13 %, d'Australie ðiều này cho th ấy, dòng et de Nouvelle-Zélande, 6%. sub-sentential fragments from comparable bilin- khách du l ịch ch ất l ượng En outre, depuis le début cao ñến Vi ệt Nam t ăng gual corpora. However this method is in need of de cette année, environ 2,8 nhanh. an initial parallel bilingual corpus, which is not millions de touristes étran- Theo th ống kê th ị khách available for the pair of language Vietnamese- gers ont fait le tour du Viet- qu ốc t ế vào Vi ệt Nam cho French (in the news domain). nam, 78 % d'eux sont venus th ấy khách ðông B ắc Á par avion. The overall process of mining a bilingual text (Nh ật B ản, Hàn Qu ốc) Cela témoigne d'un af- chi ếm t ới 33 %, châu Âu corpus which is used in a SMT system typically flux des touristes riches au chi ếm 16 %, B ắc M ỹ 13 %, takes five following steps (Koehn, 2005): raw Vietnam.… Ôxtrâylia và Niu Dilân data collection, document alignment, sentence chi ếm 6%.… splitting, tokenization and sentence alignment. Table 1. An example of a French-Vietnamese This section presents the two main steps: docu- parallel document pair in our corpus. ment alignment and sentence alignment. We also discuss the proposed document alignment 2.2 Sentence alignment method. From a PDP D1-D2 , the sentence alignment 2.1 Document alignment process identifies parallel sentence pairs (PSPs) between two documents D1 and D2 . For each Let S1 be set of documents in language L1 ; let S2 D1-D2 , we have a set SenAlignmentD1-D2 of be set of documents in language L2 . Extracting PSPs. parallel documents or aligning documents from SenAlignment D1-D2 = {“sen1-sen2”| sen1 is the two sets S1, S2 can be seen as finding the zero/one/many sentence(s) in document D1, translation document D2 (in the set S2 ) of a sen2 is zero/one/many sentence(s) in docu- document D1 (in the set S1 ). We call this pair of ment D2, sen1-sen2 is considered as a documents D1-D2 a parallel document pair PSP}. (PDP). We call a PSP sen1-sen2 alignment type m:n For collecting bilingual text data for the two when sen1 contains m consecutive sentences and sets S1, S2 , the Web is an ideal source as it is sen2 contains n consecutive sentences. large, free and available (Kilgarriff and Grefen- Several automatic sentence alignment ap- stette, 2003). For this kind of data, various meth- proaches have been proposed based on sentence ods to align documents have been proposed. length (Brown et al., 1991) and lexical informa- Documents can be simply aligned based on the tion (Kay and Roscheisen, 1993). A hybrid ap- anchor link, the clue in URL (Kraaij et al., 2003) proach is presented in (Gale and Church, 1993) or the web page structure (Resnik and Smith, whose basic hypothesis is that “longer sentences 2003). However, this information is not always in one language tend to be translated into longer available or trustworthy . The titles of documents sentences in the other language, and shorter sen- D1, D2 can also be used (Yang and Li, 2002), but tences tend to be translated into shorter sen- sometimes they are completely different. tences”. Some toolkits such as Hunalign 1 and Another useful source of information is invari- Vanilla 2 implement these approaches. However, ant words, such as named entities, dates, and they tend to work best when documents D1, D2 numbers, which are often common in news data. contain few sentence deletions and insertions, We call these words special words . (Patry and and mainly contain PSPs of type 1:1. Langlais, 2005) used numbers, punctuation, and entity names to measure the parallelism between two documents. The order of this information in 1 http://mokk.bme.hu/resources/hunalign document is used as an important criterion. How- 2 http://nl.ijs.si/telri/Vanilla/ 166 Ma (2006) provides an open source software we assume that D2 is published n days before or called Champollion 1 to solve this limitation. after D1 . After filtering by publishing date crite- Champollion permits alignment type m:n ( m, n = rion, we obtain a subset S2’ containing possible 0,1,2,3,4 ), so the length of sentence does not play documents D2 . an important role. Champollion uses also lexical information (lexemes, stop words, bilingual dic- 2.3.2 The second filter: special words tionary, etc.) to align sentences. Champollion can easily be adapted to new pairs of languages. In our case, the special words are numbers and Available language pairs in Champollion are named entities . Not only numbers ( 0-9) but also English-Arabic and English-Chinese (Ma, 2006). attached symbols (‘$’, ‘%’, ‘‰’, ‘,’, ‘.’ …) are extracted from documents, for example: 2.3 Our document alignment method “12.000$”; “13,45”; “50%”;… Named entities are specified by one or several words in which Figure 1 describes our methodology for docu- the first letter of each word is upper case, e.g. ment alignment. For each document D1 in the set “Paris ”, “ Nations Unies ” in French.

Mining a Comparable Text Corpus for a Vietnamese - French Statistical Machine Translation System

Construction of English-Bodo Parallel Text

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples

20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture

Latvian Framenet: Cross-Lingual Issues

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Natural Language Processing, Sentiment Analysis and Clinical Analytics

Improving Statistical Machine Translation Efficiency by Triangulation

Proceedings of the 55Th Annual Meeting of the Association For

Helping Language Learners Get Started with Concordancing