Adaptation of Brazilian Portuguese Texts to European Portuguese
Total Page:16
File Type:pdf, Size:1020Kb
BP2EP – Adaptation of Brazilian Portuguese texts to European Portuguese Lu´ıs Marujo 1,2,3, Nuno Grazina 1, Tiago Lu´ıs 1,2, Wang Ling 1,2,3, Lu´ısa Coheur 1,2, Isabel Trancoso 1,2 1 Spoken Language Laboratory / INESC-ID Lisboa, Portugal 2 Instituto Superior Tecnico,´ Lisboa, Portugal 3 Language Technologies Institute / Carnegie Mellon University, Pittsburgh, USA luis.marujo,ngraz,tiago.luis,wang.ling,luisa.coheur,imt @l2f.inesc-id.pt { } Abstract Portuguese (EP) translations. This discrepancy has origin in the Brazilian population size that is near This paper describes a method to effi- 20 times larger than the Portuguese population. ciently leverage Brazilian Portuguese re- This paper describes the progressive develop- sources as European Portuguese resources. ment of a tool that transforms BP texts into EP, in Brazilian Portuguese and European Por- order to increase the amount of EP parallel corpora tuguese are two Portuguese varieties very available. close and usually mutually intelligible, This paper is organized as follows: Section 2 de- but with several known differences, which scribes some related work; Section 3 presents the are studied in this work. Based on this main differences between EP and BP; the descrip- study, we derived a rule based system to tion of the BP2EP system is the focus of Section translate Brazilian Portuguese resources. 4. The following section describes an algorithm Some resources were enriched with mul- for extracting Multiword Lexical Contrastive pairs tiword units retrieved semi-automatically from SMT Phrase-tables; Section 6 presents the re- from phrase tables created using statisti- sults, and Section 7 concludes and suggests future cal machine translation tools. Our experi- work. ments suggest that applying our translation step improves the translation quality be- 2 Related Work tween English and Portuguese, relatively Few papers are available on the topic of improv- to the same process using the same re- ing Machine Translation (MT) quality by explor- sources. ing similarities in varieties, dialects and closely re- lated languages. 1 Introduction Altintas (2002) states that developing an MT Modern statistical machine translation (SMT) system between similar languages is much easier depends crucially on large parallel corpora and on than the traditional approaches, and that by putting the amount and specialization of the training data aside issues like word reordering and most of the for a given domain. Depending on the domain, semantics, which are probably very similar, it is such resources may sometimes be available in only possible to focus on more important features like one of the language varieties. For instance, class- grammar and the translation itself. This also al- room lecture and talk transcriptions such as the lows the creation of domains of closely related lan- ones that one can find in the MIT OpenCourse- guages which may be interchangeable and that, in Ware (OCW) website1, and the TED talks website2 this particular case, would allow, for instance, the (838 BP vs 308 EP) are examples of parallel cor- development of MT systems between English and pora where Brazilian Portuguese (BP) translations a set of Turkic languages instead of only Turkish. can be found much more frequently than European The system uses a set of rules, written in the XE- ROX Finite State Tools (XFST) syntax, which is c 2011 European Association for Machine Translation. 1http://ocw.mit.edu/OcwWeb based on Finite State Transducers (FST), to ap- 2http://www.ted.com/talk ply several morphological and grammatical adap- Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste (eds.) Proceedings of the 15th Conference of the European Association for Machine Translation, p. 129136 Leuven, Belgium, May 2011 tations from Turkish to Crimean Tatar. Results the BLEU score by 2.86 points (Nakov and Ng, showed that this approach does not cover all varia- 2009). tions possible in these languages, and that in some cases, there is no way of adapting the text with- 3 Corpora out an additional parser capable of determining if Several corpora were collected to use in our ex- an adaptation not covered by the rules should be periments: performed. – CETEMPublico corpus: collection of EP news- Other authors have developed very similar sys- paper articles 3 tems with identical approaches to translate from – CETEMFolha: collection of BP newspaper arti- Czech to Slovak (Hajicˇ et al., 2000), from Spanish cles 4 to Catalan (Canals-Marote et al., 2001)(Navarro et – 115 texts from the Zero Hora newspaper and 50 al., 2004), and from Irish to Gaelic Scottish (Scan- texts from the Folha Cienciaˆ da Folha de Sao˜ Paulo nell, 2006). Looking at Scannel’s system (2006) newspaper. This corpus includes about 2,200 sen- gives us a better understanding of the common sys- tences, 62,000 words, and it was initially presented tem architecture. This architecture consists of a in (Caseli et al., 2009). It has 3 versions (original pipeline of components, such as a Part-of-Speech and 2 simplified versions). A parallel original cor- tagger (POS-tagger), a Na¨ıve Bayes word sense pus version in EP was also created to allow a fair disambiguator, and a set of lexical and grammat- evaluation using very close translations. The un- ical transfer rules, based on bilingual contrastive availability of EP text simplification corpora also lexical and grammatical differences. motivated this choice. On a different level, (Nakov and Ng, 2009) de- – Ted Talks (T.T.): 761 T.T. in English were col- scribes a way of building MT systems for less- lected. From those, only 262 T.T. have a corre- resourced languages by exploring similarities with sponding EP version and 749 T.T. have a BP ver- closely related and languages with much more re- sion (Table 1). sources. More than allowing translation for less- resourced languages, this work also aims at allow- Language Nr. T.T. Nr. Sentences Nr. Words ing translation from groups of similar languages to EN 761 99,970 1,817,632 other groups of similar languages just like stated EP 262 29,284 512,233 earlier. This method proposes the merging of BP 749 95,872 1,706,223 bilingual texts and phrase-table combination in the training phase of the MT system. Merging bilin- Table 1: Description of the gathered Ted Talks cor- gual texts from similar languages (on the source pus. side), one with the less-resourced language and the other (much larger) with the extensive resource 4 Main Differences Between EP and BP language, provides new contexts and new align- Texts ments for words existing in the smaller text, in- The two Portuguese varieties involved in this creases lexical coverage on the source side and work are very close and usually mutually intelli- reduces the number of unknown words at trans- gible, but there are several sociolinguistics, ortho- lation time. Also, words can be discarded from graphic, morphologic, syntactic, and semantic dif- the larger corpus present in the phrase-table sim- ferences (Mateus, 2003). Furthermore, there are ply because the input will never match them (the also relevant phonetic differences that are beyond input will be in the low resource language). This of the scope of this work. approach is heavily based on the existence of a high number of cognates between the related lan- 4.1 Sociolinguistics Differences Silva (2008) made the point that language va- guages. Experiments performed when both ap- rieties contribute to the sociolinguistic variations. proaches are combined between the similar lan- As a matter of fact, such variations generate emo- guages show that extending Indonesian-English tive meanings (e.g.: pejorative terms), strict soci- translation models with Malaysian texts yielded a olinguistic meanings (e.g.: erudite, popular, and gain of 1.35 points in BLEU. Similar results are regional terms/expressions), discursive meanings obtained when improving Spanish-English transla- tion with larger Portuguese texts, which improved 3http://www.linguateca.pt/cetempublico 4http://www.linguateca.pt/cetenfolha/ 130 (e.g.: interjections, discourse markers) and ways possessive pronouns are frequently omitted, while of addressing people (e.g.: senhor, voceˆ, and tu). this is not correct in EP. Examples: In BP, voceˆ (you) is used as a personal pronoun – EN: I sold my car. when addressing someone, in the majority of situ- – BP: Vendi meu carro. ations, instead of tu (you) or its omission in EP. – EP: Vendi o meu carro. 4.2 Orthographic and Morphologic In addition, word expansions in BP are com- Differences The recent introduction (2009) of the ortho- monly word contractions in EP (Caseli et al., graphic agreement of 1990 5 mitigated some dif- 2009). The list of contractions was extracted from (Abreu and Murteira, 1994) and it is also ferences between the two varieties, but most ex- 6 istent linguistic resources were written using the available at Wikipedia . Examples: pre-agreement version. The germane orthographic – EN: He lived in that house. differences from EP to BP are: inclusion of muted – BP: Ele vivia em aquela casa. consonants; abolition of umlaut; and different ac- – EP: Ele vivia naquela casa. centuation in Proparoxytone words (words with 4.4 Lexical and Semantic Differences stress on third-to-last syllable), some Paroxytone In the same manner that Color and colour, or words (stress on the penultimate syllable) ending gas and petrol are examples of the differences be- in -n, -r, -s , -x; and words ending in -eica, and - tween American and British dialects, there are also oo (Teyssier, 1984). Examples: correlative lexical differences between BP and EP. – EN: project, water, tennis. At this level, there are innumerable differences be- – BP: projeto, ag´ ua,¨ tenis.ˆ tween this varieties. The origin of these differences – EP: projecto, agua,´ tenis.´ is also very varied, ranging from the influence of other languages, cultural differences and historical 4.3 Syntactic Differences Both varieties differ in their preferences when reasons.