Improved Language Modeling for English-Persian Statistical

Mahsa Mohaghegh Abdolhossein Sarrafzadeh Tom Moir Massey University, Unitec, Massey University, School of Engineering and Department of computing School of Engineering and Advanced Technology Advanced Technology [email protected] [email protected] [email protected]

Abstract enormously. With the entrance of the World Wide Web effectively connecting countries As interaction between speakers of different together over a giant network, this interaction languages continues to increase, the ever- reached a new peak. In the area of business and present problem of language barriers must be commerce, the vast majority of companies overcome. For the same reason, automatic simply would not work without this global language translation (Machine Translation) has connection. However, with this vast global become an attractive area of research and benefit comes a global problem: the language development. Statistical Machine Translation barrier. As the international connection barriers (SMT) has been used for translation between continually break down, the language barrier many language pairs, the results of which have becomes a greater issue. The English language shown considerable success. The focus of this is now the world’s lingua franca, and non- research is on the English/Persian language pair. English speaking people are faced with the This paper investigates the development and problem of communication, and limited access evaluation of the performance of a statistical to resources in English. machine translation system by building a Machine translation is the process of using baseline system using subtitles from Persian computers for translation from one human films. We present an overview of previous language to another(Lopez, 2008). This is not a related work in English/Persian machine recent area of research and development. In fact, translation, and examine the available corpora machine translation was one of the first for this language pair. We finally show the applications of natural language processing, results of the experiments of our system using with research work dating back to the an in-house corpus and compare the results we 1950s(Cancedda, Dymetman, Foster, & Goutte, obtained when building a language model with 2009). However, due to the complexity and different sized monolingual corpora. Different diversity of human language, automated automatic evaluation metrics like BLEU, NIST translation is one of the hardest problems in and IBM-BLEU were used to evaluate the computer science, and significantly successful performance of the system on half of the corpus results are uncommon. built. Finally, we look at future work by There are a number of different approaches to outlining ways of getting highly accurate machine translation. Statistical Machine translations as fast as possible. Translation (SMT) however, seems to be the preferred approach of many industrial and 1 Introduction academic research laboratories (Schmidt, 2007). The advantages of SMT compared to rule-based Over the 20th century, international interaction, approaches lie in their adaptability to different travel and business relationships have increased domains and languages: once a functional

75 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 75–82, COLING 2010, Beijing, August 2010. system exists, all that has to be done in order to Persian uses a script that is written from make it work with other language pairs or text right to left. It has similarities with Arabic but domains is to train it on new data. has an extended alphabet and different Research work on statistical machine and/or pronunciations from Arabic. translation systems began in the early 1990s. During its long history, the language has These systems, which are based on phrase-based been influenced by other languages such as approaches, operate using parallel corpora – Arabic, Turkish and even European languages huge databases of corresponding sentences in such as English and French. Today’s Persian two languages, and employ statistics and contains many words from these languages and probability to learn by example which in some cases words from other languages still translation of a or phrase is most likely follow the grammar of their original language correct. The translation moves directly from particularly in building plural, singular or source language to target language with no different verb forms. Because of the special and intermediate transfer step. In recent years, such different nature of the Persian language phrase-based MT approaches have become compared to other languages like English, the popular because they generally show better design of SMT systems for Persian requires translation results. One major factor for this special considerations. development is the growing availability of large monolingual and bilingual text corpora in recent 1.2 Related Work years for a number of languages. The focus of this paper is on statistical Several MT systems have already been machine translation for the English/Persian constructed for the English/Persian language language pair. The statistical approach has only pair. been employed in several experimental One such system is the Shiraz project, (Amtrup, translation attempts for this language pair, and Laboratory, & University, 2000). The Shiraz is still largely undeveloped. This project is MT system is an MT prototype that translates considered to be a challenge for several reasons. text one way from Persian to English. The Firstly, the Persian language structure is very project began in 1997 and the final version was different in comparison to English; secondly, delivered in 1999. there has been little previous work done for this The Shiraz corpus is a 10 MB manually- language pair; and thirdly, effective SMT constructed bilingually tagged Persian to systems rely on very large bilingual corpora, English of about 50,000 words, however these are not readily available for the developed using on-line material for testing English/Persian language pair. purposes in a project at New Mexico State University. The system also comprises its own 1.1 The Persian Language syntactic parser and morphological analyzer, and is focused on news stories material The Persian language, or Farsi as it is also translation as its domain. known as, belongs to the Indo-European Another English/Persian system was developed language family and is one of the more by (Saedi, Motazadi, & Shamsfard, 2009). This dominant languages in parts of the Middle East. system, called PEnTrans, is a bidirectional text It is in fact the most widely spoken language in translator, comprising two main modules the Iranian branch of the Indo-Iranian (PEnT1, and PEnT2) which translate in opposite languages, being the official language of Iran directions (PEnT1 from English to Persian; (Persia) and also spoken in several countries PEnT2 from Persian to English). PEnT1 including Iran, Tajikistan and Afghanistan. employs a combination of both corpus based There also exist large groups and communities and extended dictionary approaches, and PEnT2 in Iraq, United Arab Emirates, People's uses a combination of rule, knowledge and Democratic Republic of Yemen, Bahrain, and corpus based approaches. PEnTrans introduced Oman, not to mention communities in the USA. a new WSD method with a hybrid measure which evaluates different word senses in a

76 sentence and scores them according to their of-domain. A classifier attempts to assign a condition in the sentence, together with the concept to an utterance. If the object to be placement of other words in that sentence. translated is within the translation domain, the ParsTranslator is a machine translation system is capable of significantly accurate system built to translate English to Persian text. translations. Where the object is outside the It was first released for public use in mid-1997, translation domain, the SMT method is used. the latest update being PTran version in April Transonics is a translation system for a specific 2004. The ParsTran input uses English text domain (medical: doctor-to-patient interviews), typed or from a file. The latest version is able to and only deals with question/answer situations operate for over 1.5 million words and (Ettelaie, et al., 2005). terminologies in English. It covers 33 fields of Another speech-to-speech English/Persian sciences, and is a growing translation service, machine translation system is suggested by with word banks being continually reviewed Xiang et al. They present an unsupervised and updated, available at: training technique to alleviate the problem of http://www.ParsTranslator.Net/eng/index.htm . the lack of bilingual training data by taking Another English to Persian MT system is the advantage of available source language rule-based system developed by (Faili & data(Xiang, Deng, & Gao, 2008). Ghassem-Sani, 2005)This system was based on However, there was no large tree adjoining grammar (TAG), and later corpus available at the time of development for improved by implementing trained decision both of these systems. For its specific domain, trees as a word sense disambiguation module. the Transonics translation system relied on a Mohaghegh et al. (2009) presented the first dictionary approach for translation, using a such attempt to construct a parallel corpus from , rather than a parallel text corpus. BBC news stories. This corpus is intended to be Their Statistical Translation approach was an open corpus in which more text may be merely used as a backup system. added as they are collected. This corpus was used to construct a prototype for the first 2 Corpus Development for Persian statistical machine translation system. The problems encountered, especially with the A corpus is defined as a large compilation of process of alignment are discussed in this written text or audible speech transcript. research (Mohaghegh & Sarrafzadeh, 2009). Corpora, both monolingual and bilingual, have Most of these systems have largely used a been used in various applications in rule based approach, and their BLEU scores on computational and machine a standard data set have not been published. translation. Nowadays however, most large companies A parallel corpus is effectively two corpora employ the statistical translation approach, in two different languages comprising sentences using exceedingly large amounts of bilingual and phrases accurately translated and aligned data (aligned sentences in two languages). A together phrase to phrase. When used in good example of this is perhaps the most well- machine translation systems, parallel corpora known Persian/English MT system: Google must be of a very large size – billions of Translate recently released option for this sentences – to be effective. It is for this reason language pair. Google’s MT system is based on that the Persian language poses some difficulty. the statistical approach, and was made available There is an acute shortage of digitally stored online as a BETA version in June 2009. linguistic material, and few parallel online The Transonics Spoken Dialogue Translator documents, making the construction of a is also partially a statistically based machine parallel Persian corpus is extremely difficult. translation system. The complete system itself There are a few parallel Persian corpora that operates using a speech to text converter, do exist. These vary in size, and in the domains statistical language translation, and subsequent they cover. One such corpus is FLDB1, which is text to speech conversion. The actual translation a linguistic corpus consisting of approximately unit operates in two modes: in-domain and out- 3 million words in ASCII format. This corpus

77 was developed and released by (Assi, 1997) at consists of about 3,500,000 English and Persian the Institute for Humanities and Cultural words aligned at sentence level, to give Studies. This corpus version was updated in approximately 100,000 sentences distributed 2005, in 1256 character code page, and named over 50,021 entries. The corpus was originally PLDB2. This new updated version contains constructed with SQL Server, but presented in more than 56 million words, and was access type file. The format for the files is constructed with contemporary literary books, Unicode. This corpus consists of several articles, magazines, newspapers, laws and different domains, including art, culture, idioms, regulations, transcriptions of news, reports, and law, literature, medicine, poetry, politics, telephone speeches for lexicography purposes. proverbs, religion, and science; it is available Several corpora construction efforts have for sale online. been made based on online Hamshahri newspaper archives. These include Ghayoomi 3 Statistical Machine Translation (2004), with 6 months of Hamshahri archives to 3.1 General yield a corpus of 6.5 million words, and (Darrudi, Hejazi, & Oroumchian, 2004), with 4 Statistical machine translation (SMT) can be years’ worth of archives to yield a 37 million- defined as the process of maximizing the word corpus. probability of a sentence s in the source The ‘Peykareh’ or ‘Text Corpus’ is a corpus language matching a sentence t in the target of 38 million words developed by Bijankhan et language. In other words, “given a sentence s in al. available at: the source language, we seek the sentence t in http://ece.ut.ac.ir/dbrg/bijankhan/ and the target language such that it maximizes P(t | comprises newspapers, books, magazines s) which is called the conditional probability or articles, technical books, together with the chance of t happening given s'' (Koehn, et al., transcription of dialogs, monologues, and 2007). speeches for language modeling purposes. It is also referred to as the most likely Shiraz corpus (Amtrup, et al., 2000)is a translation. This can be more formally written bilingual tagged corpus of about 3000 aligned as shown in equation (1). Persian/English sentences also collected from arg max P(t | s) (1) the Hamshahri newspaper online archive and manually translated at New Mexico State Using Bayes Rule from equation (2), we can University. write equation (1) for the most likely translation Another corpus, TEP (Tehran English- as shown in equation (3). Persian corpus), available at: http://ece.ut.ac.ir/NLP/ resources.htm , consists P (t | s) = P (t) * P(s | t) =P (s) of 21,000 subtitle files obtained from (2) www.opensubtitles.org. Subtitle pairs of arg max P(t | s) = arg max P(t) * P(s | t) multiple versions of same movie were extracted, (3) a total of about 1,200(Itamar & Itai, 2008) then aligned the files using their proposed dynamic Where (t) is the target sentence, and (s) is the programming method. This method operates by source sentence. P (t) is the target language using the timing information contained in model and P(s | t) is the translation model. The subtitle files so as to align the text accurately. argmax operation is the search, which is done The end product yielded a parallel corpus of by a so-called decoder which is a part of a approximately 150,000 sentences which has statistical machine translation system. 4,100,000 tokens in Persian and 4,400,000 tokens in English. 3.2 Statistical Machine Translation Tools Finally, European Language Resources Association (ELRA), available at: There are a number of implementations of http://catalog.elra.info/product_info.php?produc subtasks and algorithms in SMT and even ts_id=1111 , have constructed a corpus which software tools that can be used to set up a fully- featured state-of-the-art SMT system.

78 Moses (Koehn, et al., 2007) is an open-source Microsoft’s bi-lingual sentence aligner statistical machine translation system which developed by (Moore, 2002). allows one to train translation models using The next step we plan to take involves the GIZA++ (Och & Ney, 2004).for any given construction of a statistical prototype based on language pair for which a parallel corpus exists. the largest available English/Persian parallel This tool was used to build the baseline system corpus extracted from the domain of movie discussed in this paper. MOSES uses a beam subtitles. This domain was chosen because the search algorithm where the translated output maximum number of words that can be sentence is generated left to right in form of displayed as a subtitle on the screen is between hypotheses. Beam-search is an efficient search 10- 12 which means both training and decoding algorithm which quickly finds the highest will be a lot faster. Building a parallel corpus probability translation among the exponential for any domain is generally the most time number of choices. consuming process as it depends on the The search begins with an initial state where availability of parallel text. But the domain of no foreign input words are translated and no subtitling makes it easier to get the source English output words have been generated. New language in the form of scripts and the target states are created by extending the English language in the form of subtitles in many output with a phrasal translation of that covers different languages. some of the foreign input words not yet translated. The algorithm can be used for exhaustively searching through all possible translations when data gets very large. The search can be optimized by discarding hypotheses that cannot be part of the path to the best translation. Furthermore, by comparing states, one can define a beam of good hypotheses and prune out hypotheses that fall out of this beam (Dean & Ghemawat, 2008).

3.3 Building a Baseline SMT System

To build a good baseline system it is important to build a sentence aligned parallel corpus Figure1. A typical SMT System which is spell-checked and grammatically correct for both the source and target language. A language model (LM) is usually trained The alignment of words or phrases turns out to on large amounts of monolingual data in the be the most difficult problem SMT faces. target language to ensure the fluency of the Words and phrases in the source and target language that the sentence is getting translated languages normally differ in where they are into. Language modeling is not only used in placed in a sentence. Words that appear on one machine translation but also used in many language side may be dropped on the other. One natural language processing applications such as English word may have as its counterpart a , part-of-speech tagging, longer Persian phrase and vice versa. The and . A statistical accuracy of SMT relies heavily on the existence language model assigns probabilities to a of large amounts of data which is commonly sequence of words and tries to capture the referred to as a parallel corpus. The first step properties of a language. taken was to develop the parallel corpus. This The Language Model (LM) for this study corpus is intended to be an open corpus in was trained on the BBC Persian News corpus which more text can be added as they are and also an in-house corpus from different collected. Sentences were aligned using genres. The SRILM toolkit developed was used

79 to train a 5-gram LM for experimentation as in size increased, we performed various (Stolcke, 2002). experiments such as increasing the language model in each instance. 4 Experiments and Results Test No. EN/FA 1 EN/FA 2 EN/FA 3 4.1 Experiment setup Test Sentences 817 1011 2343 We used Moses a phrase-based SMT Training 864 1066 7005 development tool for constructing our machine Sentences translation system. This included n-gram language models trained with the SRI language Table 1. Size of test set and train set (language modeling tool, GIZA++ alignment tool, Moses Model) En: English, FA: Farsi decoder and the script to induce phrase-based translation models from word-based ones. Evaluation results from these experiments are presented in Tables 2, 3 and 4. As expected, 4.2 Performance evaluation metrics BLEU scores improved as the size of the corpus increased. The BLEU scores themselves were A lot of research has been done in the field of significantly low; however this was expected automatic machine translation evaluation. due to the small size of the corpus. We plan to Human evaluations of machine translation are update and increase the corpus size in the near extensive but expensive. Human evaluations can future, which will undoubtedly yield more take months to finish and involve human labor satisfactory results. that cannot be reused which is the main idea behind the method of automatic machine LM=864 BLEU NIST IBM -BLEU Corpus size 0.1061 1.8218 0.0060 translation evaluation that is quick, inexpensive, 817 and language independent. Corpus size 0.0882 1.5338 0.0050 One of the most popular metrics is called 1011 Corpus size 0.0806 1.7364 0.0067 BLEU (BiLingual Evaluation Understudy) 2343 developed at IBM. The closer a MT is to a professional human translation, the better it is. Table 2. Result obtained using Language Model This is the central idea behind the BLEU metric. size=864 NIST is another automatic evaluation metric with the following primary differences LM=1066 BLEU NIST IBM -BLEU compared to BLEU such as Text pre-processing, Corpus size 0.0920 1.6838 0.0060 gentler length penalty, information-weighted N- 817 gram counts and selective use of N-grams (Li, Corpus size 0.0986 1.5301 0.0050 Callison-Burch, Khudanpur, & Thornton, 1011 Corpus size 0.1127 1.6961 0.0069 2009); (Li, Callison-Burch, Khudanpur, & 2343 Thornton, 2009). Table 3. Result obtained using Language Model 4.3 Discussion and analysis of the results size=1066 LM= 7005 BLEU NIST IBM -BLEU In this study, Moses was used to establish a Corpus size 0.0805 1.6721 0.0063 baseline system. This system was trained and 817 Corpus size 0.0888 1.5512 0.0051 tested on three in-house corpora, the first 817 1011 sentences, the second 1011 sentences, and the Corpus size 0.1148 1.7554 0.0071 third 2343 sentences. The data available was 2343 split into a training and test set. Microsoft’s Table 4. Result obtained using Language Model bilingual sentence aligner (Moore, 2002) was size=7005 used to align the corpus and training sets.

Aligning was also performed manually to aid in the improvement of the results. As the corpus

80 The first test was performed on a corpus of 817 translation (Ma & Way, 2009). In the Persian sentences in Persian and the same number for language, some problems and difficulties arise their aligned translation in English. In this due to natural language ambiguities, anaphora instance, the training set used was 864 resolution, idioms and differences in the types sentences. Results of this translation were and symbols used for punctuation. These issues evaluated using three evaluation metrics had to be resolved before any attempt at SMT (BLEU, NIST, and IBM-BLEU) An excerpt could be made. Needless to stress on the fact from the output of this first experiment is shown that the better the alignment the better the in figure2 (a). results of the translation. The second test comprised of a 1011 sentences corpus, with a 1066 sentence training Trained on 864 sentences Language Model set. As can be seen, the evaluation metric results 2 1.8 improved. 1.6 The same experiment was repeated for a 1.4 1.2 third time, this time with an even larger corpus 1 0.8 of 2343 sentences, and a training set of 7005 0.6 sentences. The result can be seen in table 4. The 0.4 0.2 results obtained in this test were close to those 0 in the previous test, apart from a small increase BLEU NIST IBM-BLEU in BLEU scores. It must be noted that BLEU is only a tool to compare different MT systems. So Corpus size 817 Corpus size 1011 Corpus Size 2343 an increase in BLEU scores may not necessarily (a) mean an increase in the accuracy of translation. The performance of the baseline English- Trained on 1066 sentences Language Model Persian SMT system was evaluated by 1.8 1.6 computing BLEU, IBM-BLEU-NIST (Li, et al., 1.4 2009) scores from different automatic 1.2 1 evaluation metrics against different sizes of the 0.8 0.6 sentence aligned corpus and different sizes of 0.4 the training set . 0.2 Tables 2, 3 and 4 show the results obtained 0 BLEU NIST IBM-BLEU using corpuses of 817, 1011, and 2343 sentences respectively. The language model size Corpus size 817 Corpus size 1011 Corpus Size 2343 was varied from 864 to 1066 and finally to 7005 sentences. (b) Moreover as shown in table 3, using a Trained on 7005 sentences Language Model corpus and language model of 1011 and 1066 in 2 size respectively produces better results. This 1.8 1.6 can clearly be noticed from graph in Figure 1.4 1.2 2(b). 1 Finally, increasing the size of the corpus to 0.8 0.6 2343 and language model constructed using 0.4 7005 sentences produced the best translation 0.2 0 results as shown in both Figure 2(c) and Table BLEU NIST IBM-BLEU 4. This data shows that an increased corpus size will yield an improved translation quality, but Corpus size 817 Corpus size 1011 Corpus Size 2343 only as long as the size of the language model is (c) proportional to the corpus size. Literature refers Figure 3. (a) Results obtained using training to the fact that the size of the corpus, although size=864 (b) Results obtained using training important, does not have as great an effect as size=1066 (c) Results obtained using training corpus and language model in the domain of size=7005

81 5 Future work Darrudi, E., Hejazi, M., & Oroumchian, F. (2004). Assessment of a modern farsi corpus . Despite the fact that compared to other Dean, J., & Ghemawat, S. (2008). MapReduce: language pairs, the available parallel corpora Simplified data processing on large clusters. for the English/Persian language pair is Communications of the ACM, 51 (1), 107- significantly smaller, the future of statistical 113. machine translation for this language pair Ettelaie, E., Gandhe, S., Georgiou, P., Knight, K., looks promising. We have been able to Marcu, D., Narayanan, S., et al. (2005). procure several very large bilingual corpora, Transonics: A practical speech-to-speech translator for English-Farsi medical which we intend to combine with the open dialogues . corpus we used in the original tests. With the Faili, H., & Ghassem-Sani, G. (2005). Using use of a much larger bilingual corpus, we Decision Tree Approach For Ambiguity expect to produce a significantly higher Resolution In Machine Translation . evaluation metric score. Our planned Itamar, E., & Itai, A. (2008). Using Movie Subtitles immediate future work will consist of for Creating a Large-Scale Bilingual combining these corpora together, Corpora . addressing the task of corpus alignment, and Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., continuing the use of a web crawler to obtain Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical further bilingual text. machine translation . Li, Z., Callison-Burch, C., Khudanpur, S., & 6 Conclusion Thornton, W. (2009). Decoding in Joshua. The Prague Bulletin of Mathematical This paper presented an overview of some of Linguistics, 91 , 47-56. the work in the area of English/Persian MT Lopez, A. (2008). Statistical machine translation. systems that has been done to date, and showed Ma, Y., & Way, A. (2009). Bilingually Motivated a set of experiments in which our SMT system Domain-Adapted Word Segmentation for was applied to the Persian language using a Statistical Machine Translation . relatively small corpus. The first part of this Mohaghegh, M., & Sarrafzadeh, A. (2009). An analysis of the effect of training data work was to test how well our system translates variation in English-Persian statistical from Persian to English when trained on the machine translation . available corpora and to spot and try and resolve Moore, R. (2002). Fast and accurate sentence problems with the process and the output alignment of bilingual corpora. Lecture produced. According to the results we obtained, notes in computer science , 135-144. it was concluded that a corpus of much greater Och, F., & Ney, H. (2004). The alignment template size would be required to produce satisfactory approach to statistical machine translation. results. Our experience with the corpus of Computational Linguistics, 30 (4), 417-449. smaller size shows us that for a large corpus, Saedi, C., Motazadi, Y., & Shamsfard, M. (2009). there will be a significant amount of work Automatic translation between English and Persian texts . required in aligning sentences. Schmidt, A. (2007). Statistical Machine Translation Between New Language Pairs Using References Multiple Intermediaries. Stolcke, A. (2002). SRILM-an extensible language Amtrup, J., Laboratory, C. R., & University, N. M. modeling toolkit . S. (2000). Persian-English machine Xiang, B., Deng, Y., & Gao, Y. (2008). translation: An overview of the Shiraz Unsupervised training for farsi-english project : Computing Research Laboratory, speech-to- . New Mexico State University. Assi, S. (1997). Farsi linguistic database (FLDB). International Journal of Lexicography, 10 (3), 5. Cancedda, N., Dymetman, M., Foster, G., & Goutte, C. (2009). A Statistical Machine Translation Primer.

82