Thai Poetry Machine Translation to English Automated Evaluation VS Human Post-Edit
Total Page:16
File Type:pdf, Size:1020Kb
The 2nd International Conference on Technical Education Thai Poetry Machine Translation to English Automated Evaluation VS Human Post-Edit Sajjaporn Waijanya Anirach Mingkhwan Faculty of Information Technology Faculty of Industrial Technology and Management King Mongkut’s University of Technology King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand North Bangkok, Prachinburi, Thailand [email protected] [email protected] Abstract- Poetry Machine Translation output is very will not concern only the quality of the meaning important, the output of poetry translation should still be correctness, it should concern to the quality of the poetry. The evaluation of translation will present the quality of translator but Evaluation is still a research topic prosody and correctness as well. In case of the poetry in itself. The reference of evaluation is very important for translation, Thai poetry has the complex prosody and automated evaluation method. It is very difficult to prepare Thai language is more complex than other languages. original reference for the poetry. The result of Human Post- Even through many researchers try to translate the poetry Edit can use for reference. This paper will focus on the Thai but there are a few published of The Thai poetry poetry type “Klonn-Pad” and aim to translate into English keeping terms of prosody. The results of translation translation by human in other languages and not found in between Forward and Backward translation will be MT world. compared by Automatic Evaluation and Semi-Automate by The previous researches by authors have presented an using Human Post-Edit. The Human Post-Edit Evaluation algorithms of Thai Poetry Translation to English with use HBLEU (Human-Targeted Bilingual Evaluation prosody keeping [2] and has analyst the quality of model Understudy) HMETEOR (Human-Targeted Metric for Evaluation of Translation with Explicit Ordering) and compare with statistical MT by BLEU and METEOR HTER (Human-Targeted Translation Error Rate). The with Forward and Backward Translation [3,4,5]. BLEU HBLEU score of Forward equal 0.977, Backward equal and METEOR are automatic evaluation tools and they 0.982. The HMETEOR score of Forward F-Measure equal would have reference for comparing and calculating. 0.896. For Backward F-Measure equal 0.922. The HTER However, BLEUE and METEOR cannot present the (Human-Targeted Translation Error Rate) score of Forward equal 0.195 For Backward equal 0.167. Based on score of aesthetics of language and beautiful words. To this study, it can be concluded that the Human-Targeted by prepare the reference of Thai Poetry in English translated Post-Edit can create the references and it is necessary for by expert translator is very difficult. It must translate to Poetry Machine Translation. English with Thai Poetry prosody keeping. Another way Keywords—Thai poetry translation; Human Post-edit; to create the reference is Post-edit rate. HTER(Human- Poetry Translator Evaluation targeted Translation Error Rate) is the methodology that use to evaluate the quality of translator, result of post-edit I. INTRODUCTION by human will be used for reference. The Poetry Translation is always challenging in groups This research will present design of Post-edit of Linguistic researcher and Machine Translation (MT) interface. The evaluated result by Automatic Evaluation researcher. Robert Lee Frost (March 26, 1874 – January is calculated by original reference and compared with 29, 1963) [1] was an American poet said “The poetry is Human Post-Edit. The evaluation methods BLEU, what gets lost in translation”. Since when human METEOR and TER which are compared to HBLEU, translator or MT try to translate the poetry in original HMETEOR and HTER language to destination language they should translate poetry to be poetry. However, it is very difficult to keep II. RELATED WORK original meaning with aesthetics of language. Moreover, each of poetry has a specific prosody. How to translate A. The Best Lexical Metric for Phrase-Based Statistical the poetry from original language to destination language MT System Optimization with original prosody is very important. It is not only Daniel Cer, Christopher D. Manning and Daniel translation techniques but the necessary evaluator tools Jurafsky were published “The Best Lexical Metric for use for evaluation the translation result are also challenge. Phrase-Based Statistical MT System Optimization”[6]. The evaluation of phase base content or prose will This paper will compare the different evaluation metrics concern about quality of meaning correctness and (BLEU, METEOR, NIST, TER) with 2 training sets that percentage of grammar completion (path of speech) but are Arabic to English system and Chinese to English the evaluation of poetry translation will be different. It system. In this paper, authors systematically explore www.icteched.org November 6, 2014 ICTechEd02ENG02 46 Faculty of Technical Education, King Mongkut’s University of Technology North Bangkok, Thailand The 2nd International Conference on Technical Education these four issues for the most popular metrics available to (corpus-level BLEU = 0.2447). Another dataset is the MT community. Authors of this paper examine how English to Spanish news-test2010 with 1,000 English well models perform both on the metrics on which they news sentences and their Moses translations into Spanish were trained and on the other alternative metrics. (corpus-level BLEU = 0.2830). Multiple models are trained by using each metric in order Researcher uses standard HTER, which looks for to determine the stability of the resulting models. Select exact matches and Confidence Estimation Framework in models are scored by human judgment in order to this research use a Support Vector Machines regression determine how performance differences obtained by algorithm. Researcher trained three CE models for each tuning to different automated metrics related to actual language pair using a random subset of 90% of the human preferences. source translation sentence pairs. They have planned to Human results indicate that edit distance trained use crowd sourcing mechanisms to include other datasets models such as WER and TERp tend produced lower in their studies to ensure the quality of the post-editing quality translations than BLEU or NIST trained models. by including multiple post-editors and reviewers for each Tuning to METEOR works out reasonably well for dataset. Chinese, but it is not a good choice for Arabic. Edit . distance models tend to do poorly if evaluated on other III. TOOLS AND METHODOLOGY metrics, as do models trained using METEOR. However, training models to METEOR can be made more robust A. BLEU (Bilingual Evaluation Understudy) by setting α to 0.5 which balances the importance the BLEU (Bilingual Evaluation Understudy) [9] is an metric assigns to precision and recall Sentence-level MT algorithm for evaluating the quality of text which has evaluation without reference translations: Beyond been translated from machine translator from one natural language modeling language to another language. Quality is considered to be . the correspondence between a machine's output and that of a human. BLEU uses a modified form of precision to B. Estimating Machine Translation Post-Editing Effort compare a candidate translation against multiple with HTER reference translations. Equation of BLUE is shown in Lucia Speciaand and Atefeh Farzindar [7] described equation (1) and (2). about an approach to estimate translation post-editing effort at sentence level in terms of Human-targeted Translation Error Rate (HTER) based on a number of (1) features reflecting the difficulty of translating the source sentence and discrepancies between the source and translation sentences. HTER is a simple metric and When Pn: Modified n-gram precision, Geometric mean of p1, p2,..pn. obtaining HTER annotated data can be made part of the BP: Brevity penalty (c=length of MT hypothesis translation workflow. They use Pearson’s Correlation to (candidate) ,r=length of reference) evaluate the performance of the CE (Confidence Estimation) system. Their datasets main dataset consists of English translations of French legal documents and (2) English to Spanish, Spanish translations for English sentences have taken from the Europarl development and test sets provided by WMT08 (CallisonBurch et al., In this baseline, N = 4 and uniform weights wn=1/N 2008). Then they compare TER and Human Score. are used. Human’s quality scores assigned by professional translators to each translation in a 1 to 4 ranges. Their conclusions, it is possible to obtain a good performance B. METEOR: an IR-inspired metric with simpler and cheaper annotations by collecting a METEOR (Metric for Evaluation of Translation with small set of machine translations and their post-edited Explicit Ordering) [10] is a metric for the evaluation of versions and computing HTER, a semi-automatic machine translation output. The metric is based on the Translation Error Rate metric. harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as C. Exploiting Objective Annotations for Measuring stemming and synonymy matching, along with the Translation Post-editing Effort standard exact word matching. The metric was designed Lucia Specia [8] continue research for Post-edit to fix some of the problems found in the more popular measurement. Researcher use 2 datasets, First is French BLEU metric, and also produce good correlation with to English news-test 2009 with 2,525 French news human judgment at the sentence or segment level. This sentences and their Moses translations into English November 6, 2014 www.icteched.org Faculty of Technical Education, 47 ICTechEd02ENG02 King Mongkut’s University of Technology North Bangkok, Thailand The 2nd International Conference on Technical Education differs from the BLEU metric in that BLEU seeks Possible edits include the insertion, deletion, and correlation at the corpus level. substitution of single words as well as shifts of word The alignment is a set of mappings between sequences.