English and Arabic Speech Translation

Normalization for Automated Metrics: English and Arabic Speech Translation Sherri Condon*, Gregory A. Sanders†, Dan Parvaz*, Alan Rubenstein*, Christy Doran*, John Aberdeen*, and Beatrice Oshika* *The MITRE Corporation †National Institute of Standards and Technology 7525 Colshire Drive 100 Bureau Drive, Stop 8940 McLean, Virginia 22102 Gaithersburg, Maryland 20899–8940 {scondon, dparvaz, Rubenstein, cdoran, aberdeen, bea}@mitre.org / [email protected] eral different protocols and offline evaluations in Abstract which the systems process audio recordings and transcripts of interactions. Details of the The Defense Advanced Research Projects TRANSTAC evaluation methods are described in Agency (DARPA) Spoken Language Com- Weiss et al. (2008), Sanders et al. (2008) and Con- munication and Translation System for Tac- don et al. (2008). tical Use (TRANSTAC) program has Because the inputs in the offline evaluation are experimented with applying automated me- the same for each system, we can analyze transla- trics to speech translation dialogues. For translations into English, BLEU, TER, and tions using automated metrics. Measures such as METEOR scores correlate well with human BiLingual Evaluation Understudy (BLEU) (Papi- judgments, but scores for translation into neni et al., 2002), Translation Edit Rate (TER) Arabic correlate with human judgments less (Snover et al., 2006), and Metric for Evaluation of strongly. This paper provides evidence to sup- Translation with Explicit word Ordering port the hypothesis that automated measures (METEOR) (Banerjee and Lavie, 2005) have been of Arabic are lower due to variation and in- developed and widely used for translations of text flection in Arabic by demonstrating that nor- and broadcast material, which have very different malization operations improve correlation properties than dialog. between BLEU scores and Likert-type judg- The TRANSTAC evaluations have provided an ments of semantic adequacy — as well as between BLEU scores and human judgments of opportunity to explore the applicability of auto- the successful transfer of the meaning of indi- mated metrics to translation of spoken dialog and vidual content words from English to Arabic. to compare these metrics to human judgments from a panel of bilingual judges. When comparing sys- 1 Introduction tem-level scores (pool all data from a given system) high correlations (typically above 0.9) have The goal of the TRANSTAC program is to demon- been obtained among BLEU, TER, METEOR, and strate capabilities for rapid development and field- scores based on human judgments (Sanders et al., ing of two-way translation systems that enable 2008). When the data are more fine-grained than speakers of different languages to communicate system-level, however, the correlations of the hu- with one another in real-world tactical situations. man judgments to the automated metrics for ma- The primary use case is conversations between US chine translation (MT) are much lower. military personnel who speak only English and The evaluations also offer a chance to study the local civilians speaking only other languages. results of applying automated MT metrics to lan- The evaluation strategy adopted for TRANS- guages other than English. Studies of the measures TAC evaluations has been to conduct two types of have primarily involved translation to English and evaluations: live evaluations in which users inte- other European languages related to English. The ract with the translation systems according to sev- © The MITRE Corporation. All rights reserved. TRANSTAC data present some significant differ- ure (Culy and Riehemann, 2003; Lavie et al., 2004; ences between the automated measures of transla- Owczarzak et al., 2007). In addition to BLEU, the tion into English vs. Arabic. In particular, the TRANSTAC program uses METEOR to score results produced by five TRANSTAC systems in translations of the recorded scenarios with a meas- July 2007 and by the best-scoring three of those ure that incorporates recall on the unigram level. five systems in June 2008 revealed that the correla- METEOR and BLEU scores routinely have high tions between the automated MT metrics and the correlation with each other. For those reasons, we human judgments are lower for translation into will report only BLEU1 results here. Arabic than for translation into English. A known limitation of the BLEU metric is that Another difference concerns the relative values it only indirectly captures sentence-level features of the automated measures. There is evidence from by counting n-grams for higher values of n, but the human judgments that the systems‘ translations syntactic variation can produce translation variants from English into Arabic are better than the trans- that may not be represented in reference transla- lations from Arabic into English, and speech rec- tions, especially for languages that have relatively ognition error rates (word error rate) for English free word order (Chatterjee et al., 2007; Owczar- source-language utterances were much lower than zak et al., 2007; Turian and Melamed, 2003). It is for Arabic (Condon et al., 2008). Yet the scores possible to run the BLEU metric on only unigrams, from automated measures for translation from Eng- and as will be explained later, that ability appears lish to Arabic have consistently been significantly to be important for accurately evaluating the ad- lower than for translation from Arabic to English. vantages of the work in the current study. We hypothesize that several features of Arabic Arabic, like other Semitic languages, has both a are incompatible with assumptions that are funda- morphology and an orthography which are not mental to these automated measures of MT quality. immediately amenable to current approaches in These features of Arabic contrast with properties automated MT scoring (and training, for that mat- of English and most of the other Indo-European ter). All approaches to date make the following languages to which automated metrics have been assumptions concerning the texts: applied. The consequence of these differences is 1. Ease of tokenization. Current scoring code as- that automated MT measures give inaccurate esti- sumes a relatively trivial means of tokenization, mates of the success of translation into Arabic i.e., along white space and punctuation. Many compared to languages like English: the estimates languages, especially most Indo-European ones, consistently correlate lower with human judgments orthographically separate articles and particles of semantic adequacy and with human judgments (prepositions, etc.) This means of tokenization of how successfully the meaning of content words isolates prepositions from noun phrases and ob- is transferred from source to target language. ject pronouns from verbs. In contrast, ortho- This report describes experiments we have con- graphic conventions in Arabic attach frequently ducted to assess the extent to which scores are af- used function words to the related content word. fected by the features and the extent to which those 2 As an example, the Arabic llbrnAmj (‗to the effects can be mitigated by normalization opera- program‘) consists of three separate elements (l- tions applied to Arabic texts before computing au- ‗to‘, Al- ‗the‘, brnAmj ‗program‘). So a scoring tomated measures. program encountering lbrnAmj (‗to a program‘) without further tokenization would score it as 2 Challenges for Automated Metrics As automated measures are used more extensively, 1 We use a variant of BLEU (bleu_babylon.pl) provided by researchers learn more about their strengths and IBM that produces the same result as the original IBM version shortcomings, which allows the scores to be inter- of BLEU when there is no value of n for which there are zero matching n-grams. For situations where zero matches occur, preted with greater understanding and confidence. this implementation uses a penalty of log(0.99/# of n-grams in Some of the limitations that have been identified the hypothesis) to compute the final score. This modification for BLEU are very general, such as the fact that its is deemed an advantage when scoring individual sentences, precision-based scoring fails to measure recall, because zero matches on longer n-grams are then fairly likely. 2 Arabic strings here are written according to Buckwalter nota- rendering it more like a document similarity meas- ﻝﻝﺏﺭﻭﺍﻡﺝ = tion (Habash et al., 2007) — llbrnAmj © The MITRE Corporation. All rights reserved. entirely wrong. However, once properly and coherence, using n-gram matching (in stemmed and tokenized, it becomes clear that BLEU) or other methods of tracking word order the only element missing is the definite article, differences between system hypotheses and ref- which means that it is 2/3 correct. erence translations (METEOR, for example, 2. Concatenative morphology. In addition to mor- looks at how many ―chunks‖ of contiguous phological elements which are affixal in nature, words match between hypothesis and reference Arabic has a morphology which interleaves translations). For languages with highly varia- roots, usually consisting of three or more con- ble word order, reference translations may not sonants, with patterns (e.g., geminate the middle (and often will not) capture all allowable orders, consonant and place a t- at the beginning) and especially since translators may be influenced characteristic vowels (a for perfect tense) to by the structure of the source text. create new forms. So the root kfr (general

Load more