Normalization for Automated Metrics: English and Speech Translation

Sherri Condon*, Gregory A. Sanders†, Dan Parvaz*, Alan Rubenstein*, Christy Doran*, John Aberdeen*, and Beatrice Oshika*

*The MITRE Corporation †National Institute of Standards and Technology 7525 Colshire Drive 100 Bureau Drive, Stop 8940 McLean, Virginia 22102 Gaithersburg, Maryland 20899–8940 {scondon, dparvaz, Rubenstein, cdoran, aberdeen, bea}@mitre.org / [email protected]

eral different protocols and offline evaluations in Abstract which the systems process audio recordings and transcripts of interactions. Details of the The Defense Advanced Research Projects TRANSTAC evaluation methods are described in Agency (DARPA) Spoken Language Com- Weiss et al. (2008), Sanders et al. (2008) and Con- munication and Translation System for Tac- don et al. (2008). tical Use (TRANSTAC) program has Because the inputs in the offline evaluation are experimented with applying automated me- the same for each system, we can analyze transla- trics to speech translation dialogues. For trans- lations into English, BLEU, TER, and tions using automated metrics. Measures such as METEOR scores correlate well with human BiLingual Evaluation Understudy (BLEU) (Papi- judgments, but scores for translation into neni et al., 2002), Translation Edit Rate (TER) Arabic correlate with human judgments less (Snover et al., 2006), and Metric for Evaluation of strongly. This paper provides evidence to sup- Translation with Explicit word Ordering port the hypothesis that automated measures (METEOR) (Banerjee and Lavie, 2005) have been of Arabic are lower due to variation and in- developed and widely used for translations of text flection in Arabic by demonstrating that nor- and broadcast material, which have very different malization operations improve correlation properties than dialog. between BLEU scores and Likert-type judg- The TRANSTAC evaluations have provided an ments of semantic adequacy — as well as be- tween BLEU scores and human judgments of opportunity to explore the applicability of auto- the successful transfer of the meaning of indi- mated metrics to translation of spoken dialog and vidual content words from English to Arabic. to compare these metrics to human judgments from a panel of bilingual judges. When comparing sys- 1 Introduction tem-level scores (pool all data from a given sys- tem) high correlations (typically above 0.9) have The goal of the TRANSTAC program is to demon- been obtained among BLEU, TER, METEOR, and strate capabilities for rapid development and field- scores based on human judgments (Sanders et al., ing of two-way translation systems that enable 2008). When the data are more fine-grained than speakers of different languages to communicate system-level, however, the correlations of the hu- with one another in real-world tactical situations. man judgments to the automated metrics for ma- The primary use case is conversations between US chine translation (MT) are much lower. military personnel who speak only English and The evaluations also offer a chance to study the local civilians speaking only other languages. results of applying automated MT metrics to lan- The evaluation strategy adopted for TRANS- guages other than English. Studies of the measures TAC evaluations has been to conduct two types of have primarily involved translation to English and evaluations: live evaluations in which users inte- other European languages related to English. The ract with the translation systems according to sev-

© The MITRE Corporation. All rights reserved. TRANSTAC data present some significant differ- ure (Culy and Riehemann, 2003; Lavie et al., 2004; ences between the automated measures of transla- Owczarzak et al., 2007). In addition to BLEU, the tion into English vs. Arabic. In particular, the TRANSTAC program uses METEOR to score results produced by five TRANSTAC systems in translations of the recorded scenarios with a meas- July 2007 and by the best-scoring three of those ure that incorporates recall on the unigram level. five systems in June 2008 revealed that the correla- METEOR and BLEU scores routinely have high tions between the automated MT metrics and the correlation with each other. For those reasons, we human judgments are lower for translation into will report only BLEU1 results here. Arabic than for translation into English. A known limitation of the BLEU metric is that Another difference concerns the relative values it only indirectly captures sentence-level features of the automated measures. There is evidence from by counting n-grams for higher values of n, but the human judgments that the systems‘ translations syntactic variation can produce translation variants from English into Arabic are better than the trans- that may not be represented in reference transla- lations from Arabic into English, and speech rec- tions, especially for languages that have relatively ognition error rates (word error rate) for English free word order (Chatterjee et al., 2007; Owczar- source-language utterances were much lower than zak et al., 2007; Turian and Melamed, 2003). It is for Arabic (Condon et al., 2008). Yet the scores possible to run the BLEU metric on only unigrams, from automated measures for translation from Eng- and as will be explained later, that ability appears lish to Arabic have consistently been significantly to be important for accurately evaluating the ad- lower than for translation from Arabic to English. vantages of the work in the current study. We hypothesize that several features of Arabic Arabic, like other Semitic languages, has both a are incompatible with assumptions that are funda- morphology and an which are not mental to these automated measures of MT quality. immediately amenable to current approaches in These features of Arabic contrast with properties automated MT scoring (and training, for that mat- of English and most of the other Indo-European ter). All approaches to date make the following languages to which automated metrics have been assumptions concerning the texts: applied. The consequence of these differences is 1. Ease of tokenization. Current scoring code as- that automated MT measures give inaccurate esti- sumes a relatively trivial means of tokenization, mates of the success of translation into Arabic i.e., along white space and punctuation. Many compared to languages like English: the estimates languages, especially most Indo-European ones, consistently correlate lower with human judgments orthographically separate articles and particles of semantic adequacy and with human judgments (prepositions, etc.) This means of tokenization of how successfully the meaning of content words isolates prepositions from noun phrases and ob- is transferred from source to target language. ject pronouns from verbs. In contrast, ortho- This report describes experiments we have con- graphic conventions in Arabic attach frequently ducted to assess the extent to which scores are af- used function words to the related content word. fected by the features and the extent to which those 2 As an example, the Arabic llbrnAmj (‗to the effects can be mitigated by normalization opera- program‘) consists of three separate elements (l- tions applied to Arabic texts before computing au- ‗to‘, Al- ‗the‘, brnAmj ‗program‘). So a scoring tomated measures. program encountering lbrnAmj (‗to a program‘) without further tokenization would score it as 2 Challenges for Automated Metrics

As automated measures are used more extensively, 1 We use a variant of BLEU (bleu_babylon.pl) provided by researchers learn more about their strengths and IBM that produces the same result as the original IBM version shortcomings, which allows the scores to be inter- of BLEU when there is no value of n for which there are zero matching n-grams. For situations where zero matches occur, preted with greater understanding and confidence. this implementation uses a penalty of log(0.99/# of n-grams in Some of the limitations that have been identified the hypothesis) to compute the final score. This modification for BLEU are very general, such as the fact that its is deemed an advantage when scoring individual sentences, precision-based scoring fails to measure recall, because zero matches on longer n-grams are then fairly likely. 2 Arabic strings here are written according to Buckwalter nota- rendering it more like a document similarity meas- ﻝﻝﺏﺭﻭﺍﻡﺝ = tion (Habash et al., 2007) — llbrnAmj

© The MITRE Corporation. All rights reserved. entirely wrong. However, once properly and coherence, using n-gram matching (in stemmed and tokenized, it becomes clear that BLEU) or other methods of tracking word order the only element missing is the definite article, differences between system hypotheses and ref- which means that it is 2/3 correct. erence translations (METEOR, for example, 2. Concatenative morphology. In addition to mor- looks at how many ―chunks‖ of contiguous phological elements which are affixal in nature, words match between hypothesis and reference Arabic has a morphology which interleaves translations). For languages with highly varia- roots, usually consisting of three or more con- ble word order, reference translations may not sonants, with patterns (e.g., geminate the middle (and often will not) capture all allowable orders, and place a t- at the beginning) and especially since translators may be influenced characteristic (a for perfect tense) to by the structure of the source text. create new forms. So the root kfr (general se- The normalization experiments reported here do mantic area: ‗sacrilege‘, ‗blasphemy‘) com- not solve all of the problems of applying auto- bined with the nominal pattern taCCiyC mated measures to languages like Arabic. Howev- (generally, ‗causing one to do X‘) results in the er, they do provide some estimates of the degree to surface form takfiyr (‗accusation of blasphe- which these problems influence scores obtained by my‘). Not only is this interleaving pattern used automated measures as well as promising direc- for coining words, it is the preferred method of tions for resolving some of the problems. forming masculine plurals. 3. Non-defective script. Languages written with 3 English-Arabic Directional Asymmetry Roman script have some orthographic represen- tation (however imperfect) of both vowels and Evaluation data analyzed in this paper is recorded , which aids both in the dictionary audio input from human speakers engaged in di- lookup process and in stemming or lemmatiz- alog scenarios. The TRANSTAC scenarios have ing. Arabic is written in a defective script in included checkpoints, searches, infrastructure sur- which most vowels (the so-called ―short‖ vo- veys (sewer, water, electricity, trash, etc.), training, wels) are usually unwritten. Therefore, it is of- medical screening, inspection of facilities, and re- ten difficult to find with any certainty whether cruiting for emergency service professionals. two similarly written forms are actually the The gold-standard metrics for translation ade- same word (e.g., the Arabic ktAb might either quacy are commonly deemed to be judgments from be kitAb ‗book‘ or kut~Ab ‗writers‘. Determin- a panel of bilingual human judges. In ing which form is which on an automated basis, TRANSTAC, we have a panel of five bilingual when possible, will depend on paying careful judges for each evaluation, and we obtained utter- attention to usage, which the scoring programs ance-level judgments of semantic adequacy, initial- generally do not do. ly on the four-value scale in Figure 1 and later on 4. Uniform orthography. Although short vowels the seven-value scale in Figure 2. The seven-value are typically not represented in , scale has an explicit numeric interpretation as they may be rendered using notations. equally-spaced values, and the numeric interpreta- The number of distinct forms in which a word tion was presented to the judges during their train- may occur is multiplied by these , oth- ing (that is, the judges knew we would interpret the er diacritics that are variably included in Arabic seven values as equally-spaced). The judges were spellings, and additional orthographic variation instructed that when they were torn between two of that is unique to specific characters and mor- the labeled choices, to choose the unlabeled choice phemes. Consequently, measures that depend between. on exact matching of word forms may fail to match forms that differ in superficial ways. -Completely adequate 5. Constrained word order. Arabic word order is -Tending adequate not as free as in some languages, but it is defi- -Tending inadequate nitely more variable than in languages like Eng- -Inadequate lish. Automated measures depend on word Figure 1: Four-value scale for semantic adequacy order to provide indirect assessments of fluency

© The MITRE Corporation. All rights reserved. +3 Completely adequate judges produce directional asymmetries that sug- +2 gest translations into Arabic are better than trans- +1 Tending adequate 0 lations into English. –1 Tending inadequate One automated measure, the word error rate –2 (WER) from the speech-recognition stage, suggests –3 Inadequate that translation from English to Arabic should be Figure 2: Seven-value scale for semantic adequacy better than translation from Arabic to English: the average WER of the top 3 systems in the June We also had a highly literate native speaker of 2008 evaluation was 13.5 for English and 31.1 for each source language mark the content words Arabic. This should account for some of the supe- (nouns, verbs, adjectives, adverbs, important pre- rior performance of the translations into Arabic positions and quantifiers) in the source utterances because it is difficult for to and asked bilingual judges to say whether the overcome speech recognition errors. meaning of each of these pre-identified content Yet Figure 3 shows that the BLEU scores exhi- words was successfully transferred in the transla- bit the opposite asymmetry: scores for translation tion; we then calculated the probability of success- from English to Arabic are considerably lower than ful transfer of content words (Sanders et al., 2008). scores from Arabic to English, and this pattern Both methods of obtaining human judgments of holds for METEOR and TER scores, too. Though semantic adequacy result in higher scores for trans- it can be argued that the values of these scores lations from English into Arabic than for transla- should not be compared across languages, the con- tions from Arabic into English. Data from the June cern is that these differences reflect serious flaws 2008 evaluation, using the seven-point scale for in the measures for languages like Arabic. In fact, semantic adequacy, averaged approximately one the automated measures achieve higher correla- point higher for translations into Arabic. Earlier tions with human judgments for translation to Eng- evaluations using the four-point scale showed the lish than for translation to Arabic. same pattern, as illustrated by Figure 3. The same contrast holds for the human judg- 4 Normalization for English and Arabic ments that assess whether each content word in the English input was successfully translated, deleted, We hypothesize that the primary responsibility for or substituted in the system output: the scores are the contrasts in directional asymmetry between higher for English to Arabic than for Arabic to automated measures and human judgments lies in English translations (Sanders et al., 2008). Moreo- the features of Arabic described in section 2. Hu- ver, the live evaluations provide a similar pattern man judges are capable of ignoring minor variation of results: humans judged that system performance in order to comprehend the meaning of language in was better for translation from English to Arabic context. The examples in (1) illustrate that even in than from Arabic to English (Condon et al, 2008). the absence of context, errors in inflectional mor- Therefore, all of the evaluations involving human phology do not prevent communication of the

0.4 100 0.35 80 0.3 0.25 60 0.2 0.15 40 Adequate

BLEU Score BLEU 0.1

Percent Judged Percent 20 0.05 0 0 A B C D E A B C D E Speech Translation System Speech Translation System Arabic to English Speech English to Arabic Speech Figure 3: July 2007 BLEU Scores Compared to Proportions Judged Completely or Tending Adequate by Bilinguals

© The MITRE Corporation. All rights reserved. sender‘s message. for Arabic. But there are a few misspellings and typographical errors that are not corrected in a. two book (two books) .(ﺏ ٍﺍء .vs ﺏٕ ٍﺍ , ﺵﻙﺩ .vs ﺵﻕﺩ ,.b. Him are my brother. ( is my brother) Norm1 (e.g For English, regular contractions are expanded In contrast, scores from automated MT metrics by the Norm1 rules, but the GLM includes forms computed with reference to the correct versions in without apostrophes such as arent. Reduced forms parentheses would be low because the inflected include gotta, gonna, ‘til, ‘cause, and ‘em. Because forms do not match. the contractor, Appen Ltd., is an Australian firm, For many Arabic strings, a complete morpho- some British spellings such as centre and vandalise logical analysis is not possible without taking con- are also included. Other mappings link various text into account because the surface forms are forms of abbreviations (e.g., C.P.R.), spelling er- ambiguous, but a complete morphological analysis rors in the reference texts, and spellings of Arabic is not required to provide forms that can be names. Finally, the mapping separates nouns from matched by automated measures. We began by the form ‘s when these occur in reference tran- applying two types of normalization to both the scriptions and translations. Where appropriate, English and Iraqi Arabic dialogs. these were later hand-normalized as contractions. Rule-based normalization, referred to as Norm1, Ambiguous contractions with ‘d were also norma- focuses on orthographic variation. For Arabic, a lized by hand into the appropriate forms with Perl script reduces seven types of variation by de- would or had. leting or replacing variants of characters with a For the additional normalizations of Arabic that single form. Table 1 lists the seven types, provides are the focus of this paper, we referred to the work examples of each, and describes the normalization of Larkey et al. (2007). They experimented with a operation that is applied in Norm1. These seven variety of stemmers and morphological analyzers are acceptable orthographic variants in written for Arabic to improve information retrieval scores. Arabic and may therefore occur in the reference We produced a modified version of light10 for our translations. For English, the rules include opera- use. At the beginnings of words, light10 removes ,(ﺍﻝ) tions that transform letters to lowercase, replace the conjunction wa (َ ), the definite article al ,ﻙ and the form ,(ﻑ) fi ,(ﻝ) li ,(ﺏ) periods with underscores in abbreviations, replace prepositions bi hyphens with spaces, and expand contractions. which is used like English like or as, but is gram- The latter include forms such as this’ll, what’ll, matically like a noun. The forms are removed only must’ve, who’re, and shouldn’t. The second type of normalization, referred to as Type of Variation Example Normalization Norm2, is inspired by the normalization operations Operation ﺝَﻡ ٍُُﺭِٴَﺕ that NIST uses to compute word error rate (WER) Short / shadda Delete vowel and vs. for evaluation of automatic speech recognition. It inclusions shadda diacritics ﺝﻡ ٍُﺭٴﺕ is standard practice for NIST to normalize system .vs ﺃﺡٕﺍﻭﺍ outputs and reference transcriptions when compu- Explicit nunation inclu- Delete nunation diacritics ﺃﺡٕﺍﻭﺍً sions ting WER, though it is not standard to apply simi- .vs ﺵٓ Omission of the Delete hamza ﺵٓء lar operations when computing automated .vs ﺍﻝ ﻁُﺍﺭﺉ measures of translation quality. In addition to rule- Misplacement of the seat Delete hamza ﺍﻝ ﻁُﺍ ﺭِء based normalization operations such as replacing of the hamza ﺏﺍﻝﺝﻡﺝﻡﺕ hyphens with spaces, NIST uses a global lexical Variations where taa Replace taa mar- vs. mapping (GLM) that allows contractions and re- marbuta should be used buta with haa ﺏﺍﻝﺝ ﻡﺝﻡً -duced forms such as wanna to match the corres vs. Replace alif ﺵّ ponding un-contracted and unreduced forms. Confusion between yaa maksura with yaa ﺵٓ For Iraqi Arabic, the contractor that processes and alif maksura vs. Replace with ﺇﺱﻡ -TRANSTAC training data produced a list of va- Initial alif with or with out hamza/madda/wasla bare alif ﺍﺱﻡ -riant spellings of Arabic words from the transcrip tion files. Most of the variants were caused by Table 1: Orthographic Normalization Operations Used orthographic variation that is addressed in Norm1 in Norm1 for Iraqi Arabic so that Norm2 tends to be redundant with Norm1

© The MITRE Corporation. All rights reserved. if followed by the definite article al, which is re- Norm2a allows comparisons to reference trans- moved only if the remainder of the word is at least lations using all of the forms that are present in the 2 characters long. These constraints minimize the texts and handles the free morphemes like inde- possibility of removing characters which are ac- pendent words, as they would be in a language like tually part of the word. The conjunction may be English. Norm2a has the effect of increasing the removed without a following al, but only if the number of words that are scored, introducing a remainder of the word is at least 3 characters long. large number of unigrams (single words) that are These forms are not prefixes in the sense of likely to be scored as correct translations. This bound morphemes attached at the beginnings of alone can increase scores from automated metrics. words. They are independent words that are con- Scores from metrics such as BLEU that are based ventionally spelled as part of the following word. on n-gram co-occurrence statistics will also in- In contrast, all the suffixes removed by light10 are crease because Norm2a ensures that the order of bound morphemes. The suffixes that are removed prefix sequences such as wa + al + noun or bi + al are listed in Table 2. Norm1 renders some of these + noun will match, thus increasing bigram and tri- forms indistinct before the normalizations based on gram matches. light10 are applied. Figure 4 presents the BLEU scores for English The light10 stemmer is ―light‖ because there is to Iraqi Arabic translation from 579 speech inputs no attempt to remove other morphemes such as the before and after normalization. The normalization prefixes that express aspect and subject agreement operations are cumulative: Norm2 is applied to the on verbs or the infixes that indicate plural nouns. output of Norm1, while Norm2a and Norm2b are Our primary concern is the prefixes because they applied to the output of Norm2. The orthographic are free morphemes with rigid word order, and se- normalization in Norm1 led to slight increases in parating them produces sequences that more close- the BLEU scores for all systems (average .009). ly resemble similar parts of speech in languages Norm2b increased BLEU scores an average of al- like English. most .04 above the Norm1 and Norm2 scores. We produced two versions of normalizations Norm2a resulted in a large increase that averaged based on light10: Norm2a separates the forms, but .148, boosted by the additional n-gram matches. does not remove them, while Norm2b removes the These additional n-gram matches could be an separated forms. important confounding factor in our assessments of the advantages of these normalizations. One way to Arabic Morphological Features When Attached to evaluate this confounding factor is to compare the Suffix Verbs (V) and Nouns (N) results of analyses for Norm2b, which removes the V: 3rd person singular feminine object; N: affixes. The other approach we took to examine possessive pronominal the effects of the extra n-grams is that in addition ٌﺍ N: Dual number to applying BLEU in the standard way (computing ﺍﻥ the geometric average of matches on unigrams, N: Feminine plural bigrams, trigrams, and 4-grams), we also computed ﺍﺙ N: Nominative masculine plural; V: subject agreement َﻥ 0.50 N: Oblique masculine plural ٴﻩ 0.40 V: 3rd person singular masculine object; N: possessive pronominal 0.30 ٴً N: Feminine nisba adjective, attributive 0.20 ٴﺕ rd V: 3 person singular masculine object; N: ScoreBLEU 0.10 ﻱ possessive pronominal 0.00 N: feminine singular (or singular of A B C D ﺓ mass/collective noun) Speech Translation Systems N: 1st person singular possessive pronoun, nisba adjective marker Norm0 Norm1 Norm2 Norm2b Norm2a ْ

Table 2: Suffixes Removed in Light10 Stemming Figure 4: July 2007 English to Iraqi Arabic BLEU Scores: Speech Inputs with Cumulative Normalization

© The MITRE Corporation. All rights reserved. BLEU using just unigram matches (an option in Table 3 provides Pearson‘s correlations among the BLEU scoring software). Because unigram- all the measures we have discussed for the English only matching effectively gives no weight to flu- to Iraqi Arabic translations. Each correlation is ency or word order, the unigram-only values for computed over 39 data points (scores from 3 sys- BLEU are more a measure of semantic adequacy tems on excerpts from 13 dialogs). Correlations to of the words in the machine translation output. We the word error rate (WER) from automated recogni- believe this is an advantage for languages like tion of the English speech input are included in the Arabic with freer word order. Because looking at first column. Next are correlations of Norm2, only unigrams gives the translations no extra credit Norm2a, and Norm2b computed with BLEU_1 for additional n-grams, especially the extra n- (BLEU with unigrams only) and with BLEU_4 grams that are present in Norm2a, these unigram- (the more usual version with unigrams through 4- only values put the Arabic scoring on a more theo- grams). Correlations with the two human-judgment retically equal footing with English. metrics are in the right-hand two columns and bot- tom two rows: ―Adj Prob Correct‖ refers to the 5 Normalization Results adjusted probability correct score for transfer of content words described in section 3. We restrict our report to the BLEU metric in order The highest correlation in Table 3 is between to compare unigram scores, but results from other the two types of human judgments. Also, it ap- automated metrics such as METEOR are similar. pears that WER is a good predictor of translation The correlations are based on a subset of the quality for the TRANSTAC systems. There is a recorded offline evaluation data consisting of 109 steady increase in correlation from Norm2 to English utterances (1431 words) and 96 Iraqi Norm2a to Norm2b. Norm2b scores correlate with Arabic utterances (1085 words) in excerpts from the human judgments considerably more strongly 13 dialogs, each including about 7 exchanges. A than is the case for the Norm 2 and Norm2a scores. fragment of typical input follows (with translation We believe this shows that human judges are more of the Arabic in square brackets). sensitive to errors on content words than to errors E: well we can do certain things for you at this time on the functional elements that are removed from E: but you still have to go through the M.O.I. or the Norm2b, but are only separated in Norm2a. Mili--Ministry of Interior for some of your Although Norm2b BLEU scores are more high- requisitions ly correlated to scores from human judgments than BLEU scores based on other normalizations, the ﺁﻱ ﺭﺡ ﺃﺡﺍَﻝ ﺁﻭ ٓ ﺃ ﺭَﺡ ﻝﻝ ُﺯﺍﺭﺓ َ ﺃﺱﺃﻝٍﻡ ﻉﻝّ – ﻉﻝّ (()) :A -highest correlation is achieved using BLEU_1 in %ﺕﻯﻑﺱ َ ﺍ. ﻙﺭٴﻡ [%AH I‘ll try to go to the Ministry and ask them stead of BLEU_4. Correlations with the human- about--about ((unintelligible)) %BREATH and judgment metrics are always much lower for we‘ll see (literally: God is generous)] E: well we can do in that process we will assist you BLEU_4 than for BLEU_1. This result suggests with that process and maybe speed up their end our human judges were more tolerant of word or- of the %AH of the dealings with this der differences than the BLEU_4 metric expects.

Likert Content English input BLEU_1 BLEU_4 BLEU_1 BLEU_4 BLEU_1 BLEU_4 Semantic Word

WER Norm2 Norm2 Norm2 Norm2a Norm2a Norm2b Norm2b Adequacy AdjProbCor

WER Norm2 1 BLEU_1 Norm2 -0.23 1 BLEU_4 Norm2 -0.03 0.81 1 BLEU_1 Norm2a -0.33 0.77 0.63 1 BLEU_4 Norm2a -0.18 0.81 0.89 0.79 1 BLEU_1 Norm2b -0.43 0.82 0.51 0.80 0.61 1 BLEU_4 Norm2b -0.38 0.76 0.63 0.64 0.66 0.84 1 Likert Sem Adeq -0.63 0.50 0.19 0.60 0.41 0.75 0.63 1 Adj Prob Correct -0.67 0.35 0.07 0.59 0.30 0.67 0.48 0.86 1

Table 3: Pearson‘s R Correlations among the Metrics and Normalizations: June 2008 English to Iraqi Arabic

© The MITRE Corporation. All rights reserved. Both types of human judgments focus on semantic Christopher Culy and Susanne Riehemann. 2003. The quality, which may reduce the effect of word order. Limits of N-gram Translation Evaluation Metrics. Proceedings of the MT Summit IX, AMTA, pp. 71-79. Conclusion Sherri Condon, Jon Phillips, Christy Doran, John Aber- deen, Dan Parvaz, Beatrice Oshika, Greg Sanders, Puzzling asymmetries between automated meas- and Craig Schlenoff. 2008. Applying Automated Me- ures of English to Iraqi Arabic and Iraqi Arabic to trics to Speech Translation Dialogs. Proceedings of LREC-2008. English translations can be attributed to ortho- Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. graphic variation, inflectional morphology, and 2007. On Arabic transliteration. In Abdelhadi Soudi, relatively free word order in Arabic. We demon- Antal van den Bosch, and Günter Neumann, editors, strate that the asymmetric effects of these linguistic Arabic Computational Morphology: Knowledge- differences can be mitigated by normalization based and empirical methods, volume 38 of Text, processes that reduce orthographic variation and Speech and Language Technology, Springer Verlag. delete or separate affixes and function words. Leah S. Larkey, Lisa Ballesteros and Margaret E. Con- Correlations to human judgments suggest that nell. 2007. Light stemming for Arabic information re- features with minimal effect on meaning such as trieval. In Abdelhadi Soudi, Antal van den Bosch, and word order have little impact on and Günter Neumann, editors, Arabic Computational Morphology: Knowledge-based and empirical me- judgments that focus on semantic quality. This thod, volume 38 of Text, Speech and Language Tech- may be especially true for dialogs, where disfluen- nology, Springer Verlag. cies and inference from context are the norm. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. In demonstrating the advantages of using light 2004. The Significance of Recall in Automatic Me- stemming to improve the validity of automated trics for MT Evaluation. Proceedings of the 6th Con- measures of translation, we have drawn from Lar- ference of the Association for Machine Translation in key et al.‘s (2007) research, which demonstrated the Americas (AMTA-2004), pages 134–143. the advantages of using light stemming for infor- Karolina Owczarzak, Josef van Genabith, and Andy mation retrieval. It also appears that light stem- Way. 2007. Dependency-Based Automatic Evalua- ming can provide benefits for training speech tion for Machine Translation. Proceedings of HLT- NAACL 2007 AMTA Workshop on Syntax and Struc- translation systems. Shen et al. (2007) obtained ture in Statistical Translation, pages 80-87. BLEU score increases by processing training data Kishore Papineni, Salim Roukos, Todd Ward, and Wei- with a series of normalization operations similar to Jing Zhu. 2002. Bleu: A method for automatic eval- the ones we investigated. uation of machine translation. Proceedings of ACL In future work, we will explore different com- 2002, pages 311-318. binations of deleting vs. separating affixes along Greg Sanders, Sebastién Bronsart, Sherri Condon, and with enhancements to take into account pronomin- Craig Schlenoff. 2008. Odds of successful transfer of al endings for the 2nd person (more common in low-level concepts: A key metric for bidirectional spoken discourse) and other forms unique to Iraqi speech-to-speech machine translation in DARPA‘s Arabic. Nevertheless, the first approximation pre- TRANSTAC program. Proceedings of LREC 2008. Wade Shen, Brian Delaney, Tim Anderson, and Ray sented here has been productive. Slyh. 2007. The MIT-LL/AFRL IWSLT-2007 MT System. Proceedings of the 2007 International Work- References shop on Spoken Language Translation, Trento, Italy. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: nea Micciula and John Makhoul. 2006. A Study of An Automatic Metric for MT Evaluation with Im- Translation Error Rate with Targeted Human Annota- proved Correlation with Human Judgments. Proceed- tion. Proceedings of AMTA 2006, pages 223-231. ings of the ACL 2005 Workshop on Intrinsic and Joseph Turian, Luke Shen, and I. Dan. Melamed. 2003. Extrinsic Evaluation Measures for MT and/or Sum- Evaluation of Machine Translation and Its Evaluation marization, pages 65-73. Proceedings of MT Summit 2003, pages 386-393. Niladri Chatterjee, Anish Johnson, and Madhav Krish- Brian Weiss, Craig Schlenoff, Greg Sanders, Michelle na. 2007. Some Improvements over the BLEU Metric Potts Steves, Sherri Condon, Jon Phillips, and Dan for Measuring Translation Quality for Hindi. Pro- Parvaz. 2008. Performance Evaluation of Speech ceedings of the International Conference on Compu- Translation Systems. Proceedings of LREC 2008. ting: Theory and Applications 2007, pages 485-90.

© The MITRE Corporation. All rights reserved.