Word Error Rates: Decomposition over POS Classes and Applications for Error Analysis

Maja Popovic´ Hermann Ney Lehrstuhl fur¨ Informatik 6 Lehrstuhl fur¨ Informatik 6 RWTH Aachen University RWTH Aachen University Aachen, Germany Aachen, Germany [email protected] [email protected]

Abstract different systems as well as for evaluating improve- ments within one system. However, these measures Evaluation and error analysis of machine do not give any details about the nature of translation translation output are important but difficult errors. Therefore some more detailed analysis of the tasks. In this work, we propose a novel generated output is needed in order to identify the method for obtaining more details about ac- main problems and to focus the research efforts. A tual translation errors in the generated output framework for human error analysis has been pro- by introducing the decomposition of Word posed in (Vilar et al., 2006), but as every human Error Rate (WER) and Position independent evaluation, this is also a time consuming task. word Error Rate (PER) over different Part- This article presents a framework for calculating of-Speech (POS) classes. Furthermore, we the decomposition of WER and PER over different investigate two possible aspects of the use POS classes, i.e. for estimating the contribution of of these decompositions for automatic er- each POS class to the overall word error rate. Al- ror analysis: estimation of inflectional errors though this work focuses on POS classes, the method and distribution of missing words over POS can be easily extended to other types of linguis- classes. The obtained results are shown to tic information. In addition, two methods for error correspond to the results of a human error analysis using the WER and PER decompositons to- analysis. The results obtained on the Euro- gether with base forms are proposed: estimation of pean Parliament Plenary Session corpus in inflectional errors and distribution of missing words Spanish and English give a better overview over POS classes. The translation corpus used for of the nature of translation errors as well as our error analysis is built in the framework of the ideas of where to put efforts for possible im- TC-STAR project (tcs, 2005) and contains the tran- provements of the translation system. scriptions of the European Parliament Plenary Ses- sions (EPPS) in Spanish and English. The translation 1 Introduction system used is the phrase-based statistical machine Evaluation of output is a very translation system described in (Vilar et al., 2005; important but difficult task. Human evaluation is Matusov et al., 2006). expensive and time consuming. Therefore a variety 2 Related Work of automatic evaluation measures have been studied over the last years. The most widely used are Word Automatic evaluation measures for machine trans- Error Rate (WER), Position independent word Error lation output are receiving more and more atten- Rate (PER), the BLEU score (Papineni et al., 2002) tion in the last years. The BLEU metric (Pap- and the NIST score (Doddington, 2002). These mea- ineni et al., 2002) and the closely related NIST met- sures have shown to be valuable tools for comparing ric (Doddington, 2002) along with WER and PER

48 Proceedings of the Second Workshop on Statistical Machine Translation, pages 48–55, Prague, June 2007. c 2007 Association for Computational Linguistics have been widely used by many machine translation taken into account. In this work, the standard WER researchers. An extended version of BLEU which and PER are decomposed and analysed. uses n-grams weighted according to their frequency estimated from a monolingual corpus is proposed 3 Decomposition of WER and PER over in (Babych and Hartley, 2004). (Leusch et al., 2005) POS classes investigate preprocessing and normalisation meth- The standard procedure for evaluating machine ods for improving the evaluation using the standard translation output is done by comparing the hypoth- measures WER,PER, BLEU and NIST. The same set esis document hyp with given reference translations of measures is examined in (Matusov et al., 2005) ref , each one consisting of K sentences (or seg- in combination with automatic sentence segmenta- ments). The reference document ref consists of tion in order to enable evaluation of translation out- R reference translations for each sentence. Let the put without sentence boundaries (e.g. translation of length of the hypothesis sentence hypk be denoted speech recognition output). A new automatic met- as Nhypk , and the reference lengths of each sentence ric METEOR (Banerjee and Lavie, 2005) uses stems Nref . Then, the total hypothesis length of the doc- and synonyms of the words. This measure counts k,r ument is Nhyp = k Nhyp , and the total reference the number of exact word matches between the out- k length is Nref = Pk N ∗ where N ∗ is defined put and the reference. In a second step, unmatched ref k ref k as the length of theP reference sentence with the low- words are converted into stems or synonyms and est sentence-level error rate as shown to be optimal then matched. The TER metric (Snover et al., 2006) in (Leusch et al., 2005). measures the amount of editing that a human would have to perform to change the system output so that 3.1 Standard word error rates (overview) it exactly matches the reference. The CDER mea- The word error rate (WER) is based on the Lev- sure (Leusch et al., 2006) is based on edit distance, enshtein distance (Levenshtein, 1966) - the mini- such as the well-known WER, but allows reordering mum number of substitutions, deletions and inser- of blocks. Nevertheless, none of these measures or tions that have to be performed to convert the gen- extensions takes into account linguistic knowledge erated text hyp into the reference text ref . A short- about actual translation errors, for example what is coming of the WER is the fact that it does not allow the contribution of verbs in the overall error rate, reorderings of words, whereas the word order of the how many full forms are wrong whereas their base hypothesis can be different from word order of the forms are correct, etc. A framework for human error reference even though it is correct translation. In analysis has been proposed in (Vilar et al., 2006) order to overcome this problem, the position inde- and a detailed analysis of the obtained results has pendent word error rate (PER) compares the words been carried out. However, human error analysis, in the two sentences without taking the word order like any human evaluation, is a time consuming task. into account. The PER is always lower than or equal Whereas the use of linguistic knowledge for im- to the WER. On the other hand, shortcoming of the proving the performance of a statistical machine PER is the fact that the word order can be impor- translation system is investigated in many publi- tant in some cases. Therefore the best solution is to cations for various language pairs (like for exam- calculate both word error rates. ple (Nießen and Ney, 2000), (Goldwater and Mc- Calculation of WER: The WER of the hypothe- Closky, 2005)), its use for the analysis of translation sis hyp with respect to the reference ref is calculated errors is still a rather unexplored area. Some auto- as: matic methods for error analysis using base forms K and POS tags are proposed in (Popovic´ et al., 2006; 1 WER = min dL(ref k,r, hypk) Popovic´ and Ney, 2006). These measures are based N ∗ r ref Xk=1 on differences between WER and PER which are cal- culated separately for each POS class using subsets where dL(ref k,r, hypk) is the Levenshtein dis- extracted from the original texts. Standard overall tance between the reference sentence ref k,r and the WER and PER of the original texts are not at all hypothesis sentence hypk. The calculation of WER

49 is performed using a dynamic programming algo- reference: rithm. Mister#N Commissioner#N ,#PUN Calculation of PER: The PER can be calcu- twenty-four#NUM hours#N lated using the counts n(e, hypk) and n(e, ref k,r) sometimes#ADV can#V be#V too#ADV of a word e in the hypothesis sentence hypk and the much#PRON time#N .#PUN reference sentence ref k,r respectively: hypothesis: Mrs#N Commissioner#N ,#PUN K 1 twenty-four#NUM hours#N is#V PER = min dPER(ref , hyp ) r k,r k sometimes#ADV too#ADV Nref∗ Xk=1 much#PRON time#N .#PUN where Table 1: Example for illustration of actual errors: a 1 POS tagged reference sentence and a corresponding dPER(ref , hyp )= |Nref − Nhyp |+ k,r k 2 k,r k hypothesis sentence

|n(e, ref ) − n(e, hyp )| k,r k  reference errors hypothesis errors error type Xe Mister#N Mrs#N substitution 3.2 WER decomposition over POS classes sometimes#ADV is#V substitution The dynamic programming algorithm for WER en- can#V deletion ables a simple and straightforward identification of be#V sometimes#ADV substitution each erroneous word which actually contributes to Table 2: WER errors: actual words which are partici- WER. Let err denote the set of erroneous words k pating in the word error rate and their corresponding in sentence k with respect to the best reference and POS classes p bea POS class. Then n(p, err k) is the number of errors in err k produced by words with POS class p. It should be noted that for the substitution errors, the WER(N) = 1/12 = 8.3%, of verbs WER(V) = POS class of the involved reference word is taken 2/12 = 16.7% and of adverbs WER(ADV) = into account. POS tags of the reference words are 1/12 = 8.3% also used for the deletion errors, and for the inser- tion errors the POS class of the hypothesis word is 3.3 PER decomposition over POS classes taken. The WER for the word class p can be calcu- In contrast to WER, standard efficient algorithms for lated as: the calculation of PER do not give precise informa-

K tion about contributing words. However, it is pos- 1 sible to identify all words in the hypothesis which WER(p) = n(p, err k) N ∗ ref Xk=1 do not have a counterpart in the reference, and vice versa. These words will be referred to as PER errors. The sum over all classes is equal to the standard overall WER. An example of a reference sentence and hypothe- reference errors hypothesis errors sis sentence along with the corresponding POS tags Mister#N Mrs#N is shown in Table 1. The WER errors, i.e. actual be#V is#V words participating in WER together with their POS can#V classes can be seen in Table 2. The reference words Table 3: PER errors: actual words which are partic- involved in WER are denoted as reference errors, ipating in the position independent word error rate and hypothesis errors refer to the hypothesis words and their corresponding POS classes participating in WER. Standard WER of the whole sentence is equal to 4/12 = 33.3%. The contribution of nouns is An illustration of PER errors is given in Table 3.

50 The number of errors contributing to the standard Nref = 12 thus being equal to 25%. The FPER is the PER according to the algorithm described in 3.1 is 3 sum of hypothesis and reference errors divided by - there are two substitutions and one deletion. The the sum of hypothesis and reference length: FPER = problem with standard PER is that it is not possible (2 + 3)/(11 + 12) = 5/23 = 21.7%. The contribu- to detect which words are the deletion errors, which tion of nouns is FPER(N) = 2/23 = 8.7% and the are the insertion errors, and which words are the sub- contribution of verbs is FPER(V) = 3/23 = 13%. stitution errors. Therefore we introduce an alterna- tive PER based measure which corresponds to the 4 Applications for error analysis F-measure. Let herr refer to the set of words in the k The decomposed error rates described in Section 3.2 hypothesis sentence which do not appear in the k and Section 3.3 contain more details than the stan- reference sentence (referred to as hypothesis er- k dard error rates. However, for more precise informa- rors). Analogously, let rerr denote the set of words k tion about certain phenomena some kind of further in the reference sentence which do not appear in k analysis is required. In this work, we investigate two the hypothesis sentence (referred to as reference k possible aspects for error analysis: errors). Then the following measures can be calcu- lated: • estimation of inflectional errors by the use of FPER errors and base forms • reference PER (RPER) (similar to recall):

K • extracting the distribution of missing words 1 over POS classes using WER errors, FPER er- RPER(p) = n(p, rerr k) N ∗ ref Xk=1 rors and base forms. 4.1 Inflectional errors • hypothesis PER (HPER) (similar to precision): Inflectional errors can be estimated using FPER K 1 errors and base forms. From each reference- HPER(p) = n(p, herr k) Nhyp hypothesis sentence pair, only erroneous words Xk=1 which have the common base forms are taken into account. The inflectional error rate of each POS • F-based PER (FPER): class is then calculated in the same way as FPER. 1 For example, from the PER errors presented in Ta- FPER(p) = · Nref∗ + Nhyp ble 3, the words “is” and “be” are candidates for an K inflectional error because they are sharing the same base form “be”. Inflectional error rate in this exam- · (n(p, rerr k)+ n(p, herr k)) Xk=1 ple is present only for the verbs, and is calculated in the same way as FPER, i.e. IFPER(V) = 2/23 = Since we are basically interested in all words with- 8.7%. out a counterpart, both in the reference and in the hypothesis, this work will be focused on FPER. The 4.2 Missing words sum of FPER over all POS classes is equal to the Distribution of missing words over POS classes can overall FPER, and the latter is always less or equal be extracted from the WER and FPER errors in the to the standard PER. following way: the words considered as missing are For the example sentence presented in Table 1, the those which occur as deletions in WER errors and number of hypothesis errors n(e, herr k) is 2 and the at the same time occur only as reference PER errors number of reference errors n(e, rerr k) is 3 where e without sharing the base form with any hypothesis denotes the word. The number of errors contributing error. The use of both WER and PER errors is much to the standard PER is 3, since |Nref − Nhyp| = 1 more reliable than using only the WER deletion er- and e |n(e, ref k) − n(e, hypk)| = 5. The stan- ros because not all deletion errors are produced by dardP PER is normalised over the reference length missing words: a number of WER deletions appears

51 due to reordering errors. The information about the 6 Error analysis base form is used in order to eliminate inflectional The translation is performed in both directions errors. The number of missing words is extracted for (Spanish to English and English to Spanish) and the each word class and then normalised over the sum of error analysis is done on both the English and the all classes. For the example sentence pair presented Spanish output. Morpho-syntactic annotation of the in Table 1, from the WER errors in Table 2 and the English references and hypotheses is performed us- PER errors in Table 3 the word “can” will be identi- ing the constraint grammar parser ENGCG (Vouti- fied as missing. lainen, 1995), and the Spanish texts are annotated 5 Experimental settings using the FreeLing analyser (Carreras et al., 2004). In this way, all references and hypotheses are pro- 5.1 Translation System vided with POS tags and base forms. The decom- The machine translation system used in this work position of WER and FPER is done over the ten is based on the statistical aproach. It is built as main POS classes: nouns (N), verbs (V), adjectives a log-linear combination of seven different statisti- (A), adverbs (ADV), pronouns (PRON), determiners cal models: phrase based models in both directions, (DET), prepositions (PREP), conjunctions (CON), IBM1 models at the phrase level in both directions, numerals (NUM) and punctuation marks (PUN). In- as well as target language model, phrase penalty and flectional error rates are also estimated for each POS length penalty are used. A detailed description of the class using FPER counts and base forms. Addition- system can be found in (Vilar et al., 2005; Matusov ally, details about the verb tense and person inflec- et al., 2006). tions for both languages as well as about the adjec- tive gender and person inflections for the Spanish 5.2 Task and corpus output are extracted. Apart from that, the distribu- The corpus analysed in this work is built in the tion of missing words over the ten POS classes is framework of the TC-STAR project. The training estimated using the WER and FPER errors. corpus contains more than one million sentences and 6.1 WER and PER (FPER) decompositions about 35 million running words of the European Par- liament Plenary Sessions (EPPS) in Spanish and En- Figure 1 presents the decompositions of WER and glish. The test corpus contains about 1 000 sentences FPER over the ten basic POS classes for both lan- and 28 000 running words. The OOV rates are low, guages. The largest part of both word error rates about 0.5% of the running words for Spanish and comes from the two most important word classes, 0.2% for English. The corpus statistics can be seen namely nouns and verbs, and that the least critical in Table 4. More details about the EPPS data can be classes are punctuations, conjunctions and numbers. found in (Vilar et al., 2005). Adjectives, determiners and prepositions are sig- nificantly worse in the Spanish output. This is partly TRAIN Spanish English due to the richer morphology of the Spanish lan- Sentences 1 167 627 guage. Furthermore, the histograms indicate that the Running words 35 320 646 33 945 468 number of erroneus nouns and pronouns is higher Vocabulary 159 080 110 636 in the English output. As for verbs, WER is higher TEST for English and FPER for Spanish. This indicates Sentences 894 1 117 that there are more problems with word order in the Running words 28 591 28 492 English output, and more problems with the correct OOVs 0.52% 0.25% verb or verb form in the Spanish output. In addition, the decomposed error rates give an Table 4: Statistics of the training and test corpora idea of where to put efforts for possible improve- of the TC-STAR EPPS Spanish-English task. Test ments of the system. For example, working on im- corpus is provided with two references. provements of verb translations could reduce up to about 10% WER and 7% FPER, working on nouns

52 WER over POS classes [%] inflectional errors [%] 11 2.5 English English 10 Spanish Spanish

9 2 8

7 1.5 6

5 1 4

3

2 0.5

1

0 0 N V A ADV PRON DET PREP CON NUM PUN N V A ADV PRON DET PREP CON NUM PUN

FPER over POS classes [%] 9 English Figure 2: Inflectional error rates [%] for English and Spanish 8 Spanish output

7 6 Nouns have a higher error rate for English than 5 for Spanish. The reason for this difference is not

4 clear, since the noun morphology of neither of the

3 languages is particularly rich - there is only distinc- tion between singular and plural. One possible ex- 2 planation might be the numerous occurences of dif- 1 ferent variants of the same word, like for example 0 N V A ADV PRON DET PREP CON NUM PUN “Mr” and “Mister”. In the Spanish output, two additional POS classes Figure 1: Decomposition of WER and FPER [%] are showing significant error rate: determiners and over the ten basic POS classes for English and Span- adjectives. This is due to the gender and number in- ish output flections of those classes which do not exist in the English language - for each determiner or adjective, there are four variants in Spanish and only one in En- up to 8% WER and 5% FPER, whereas there is no glish. Working on inflections of Spanish verbs might reason to put too much efforts on e.g. adverbs since reduce approximately 2% of FPER, on English verbs this could lead only to about 2% of WER and FPER about 1%. Improvements of Spanish determiners 1 reduction. could lead up to about 2% of improvements.

6.2 Inflectional errors 6.2.1 Comparison with human error analysis Inflectional error rates for the ten POS classes are The results obtained for inflectional errors are presented in Figure 2. For the English language, comparable with the results of a human error anal- these errors are significant only for two POS classes: ysis carried out in (Vilar et al., 2006). Although it nouns and verbs. The verbs are the most problem- is difficult to compare all the numbers directly, the atic category in both languages, for Spanish having overall tendencies are the same: the largest num- almost two times higher error rate than for English. ber of translation errors are caused by Spanish verbs, This is due to the very rich morphology of Spanish and much less but still a large number of errors by verbs - one base form might have up to about fourty English verbs. A much smaller but still significant different inflections. number of errors is due to Spanish adjectives, and only a few errors of English adjectives are present. 1Reduction of FPER leads to a similar reduction of PER. Human analysis was done also for the tense and

53 person of verbs, as well as for the number and gen- missing words [%] 30 der of adjectives. We use more detailed POS tags in eng 28 esp order to extract this additional information and cal- 26 culate inflectional error rates for such tags. It should 24 22 be noted that in contrast to all previous error rates, 20 these error rates are not disjunct but overlapping: 18 16 many words are contributing to both. 14 The results are shown in Figure 3, and the tenden- 12 cies are again the same as those reported in (Vilar 10 8 et al., 2006). As for verbs, tense errors are much 6 more frequent than person errors for both languages. 4 2 Adjective inflections cause certain amount of errors 0 only in the Spanish output. Contributions of gender N V A ADV PRON DET PREP CON NUM PUN and of number are aproximately equal. Figure 4: Distribution of missing words over POS classes [%] for English and Spanish output inflectional errors of verbs and adjectives [%] 2 English Spanish Prepositions are more often missing in Spanish 1.5 than in English, as well as determiners. A probable reason is the disproportion of the number of occur- rences for those classes between two languages. 1 7 Conclusions

0.5 This work presents a framework for extraction of lin- guistic details from standard word error rates WER 0 and PER and their use for an automatic error analy- V tense V person A gender A number sis. We presented a method for the decomposition of standard word error rates WER and PER over ten ba- Figure 3: More details about inflections: verb tense sic POS classes. We also carried out a detailed anal- and person error rates and adjective gender and num- ysis of inflectional errors which has shown that the ber error rates [%] results obtained by our method correspond to those obtained by a human error analysis. In addition, we 6.3 Missing words proposed a method for analysing missing word er- Figure 4 presents the distribution of missing words rors. over POS classes. This distribution has a same be- We plan to extend the proposed methods in order haviour as the one obtained by human error analysis. to carry out a more detailed error analysis, for ex- Most missing words for both languages are verbs. ample examining different types of verb inflections. For English, the percentage of missing verbs is sig- We also plan to examine other types of translation nificantly higher than for Spanish. The same thing errors like for example errors caused by word order. happens for pronouns. The probable reason for this is the nature of Spanish verbs. Since person and Acknowledgements tense are contained in the suffix, Spanish pronouns are often omitted, and auxiliary verbs do not exist This work was partly funded by the European Union for all tenses. This could be problematic for a trans- under the integrated project TC-STAR– Technology lation system, because it processes only one Spanish and Corpora for Speech to Speech Translation (IST- word which actually contains two (or more) English 2002-FP6-506738). words.

54 References The RWTH Machine Translation System. In TC-Star Workshop on Speech-to-Speech Translation, pages 31– Bogdan Babych and Anthony Hartley. 2004. Extend- 36, Barcelona, Spain, June. ing BLEU MT Evaluation Method with Frequency Weighting. In Proc. of the 42nd Annual Meeting of Sonja Nießen and Hermann Ney. 2000. Improving SMT the Association for Computational Linguistics (ACL), quality with morpho-syntactic analysis. In COLING Barcelona, Spain, July. '00: The 18th Int. Conf. on Computational Linguistics, pages 1081–1085, Saarbrucken,¨ Germany, July. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Im- Kishore Papineni, Salim Roukos, Todd Ward, and Wie- proved Correlation with Human Judgements. In 43rd Jing Zhu. 2002. BLEU: a method for automatic eval- Annual Meeting of the Assoc. for Computational Lin- uation of machine translation. In Proc. of the 40th guistics: Proc. Workshop on Intrinsic and Extrinsic Annual Meeting of the Association for Computational Evaluation Measures for MT and/or Summarization, Linguistics (ACL), pages 311–318, Philadelphia, PA, pages 65–72, Ann Arbor, MI, June. July.

Xavier Carreras, Isaac Chao, Llu´ıs Padro,´ and Muntsa Maja Popovic´ and Hermann Ney. 2006. Error Analysis Padro.´ 2004. FreeLing: An Open-Source Suite of of Verb Inflections in Spanish Translation Output. In Language Analyzers. In Proc. 4th Int. Conf. on Lan- TC-Star Workshop on Speech-to-Speech Translation, guage Resources and Evaluation (LREC), pages 239– pages 99–103, Barcelona, Spain, June. 242, Lisbon, Portugal, May. Maja Popovic,´ Adria` de Gispert, Deepa Gupta, Patrik George Doddington. 2002. Automatic evaluation of ma- Lambert, Hermann Ney, Jose´ B. Marino,˜ Marcello chine translation quality using n-gram co-occurrence Federico, and Rafael Banchs. 2006. Morpho-syntactic statistics. In Proc. ARPA Workshop on Human Lan- Information for Automatic Error Analysis of Statisti- guage Technology, pages 128–132, San Diego. cal Machine Translation Output. In Proc. of the HLT- NAACL Workshop on Statistical Machine Translation, Sharon Goldwater and David McClosky. 2005. Improv- pages 1–6, New York, NY, June. ing stastistical machine translation through morpho- logical analysis. In Proc. of the Conf. on Empirical Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Methods for Natural Language Processing (EMNLP), nea Micciulla, and John Makhoul. 2006. A Study Vancouver, Canada, October. of Translation Error Rate with Targeted Human An- notation. In Proc. of the 7th Conf. of the Association Gregor Leusch, Nicola Ueffing, David Vilar, and Her- for Machine Translation in the Americas (AMTA 06), mann Ney. 2005. Preprocessing and Normalization pages 223–231, Boston, MA. for Automatic Evaluation of Machine Translation. In 43rd Annual Meeting of the Assoc. for Computational 2005. TC-STAR - technology and corpora for speech to Linguistics: Proc. Workshop on Intrinsic and Extrin- speech translation. Integrated project TCSTAR (IST- sic Evaluation Measures for MT and/or Summariza- 2002-FP6-506738) funded by the European Commis- tion, pages 17–24, Ann Arbor, MI, June. Association sion. http://www.tc-star.org/. for Computational Linguistics. David Vilar, Evgeny Matusov, Sasaˇ Hasan, Richard Zens, Gregor Leusch, Nicola Ueffing, and Hermann Ney. 2006. and Hermann Ney. 2005. Statistical Machine Transla- CDER: Efficient MT Evaluation Using Block Move- tion of European Parliamentary Speeches. In Proc. MT ments. In EACL06, pages 241–248, Trento, Italy, Summit X, pages 259–266, Phuket, Thailand, Septem- April. ber.

Vladimir Iosifovich Levenshtein. 1966. Binary Codes David Vilar, Jia Xu, Luis Fernando D’Haro, and Her- Capable of Correcting Deletions, Insertions and Re- mann Ney. 2006. Error Analysis of Statistical Ma- versals. Soviet Physics Doklady, 10(8):707–710, chine Translation Output. In Proc. of the Fifth Int. February. Conf. on Language Resources and Evaluation (LREC), pages 697–702, Genoa, Italy, May. Evgeny Matusov, Gregor Leusch, Oliver Bender, and Hermann Ney. 2005. Evaluating Machine Transla- Atro Voutilainen. 1995. ENGCG - tion Output with Automatic Sentence Segmentation. Constraint Grammar Parser of English. In Proceedings of the International Workshop on Spo- http://www2.lingsoft.fi/doc/engcg/intro/. ken Language Translation (IWSLT), pages 148–154, Pittsburgh, PA, October.

Evgeny Matusov, Richard Zens, David Vilar, Arne Mauser, Maja Popovic,´ and Hermann Ney. 2006.

55