Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

WHY WORD ERROR RATE IS NOT A GOOD METRIC FOR SPEECH RECOGNIZER TRAINING FOR THE SPEECH TRANSLATION TASK? Xiaodong He, Li Deng, and Alex Acero Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA ABSTRACT ASR, where the widely used metric is word error rate (WER), the translation accuracy is usually measured by the quantities Speech translation (ST) is an enabling technology for cross-lingual including BLEU (Bi-Lingual Evaluation Understudy), NIST-score, oral communication. A ST system consists of two major and Translation Edit Rate (TER) [12][14]. BLEU measures the n- components: an automatic speech recognizer (ASR) and a machine gram matches between the translation hypothesis and the translator (MT). Nowadays, most ASR systems are trained and reference(s). In [1][2], it was reported that translation accuracy tuned by minimizing word error rate (WER). However, WER degrades 8 to 10 BLEU points when the ASR output was used to counts word errors at the surface level. It does not consider the replace the verbatim ASR transcript (i.e., assuming no ASR error). contextual and syntactic roles of a word, which are often critical On the other hand, although WER is widely accepted as the de for MT. In the end-to-end ST scenarios, whether WER is a good facto metric for ASR, it only measures word errors at the surface metric for the ASR component of the full ST system is an open level. It takes no consideration of the contextual and syntactic roles issue and lacks systematic studies. In this paper, we report our of a word. In contrast, most modern MT systems rely on syntactic recent investigation on this issue, focusing on the interactions of and contextual information for translation. Therefore, despite the ASR and MT in a ST system. We show that BLEU-oriented global extreme example offered in [1], it is not clear whether WER is a optimization of ASR system parameters improves the translation good metric for ASR in the scenario of ST. Since the latest ASR quality by an absolute 1.5% BLEU score, while sacrificing WER systems are usually trained by discriminative techniques where over the conventional, WER-optimized ASR system. We also models are optimized by the criteria that are strongly correlated conducted an in-depth study on the impact of ASR errors on the with WER (see an overview paper in [4]), the answers to the final ST output. Our findings suggest that the speech recognizer question of whether WER is a good metric for training the ASR component of the full ST system should be optimized by component of a full ST system become particularly important. translation metrics instead of the traditional WER. The question is addressed in our recent investigation, where we use a log-linear model to integrate the ASR and MT modules Index Terms— Speech translation, speech recognition, for ST. In our approach, the ASR output is treated as a hidden machine translation, translation metric, word error rate, BLEU variable, and the posterior probability of a <ASR-output, MT- score optimization, log-linear model. output> pair given the speech signal is modeled through a regular log-linear model using the feature functions derived from hidden 1. INTRODUCTION Markov model (HMM)-based ASR and a hierarchical phrase-based MT [3]. These “features” include acoustic model (AM) score, source and target language model (LM) scores, phrase level and Speech translation (ST) is an important technology for cross- lexicon level translation model (TM) scores, etc. All the lingual (one-way or two-way) oral communication, whose societal parameters of the log-linear model are trained to directly optimize role is rapidly increasing in the modern global and interconnected the quality of the final translation output measured in BLEU. On a informational age. ST technology as key enabler of universal ST task of oral lecture translation from English to Chinese, our translation is one of the most promising and challenging future experimental results show that the log-linear model and global needs and wants in the coming decade [15]. optimization improve the translation quality by 1.5% in the BLEU A ST system consists of two major components: automatic score. Our investigation also provides insights to the relationship speech recognition (ASR) and machine translation (MT). Over the between the WER of the ASR output and the BLEU score of the past years, significant progress has been made in the integration of final ST output. The experimental results show a poor correlation these two components in the end-to-end ST task between the two, suggesting that WER is not a good metric for the [2][7][9][10][16][20]. In [10], a Bayes-rule-based integration of ASR component of the ST system. In particular, using real ASR and MT was proposed, in which the ASR output is treated as examples extracted from the test data, we further isolate two a hidden variable. In [19], a log-linear model was proposed to typical situations where ASR outputs with higher WER can lead to directly model the posterior probability of the translated output counter-intuitively better translations. These findings suggest that given the input speech signal, where the feature functions are the speech recognizer in a ST system should be trained directly by derived from the overall outputs of the ASR model, the translation the translation metric of the full system such as the BLEU score, model, and the Part-of-Speech language model. This set of work is instead of the local measure of WER. later extended with the use of the phrase-based MT component and a lattice/confusion-network based interface between ASR and MT [8][13]. 2. SPEECH TRANSLATION SYSTEMS Despite their importance, there have been relatively few studies on the impact of ASR errors on the MT quality. Unlike 978-1-4577-0539-7/11/$26.00 ©2011 IEEE 5632 ICASSP 2011 A general framework for ST is illustrated in Fig. 1. The input x Forward word translation feature: (, , ) = speech signal X is first fed into the ASR module. Then the ASR (|) = ∏∏ ∑ (,|,) , where , is module generates the recognition output set {F}, which is in the the m-th word of the k-th target phrase ̃, , is the n- source language. The recognition hypothesis set {F} is finally th word in the k-th source phrase , and (,|,) is passed to the MT module to obtain the translation sentence E in the the probability of translating word to word . target language. In our setup, an N-best list is used as the interface , , (This is also referred to as the lexical weighting feature.) between ASR and MT. In the following, we use F to represent an x Backward phrase translation feature: (,,) = ASR hypothesis in the N-best list. (|) = ∏ (|̃) , where ̃ and are {F} defined as above. X ASR MT E x Backward word translation feature: (, , ) = (|) = ∏∏∑ (,|,), where , and Fig. 1. Two components of a full speech translation system , are defined as above. |()| x Count of NULL translations: (,,) = 2.1. The unified log-linear model for ST is the exponential of the number of the source words that are not translated (i.e., translated to NULL word in the The optimal translation given the input speech signal X is target side). obtained via the decoding process according to x Count of phrases: (,,) = |{̃,,,…,}| is the exponential of the number of phrase pairs. =argmax(|) (1) || x Translation length: (, , ) = is the exponential of the word count in translation E. Based on law of total probability, we have, x Hierarchical phrase segmentation and reordering feature: (,,) =(|, ) is the probability of (|) =(,|) particular phrase segmentation and reordering S, given (2) the source and target sentence E and F [3]. x Target language model (LM) feature: (, , ) = Then we model the posterior probability of the (E, F) sentence pair (), which is the probability of E computed from an given X through a log-linear model: N-gram LM of the target language. Unlike the previous work [8][19], we used a hierarchical 1 phrase-based MT module [3]. It is based on probabilistic (,|) = (, , ) (3) synchronous context-free grammar (PSCFG) models that define a set of weighted transduction rules. These rules describe the translation and reordering operations between source and target where =∑, {∑ (, , )} is the normalization languages. In training, our MT module is learned from parallel denominator to ensure that the probabilities sum to one. In the log- training data; and in runtime, the decoder will choose the most linear model, {(,,)} are the feature functions empirically likely rules to parse the source language sentence while constructed from E, F, and X. The only free parameters of the log- synchronously generating the target language output. Compared linear model are the feature weights, i.e., = {}. Details of with the simple phrase-based MT [6], the hierarchical MT supports these features used in our experiments are provided in the next the translation of non-contiguous phrases with more complex section. segmentation and re-ordering, and it also gives better translation performance [3]. 2.2. Features in the ST model 2.3. Training of feature weights The full set of feature functions constructed and used in our ST system are derived from both the ASR and the MT [2][3] The free parameters of the log-linear model, i.e., the weights modules as listed below: (denoted by ) of these features, are trained by maximizing the x Acoustic model (AM) feature:(,,) = (|), BLEU score of the final translation on a dev set, i.e., which is the likelihood of speech signal X given a ∗ recognition hypothesis F, computed from the AM of the =argmax( ,(, )) (4) source language. x Source language model (LM) feature: (, , ) = where ∗ is the translation reference(s), and (, ) is the (), which is the probability of F computed from a translation output, which is obtained through the decoding process N-gram LM of the source language.

Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

BLEU Might Be Guilty but References Are Not Innocent

A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output

Machine Translation Evaluation

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

A Survey of Evaluation Metrics Used for NLG Systems

English and Arabic Speech Translation

Automatic Evaluation Measures for Statistical Machine Translation System Optimization

Re-Evaluating the Role of BLEU in Machine Translation Research

Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst

Minimum Error Rate Training in Statistical Machine Translation

LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng

Bertscore: Evaluating Text Generation with Bert