<<

Citation: Doherty, S. (2019). technology evaluation research. In M. O’Hagan (Ed.), The Routledge handbook of translation and technology. Abingdon, Oxon: Routledge.

Publisher link: https://www.routledge.com/The-Routledge-Handbook-of-Translation-and- Technology-1st-Edition/OHagan/p/book/9781138232846

20. Translation technology evaluation research Stephen Doherty

Abstract

Central to the successful research, development and application of translation technology is its evaluation. Accordingly, there has been a growing demand to evaluate translation technologies based on the systematic and rigorous application of scientific methods to assess their design, development and outcomes. Diverse needs of different stakeholders have led to a proliferation of evaluation methods and measures involving human- and machine-based approaches that are used across academia and industry research contexts. The chapter provides a critical description of contemporary direct and indirect translation technology evaluation research in order to identify trends, emerging issues, and potential solutions. It traces advances made in the field to date and indicates remaining gaps calling for further work. The author anticipates future research in the field to be able to deliver more standardized and inclusive approaches towards more meaningful and nuanced evaluation indicators for the different stakeholders involved in the process and products of translation.

Keywords: technology evaluation, human-based evaluation, automatic evaluation metrics, translation quality

Introduction

1 Contemporary translation research and practice requires fast and resource-efficient solutions that can scale to high-volumes, a growing number of language pairs, an expanding diversity of genres and text types, and a multitude of stakeholder requirements. Translation technology, particularly (MT) and computer-assisted translation (CAT) tools, offers such a solution and has become central to the process of translation and, by extension, most aspects of interlingual communication and the global language services industry in which research and industry practice are intertwined (see O’Hagan 2013, Doherty 2016). Stakeholders wishing to evaluate translation technology range from researchers and developers to suppliers, providers and buyers of translation products and services, and end- users. The need to evaluate translation technology is common to all of these stakeholders where the primary aim remains as being the evaluation of the effectiveness and efficiency of a technology within the specified task that it is designed to perform, yet the context, purpose, design, and output of an evaluation often takes on vastly different forms. This diversity of stakeholder needs has led to a proliferation of evaluation methods and measures involving human- and machine-based approaches across research and applied industry research settings. The need to evaluate is of course not limited to translation technology, it is of primary concern to all technologies that we use in many aspects of our professional and personal lives. Evaluation in a universal sense is typically defined as a systematic, rigorous, and meticulous application of the scientific method to assess the design, development, and outcomes of a specified tool, process, etc. (e.g. Ross, Ellipse and Freeman 2004)—this definition rings true for translation technology. Building upon the detailed descriptions of translation technology and its usage provided in this volume, the current chapter will provide a critical description of contemporary direct and indirect translation technology evaluation research in order to identify trends, emerging issues, and potential solutions. In doing so, it draws a parallel between the evaluation of the products and the processes of translation technology and calls for a more systematic approach to evaluation research in which standardization and universals are foregrounded.

2 The development of translation technology evaluation research

The development of translation technology is arguably inseparable from its evaluation: after all, how can progress be made if it cannot be identified as such? Indeed, past and current research and development into MT and CAT tools has been carried out in tandem with evaluation with the findings from the latter feeding directly and indirectly into the former. As translation technology becomes more sophisticated, so too do the methods and measures by which they are evaluated. I distinguish here between an evaluation research method and a measure, where a method is the overall approach to the examination of a specific phenomenon, e.g., the quality of MT output, while a measure is an individual instrument that ascertains a quantifiable or qualifiable constructed unit of that phenomenon, e.g., the notion of accuracy (see, for example, Saldanha and O’Brien 2013). Early evaluation research of the 20th century took the form of manual human evaluation of the output of the respective translation technology where the focus was, and continues to be, the examination of the quality of the output using criteria developed from the disciplines of Linguistics and , e.g., accuracy and fluency, and a range of computational metrics emerged from the Computational Linguistics and Computer Science research communities to provide an alternative to manual human evaluations, e.g., recall and precision (see Drugan 2013, House 2015, Chunyu and Wong Tak-Ming 2014, Castilho, Doherty, Gaspari and Moorkens 2018). However, as a result of the booming localization industry in the 1990s, a shift to error-based evaluations became apparent in which evaluators count the number and nature of the errors in the output (e.g. the Localization Industry Standards Association Quality Assessment Modeli). Since then, error-based metrics and Likert-scales have become commonplace in evaluation research as they allow human evaluators to identify the strengths and weaknesses of a technology at varying degrees of granularity, from word-level to document- and system-level, and for a variety of different purposes and contexts (Castilho et al. 2018). Subsequent research in the early 2000s then moved to a more holistic approach which allowed for the combination of methods and the incorporation of process-oriented data from

3 users, e.g., translators and post-editors, and end-users of the respective technology. While more recent research in the 2010s has begun to investigate the underlying cognitive processes involved in using translation technology, the focus remains on the product, given the traditional role of translation being a purposeful (Nord 1997) and typically economic activity. Central to the evaluation of any translation technology is the evaluation of the quality of its output. While this chapter does not attempt to discuss translation quality in its own rightii (Doherty 2017, Castilho et al. 2018, also see Pym in this volume), it relates the evaluation of quality to the wider evaluation of translation technology given its dominance in research on the current topic (see Bowker in this volume). In this context, there is an apparent divergence between academic evaluation research and its industry counterpart. In industry contexts, the aim of assessing quality as a means to evaluate a translation technology is to verify that a specific level of quality is reached in accordance with client specifications as well as sector and compliance requirements. In contrast, academic evaluation research typically gives focus to the identification and measurement of changes in a system and/or its output. Indeed, as Drugan (2013) points out, there are different questions and goals between academic and industry approaches in this context, which Lommel and colleagues see as limiting the comparisons that can be made between the two (Lommel, Uszkoreit and Burchardt 2014). On the other hand, several recent initiatives have pushed for a unified approach to find agreement between stakeholders and their evaluation methods and measures (Koby, Fields, Hague, Lommel and Melby 2014), including, the Defense Advanced Research Projects Agency (White and O’Connell 1996), and the Framework for the Evaluation of Machine Translation (Hovy et al. 2002), the Multidimensional Quality Metrics (MQM) framework proposed by Lommel and colleagues (2014) as part of the European Commission-funded QT21iii project, TAUS Dynamic Quality Frameworkiv, and initiatives by the International Organization for Standardization (ISO)v (see Wright in this volume). I support the spirit of these initiatives and contend that there are more similarities than differences in evaluation methods are in academic and industry research and practice. While the purpose and context of the evaluation may indeed differ, the evaluation process is inherently the same in that the evaluator needs to

4 align the purpose of the evaluation with the resources and methods available, and the desired format of the results of the evaluation. In both academic and industry contexts, evaluation methods typically employ human- based and/or machine-based linguistic evaluations. I will first concisely describe human-based linguistic methods in the combinations that they are most commonly employed, and then move to describe evaluations concerning usability before moving to describe the machine-based evaluation methods of automatic evaluation metrics. Table 1 provides an overview of all of the methods, where, based on the description of MT evaluation by Humphreys and colleagues (1991) and Way (2018), I will categorize translation technology evaluation methods according to the following: 1. Typological evaluations that evaluate the linguistic phenomena processed by the technology; 2. Declarative evaluations that evaluate the performance of the technology with regard to the output it produces against a specific set of evaluation criteria; 3. Operational evaluations that evaluate the effectiveness of a technology within a specific process and/or workflow.

Evaluation method Measure Typological Declarative Operational Accuracy • • Automatic evaluation metrics • • Comprehensibility • • Linguistic evaluation Error-based metric • • (Product-oriented) Fluency • • Readability • • Rubric • • Audio, video, and screen recording • • Performance-based Cognitive load (e.g. eye tracking, self-report) • • evaluation Interview (e.g. individual, group-based • • (Process-oriented) structured, semi-structured, unstructured) Reception study (e.g. product testing) • •

5 Resource-based metric (e.g. cost, risk, • • computational power) Task performance (e.g time, efficiency) • • Survey or questionnaire (e.g. capturing • • attitudes and opinions using constructed items or existing psychometrics) Usability-based metric • • Verbalisation (e.g. Think aloud protocol) • • Table 1: Overview of methods in translation technology evaluation research

Human-centric evaluation of translation technology

Accuracy (also called adequacy or fidelity) pertains to the extent to which the translation unit carries the meaning of the source into the target (e.g., Koehn 2009, Koehn 2010, Drugan, 2013, House, 2015, Doherty, 2017, Castilho et al. 2018). Human evaluators typically rate the accuracy of a translation at sentence level and assign a score out of a prescribed range or rank different of the same sentence from best to worst on an ordinal scale. Accuracy is typically used in conjunction with fluency (also called intelligibility), a measure that focuses on the rendering of the translation according to the rules and norms of the target language (Arnold et al. 1994, Reeder 2004). Accuracy and fluency are the longest- standing measures of translation quality assessment and are therefore central to evaluation methods more generally (Doherty 2017). Readability typically refers to the ease with which a defined segment of text can be read by a specified reader, e.g., according to age or education level. While readability measures are programmed to calculate a variety of linguistic features, e.g., word frequency and word and sentence length, and non-linguistic features, e.g., formatting and spacing, in one or multiple languages (e.g. Björnsson 1971, Kincaid et al. 1975, Graesser, McNamara, Louwerse and Cai 2004), self-report ordinal scales are also used wherein an evaluator has readers or users of a text provide numerical ratings based on a set of specified criteria for readability. In either form of measurement, readability is largely dependent on the text given its focus on linguistic content and its immediate environment (Doherty 2012). Readability measures are available in

6 word processing software and in online tools as well being integrated into translation technology software (e.g. Stymne, Tiedemann, Hardmeier and Nivre 2013). Closely related to the theoretical construct and measurement of readability is comprehensibility which focuses on reader-dependent features to measure to what extent a reader has understood and retained the information contained in a text. Recall and cloze testing are often used to assess comprehension (e.g. Doherty 2012). Readability and comprehensibility measures can be used in tandem to evaluate both the linguistic and non- linguistic components of a text and the individual reader interacting with it. Acceptability refers to the extent a system and its output meets the needs of the user (Nielsen 1993) and is typically employed in translation technology evaluation research by asking users the extent to which a translation, or part thereof, meets the requirements of the task at hand (e.g. Castilho 2016). This form of evaluation is a more passive form of usability and reception studies, in which evaluators have end-users of translated and localized products, e.g., video game, use the product to test its functionality and response in the target language market. While few studies evaluate usability in isolation (cf. Byrne, 2006), this form of evaluation has become a mainstay of academic research over the past decade not only to evaluate how users interact with a technology but also how they use its output. Usability studies (e.g. Gaspari 2004, Roturier 2006, Stymne et al. 2012, Castilho et al. 2014, Doherty and O’Brien 2014; Klerk et al. 2015, Castilho and O’Brien 2016, Moorkens and Way 2016) have investigated the usability or usefulness of a technology’s output using measures of efficiency, accuracy, and user satisfaction. Moving to a more active form of evaluation, research into post-editing reveals valuable information into how the output from a translation technology is actually used. Post-editing is the human editing of raw MT output to achieve a specified level of quality through minor, light post-editing, and major, full post-editing, interventions (Allen, 2003, also see Vieira in this volume). Krings’s (2001) division of post-editing effort into three discreet categories of investigation has been widely accepted: temporal, technical, and cognitive, where temporal effort is the time spent doing post-editing, technical effort is the number and form of edits made, and cognitive effort is measured using eye tracking, keyboard logging, pause-to-word

7 ratios, or other similar methods. Studies of post-editing (e.g. De Almeida and O’Brien, 2010, Depraetere 2010, Plitt and Masselot 2010, Specia 2011, Koponen 2012, O’Brien et al. 2012, O’Brien et al. 2013, O’Brien et al. 2014, Lacruz and Shreve 2014, Guerberof 2014, Moorkens et al. 2015, Daems et al. 2015, Carl et al. 2015) are significant to translation technology evaluation research as they provide a comprehensive analysis of what the translator or post-editor is actually doing with the output of a technology, but also takes into account the interface of the software, and other ergonomic and task-based factors of performance (see Ehrensberger-Dow and Massey; Läubli and Green in this volume). As a consequence of the above studies, evaluation research involving post-editing has seen the greatest development in recent years and has led to significant advances in how translation technology is evaluated from the perspective of several stakeholders, including translators and end-users, and in the context of integrated technologies such as MT and CAT, rather than in isolation and in combination with machine-based evaluation methods (see below). Usability studies have developed to include the interface of the translation technology itself and the impact on the user’s cognitive processing of the translation and post-editing task. These studies including evaluations of the interfaces of CAT tools for translators (e.g. O’Brien, 2008, O’Brien, O’Hagan and Flanagan 2010) and post-editing software (e.g. Moorkens, O’Brien, and Vreeke 2016, Torres-Hostench, Moorkens, O’Brien and Vreeke 2017, Moorkens and O’Brien 2017), and explore the human-factors associated with human-computer interaction (see O’Brien 2012) of translation technologies, including economics (e.g. Ehrensberger-Dow and O’Brien 2015, Ehrensberger-Dow and Heeb, 2016) and cognitive load (Doherty et al. 2010, Doherty 2014, O’Brien 2011). These studies have also expanded upon the evaluation methods typically used in translation technology, namely the use of surveys (e.g., Ehrensberger-Dow and O’Brien 2015) video and screen recordings (Ehrensberger-Dow and Heeb 2016), interviews (e.g., Ehrensberger-Dow and Heeb 2016), and verbalizations in the form of Think Aloud Protocols (e.g. Doherty 2012). Lastly, performance-based evaluations are employed to evaluate how users of translated content actually use a specific product or service. This approach contains subjective and objective measures, including: time, e.g. time spent on a webpage; browsing behaviour

8 (e.g. visits and revisits to a webpage or window, number of likes and shares on social media); asking users for their opinions on the content, e.g. checking whether the webpage was what the user was actually searching for, satisfaction ratings, and likelihood of recommendation. These performance-based evaluation methods focus on the users of the final product or service and involve collecting usage data in a real-world user scenario. As such, they are more feasible and common in the industry and provide an indirect method for translation technology evaluation. Similarly, performance-based evaluations of translators and post-editors often take the form of identifying the number of words translated or edited relative to a given time and the number or errors made in that period.

Machine-centric evaluations of translation technology

While it is evident that a range of human-based methods and measures are available to stakeholders to evaluate translation technology, they can be resource-expensive, inconsistent and may not be feasible for the small-scale, incremental evaluations of a technology’s development (Bojar et al. 2011, Callison-Burch et al. 2011), where it is widely acknowledged in evaluation research that no single method or measure can address all possible purposes and use cases (e.g., Hovy et al. 2002). MT evaluation, in particular, requires an alternative to human-based methods so that MT researchers and developers can identify the immediate effect of frequent changes to their systems and workflows, e.g., adding corpora, changing models, modifying parameters, etc. As such, human-based linguistic evaluations of MT tend to be confined to large-scale evaluation campaigns (e.g., Callison-Burch et al. 2009; Callison-Burch et al. 2011) and industry contexts (e.g. Lommel et al. 2014), and MT researchers have developed a range of machine-based evaluation measures called automatic evaluation metrics (AEMs) to provide a simplified, resource-cheap and more consistent alternative to human evaluation. Broadly speaking, the purpose of AEMs is to measure the similarity or difference between the output of an MT system and a reference translation or gold standard, typically a human translation, on a set of predefined linguistic criteria (Koehn 2010). Research and development of AEMs is growing at a rapid pace with initiatives to improve existing metrics and creating new ones (Castilho et al. 2018). Existing metrics however do not readily indicate the

9 linguistic nature of the types of errors that an MT output contains (Uszkoreit and Lommel 2013). Implicit in AEMs is the assumption that the performance of an MT system can be evaluated on the basis of the prescribed linguistic features matching at word, phrase, and sentence level. This approach contrasts with human-based linguistic evaluations using concepts of accuracy and fluency. As AEMs are not yet sophisticated enough to make such complex evaluations, most of them employ information retrieval concepts of error rate, edit distance, precision and recall, building them into metrics of varying complexity and operating usually on single words (uni-grams), and to a lesser extent, on longer phrase (n-grams). Edit distance refers to the minimum number of edits required to make the MT output match the human reference translation(s). Precision is typically defined as the relative number of words in the MT output that are correct as verified against the reference translation on the set of criteria, and recall is the relative number of correct words that are produced in the MT output. Many of the first AEMs used in MT evaluation research originated from speech recognition research, e.g., Word Error Rate (WER) by Nießen and colleagues (2000) (also see Ciobanu and Secară in this volume). Error-rate metrics compute the insertions, deletions and substitutions required for the MT output to match the reference translation, normalized by the length of the reference translation. Popular modifications of WER can be found in Translation Error Rate (TER) and its variant Human-targeted TER (HTER) (Snover et al. 2006). Other popular AEMs based on error are General Text Matcher (GTM) (Turian et al. 2006) and METEOR (Lavie and Agarwal 2007). Adopting an approach based on precision and recall, Bilingual Evaluation Understudy (BLEU) (Papineni et al. 2002) gained substantial popularity by showing consistently moderate correlations with human evaluation data, so much so that it became the official metric of the MTE campaigns of the US National Institute of Standards and Technology (NIST) (Doddington, 2002). BLEU remains the de facto standard for most research purposes, while industry applications have begun to show scores on several metrics, e.g., TER and BLEU offered by KantanMTvi, a popular MT provider. Over the last decade, however, a new generation of AEMs has emerged to outperform BLUE in their correlations with human-based linguistic evaluations, including TERp (Snover et al. 2006), MaxSim (Chan and Ng 2008), ULC (Gimenez

10 and Marquez 2008), RTE (Padó et al. 2009), wpBleu (Popović and Ney 2009) and chrF (Popović 2015). In addition, many of these newer AEMs offer advantages in terms of the complexity of linguistic features they can cover and their ability to work with character-based systems, e.g., the morpho-syntactic information offered by chrF (Popović 2015). AEMs have also provided an interesting nexus between human-based evaluations and machine-based evaluations in that many researchers seek to examine the correlations between them. The end result of such an approach is to establish a benchmark for an AEM against several human-based evaluations, so that it can be validated for usage without human evaluation data in the future as the correlation has been repeatedly substantiated. This line of enquiry has yielded compelling results showing the complementarity of human- and machine- based evaluation methods, including correlations between AEMs and: human-based linguistic evaluations (see O’Brien 2011), post-editing (e.g. Tatsumi, 2009, Sousa et al. 2011, Specia 2011, Daems et al. 2015). In this section, I have described each of the research methods employed in translation technology evaluation research. These methods are readily grouped into two areas, namely human- and machine-centric evaluation methods. Commonalities and links between the two areas are apparent in the focus on the evaluation of the output of a technology, chiefly that of MT, while a growing body of research in both areas also incorporates process-oriented data to not only show how users of a technology interact with the tool itself, but also how it impacts on the wider environment and performance of the user. Using an extension of the categorization of evaluation methods described by Humphreys and colleagues (1991) and Way (2018), the summary of methods in Table 1 shows how a single evaluation method may be used for more than one purpose and even combined with other methods to offer a more comprehensive evaluation that takes both process and product into account. The next section will move to discuss current limitations and identify emerging developments in translation technology evaluation research.

Emerging developments in translation technology evaluation research

11 Building upon the descriptions of the previous section, I will now identify emerging developments in translation technology evaluation research in order to highlight current challenges, limitations, and potential solutions. I identify these developments in the order of three overarching themes: universalism and standardization, methodological limitations, and education and training.

Universalism and standardization

An implicit recurring theme in translation technology evaluation research is that of universalism. Over the past century, research not only in translation theory, but also in translation technology, linguistics, and computational linguistics has wrestled with the theme, its conceptualization and its measurement. While this has resulted in a range of methods and metrics by which we can evaluate translation technology, principally MT systems, it is arguably a product of different disciplines and stakeholders trying to use their expertise and resources to solve a complex issue. The idea of a universal method or measure is indeed attractive; however, I contend that it is simply not operationally realistic, given the diversity of stakeholder needs and contexts in translation technology research and industry application. In its place, I propose taking a step back from superficial operationalization and focussing on conceptualization where a universal of evaluation can be found. That is to say that the universal is the fact that all stakeholders in evaluation research aim to evaluate a technology within a set of predefined parameters for a specific goal and in a specific context. While there may not be a one-size-fits- all method or measure available, as others have already well articulated (e.g., Lommel et al. 2014), the aim, purpose, and context of an evaluation are undeniable commonalities to all stakeholders. I return here to the description of evaluation that opened the current chapter which emphasized the systematic, rigorous, and meticulous nature of all evaluation research (the universal) regardless of how it may be operationalized in individual contexts in line with individual needs (the local). As mentioned in the second section of the current chapter, several initiatives are already underway, e.g., the Multidimensional Quality Metrics (MQM) framework and TAUS DQF, to bring together the existing range of human- and machined-based methods of

12 evaluation in order to standardize them in a meaningful way on a global scale. Such standardization would add greatly to the value of evaluation research and may even allow us to explain previous research findings in a new light, e.g., inconsistency between human-based measures and AEMs. Turchi et al. (2014), for example, note widespread inconsistency between evaluators in MT evaluation and quality estimation of MT, and, on examination of the reported inter-rater reliability scores for the Workshop for Machine Translation evaluation campaigns (Bojar et al. 2014) from 2011 through 2014 and across all ten language pairs, one finds scores that are very low on the widely agreed upon thresholds for inter-rater reliability (see Landis and Koch 1977).

Methodological limitations

A number of resolvable methodological challenges can be identified in the current literature which have also limited the consistency and impact of both human- and machine-based methods and measures, namely: the design and reporting of evaluation research methods, the contextualization of methods and their findings, and the use of evaluators. The methodological design of an evaluation is key to its success, as is the clear reporting of methodologies crucial to the overall success of evaluation research more generally. These methodological challenges exist for us to reach a point where we have sufficient information to consistently replicate meaningful research findings across technologies, research groups, language pairs, and linguistic phenomena. The use of a growing variety of evaluation methods and measures has led to limitations in our ability to conduct meta-analyses and chart the development of translation technology over time. While user activity data (e.g. Way 2018), evaluation campaigns (e.g. Workshop on Machine Translationvii) and empirical studies (e.g. Callison-Burch et al. 2007, Graham et al. 2014) have begun to provide longitudinal data to show development over time (as we would logically expect to be the case), the absence of a common yardstick has limited our ability to compare findings across languages, domains, text types, and time periods. In turn, this significantly limits our ability to perform meta-analyses which can harness the statistical power of a multitude of studies in order to identify effects and correlations between variables of

13 interest, e.g., between human- and machine-based metrics, as is the case in most other scientific disciplines. Furthermore, as evaluation research is largely confined to one technology in isolation and its immediate output and usage scenario, a high-level approach to evaluation research would be a valuable avenue of future exploration. Such work would explore the effectiveness and efficiency of translation technology in the wider context of an entire workflow or business management process and, of course, tie into business decision-making on resources, e.g., through a cost-benefit analysis. While industry stakeholders are much more likely to engage in this type of research internally, few stakeholders have publicly shared findings of high-level evaluations due to the commercial sensitivity; although exceptions exist, e.g. Roturier (2009). As such, it can be difficult for academic researchers to understand industry context and needs and vice versa. Collaborative projects between industry and academic stakeholders can be invaluable in addressing the decontextualized nature of evaluation research. A further methodological challenge is the use of evaluators for human-based evaluation methods and measures which can involve both professional and non-professional evaluators depending on the parameters of the evaluation and the resources available. While trained translators and linguists are more frequently involved in evaluation in industry research, the use of self-selecting amateur evaluators has become commonplace in academic research evaluations, particularly in MT, where resource constraints make it difficult for researchers and developers to access evaluators with the appropriate training in evaluation, language pair combination, and domain expertise. There is therefore a reliance on student and amateur evaluators in MT research in which voluntary evaluators possess an undisclosed proficiency in the languages involved, an unknown expertise with the domain and text type involved, and no training in requirements of an evaluation nor any formal knowledge of linguistics or translation (see also Doherty 2017). A further complication is the use of crowdsourcing platforms to access large pools of amateur evaluators in which the average of their scores on the prescribed evaluation criteria attempts to minimize bias and errors, while the large number of evaluators, relative to using professional evaluators, in such designs attempts to compensate for the limited expertise of the

14 evaluators (also see Jiménez-Crespo in this volume). While it is rare for evaluation research to engage with groups of professional evaluators due to the cost and time involved, it would clearly be invaluable for critical periods of development for technologies, particularly MT systems. As few studies provide details of the evaluators used (see Doherty 2017), it remains a challenge to encourage stakeholders to report such detail and to acknowledge the potential methodological issues involved with the choice of evaluators be they unknown volunteers and amateurs or paid professionals. Given the above methodological issues, it is unsurprising that there is still a considerable inconsistency in the results of evaluations and the correlation between human- and machine- based evaluation results (see, e.g. Labaka et al. 2014, Doherty 2017). Given that results from any evaluation method or measure may vary considerably depending on the chosen parameters, they should be analysed and interpreted with due diligence after a comprehensive description of the methodology employed. After all, inconsistencies in results are arguably at least partially due to the parameters used in evaluations, including formalized criteria, operationalized measures, choice of evaluators, etc. Doherty (2017) proposes a set of psychometric principles to improve the validity, reliability, and consistency of translation technology evaluation methods, which may help to address the above issues.

Education and training

Education and training are essential to the effective and efficient use of any technology and its evaluation. The growth in the number of evaluation methods, especially AEMs, can make it problematic for stakeholders to correctly understand and use the method(s) and measure(s) appropriate to their needs and contexts. With the growing development of translation technology evaluation research in general, and MT evaluation in particular, it behoves all stakeholders to be informed as to the strengths and weaknesses of evaluation methods and measures. While there are still limited educational and training opportunities in the form of online and face-to-face courses and workshops, research publications have become more accessible due to open-access policies and the continued work of institutional and public archives (e.g., MT Archiveviii), and academic research services and networks (e.g., Google

15 Scholar, Academia.edu, ResearchGate). Several associations and organisations also provide online resources (e.g., the Globalization and Localization Associationix) in addition to those found on YouTube. Lastly, as numerous academic conferences involving translation technology now include researcher, developer, and user tracks as well as co-locating with other conferences, workshops, and events, these initiatives are an invaluable form of education, training, and interaction for all stakeholders interested in evaluation research.

Conclusion

Central to the successful research, development and application of translation technology is its evaluation. While not unique to translation technology, evaluation has been a substantial topic of research across the numerous disciplines concerned with it. Due to the diversity of contexts, needs, resources, and stakeholders involved in translation technology evaluation, it is unsurprising that there has been considerable development of a wide range of human-based and machine-based evaluation methods from which evaluators can freely choose. While these methods have drawn from established theoretical constructs and practices in Translation Studies, Computational Linguistics, and Computer Science, more recent forays into the investigation of the process of translation and its components, particularly post-editing, have highlighted the value of online measures in the evaluation of translation technology, particularly in the form of performance-based measures. While significant growth in the length and breadth of translation technology evaluation research has yielded important insights into our understanding, development, and application of it, further opportunities exist if we are to harness the potential of the evaluation methods available to us by employing them in an informed, effective, and efficient manner to overcome the issues identified in the previous section, namely those of universalism and standardization, methodological limitations, and education and training. I hypothesize that future evaluation research will continue to work towards addressing the above limitations by employing more standardized and inclusive approaches to bridge the gap between existing methods, e.g., automatic evaluation metrics and performance- and usability-based measures, to provide more

16 meaningful and nuanced indicators to all stakeholders involved in the process and products of translation.

Further reading

Moorkens, J., S. Castilho, F. Gaspari and S. Doherty (eds) (2018). Translation quality assessment: From principles to practice, Cham, Switzerland: Springer|The book provides a detailed critical description of the strengths and limitations of contemporary translation technology across mainstream academic and industry contexts.

Chunyu, K. and B. Wong Tak-Ming (2014). ‘Evaluation in machine translation and computer- aided translation’, in S-W. Chan (ed), The Routledge encyclopaedia of translation technology. London: Routledge, 213–236|This chapter provides a detailed historical overview of the evaluation of MT and CAT tools.

Doherty, S. (2017) 'Issues in human and automatic translation quality assessment', in D. Kenny (ed.) Human issues in translation technology, London: Routledge, 131-148|This chapter critically reviews contemporary translation quality assessment practices to identify methodological issues and demonstrate how these issues can be overcome by drawing upon psychometric principles.

Related topics 24: Translation technology research with eye-tracking 25: Future of Machine translation: Musings on Weaver’s memo 26: Quality 27: Fit-for-purpose translation

17

References

Allen, J. (2003) ‘Post-editing’, in H. Somers (ed.), Computers and translation: A translator's guide, Amsterdam: John Benjamins Publishing Company, 297–317. Arnold, D., L. Balkan, S. Meijer, R. Lee Humphreys and L. Sadler (1994) Machine translation: An introductory guide. Blackwell, Manchester, UK: Blackwell. Bojar, O., M. Ercegovčević, M. Popel and O. Zaidan (2011) ‘A grain of salt for the WMT manual evaluation’. Proceedings of the 6th Workshop on Statistical Machine Translation, 30–31 July 2011, Edinburgh, 1–11. Byrne, J. (2006) : Usability strategies for translating technical documentation. Heidelberg: Springer. Callison-Burch, C., C. Fordyce, P. Koehn, C. Monz and J. Schroeder (2007) ‘(Meta-)evaluation of machine translation’, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, 136–158. Callison-Burch, C., P. Koehn, C. Monz and J. Schroeder (2009). ‘Findings of the 2009 Workshop on Statistical Machine Translation’, Proceedings of the 4th EACL Workshop on Statistical Machine Translation, 30–31 March 2009, Athens,1–28. Callison-Burch, C., P. Koehn, C. Monz and O. Zaidan (2011) ‘Findings of the 2011 Workshop on Statistical Machine Translation’, Proceedings of the 6th Workshop on Statistical Machine Translation, 30–31 July, 2011, Edinburgh, 22–64. Carl, M., S. Gutermuth and S. Hansen-Schirra (2015) ‘Post-editing machine translation: A usability test for professional translation settings’, in A. Ferreira and J. Schwieter (eds) Psycholinguistic and cognitive inquiries into translation and interpreting, Amsterdam: John Benjamins Publishing Company, 145–174. Castilho, S. (2016). Measuring acceptability of machine translated enterprise content, Doctoral Thesis, Dublin City University, Ireland.

18 Castilho, S., S. Doherty, F. Gaspari and J. Moorkens (2018). ‘Approaches to human and machine translation quality assessment’, in J. Moorkens, S. Castilho, F. Gaspari and S. Doherty (eds), Translation quality assessment: From principles to practice. Cham: Springer, 1–29. Castilho, S. and S. O'Brien (2016). ‘Evaluating the impact of light post-editing on usability’, Proceedings of the Tenth International Conference on Language Resources and Evaluation, 23–28 May 2016, Portorož, 310–316. Castilho, S., S. O'Brien, F. Alves and M. O'Brien (2014). ‘Does post-editing increase usability? A study with Brazilian Portuguese as target language’, in Proceedings of the Seventeenth Annual Conference of the European Association for Machine Translation, 16–18 June 2014, Dubrovnik, 183–190. Chan, Y., and H. Ng (2008). ‘MAXSIM: An automatic metric for machine translation evaluation based on maximum similarity’, Proceedings of the MetricsMATR Workshop of AMTA- 2008, Honolulu, Hawaii, 55–62. Daems, J., S. Vandepitte, R. Hartsuiker and L. Macken (2015). ‘The impact of machine translation error types on post-editing effort indicators’, Proceedings of the 4th Workshop on Post-Editing Technology and Practice, 3 November 2015, Miami, 31–45. De Almeida, G. and S. O'Brien (2010). ‘Analysing post-editing performance: Correlations with years of translation experience’, Proceedings of the 14th Annual Conference of the European Association for Machine Translation, 27–28 May 2010, St. Raphaël. Depraetere, I. (2010) ‘What counts as useful advice in a university post-editing training context? Report on a case study’, Proceedings of the 14th Annual Conference of the European Association for Machine Translation, 27–28 May 2010, St. Raphaël. Doddington, G. (2002) ‘Automatic evaluation of machine translation quality using n-gram co- occurrence statistics’, Proceedings of the Second International Conference on Human Language Technology Research, San Diego, p 138–145. Doherty, S. (2012) Investigating the effects of controlled language on the reading and comprehension of machine-translated texts: A mixed-methods approach using eye tracking, Doctoral thesis, Dublin City University, Ireland.

19 Doherty, S. (2014) 'The design and evaluation of a Statistical Machine Translation syllabus for translation students', Interpreter and Translator Trainer 8, 295–315. Doherty, S. (2016) 'The impact of translation technologies on the process and product of translation', International Journal of Communication 10, 947-969. Doherty, S. (2017) 'Issues in human and automatic translation quality assessment', in D. Kenny (ed.) Human issues in translation technology, London: Routledge, 131-148 Doherty S., S. O'Brien and M. Carl (2010) 'Eye tracking as an MT evaluation technique', Machine Translation 24, 1-13. Drugan, J. (2013) Quality in professional translation: Assessment and improvement. London, UK: Bloomsbury. Ehrensberger-Dow, M. and S. O’Brien (2015). ‘Ergonomics of the translation workplace’, Translation spaces, 4(1), 98–118. Ehrensberger-Dow, M. and A. H. Heeb (2016) ‘Investigating the ergonomics of the technologized translation workplace’, in R. Muñoz Martín (ed.) Reembedding translation process research, Amsterdam: John Benjamins Publishing Company, 69–88. Gaspari, F. (2004) ‘Online MT services and real users' needs: An empirical usability evaluation’, Proceedings of AMTA 2004: 6th Conference of the Association for Machine Translation in the Americas, Berlin, 74–85. Graesser, A. C., D. S. McNamara, M. M. Louwerse and Z. Cai (2004) ‘Coh-Metrix: Analysis of text on cohesion and language’. Behavior research methods, 36(2), 193–202. Graham, Y., T. Baldwin, A. Moffat and J. Zobel (2014) ‘Is machine translation getting better over time?’, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 443–451. Guerberof, A. (2014) ‘Correlations between productivity and quality when post-editing in a professional context’, Machine Translation, 28(3–4), 165–186. House, J. (2015) Translation quality assessment: Past and present. London, UK: Routledge. Hovy, E., M. King and A. Popescu-Belis (2002) ‘Principles of context-based machine translation evaluation’, Machine Translation 17(1), 43–75.

20 Humphreys, L., M. Jäschke, A. Way, L. Balkan and S. Meyer (1991). ‘Operational evaluation of MT’, Working Papers in Language Processing 22. Essex, UK: University of Essex. Kincaid, J., R. Fishburne, R. Rogers and B. Chissom (1975) Derivation of new readability formulas (automated readability index, Fog count and Flesch reading ease formula) for navy enlisted personnel (No. RBR-8-75), Naval Technical Training Command Millington TN Research Branch. Klerke, S., S. Castilho, M. Barrett and A. Søgaard (2015) ‘Reading metrics for estimating task efficiency with MT output’, Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning, 18 September 2015, Lisbon, 6–13. Koby, G., P. Fields, D. Hague, A. Lommel and A. Melby (2014) ‘Defining translation quality’, Revista tradumàtica, 12, 413–420. Koehn, P. (2009) Statistical machine translation. Cambridge, UK: Cambridge University Press. Koehn, P. (2010) ‘Enabling monolingual translators: Post-editing vs. options’, Proceedings of the 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles,537– 545. Koponen, M. (2012) ‘Comparing human perceptions of post-editing effort with post-editing operations’, Proceedings of the Seventh Workshop on Statistical Machine Translation, 7– 8 June 2012, Montréal, 181–190. Krings, H. (2001) Repairing texts: Empirical investigations of machine translation post-editing processes. Kent, OH: Kent State University Press. Lacruz, I. and G. Shreve (2014) ‘Pauses and cognitive effort in post-editing’, in S. O'Brien, L. Balling, M. Carl, M. Simard and L. Specia (eds), Post-editing of machine translation: Processes and applications, Newcastle-Upon-Tyne: Cambridge Scholars Publishing, 246– 272. Landis, J. and G. Koch (1977) ‘The measurement of observer agreement for categorical data’, Biometrics, 33(1), 159–174. Lavie, A. and A. Agarwal (2007). ‘METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments’, Proceedings of the Workshop on Statistical Machine Translation, June, Prague, Czech Republic, 228–231.

21 Lommel, A., H. Uszkoreit and A. Burchardt (2014) ‘Multidimensional Quality Metrics (MQM): A framework for declaring and describing translation quality metrics’, Revista tradumàtica, 12, 455–463. Moorkens, J., S. O'Brien, I. da Silva, N. de Lima Fonseca and F. Alves (2015) ‘Correlations of perceived post-editing effort with measurements of actual effort’, Machine Translation 29(3), 267–284. Moorkens, J., S. O’Brien and J. Vreeke (2016) ‘Developing and testing Kanjingo: A mobile app for post-editing’, Revista tradumàtica, 14, 58–66. Moorkens, J., and A. Way (2016). ‘Comparing translator acceptability of TM and SMT outputs’, Baltic Journal of Modern Computing, 4(2), 141–151. Moorkens, J. and S. O’Brien (2017) ‘Assessing user interface needs of post-editors of machine translation’, in D. Kenny (ed.) Human issues in translation technology, London: Routledge, 109–130. Nielsen, J. (1993). Usability engineering. Amsterdam: Morgan Kaufmann. Nießen, S., F. Och, G. Leusch and H. Ney (2000) ‘An evaluation tool for machine translation: Fast evaluation for MT research’, Proceedings of the Second International Conference on Language Resources and Evaluation, 31 May–2 June 2000, Athens, 39–45. Nord, C. (1997) Translation as a purposeful activity: functionalist approaches explained. Manchester: St. Jerome publishing. O'Brien, S. (2008) P’rocessing fuzzy matches in translation memory tools: An eye-tracking analysis’, in S. Göpferich, I. Mees and A. Jakobsen (eds) Copenhagen studies in language 36, Copenhagen: Samfundslitteratur, 79–102. O'Brien, S., M. O'Hagan and M. Flanagan (2010) ‘Keeping an eye on the UI design of Translation Memory: how do translators use the Concordance feature?’, Proceedings of the 28th Annual European Conference on Cognitive Ergonomics ACM, 187–190. O’Brien, S. (ed.) (2011) Cognitive explorations of translation. London: Bloomsbury. O’Brien, S. (2012) ‘Translation as human-computer interaction’, Translation spaces 1(1), 101– 122.

22 O'Brien, S., M. Simard and L. Specia (eds) (2013) ‘Workshop on post-editing technology and practice’, Proceedings of Machine Translation Summit XIV, Nice. O'Brien, S., L. Balling, M. Carl, M. Simard, L. Specia (eds) (2014). Post-editing of machine translation: Processes and applications, Newcastle-Upon-Tyne: Cambridge Scholars Publishing. O'Hagan, M. (2013) ‘The impact of new technologies on translation studies: A technological turn’ in C. Millan-Varela and F. Bartrina (eds), The Routledge handbook of Translation Studies, London, UK: Routledge, 503 – 518. Padó, S., D. Cer, M. Galley, D. Jurafsky and C. Manning (2009) ‘Measuring machine translation quality as semantic equivalence: A metric based on entailment features’, Machine Translation, 23(2–3), 181–193. Papineni, K., S. Roukos, T. Ward and W. Zhu (2002) ‘BLEU: A method for automatic evaluation of machine translation’, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, 311–318. Plitt, M. and F. Masselot (2010) ‘A productivity test of statistical machine translation post- editing in a typical localisation context’, The Prague Bulletin of Mathematical Linguistics, 93, 7–16. Popović, M (2015). ‘ChrF: Character n-gram F-score for automatic MT evaluation’, Proceedings of the 10th Workshop on Statistical Machine Translation, 17–18 September 2015, Lisbon, 392–395. Popović, M. and H. Ney (2009) ‘Syntax-oriented evaluation measures for machine translation output’ Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, 29–32. Ross, P., M. Ellipse and H. Freeman (2004). Evaluation: A systematic approach, 7th ed., Thousand Oaks, CA: Sage. Roturier, J. (2006) An investigation into the impact of controlled English rules on the comprehensibility, usefulness, and acceptability of machine-translated technical documentation for French and German users, Doctoral Thesis, Dublin City University, Ireland.

23 Roturier, J. (2009) ‘Deploying novel MT technology to raise the bar for quality: A review of key advantages and challenges’, The twelfth Machine Translation Summit, Ottawa, Canada, August. International Association for Machine Translation. Reeder, F (2004) ‘Investigation of intelligibility judgments’ Proceedings of the 6th Conference of the Association for MT in the Americas, AMTA 2004, Springer, Heidelberg, 227–235. Saldanha, G. and S. O'Brien (2014). Research methodologies in translation studies. London: Routledge. Snover, M., B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul (2006) ‘A study of translation edit rate with targeted human annotation’, Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, 8–12 August 2006, Cambridge, 223–231. Specia, L. (2011) ‘Exploiting objective annotations for measuring translation post-editing effort’, Proceedings of the Fifteenth Annual Conference of the European Association for Machine Translation, 30–31 May, Leuven, 73-80. Stymne, S., H. Danielsson, S. Bremin, H. Hu, J. Karlsson, A. Lillkull, M. Wester (2012) ‘Eye- tracking as a tool for machine translation error analysis’, Proceedings of the Eighth International Conference on Language Resources and Evaluation, 23–25 May 2012, Istanbul, 1121–1126. Stymne, S., J. Tiedemann, C. Hardmeier and J. Nivre (2013) ‘Statistical machine translation with readability constraints’, Proceedings of the 19th Nordic Conference of Computational Linguistics, 22-24 May 2013, Oslo, 375–386. Tatsumi, M. (2010) Post-editing machine translated text in a commercial setting: Observation and statistical analysis, Doctoral Thesis, Dublin City University, Ireland. Torres-Hostench, O., J. Moorkens, S. O’Brien and J. Vreeke (2017) ‘Testing interaction with a mobile MT post-editing app’, Translation & Interpreting, 9(2), 138–150. Turchi, M., M. Negri and M. Federico (2014) ‘Data-driven annotation of binary MT quality estimation corpora based on human post-editions’, Machine translation 28(3–4), 281– 308.

24 Turian, J., L. Shea and I. Melamed (2003) ‘Evaluation of machine translation and its evaluation’, Proceedings of MT Summit IX, New Orleans, 386–393. Uszkoreit, H. and A. Lommel (2013) ‘Multidimensional Quality Metrics: A new unified paradigm for human and machine translation quality assessment’. Paper presented at Localization World, London, 12–14 June 2013. Way, A. (2018) ‘Quality expectations of machine translation’, in J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (eds), Translation quality assessment: From principles to practice. Cham, Switzerland: Springer, 159-178.

i http://producthelp.sdl.com/SDL_TMS_2011/en/Creating_and_Maintaining_Organizations/Managing_QA_Models/LISA_QA_Model.htm ii A detailed discussion of historical and current approaches to the evaluation of translation quality can also be found in the list of further reading. iii http://www.qt21.eu/quality-metrics/ iv https://www.taus.net/evaluate/dqf-background v https://www.iso.org/standard/59149.html vi http://www.kantanmt.com/overview-measure.php vii http://www.statmt.org/wmt18/ viii http://www.mt-archive.info/ ix https://www.gala-global.org/

25