Introduction
Total Page:16
File Type:pdf, Size:1020Kb
Citation: Doherty, S. (2019). Translation technology evaluation research. In M. O’Hagan (Ed.), The Routledge handbook of translation and technology. Abingdon, Oxon: Routledge. Publisher link: https://www.routledge.coM/The-Routledge-Handbook-of-Translation-and- Technology-1st-Edition/OHagan/p/book/9781138232846 20. Translation technology evaluation research Stephen Doherty Abstract Central to the successful research, developMent and application of translation technology is its evaluation. Accordingly, there has been a growing deMand to evaluate translation technologies based on the systeMatic and rigorous application of scientific Methods to assess their design, developMent and outcoMes. Diverse needs of different stakeholders have led to a proliferation of evaluation Methods and Measures involving huMan- and Machine-based approaches that are used across acadeMia and industry research contexts. The chapter provides a critical description of conteMporary direct and indirect translation technology evaluation research in order to identify trends, eMerging issues, and potential solutions. It traces advances Made in the field to date and indicates reMaining gaps calling for further work. The author anticipates future research in the field to be able to deliver more standardized and inclusive approaches towards More Meaningful and nuanced evaluation indicators for the different stakeholders involved in the process and products of translation. Keywords: technology evaluation, huMan-based evaluation, autoMatic evaluation Metrics, translation quality Introduction 1 ConteMporary translation research and practice requires fast and resource-efficient solutions that can scale to high-volumes, a growing nuMber of language pairs, an expanding diversity of genres and text types, and a Multitude of stakeholder requireMents. Translation technology, particularly Machine translation (MT) and coMputer-assisted translation (CAT) tools, offers such a solution and has become central to the process of translation and, by extension, most aspects of interlingual coMMunication and the global language services industry in which research and industry practice are intertwined (see O’Hagan 2013, Doherty 2016). Stakeholders wishing to evaluate translation technology range froM researchers and developers to suppliers, providers and buyers of translation products and services, and end- users. The need to evaluate translation technology is coMMon to all of these stakeholders where the priMary aiM reMains as being the evaluation of the effectiveness and efficiency of a technology within the specified task that it is designed to perforM, yet the context, purpose, design, and output of an evaluation often takes on vastly different forMs. This diversity of stakeholder needs has led to a proliferation of evaluation Methods and Measures involving huMan- and Machine-based approaches across research and applied industry research settings. The need to evaluate is of course not liMited to translation technology, it is of priMary concern to all technologies that we use in Many aspects of our professional and personal lives. Evaluation in a universal sense is typically defined as a systeMatic, rigorous, and Meticulous application of the scientific Method to assess the design, development, and outcoMes of a specified tool, process, etc. (e.g. Ross, Ellipse and FreeMan 2004)—this definition rings true for translation technology. Building upon the detailed descriptions of translation technology and its usage provided in this voluMe, the current chapter will provide a critical description of conteMporary direct and indirect translation technology evaluation research in order to identify trends, emerging issues, and potential solutions. In doing so, it draws a parallel between the evaluation of the products and the processes of translation technology and calls for a More systeMatic approach to evaluation research in which standardization and universals are foregrounded. 2 The development of translation technology evaluation research The developMent of translation technology is arguably inseparable froM its evaluation: after all, how can progress be Made if it cannot be identified as such? Indeed, past and current research and developMent into MT and CAT tools has been carried out in tandeM with evaluation with the findings froM the latter feeding directly and indirectly into the forMer. As translation technology becomes More sophisticated, so too do the Methods and Measures by which they are evaluated. I distinguish here between an evaluation research Method and a Measure, where a Method is the overall approach to the examination of a specific phenoMenon, e.g., the quality of MT output, while a Measure is an individual instruMent that ascertains a quantifiable or qualifiable constructed unit of that phenoMenon, e.g., the notion of accuracy (see, for example, Saldanha and O’Brien 2013). Early evaluation research of the 20th century took the forM of Manual huMan evaluation of the output of the respective translation technology where the focus was, and continues to be, the examination of the quality of the output using criteria developed froM the disciplines of Linguistics and Translation Studies, e.g., accuracy and fluency, and a range of coMputational Metrics eMerged froM the Computational Linguistics and Computer Science research communities to provide an alternative to Manual huMan evaluations, e.g., recall and precision (see Drugan 2013, House 2015, Chunyu and Wong Tak-Ming 2014, Castilho, Doherty, Gaspari and Moorkens 2018). However, as a result of the booMing localization industry in the 1990s, a shift to error-based evaluations became apparent in which evaluators count the nuMber and nature of the errors in the output (e.g. the Localization Industry Standards Association Quality AssessMent Modeli). Since then, error-based Metrics and Likert-scales have becoMe coMMonplace in evaluation research as they allow huMan evaluators to identify the strengths and weaknesses of a technology at varying degrees of granularity, froM word-level to document- and systeM-level, and for a variety of different purposes and contexts (Castilho et al. 2018). Subsequent research in the early 2000s then Moved to a More holistic approach which allowed for the coMbination of Methods and the incorporation of process-oriented data froM 3 users, e.g., translators and post-editors, and end-users of the respective technology. While More recent research in the 2010s has begun to investigate the underlying cognitive processes involved in using translation technology, the focus reMains on the product, given the traditional role of translation being a purposeful (Nord 1997) and typically econoMic activity. Central to the evaluation of any translation technology is the evaluation of the quality of its output. While this chapter does not atteMpt to discuss translation quality in its own rightii (Doherty 2017, Castilho et al. 2018, also see Pym in this voluMe), it relates the evaluation of quality to the wider evaluation of translation technology given its doMinance in research on the current topic (see Bowker in this voluMe). In this context, there is an apparent divergence between acadeMic evaluation research and its industry counterpart. In industry contexts, the aiM of assessing quality as a Means to evaluate a translation technology is to verify that a specific level of quality is reached in accordance with client specifications as well as sector and compliance requireMents. In contrast, acadeMic evaluation research typically gives focus to the identification and MeasureMent of changes in a systeM and/or its output. Indeed, as Drugan (2013) points out, there are different questions and goals between acadeMic and industry approaches in this context, which LoMMel and colleagues see as liMiting the coMparisons that can be Made between the two (LoMMel, Uszkoreit and Burchardt 2014). On the other hand, several recent initiatives have pushed for a unified approach to find agreeMent between stakeholders and their evaluation Methods and Measures (Koby, Fields, Hague, LoMMel and Melby 2014), including, the Defense Advanced Research Projects Agency (White and O’Connell 1996), and the Framework for the Evaluation of Machine Translation (Hovy et al. 2002), the MultidiMensional Quality Metrics (MQM) framework proposed by LoMMel and colleagues (2014) as part of the European CoMMission-funded QT21iii project, TAUS Dynamic Quality Frameworkiv, and initiatives by the International Organization for Standardization (ISO)v (see Wright in this voluMe). I support the spirit of these initiatives and contend that there are More siMilarities than differences in evaluation Methods are in acadeMic and industry research and practice. While the purpose and context of the evaluation May indeed differ, the evaluation process is inherently the same in that the evaluator needs to 4 align the purpose of the evaluation with the resources and Methods available, and the desired forMat of the results of the evaluation. In both acadeMic and industry contexts, evaluation Methods typically eMploy huMan- based and/or Machine-based linguistic evaluations. I will first concisely describe huMan-based linguistic Methods in the coMbinations that they are Most coMMonly eMployed, and then Move to describe evaluations concerning usability before Moving to describe the Machine-based evaluation Methods of autoMatic evaluation Metrics. Table 1 provides an overview of all of the Methods, where, based on the description of MT evaluation by HuMphreys and