Three Directions for the Design of Human-Centered Machine Translation

Three Directions for the Design of Human-Centered Machine Translation Samantha Robertson Wesley Hanwen Deng University of California, Berkeley University of California, Berkeley samantha [email protected] [email protected] Timnit Gebru Margaret Mitchell Daniel J. Liebling Black in AI [email protected] Google [email protected] [email protected] Michal Lahav Katherine Heller Mark D´ıaz Google Google Google [email protected] [email protected] [email protected] Samy Bengio Niloufar Salehi Google University of California, Berkeley [email protected] [email protected] Abstract As people all over the world adopt machine translation (MT) to communicate across languages, there is increased need for affordances that aid users in understanding when to rely on automated translations. Identifying the information and interactions that will most help users meet their translation needs is an open area of research at the intersection of Human- Computer Interaction (HCI) and Natural Lan- guage Processing (NLP). This paper advances Figure 1: TranslatorBot mediates interlingual dialog us- work in this area by drawing on a survey of ing machine translation. The system provides extra sup- users’ strategies in assessing translations. We port for users, for example, by suggesting simpler input identify three directions for the design of trans- text. lation systems that support more reliable and effective use of machine translation: helping users craft good inputs, helping users under- what kinds of information or interactions would stand translations, and expanding interactiv- best support user needs (Doshi-Velez and Kim, ity and adaptivity. We describe how these 2017; Miller, 2019). For instance, users may be can be introduced in current MT systems and more invested in knowing when they can rely on an highlight open questions for HCI and NLP re- AI system and when it may be making a mistake, search. than they are in understanding its inner workings. 1 Introduction Most consumer-facing translation systems do not provide support for safe and reliable use of ma- Users often do not have enough information about chine translation. Despite some evidence of corpus- the capabilities and limitations of machine learning level human parity, individual translations still vary models in order to use them effectively, safely, and substantially in quality (Toral et al., 2018). Fur- reliably (Green and Chen, 2019; Pasquale, 2015). ther, it is especially difficult for users to assess the One approach to this problem is to provide addi- quality of a translation because they often do not tional information to users, such as explanations understand the source language, target language, of model behavior (Liao et al., 2020; Lai and Tan, or both. Mistranslations are frustrating for users 2019; Rader et al., 2018). However, it is not clear (Hara and Iqbal, 2015), contribute to social and economic isolation (Liebling et al., 2020), and have improve communication in multilingual groups or even prompted human rights violations, such as teams (Wang et al., 2013; Lim and Yang, 2008; the wrongful arrest of a man whose social media Calefato et al., 2016), but significant challenges post was mistranslated from “good morning” in remain. Poor quality translations can lead to frus- Arabic to “attack them” in Hebrew (Berger, 2017). tration and conversational breakdowns (Yamashita Nevertheless, people and organizations continue and Ishida, 2006; Hara and Iqbal, 2015), and are to rely on machine translation, even in high-stakes detrimental to social and economic life for immi- contexts like immigration (Torbati, 2019), and con- grants and people living and working in places tent moderation (Stecklow, 2018). To minimize the where they do not know the dominant language potential for harm from mistranslations, it is impor- (Liebling et al., 2020). tant that users are able to understand and account Researchers have found that people adapt to the for the limitations of these systems. limitations of translation models as they use a sys- In this paper, we explore what information and tem, for example, by repeating, rephrasing, or sup- interactions can assist users of machine translation plementing information (Ogura et al., 2005; Hara systems. Prior work has developed models to esti- and Iqbal, 2015). In some cases, users may treat a mate the quality of a translation (Blatz et al., 2004; machine translation as a preliminary or gist trans- Specia et al., 2018). However, “translation quality” lation, supplementing it with professional transla- is a complex, nuanced, and context-dependent con- tions when needed (Nurminen and Papula, 2018). cept (Specia et al., 2013). It remains unclear when, However, these strategies require users to assess how, and what kind of quality measures might be the translation for errors and respond in a way that intelligible and actionable to users in different sit- improves the translation output (Bowker and Ciro, uations. We utilize HCI methods to understand 2019; King, 2019), both of which are challenging what users need to know about translations in order with affordances of existing systems. to meet their goals. Our findings can inform the One way to help users identify communication design of systems that support more effective use breakdowns and initiate repairs is to design systems of machine translation. that provide additional guidance and information. First, we summarize key findings from a pilot Researchers have proposed showing users alter- survey of users’ practices for evaluating machine native translations at the word or sentence level translations. Informed by the survey findings, we (Gao et al., 2015; Coppers et al., 2018), automati- propose three directions for the design of machine cally providing back-translation (Shigenobu, 2007), translation systems: 1) help users craft good inputs; displaying the estimated sentiment of a transla- 2) help users understand translations; and 3) adapt tion (Lim et al., 2018), or highlighting keywords to context and feedback. To begin exploring these (Gao et al., 2013). Experiments show that these directions for design, we utilize a chatbot prototype approaches can improve message clarity, compre- that supports interlingual conversation mediated by hension, and sense-making (Gao et al., 2013, 2015; machine translation (Figure 1). Lim et al., 2018, 2019). Other research systems have attempted to detect cross-cultural translation 2 Related Work issues using machine learning (Pituxcoosuvarn Increasingly, people use machine translation (MT) et al., 2020). Chatbots have also leveraged con- systems, such as Google Translate, to communi- versational content to translate (Hecht et al., 2012) cate across languages, from interacting with social or aid language learning (Cai et al., 2015). We media posts (Lim and Fussell, 2017) to negotiat- build on this work and ask how we can guide users ing employment or medical care (Liebling et al., to identify when to rely on automated translations 2020). Here, we focus on MT systems used by and when and how to make changes. consumers, rather than systems designed for pro- Researchers in NLP have developed methods fessional translators. These systems promise ease to quantitatively evaluate MT models by compar- of access and improving quality (King, 2019), but ing machine translations to reference translations it remains unclear how effectively these systems (e.g. BLEU (Papineni et al., 2001)) or by Quality meet users’ needs, particularly in transactional and Estimation methods (QE). QE uses noisy parallel conversational situations (Liebling et al., 2020). corpora annotated with quality information to es- Studies have shown that machine translation can timate quality scores or classifications based on linguistic and, more recently, model features (Blatz participants worked in research or the technology et al., 2004; Specia et al., 2020; Callison-Burch sector. Our results provide insight into the MT et al., 2012; Specia et al., 2018). Because QE does usage of these participants, and these results are not require reference translations, these techniques strengthened by the wide range of languages that could provide run-time quality estimates to end participants knew. Future work is needed to under- users. One challenge in QE is that evaluating trans- stand the needs and practices of other users, such lations is extremely complex and task-dependent as those with less familiarity with technology and (Banerjee and Lavie, 2005; Denkowski and Lavie, those who are not fluent in high resource languages. 2010). Even a highly accurate confidence score or quality classification will never capture all the 3.1 Results: Strategies for Evaluating nuances of translation quality (King et al., 2003; Translation Quality Specia et al., 2013). One approach is to develop Consistent with prior work (Hara and Iqbal, 2015; more complex QE models that combine multiple Liebling et al., 2020), we found that poor quality aspects of translation quality (Kane et al., 2020; translations are a problem for users of machine Yuan and Sharoff, 2020). In this work we offer an translation: 93% of respondents experienced poor alternative, human-centered approach: we begin quality translations in one or more contexts. Prob- by understanding users’ needs and goals, and then lems were especially pronounced in online contexts explore what kind of quality information might be such as social media. Participants identified a num- intelligible and useful to them. ber of known weaknesses of MT, including

Three Directions for the Design of Human-Centered Machine Translation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support