Crowdsourced Monolingual Translation

CHANG HU, PHILIP RESNIK, and BENJAMIN B. BEDERSON, University of Maryland

An enormous potential exists for solving certain classes of computational problems through rich collaboration among crowds of humans supported by computers. Solutions to these problems used to involve human professionals, who are expensive to hire or difficult to find. Despite significant advances, fully automatic systems still have much room for improvement. Recent research has involved recruiting large crowds of skilled humans (“”), but crowdsourcing solutions are still restricted by the availability of those 22 skilled human participants. With translation, for example, professional translators incur a high cost and are not always available; machine translation systems have been greatly improved recently but still can only provide passable translation; and crowdsourced translation is limited by the availability of bilingual humans. This article describes crowdsourced monolingual translation, where monolingual translation is translation performed by monolingual people. Crowdsourced monolingual translation is a collaborative form of trans- lation performed by two crowds of people who speak the source or the target language, respectively, with machine translation as the mediating device. This article describes a general protocol to handle crowdsourced monolingual translation and analyzes three systems that implemented the protocol. These systems were studied in various settings and were found to supply significant improvement in quality over both machine translation and monolingual editing of machine translation output (“postediting”). Categories and Subject Descriptors: H.5.m [Human-Computer Interaction]: Miscellaneous General Terms: Human Computer Interaction Additional Key Words and Phrases: Crowdsourcing, translation, human computation ACM Reference Format: Chang Hu, Philip Resnik, and Benjamin B. Bederson. 2014. Crowdsourced monolingual translation. ACM Trans. Comput.-Hum. Interact. 21, 4, Article 22 (June 2014), 35 pages. DOI: http://dx.doi.org/10.1145/2627751

1. INTRODUCTION Translation is becoming more and more important with the growing diversity of lan- guages in which information is being distributed around the world [Danet and Herring 2007]. Although recent advances in statistical machine translation have improved ma- chine translation greatly [Lopez 2008], use of bilingual humans, especially professional translators, is still the norm whenever high-quality results, or text that represents all of the original meaning and is fluent in the target language, are required. However, solutions using bilingual humans often suffer from the bilinguals’ relative scarcity. Although most of the world’s population is bilingual [Edwards 1994], the number of language pairs spoken by bilingual people is still relatively small compared to all the possible pairs.

This research was supported by NSF contract #BCS0941455 and by a Research Award. Authors’ addresses: resnik/[email protected], UMIACS, University of Maryland, College Park, MD 20742; [email protected], Microsoft, Redmond, WA 98052. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2014 ACM 1073-0516/2014/06-ART22 $15.00 DOI: http://dx.doi.org/10.1145/2627751

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:2 C. Hu et al.

This article presents crowdsourced monolingual translation, a new translation pro- tocol that supports collaboration between two crowds of people who only speak the source or the target language, respectively. This protocol uses machine translation as an initial pass and creates a feedback loop between the source and the target language speakers who iteratively add redundancy to the translation so machine translation er- rors become easier to correct. The feedback loop supports communication between the source and the target language speakers through two channels: a machine translation channel and an annotation channel. The iteration between these two crowds of mono- lingual people ends when there is evidence that the process has reached a sufficiently accurate and readable translation. We explored various implementations of crowdsourced monolingual translation [Hu et al. 2010, 2011, 2012]. Implementing the protocol posed unique HCI challenges, such as supporting different modes of computer-supported interaction between groups of monolingual people [Hu et al. 2011], choosing a suitable task size for them [Hu et al. 2012], and assigning those tasks automatically [Hu et al. 2012]. In this article, we summarize the protocol behind these systems (Section 5), the evaluation of the systems’ translation quality (Section 7), and studies on the protocol’s effectiveness (Section 8). Section 7 is a revision of material previously published in Hu et al. [2010], Hu et al. [2011], and Hu et al. [2012]; the other sections contain materials not published before.

2. MOTIVATION Professional translators can provide translation with the highest quality. However, professional translators usually incur a nontrivial cost, if they are available at all. On the other hand, machine translation provides a fast and low-cost alternative, but current machine translation systems are still insufficient for high-quality applications. Crowdsourcing has become common among researchers to recruit a crowd of skilled humans as workers to solve computationally hard problems [Surowiecki 2005]. Crowd- sourcing brings good, economical solutions to some problems, such as image recogni- tion [Von Ahn and Dabbish 2004; Bigham et al. 2010] and automatic summarization [Bernstein et al. 2010]. However, most current crowdsourcing solutions are limited by their focus on fully skilled humans, in the sense of individual workers who can solve the problem all by themselves. While some crowdsourcing solutions do have different roles for workers [Bernstein et al. 2010; Little et al. 2010], they do not differentiate which worker takes which role (i.e., the workers are still presumed to have all the skills to take any role). For translation in particular, following some successful first attempts to crowdsource translation [Kittur 2010], current crowdsourced translation systems use bilingual humans as workers [Zaidan and Callison-Burch 2011]. There- fore, these crowdsourced translation systems are still limited by the availability of bilingual humans. Our work is also motivated by our own International Children’s Digital Library (ICDL), which has about 5,000 children’s books in 61 languages.1 Part of the ICDL’s mission is to “have every culture and language represented so that every child can know and appreciate the riches of children’s literature from the world community.” Prior to this work, a group of over 2,000 volunteer translators already existed on the ICDL. These volunteer translators were bilingual ICDL users who self-selected to translate children’s books on ICDL. Even with these volunteer translators on the ICDL, it is still difficult to translate between uncommon language pairs (e.g., between Croatian and Mongolian) or to translate most of the ICDL’s books into most of the 61 languages.2

1The ICDL is at http://childrenslibrary.org. 2Needs for translation between these two languages do exist. In the International Children’s Digital Library, there are books in Croatian, and the library has a sizable Mongolian-speaking user population.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:3

Crowdsourced monolingual translation is a form of “human computation” as defined by Quinn and Bederson [2009]. It aims to solve these translation problems by expand- ing translation to monolingual people who would like to help but do not speak two languages.

3. SUMMARY OF CONTRIBUTIONS Crowdsourced monolingual translation creates a feedback loop between the source lan- guage and the target language speakers. This feedback loop includes not only machine translation but also language-independent annotations that enhance communication between two monolingual crowds by decreasing the dependency on unreliable machine translation. Through the feedback loop, the source language speakers and the target language speakers are engaged in a kind of dialogue during which they collaboratively improve translation quality. Our research hypotheses are: (1) H1: Crowdsourced monolingual translation performs better than machine transla- tion and monolingual postediting in terms of translation quality. (2) H2: The feedback loop improves translation quality compared to both machine translation and editing of machine translation output. (3) H3: Overall, using annotations during the translation process improves translation quality; each type of annotation also improves translation quality.3 Later in this article, we offer evidence suggesting that H1 is true. A system (Mono- Trans2) that implemented the crowdsourced monolingual translation protocol per- formed significantly better than monolingual postediting with a 0.30-point average improvement (p < 0.001) on a 5-point accuracy scale (Section 7.2). Systems that imple- mented crowdsourced monolingual translation also performed significantly better than pure machine translation. In one experiment, the percentage of high-quality sentences improved from 10% to 68%. Studies did not confirm H2 with statistical significance due to the difficulty in recov- ering the feedback process from data collected with systems that used the asynchronous interaction model.4 However, preliminary analysis (Section 8.2) still showed that the feedback loop improves translation quality if it forms a successful clarification dialogue. Regarding H3, our analysis (Section 8.3) confirmed that using annotations during the translation process improves translation quality. Supporting monolingual humans to perform translation provides the research com- munity with a new alternative to translation with bilingual humans or machine trans- lation. It also opens up the opportunity to developing a new crowdsourcing system with unskilled humans who collaborate with each other through computer interfaces.

4. RELATED WORK Translation is a complex and creative task that traditionally involves only profes- sional translators or bilingual people who are specifically trained for the job. Although professional translators can provide the highest-quality translation, they are usually expensive to hire. For some language pairs, no professional translators or even amateur bilingual people are readily available. There are attempts to extend translation beyond what is provided by professional translators, and the most successful ones involve statistical machine translation [Lopez 2008]. In the past decade, statistical machine translation systems have reached usable

3More discussion about annotation types is in Section 5. 4See Hu et al. [2011] for more details about the asynchronous interaction model.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:4 C. Hu et al.

Table I. Children’s Stories and Translation Cost

quality between many language pairs, and some general-purpose systems have become publicly available. However, machine translation has yet to reach the high-quality output typically pro- vided by professional translators or even amateur bilinguals. To bridge this gap, several other attempts have been made to either combine humans with machine translation [Kay 1997; Callison-Burch 2005] or, more recently, to use a crowd of amateur bilingual people to do translation [Zaidan and Callison-Burch 2011], as explained later.

4.1. Professional Translators The highest-quality translation is usually provided by professional translators. Profes- sional translators are not simply bilinguals, but typically have years of training and practice. As a result, professional translation usually requires payment of a substantial fee. In addition, professional translation services usually have a minimum order size. For example, translating from English into Spanish would cost $0.10 per word, but translating two six-word English sentences into Spanish costs $45, as a result of the minimum order size.5 An estimation of typical translation cost (at $0.10 per word) is given in Table I.6 Professional translation also takes time. A typical professional translator’s speed is 1,000 to 2,500 words per day.7 Some translation companies provide online services with a shorter turnaround time.8 However, no professional translation service can be instantaneous.

4.2. Machine Translation Researchers have been trying to extend translation beyond professional translators for a very long time. During the last 20 years, a revolution has taken place in computa- tional research on translation: machine translation systems that used to rely on human knowledge about grammar and meaning provided by language experts have been re- placed by systems that learn statistical models from large collections of translated text [Lopez 2008]. This change in approach has made it possible to translate unrestricted input from a far broader spectrum of languages. With statistical machine translation systems, us- able translation quality can now be obtained between a number of language pairs, and there are online machine translation systems that offer fast and free general-purpose translation [Hutchins 2005; Hampshire and Porta Salvia 2010]. Machine translation engines can also be built relatively quickly between language pairs not previously available (e.g., between “surprise languages” [Oard 2003]). However, machine translation has drawbacks too. While “usable” quality makes machine translation very helpful for casual translation needs (e.g., finding the correct

5The quotes were taken from SDL’s translation service, at http://www.click2translate.com/quote/quick_quote. asp. 6Sources: Edwardes, M., Taylor, E., trans. (1905). Grimm’s Fairy Tales. New York: Maynard, Merrill, & Co. Rowling, J. K. (1999). Harry Potter and the Sorcerer’s Stone. New York: Scholastic. 7This information was taken from Translia, at http://www.translia.com/translation_agencies/. 8For example, myGengo claims to have a 4-hour turnaround time for texts shorter than 250 words; informa- tion taken from http://gengo.com.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:5 link on a web page, getting the gist of a news story, etc.), it is not sufficient in situations where high-quality translation is preferred (e.g., literary translation, legal translation). Translation systems can be built quickly [Oard 2003; Lewis 2010]. However, for most language pairs not involving English, systems often still lack even the most basic ability to create comprehensible translations that preserve basic meaning.

4.3. Crowdsourced Translation Recently, there has been some success in using a crowd of untrained bilinguals in place of professional translators [Zaidan and Callison-Burch 2011]. This approach is called crowdsourced translation. Crowdsourced translation organizes translation among a crowd of bilingual people by assigning each one of them a small piece of text. Compared to professional translators, crowdsourced translation is considerably less expensive, and it can be significantly faster with good translation quality [Zaidan and Callison-Burch 2011]. Crowdsourced translation was quickly adopted by some websites. For example, Facebook9 and Twitter10 both use crowdsourced translation to translate their from English to many languages.

4.4. Translation with Monolingual People The cost of crowdsourced translation is lower than that of professional translators, but crowdsourced translation still uses bilingual humans. The biggest drawback to translation by bilingual humans is that bilinguals are relatively scarce compared to the vast number of possible source and target language pairs. Finding translators for common language pairs (e.g., Spanish-English) may be possible, but finding even an amateur translator becomes difficult when just one of the languages involved is less common. For example, an English-to-Haitian-Creole translator is not easy to find.11 Consider, then, how difficult it is to find bilingual people to translate between two less common languages, for example, between Croatian and Mongolian. To avoid the bilingual availability problem, researchers have made multiple attempts at performing translation with monolingual people. Translating with monolingual peo- ple offers the potential to recruit from a much larger population. Just to illustrate the difference in scale between bilingual and monolingual users: while Wikipedia currently has about 1,600,000 active English-speaking contributors,12 there are fewer than 1,500 users who have self-identified as (even amateur) translators.13 Monolingual postediting of machine translation output is a natural extension to machine translation. Postediting is usually done with bilingual humans [Allen 2003]. The bilingual posteditors are given machine translation output, and they revise ma- chine translation output with the original text as reference. Monolingual postediting replaces the bilingual posteditors with monolingual posteditors. The difference is that monolingual posteditors do not have access to the source sentence and thus must infer the original meaning. Koehn showed that monolingual postediting alone, even with- out knowledge of the source language, can improve translation quality [Koehn 2010]. Compared to bilingual postediting, monolingual postediting does not require bilingual skills and is thus more scalable. It may also be less expensive because monolingual skills are less scarce.

9http://developers.facebook.com/docs/internationalization/#translate. 10https://translate.twitter.com/. 11Drawing from our own experience, which is discussed in Hu et al. [2011]. 12http://stats.wikimedia.org/EN/TablesWikipediansContributors.htm. 13http://en.wikipedia.org/wiki/Category:Available_translators_in_Wikipedia.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:6 C. Hu et al.

The Linear B system [Callison-Burch 2005] introduces tighter coupling between monolingual humans and machine translation. With the Linear B system, monolingual target language speakers are given segments in the target language proposed by the machine translation engine, and their task is to choose and concatenate the segments into a fluent target sentence. Experiments with the Linear B system showed that it improved machine translation quality without bilingual people. Monolingual postediting and the Linear B system use only monolingual target lan- guage speakers. There are also systems that involve both target language speakers and source language speakers. Lemmatic machine translation [Soderland et al. 2009; Everitt et al. 2010] integrates machine translation with monolingual human editing in both rephrasing the source text (encoding) and inferring the translation (decoding). However, as its name suggests, source language speakers can only use word sequences that are already contained in the translation vocabulary. This design is consistent with the system’s focus on using humans to help the machine obtain a passable translation. However, with this design, the system “will not yield fluent, grammatical sentences in the target language” [Soderland et al. 2009]. Morita and Ishida [2009, 2011] independently proposed the closest system to our approach. As part of the Language Grid project [Ishida 2006], their system is a mono- lingual translation system (referred to as “the Morita system” hereafter) that includes a monolingual source language speaker and a target language speaker. The Morita system uses a two-phase, bidirectional communication protocol between a monolingual source language speaker and a monolingual target language speaker. During the first phase, the target language speaker repeatedly requests a sentence- level paraphrase from the source language speaker (who knows the original sentence); when the target language speaker feels that he or she has a good understanding of the original meaning, the protocol then enters the second phase, where the target language speaker repeatedly rephrases the translation, which is in turn back-translated and given to the source language speaker for confirmation. Morita and Ishida [2011] showed that this type of collaboration between the monolingual source language speaker and the monolingual target language speaker improved translation quality. Our crowdsourced monolingual translation protocol shares some significant elements in common with the Morita system, but it is significantly different both in the type of information exchanged between the source and the target language speakers and in the synchronicity of the interaction between them (as is described in Section 5).14 The crowdsourced monolingual translation protocol is also designed to work with crowds from the beginning, whereas the Morita system did not utilize crowdsourcing [Morita and Ishida 2011]. In addition to all of the systems using monolingual humans without any knowledge of the source language, an interesting approach worth mentioning is Duolingo, a commercial system that uses language learners to translate online content.15 Instead of performing translation with monolingual people, Duolingo provides just-in-time training, turning monolingual people into bilinguals. Duolingo’s approach is to offer its users language courses for free, and to require them to perform translation (and translation-related tasks) as a form of practice language learning. In terms of crowdsourcing, Duolingo solves the crowd availability problem by targeting a specific group (language learners) and by giving them the bilingual skills necessary to do the work. The crowdsourced monolingual translation protocol, on the other hand, aims to solve the same problem using a different approach: combining monolingual crowds

14Our crowdsourced monolingual translation protocol was proposed independently in 2010 [Hu et al. 2010]. 15At http://duolingo.com/. Our work on crowdsourced monolingual translation was first published in April 2009. Duolingo was not open to testing until November 2011.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:7 with different existing skills. In terms of generating the final translation output, Duolingo uses voting to select the best translation candidate (similar to some systems in this article). While voting is a standard method for aggregating output from human workers, it can be vulnerable to spamming or cheating, especially when the human workers have external incentive. Duolingo’s postvoting mechanism for combining translation is to our knowledge unpublished.

5. THE CROWDSOURCED MONOLINGUAL TRANSLATION PROTOCOL Our crowdsourced monolingual translation protocol is based on the computationally supported collaboration between two groups of monolingual people, each group speak- ing only one of the two languages involved. These two groups of people are therefore effectively monolingual (although they may speak other languages), as opposed to bilin- gual translators who speak both the source language and the target language. This protocol uses machine translation as an initial pass. After that, it creates a feedback loop between the source and the target language speakers who iteratively add redundancy to the translation so machine translation errors become easier to correct. The feedback loop supports communication between the source and the target language speakers through two channels: a machine translation channel and an annotation channel. This process ends when there is evidence that it has reached sufficiently accurate and fluent translation.

5.1. A Synthetic Example The essentials of crowdsourced monolingual translation can be illustrated in a simple synthetic example. In this example, two crowds of monolingual people engage in a dialogue mediated by an automatic system to correct machine translation errors.16 In the example (see Figure 1), a sentence is being translated from Spanish to English. The two crowds are the source language (Spanish) speakers and the target language (English) speakers. For simplicity, We assume that only one member from each crowd is involved in the translation.17 The original sentence is “Todo el mundo ha o´ıdo la historia de Cenicienta. (Every- body has heard the story of Cinderella.)” First, it is translated by machine translation into English: “Everybody has heard the history of Cinderella,” in which the Spanish word “historia” (“story”) is mistranslated into “history.” (The Spanish word “historia” means “story” in this context but can also mean “history.”) The English speaker reads this translation and indicates that there may be a translation error around the word “history” by marking it as problematic in the English translation. The marking (which is in effect an annotation) on the word “history” is then projected back onto the original sentence and onto the corresponding word “historia.” The Spanish speaker now sees the sentence with the projected error mark: “Todo el mundo ha o´ıdo la historia de Cenicienta. (Everybody has heard the story of Cinderella.)” He tries to explain the marked word by attaching to it a picture about stories (in this case a picture about a lady telling a story). The system attaches the same picture to the corresponding En- glish translation “history.” Since “history” was marked as a translation error and now has a picture annotation, the English speaker is able to infer its meaning and to edit it into the correct word, “story.” Finally, the corrected English translation is back-translated into Spanish. Because the back-translation’s meaning matches that of the original sentence, the Spanish

16This example is constructed to illustrate several aspects of the crowdsourced monolingual translation protocol. For each aspect, there are many similar examples during the actual translation process. 17In practice, each task at each step could be carried out by a different member from the crowd.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:8 C. Hu et al.

Fig. 1. An example of the crowdsourced monolingual translation process. A solid arrow indicates a pass through machine translation and/or annotation projection in the system; a dotted arrow indicates editing and/or annotating performed by the monolingual people. Literal translation of the original Spanish sentence is shown. Note that the annotations attached to the phrases are projected to the other side through annotation projection. speaker judges that the translation process has reached the satisfactory goal state.18 By definition of the English speaker’s task, this is the best edit she can make, so the English speaker would also stop once the translation has been accepted by the Spanish speaker.

5.2. People, Redundancy, and Monolingual Translation Support for translation with monolingual people in existing systems (Linear B [Callison-Burch 2005] and the Morita system [Morita and Ishida 2011]) can be seen as adding monolingual people’s knowledge as redundant information to the machine translation process to increase the likelihood that machine translation errors can be corrected. The importance of redundancy in linguistic communication is well estab- lished. Redundancy can be characterized as the quantity of information (measured in bits) used in transmitting a message over and above the number of bits in the message itself [Shannon 1948]. Languages contain a variety of phonological, syntactic, seman- tic, and pragmatic mechanisms that help the listener narrow the hypothesis space for the intended message via redundancy. One common illustration of such constraints involves rmvng ll th vwls frm th wrds nd shwng tht th rdr cn stll ndrstnd th sntnc [Shannon et al. 1950]. Lossless data compression of natural language relies on the fact that linguistic redundancy exists [Shannon 1951]. The recoverability of information conveyed over an unreliable channel is improved by increasing redundancy [Shannon 1948]. In linguistic communication, people are very good at recovering information with redundancy [Clark and Marshall 2002]. If we look at people who manage to

18In this example, the back-translation is the original sentence, but the protocol only requires that their meanings match. In practice, if the meanings do not match, the translation process may not stop. Our systems included a mechanism to force a stop when necessary.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:9

Fig. 2. An overview of the crowdsourced monolingual translation protocol showing two crowds of monolin- gual people and the information flow between them. This protocol takes a source sentence as the input. With the protocol, the source language speakers and the target language speakers communicate and collaborate via a feedback loop formed by machine translation and annotation projection. Once the stopping condition is reached, the protocol outputs the best translation collaboratively generated by the two crowds. communicate successfully in challenging circumstances—whether they are in a noisy bar, using a poor-quality cell phone connection, playing with a young child, talking to someone who does not speak their language very well, or writing to a foreign relative— we find that people adapt to all of these situations through a combination of linguistic constraints, world knowledge, shared context, and clarification requests. Therefore, it is reasonable to hypothesize that adding humans to the translation process can help improve translation quality by helping to recover the original meaning from unreliable machine translation, even if they do not know both languages.

5.3. The Protocol as a Closed-Loop System In control theory, a system whose output is measured and compared against the control- ling reference (the goal for which the system is controlled toward) is called a closed-loop system [Nise 1995]. The feedback loop in a closed-loop system is the path through which the output is fed back for the comparison. Compared to a system without the feedback loop (an “open-loop” system), a closed-loop system is less sensitive to internal and/or external disturbances (“noise”) and has greater accuracy because of its self-correcting nature. Translation systems that simply output a translation without checking its correct- ness against the source sentence are open-loop systems. Machine translation systems by themselves often belong to this category. Monolingual postediting [Allen 2003] and the Linear B system [Callison-Burch 2005] are also open-loop systems because trans- lation produced by monolingual target language speakers is not related back to the source sentence within the systems. For open-loop translation systems, although the specific translation mechanisms can be improved to obtain high translation quality, errors produced within the system cannot be corrected. Because machine translation systems have intrinsic noise (machine translation er- rors), our crowdsourced monolingual translation protocol (Figure 2) is designed to produce high-quality translation more reliably using a feedback loop.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:10 C. Hu et al.

The feedback loop is made of two channels: the machine translation channel and the annotation channel. The machine translation channel translates text in both directions, forming the basis for which annotations can be added on; the annotation channel, on the other hand, enables monolingual people to communicate and add redundant information in a language-independent manner. In the feedback loop, the source sentence is first translated into the target language using machine translation. After the initial machine translation, the source language speakers and the target language speakers take turns editing or annotating the transla- tion (or the corresponding back-translation). In this way, the source language speakers and the target language speakers are engaged in a dialogue that aims to improve the translation by correcting errors and adding redundant information so errors become easier to correct. At each iteration of this process, monolingual people change the sentence in two ways: by editing and by annotating. Editing changes the text being translated (without chang- ing the meaning); annotating adds information to the text. Editing and annotating on both sides may look similar to each other, but they are quite different conceptually, as is discussed in Section 5.4. In addition to changing the translation (or the correspond- ing back-translation), monolingual people have one other important task: to halt the process once high-quality translation is obtained.

5.4. Roles of Monolingual People in the Protocol In the crowdsourced monolingual translation protocol, the target language speakers and the source language speakers have different roles. The target language speakers represent the consumers of the translation. At a high level, the target language speak- ers’ goal is to recover the original meaning from the current translation, which may contain errors. A straightforward way to do so is monolingual postediting: editing the translation according to the inferred meaning and correcting any translation errors. This is very much like any user of an online machine translation engine (such as or Babel Fish) does when reading imperfect translation from an unfamiliar language. People are good at recovering the original message even when some information is missing [Shannon et al. 1950]. Monolingual people are able to do so because humans have a very rich body of knowledge about the language they speak as well as a full range of contextual knowledge about the world. (In effect, they are very good language models.) Therefore, monolingual postediting alone can recover the original meaning to some extent. However, monolingual postediting alone forms an open-loop system. From the ini- tial machine translation alone, the errors correctable by target language speakers can be limited. Therefore, the crowdsourced monolingual translation protocol introduces a feedback loop by enabling the target language speakers to request that more informa- tion be added to the original sentence. The target language speakers can make such requests by identifying phrases that need annotations, in some cases along with the types of annotation needed. For example, the target language speakers can mark a phrase to indicate a possible translation error or to annotate the phrase with a prede- fined question (“Is this phrase the name of a person?”). Their requests will be answered by source language speakers, resulting in increased redundancy in the sentence being translated. The source language speakers know the intended meaning because they have the original sentence. At a high level, their goal is to confirm that the target language speakers have produced the correct translation. If the current translation is not correct yet, the source language speakers also act as the source of redundancy by adding more information to the original sentence.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:11

Redundancy can be added by minimally editing the back-translation so it matches the original meaning. This is in effect paraphrasing the original source sentence using the words in the back-translation. Every paraphrased source sentence is then translated using machine translation and given to the target language speakers. Paraphrasing the original sentence helps to add redundancy to the translation process, and using the words of the back-translation makes the paraphrase more likely to be translated correctly by the machine translation engine.19 In addition to providing textual redundancy, which is sent through machine trans- lation, source language speakers also respond to target language speakers’ requests for annotations through the annotation channel. The annotation channel explicitly supports increasing the level of redundancy available to the monolingual people by at- taching annotations to phrases. This is because with increased redundancy, translation errors can become easier to correct.20 The annotation channel depends on the ability to link annotations attached to part of a sentence in one language to the corresponding part of its translation in the other language (and vice versa). For example, if the source language speakers annotate the phrase “the story of Cinderella,” the system links the annotation to the piece of the sentence that phrase was translated into, in order to convey that information back to the target language speakers. Linking is done by a technique called annotation projection [Hwa et al. 2002; Yarowsky et al. 2001]. The Morita system [Morita and Ishida 2009] (developed independently and simul- taneously) also creates a feedback loop between the source language speaker and the target language speaker. The feedback loop in their case consists of two channels: the back-translation and the yes/no questions about whether to end each phase. The major difference between this feedback loop and our protocol is that our protocol feeds back more information by introducing a channel using language-independent annotations. Compared to the one bit of yes/no information in the Morita system, our annotation channel can convey more information and can reduce translation errors at a more fine-grained level. Our annotation channel is designed to decouple the commu- nication between monolingual people from the unreliable translation channel as much as possible. Unlike the translation channel, it does not translate text directly. Instead, it augments the text being translated by enabling users to annotate it.21

6. BRIEF SUMMARY OF SYSTEMS We have previously reported three systems that implemented the crowdsourced mono- lingual translation protocol. Each system was built on top of the previous system to solve a bigger problem. First, MonoTrans (Figure 3) established the feasibility of the protocol through a small-scale study [Hu et al. 2010]. MonoTrans2 (Figure 4) expanded the protocol’s application not only to a larger group of participants [Hu et al. 2011] but also to participants who do not speak English, in a very different technological (and social) environment [Hu et al. 2011]. Finally, the MonoTrans Widgets system (Figure 5) was deployed “in the wild” to real-world casual web users [Hu et al. 2012]. The tasks performed by users evolved with each system. MonoTrans assumed that users had a clear understanding of the translation protocol; MonoTrans2 provided

19Assuming machine translation engines in both directions were trained with similar data. 20Consistent with machine translation terminology, a “phrase” here means a subsentential sequence of words that may or may not be a syntactic constituent. 21Nonetheless, one could argue that the understanding of nontextual information (such as pictures) is still dependent on cultural context and is thus language dependent. However, such understanding is less dependent on language than language itself. For example, there are many cases where different cultures understand a picture in similar ways.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:12 C. Hu et al.

Fig. 3. Our first system, MonoTrans. A children’s book, TheBlueSky, is being translated from English to Chinese.

Fig. 4. The second system, MonoTrans2. The interface consists of a list of sentences overlaid on top of the background picture (middle), the navigation buttons (top), and the contextual help (right). contextual instructions for each step of the process, but it still required an under- standing of the protocol to some extent; with MonoTrans Widgets, this expectation was totally removed. With MonoTrans and MonoTrans2, users chose sentences to work on; with MonoTrans Widgets, they were assigned sentences to work on by the system.

6.1. Types of Annotation Our previously reported systems include these types of annotations [Hu et al. 2010, 2011, 2012]:

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:13

Fig. 5. The third system, MonoTrans Widgets, deployed on an ICDL page. The widget is placed in the center of the page, with the rest of the page dimmed. When the “show context” button (below the left text box) is clicked, the widget displays the sentences immediately before and after the current sentence. When users click on the “How am I helping?” link, an explanation of the monolingual translation protocol is shown. The interface language can be selected using the drop-down language selector in the lower-left region of the widget.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:14 C. Hu et al.

—Source language speakers: —Attach Wikipedia link: Attach a Wikipedia entry to explain the problematic phrase. —Attach web link: Attach a web URL to explain the problematic phrase. —Attach picture: Attach a picture to explain the problematic phrase. —Rephrase a phrase: Express the problematic phrase in a different way. —Give a yes/no answer: Respond “Yes” or “no” to a question asked by the target language speakers. —Give an answer with templates: Use predefined templates to answer a question asked by the target language speakers. Templates include “This is a person” and “Thisisaplace.” —Target language speakers: —Mark correct phrase: Indicate that the phrase does not need improvement. —Mark translation error: Indicate that the phrase appears problematic in the target language. —Ask a question with templates: Use predefined templates to ask a question about the phrase. Questions include “Is this a person?” or “Is this a place?” Notice that this is the set of all possible user tasks, and some of them overlap because only one of the overlapping tasks was implemented in each specific system.

7. TRANSLATION QUALITY EVALUATION We conducted two types of evaluation on the crowdsourced monolingual translation protocol. We evaluated the overall translation quality of systems implemented (this section). We also studied the effectiveness of the protocol by detailed analyses of the feedback loop and the annotation channel, a comparison between the protocol and postediting, and an error category analysis (Section 8). In all the experiments con- ducted, the working language was English. 7.0.1. Evaluation Protocol. For the overall translation quality evaluation, human eval- uators were hired to evaluate every sentence translated. All evaluators were native bilingual speakers of both the source and the target languages, and they were unfa- miliar with the project. Every sentence translated was evaluated by all evaluators. The sentences were anonymized so the evaluator did not know which sentences were generated by which systems. Evaluators were given the original source text as reference. Each evaluator gave two scores to every target sentence: fluency and accuracy. To prevent the evalua- tors’ fluency judgment from being confounded by the accuracy judgment, the evaluators were asked to first read the translation alone, rate its fluency, and then rate the trans- lation’s accuracy by comparing to the reference. The ground truth translations were generated by a professional translation firm.22 7.0.2. Translation Fluency. Fluency evaluation followed a standard scoring procedure [Dabbadie et al. 2002]. A sentence’s fluency indicates how natural it is in the target language alone. In the experiments, two fluency scoring scales were used. The first experiment used a 4-point scale [Dabbadie et al. 2002] to evaluate fluency with MonoTrans (Section 7.1): (1) Unintelligible: Nothing or almost nothing of the translation is comprehensible. (2) Barely intelligible: Only a part of the translation (less than 50%) is understandable.

22Lengua Translation, at http://lenguatransaltion.com. The translation firm was not hired to perform trans- lation quality evaluation directly because their service for doing so was too expensive.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:15

(3) Fairly intelligible: The major part of the translation passes. (4) Very intelligible: All the content of the message is comprehensible, even if there are errors of style and/or spelling, and if certain words are missing or are badly translated but close to the target language. We then introduced a 5-point scale, which provided a finer comparison between translations containing minor mistakes and those containing no mistakes: (1) Unintelligible: Nothing or almost nothing of the translation is comprehensible. (2) Barely intelligible: Only a part of the translation (less than 50%) is understand- able. (3) Fairly intelligible: The major part of the translation passes. (4) Intelligible: All the content of the translation is comprehensible, but there are errors of style and/or of spelling, or certain words are missing. (5) Very intelligible: All the content of the translation is comprehensible. There are no mistakes. A fluent translation does not necessarily carry the original meaning. (For exam- ple, imagine a translation mechanism that translates every sentence into “John loves Mary.”) Therefore, we also evaluated translation accuracy. 7.0.3. Translation Accuracy. Accuracy evaluation also followed the standard scoring pro- cedure described in Dabbadie et al. [2002]. A translation’s accuracy indicates how much it carries the meaning in the source text. An accurate translation is not necessarily fluent. (For example, “Me no speak English” is an accurate translation of “No hablo ingles.”)´ The human evaluators judged each sentence’s accuracy by giving a score between 1and5: (1) None of the meaning expressed in the original sentence is expressed in the translation. (2) Little of the original sentence meaning is expressed in the translation. (3) Much of the original sentence meaning is expressed in the translation. (4) Most of the original sentence meaning is expressed in the translation. (5) All meaning expressed in the original sentence appears in the translation.

7.1. Translation Quality Evaluation for MonoTrans MonoTrans (Figure 3) is our first crowdsourced monolingual translation system [Hu et al. 2010]. MonoTrans was evaluated by translating a children’s book from Russian to Chinese.23 Two Russian speakers and four Chinese speakers formed four pairs to use MonoTrans. (One Russian speaker participated three times with different content and partners.) The participants were effectively monolingual: they were all native speakers of one language and had no knowledge of the other. Each pair of participants spent an hour on the task, working synchronously. In total, participants worked on six pages (a total of 44 sentences) and finished translating 28 of the 44 sentences.24 After the experiment, the sentences translated by MonoTrans were evaluated for translation quality by a professional translator. In the results (Figure 6), 16 of the 28 sentences translated with MonoTrans were rated as fully fluent (fluency score 4 in Figure 6(a)), and 19 sentences of the 28 were rated as mostly or fully accurate (accuracy

23In March 2010. 24In this experiment, the completion of translation has to be agreed upon by both participants. Therefore, although more sentences were worked on during the experiment, only sentences with mutually agreed upon completion were evaluated for translation quality.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:16 C. Hu et al.

Fig. 6. The distributions of fluency and accuracy scores (from one evaluator) for MonoTrans output compared with Google Translate output. MonoTrans output has more sentences with scores 4 and 5 than the output from Google Translate. scores 4 or 5 in Figure 6(b)). There were also incomplete translations with very high quality, but only completed translations were included in the results. The results showed an obvious shift in translation accuracy (Figure 6(b)). If the re- sults are coarsely categorized so that those with accuracy scores 1 and 2 are considered inaccurate and those with scores 4 and 5 are considered highly accurate, then there is a drop in the number of inaccurate sentences from 12 to four out of 28, and an increase in highly accurate sentences from seven to 19 of 28—roughly a factor of 3 in each of the desirable directions. In addition, the number of completely inaccurate machine transla- tion outputs (score 1, none of the meaning preserved) dropped from 6 to 0. This showed that the protocol was helping the target language speakers understand at least some of the meaning even when the original machine translation output quality was so low that the target language speakers had very little to infer the original meaning from. The improvements in fluency were to be expected given the target language speakers’ editing task. The fact that the fluency scores were not perfect might seem unexpected given the instructions, but we believe this was due to natural variation in human judgment about fluency.

7.2. Translation Quality Evaluation for MonoTrans2 MonoTrans2 (Figure 4) introduces asynchronous interaction between monolingual users [Hu et al. 2011]. We evaluated MonoTrans2 by conducting a translation ex- periment in which monolingual people used MonoTrans2 to translate five children’s books selected from the ICDL’s collection.25 In the experiment, participants worked on translating four Spanish books into German and one German book into Spanish. All translations were from the language in which the book was originally published. Participants were recruited from a database of ICDL volunteer translators; people who spoke German or Spanish but not both were solicited. Sixty (60) fluent Spanish speakers and 22 fluent German speakers participated. In 4 days, participants worked on 162 sentences.26 After the experiment, two fluently bilingual people (who speak Spanish and German) unfamiliar with the project were recruited as paid evaluators to assess the translation quality on all five books. They evaluated the fully automatic output of Google Translate

25Intended for readers ages 6–9. 26The books contain 242 sentences in total. We did not optimize for sentence completion during the experiment.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:17

Table II. Distribution of Fluency Scores (top) and Accuracy Scores (bottom) by Evaluator Each column shows the number of sentences receiving that score. B1 and B2 are bilingual evaluators. (See also Figure 7.)

Fig. 7. The distributions of fluency and accuracy scores for MonoTrans2 output compared with Google Translate output (Error bars: standard error).

(as a baseline) and the output of MonoTrans2 (using Google Translate as the translation channel). Table II and Figure 7 summarize the results. Unsurprisingly, MonoTrans2 produced large gains in fluency compared to machine translation alone. This improvement in fluency (Figure 7(a)) was to be expected given the instructions, as was the heavy skew toward top fluency. Indeed, anything except a top score in fluency would seem unex- pected given the instructions, but (just as in Section 7.1) we do see natural variation in human judgment. The shift in accuracy was more notable (Figure 7(b)): using Mono- Trans2, the peak of the accuracy distribution was shifted from 3 to 5. We performed four two-tailed paired t-tests between scores of MonoTrans2 and scores of Google Translate for both fluency and accuracy for each bilingual evaluator. For evaluator B1, the average fluency scores were Google 3.78, MonoTrans2 4.61, and the average accuracy scores were Google 3.63, MonoTrans2 4.72; for evaluator B2, the average fluency scores were Google 3.08, MonoTrans2 4.45, and the average accuracy scores were Google 3.07, MonoTrans2 4.45. In all cases, p < 10−15. We also performed χ 2 tests for the scores (both for individual evaluators and for the evaluators in aggregate). The F values were as follows (DOF = 4): for evaluator B1, fluency 89.4, accuracy 85.1; for evaluator B1, fluency 115, accuracy 115; for both evalua- tors in aggregate, fluency 192, accuracy 192. In all cases, the p values were well under .001. These results showed that MonoTrans2 using only monolinguals significantly improved translation fluency and accuracy over Google Translate. Table III conveys this experiment’s bottom-line results more strikingly. On the very conservative criterion that a translation output is considered high quality only if both

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:18 C. Hu et al.

Table III. Number of Sentences with Maximum Possible Fluency and Accuracy Number of sentences, N = 162.

bilingual evaluators rated it a 5 for both fluency and accuracy, Google Translate pro- duced high-quality output for 10% of the sentences, while MonoTrans2 improved this to 68%. The most notable result was the comparison between machine translation and MonoTrans2: a dramatic improvement in the production of high-quality translations, without requiring any human bilingual expertise. These results suggested MonoTrans2 could potentially convert 68% of bilingual translators’ time to validation rather than full translation, although the role of those bilinguals would remain necessary. Overall, the experiment showed that MonoTrans2 can improve translation quality over machine translation alone. However, machine translation is a rather low baseline. In Section 8.1, we give a comparison between MonoTrans2 and postediting as a higher baseline.

7.3. Quantitative Evaluation for MonoTrans Widgets MonoTrans Widgets (Figure 5) is a system based on micro-tasks [Hu et al. 2012]. It was deployed on the ICDL website from August 2011 to May 2012. We analyzed the translation quality resulting from a trial deployment of MonoTrans Widgets dur- ing the 14-day period spanning September 5 to September 18, 2011. During this pe- riod, 27,858 users viewed the MonoTrans Widgets, and there were 6,358 widget task submissions. Because English and Spanish are the two most common languages on the ICDL, we selected one English book (for translation into Spanish) and one Spanish book (for translation into English) for use with this quantitative evaluation. The English book contains 30 sentences, and the Spanish book contains 24 sentences. Both books were translated from the language in which they were originally published. Compared to other books (in other languages), these books contained more edited sentences during the 14-day period.27 The two book translations received 3,678 user submissions (including edits, votes, error identifications, and explanations) from 739 workers. For each submission, the average time spent per submission on either side was 126 seconds. On average, each sentence was edited 1.1 times by both the English speakers and the Spanish speakers. Two native bilingual evaluators (who speak both English and Spanish) were re- cruited to assess translation quality for fully automatic output of Google Translate (as the baseline) and for the output of MonoTrans Widgets (using Google Translate as the translation engine). To evaluate interevaluator agreement, we used Krippendorff’s alpha with interval distance metric. The scores are as follows: α(en, fluency) = 0.516, α(es, fluency) = 0.680, α(en, adequacy) = 0.432, α(es, adequacy) = 0.642. The fluency and accuracy results (Figure 8) show that MonoTrans Widgets pro- duced higher-quality translations compared to Google Translate. A paired t-test was conducted between fluency and accuracy ratings given to translations of the same sen- tence by the two systems. All the evaluators rated both the fluency and accuracy of the MonoTrans Widgets translations higher than the Google Translate translations,

27These books are also intended for 6–9-year-old children. While more editing probably implied better translation quality, expected translation quality was not a factor in the selection of these books.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:19

Fig. 8. The distributions of fluency and accuracy scores for MonoTrans Widgets output compared with Google Translate output. Notice that there are more sentences with highest fluency and accuracy scores (5) among the MonoTrans Widgets output (Error bars: standard error). and all of these differences were statistically significant (p < 0.05): for evaluator B1, the average fluency scores were Google 4.57, MonoTrans Widgets 4.83, and the aver- age accuracy scores were Google 4.11, MonoTrans Widgets 4.70; for evaluator B2, the average fluency scores were Google 4.19, MonoTrans Widgets 4.52, and the average accuracy scores were Google 3.61, MonoTrans Widgets 4.20. As in Section 7.2, on the very conservative criterion that a translation output is considered high quality only if both bilingual evaluators rated it a 5 for both flu- ency and accuracy (same as Section 7.2), Google Translate produced high-quality out- put for 31% of the sentences, and MonoTrans Widgets improved this percentage to 52%. The results show that MonoTrans Widgets significantly improved translation fluency and accuracy over machine translation alone, with only monolingual people involved. In previous experiments (Section 7.2), MonoTrans2 improved the percentage of high-quality sentences from Google Translate’s 10% to 68%. However, the comparison between MonoTrans2 and MonoTrans Widgets is not strict because the systems translated different materials between different language pairs. In addition, different users participated in the translation. The bottom-line message here is that while both systems yielded quality improvements without the need for bilingual people, MonoTrans Widgets successfully engaged the casual ICDL users in a way that was not possible with MonoTrans2. 7.3.1. Translation Speed. The previous translation quality evaluation did not include translation speed. A rough estimation of translation speed28 showed that MonoTrans2 translated roughly 800 words per day during the experiment, which is a third to a half of the speed of professional translators.29 The speed difference between Crowdsourced Monolingual Translation and profes- sional translation is obvious. The speed difference, together with the difference in trans- lation quality, may seem to imply Crowdsourced Monolingual Translation’s incompe- tency as a competitor for professional translation. However, the value for Crowdsourced Monolingual Translation is not competing with or replacing professional translation, but rather filling the gap between machine translation and professional translation. Crowdsourced Monolingual Translation enables translation where few bilinguals are available; it also provides higher-quality output than machine translation.

28There were 20 words per typical sentence in the dataset, and the experiment lasted for 4 days. 29A typical professional translator’s speed is 1,000 to 2,500 words per day. This information was taken from Translia, at http://www.translia.com/translation_agencies/.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:20 C. Hu et al.

Fig. 9. Study comparing postediting and MonoTrans2. The five conditions are shown here.

8. PROTOCOL EFFECTIVENESS STUDIES The previous section summarizes the prior translation quality evaluations, comparing three systems with the machine translation baseline. In addition to making the overall comparison, we now present an attempt to answer the following, more specific questions about the crowdsourced monolingual translation protocol: —How does crowdsourced monolingual translation compare to approaches other than (and better than) the machine translation baseline? —How does the feedback loop improve quality? —Does using annotations improve quality? —What types of translation errors does the crowdsourced monolingual translation protocol produce?

8.1. Comparison to Postediting In the previous section (Section 7), we compared the translation quality to machine translation. While machine translation certainly establishes a baseline for the crowd- sourced monolingual translation protocol, quality improvement over this baseline is not very surprising because all the systems used machine translation as an initial pass. A more informative comparison, therefore, would be related to approaches that also use human knowledge to improve machine translation. Here we present a com- parative study between MonoTrans2 and postediting, which is “to edit, modify and/or correct pre-translated text that has been processed by a machine translation system” [Allen 2003]. We compared MonoTrans2 against both monolingual postediting and the more stan- dard bilingual postediting. In addition, we also investigated the bilingual postediting effort needed to enhance MonoTrans2 output. The conditions compared were (Figure 9): —Google Translate output, no postediting (Google) —Monolingual postediting of Google Translate output (Google-M) —MonoTrans2 output, no postediting (MonoTrans2) —Bilingual postediting of Google Translate output (Google-B) —Bilingual postediting of MonoTrans2 output (MonoTrans2-B) In addition to MonoTrans2 itself and Google as the baseline, two postediting approaches, Google-B and Google-M, were included. Finally, MonoTrans2-B was

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:21 included to investigate the bilingual postediting effort needed to bring MonoTrans2 output to high quality. The same set of children’s books (Section 7.2) was used as the dataset. These books contained 242 sentences in total, and the output from both MonoTrans2 and Google Translate collected in the previous experiment (Section 7.2) was also reused (each set contained 162 sentences; the other sentences were not worked on by MonoTrans2 users). Five monolingual people postedited the Google Translate output; three bilingual people postedited both the Google Translate output and the MonoTrans2 output. The monolingual posteditors were asked to edit the first-pass translation by Google Trans- late without the source sentence as reference; the bilingual posteditors were asked to make edits given the original sentences in the source language. In every condition, every sentence was edited by one person. Although this might have introduced between- editor variance, it was closest to how sentences had been edited in MonoTrans2 and therefore made the postedited sentences more comparable with MonoTrans2 output. Two native bilingual evaluators (speakers of German and Spanish, different from the posteditors and different from the evaluators in Section 7.2) were then hired independently to evaluate the fluency and accuracy of the sentences in all five con- ditions (so that sentences in all conditions were evaluated by the same evaluators). The distributions of scores (aggregated for both evaluators) are shown in Figure 10. To further investigate the statistical significance of the differences shown, we performed one-way repeated measures analysis of variance (RM-ANOVA) tests for the scores ( IV and V). In these tests, we treated the sentences as “subjects” and the conditions (translation methods) as “treatments.” For every translated sentence, the average fluency and accuracy scores between the two evaluators were calculated, and an RM-ANOVA test was then performed on the fluency and accuracy scores, respectively, with the scores (matched by original sentence) as the dependent variables and the condition (translation method) as the independent variable. We then performed post hoc Tukey tests (Table V) for pairs of conditions that were adjacent in terms of fluency and accuracy (in Figures 10(a) and 10(b)). These results show that (1) MonoTrans2 was significantly better than monolingual postediting at improving accuracy, and that (2) monolingual postediting did not significantly improve translation accuracy over Google’s machine translation output. The accuracy results (Table V, bottom) show a statistically significant difference be- tween the accuracy of MonoTrans2 output (MonoTrans2) and that of monolingual postediting output (Google-M). In contrast, there is no significant difference between the accuracy scores from monolingual postediting output (Google-M)andmachine translation output (Google). This suggests an advantage of MonoTrans2 over mono- lingual postediting: while MonoTrans2 improved translation accuracy over machine translation output, monolingual postediting of machine translation output did not im- prove accuracy. Regarding fluency (Figure 10(a) and Table V, top), MonoTrans2 and both bilingual postediting conditions (Google-B and MonoTrans2-B) had the most sentences rated as highly fluent (score 5 in Figure 10(a)). These results are not surprising because all three conditions involved editing by target language speakers, whether they were monolingual or bilingual. It is unexpected, however, that monolingual postediting of machine translation output (Google-M) did not generate as many fluent sentences as the previous three. One possible explanation is that since monolingual posteditors were unable to infer the original meaning from poor machine translation, they were somewhat hampered from postediting some sentences into fluent ones. Overall, these findings provide statistical confirmations that are consistent with our intuition about crowdsourced monolingual translation. Compared to monolingual

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:22 C. Hu et al.

Fig. 10. Fluency and accuracy scores for the conditions in the comparative study. The error bars show the standard deviation. For the highest accuracy, MonoTrans2 and the two bilingual postediting conditions are close together.

Table IV. Result of RM-ANOVA

postediting, MonoTrans2 was significantly better at improving both translation accu- racy and fluency over machine translation. To understand the amount of bilingual postediting effort needed to turn MonoTrans2 output into “perfect” translation, we counted the number of sentences that were rated as highly fluent (score 5) and highly accurate (score 5) by both evaluators and normalized

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:23

Table V. Result of Post Hoc Tukey HSD Test between Pairs of Conditions The average score improvement shows the improvement of condition 2 over condition 1. For example, MonoTrans2’s average accuracy score is 0.30 points higher than that of Google-M’s (Significance codes: ‘***’ p < 0.001, ‘**’ p < 0.01, ‘*’ p < 0.05, ‘.’ p < 0.1, “NS” indicates that the effect was not significant.)

Table VI. Percentage of High-Quality Results (Fluency = Accuracy = 5 by Both Evaluators)

them by the total number of sentences that were processed by MonoTrans2 (Table VI). (This criterion was the same as the last row of Table III, Section 7.2.) There was a difference of roughly 3% between MonoTrans2 and MonoTrans2-B. This result is different (and in a sense more optimistic) compared to the 68% shown in Table III (Section 7.2), because both MonoTrans2 and MonoTrans2-B under- went evaluation, whereas in the previous evaluation, only MonoTrans2 was evalu- ated. This result also implies that MonoTrans2 output (MonoTrans2) is in fact much closer to bilingual postedited sentences (MonoTrans2-B) than the previous evaluation implied.30

8.2. The Feedback Loop as a Clarification Dialogue As discussed in Section 5.3, the feedback loop supports a kind of dialogue between monolingual crowds. According to Gabsdil [2003] and Clark and Marshall [2002], clar- ification dialogues help to create common ground in human communication. Although research on clarification dialogues (especially in automatic question-answering sys- tems) has focused on monolingual situations (in which both the requester and the responding machine use the same language) [Gabsdil 2003], we use its concepts here to analyze some translation cases in the data collected through MonoTrans Widgets. Using the clarification theory, the feedback loop can be seen as a clarification dialogue between the source language speakers and the target language speakers. In this dia- logue, the source language speakers clarify phrases that the target language speakers highlighted; the source language speakers also confirm the best translation candidate that the target language speakers generated.

30It should be pointed out, however, that there is a difference between the percentages of high-quality sentences in the two evaluations (i.e., 68% in Section 7.2 vs. 41% here). Further investigation showed that this difference was due to a much stricter bilingual evaluator in this study compared to more lenient evaluators in the previous one. Future work will look more closely at evaluator variance.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:24 C. Hu et al.

Table VII. Example of Source-Side Rephrasing That Is Robust to Machine Translation Errors Original sentence ...meti´o debajo de la almohada los dientes postizos de la abuela y se durmi´o... Ground truth ...he put his grandmother’s dentures under the pillow and fell asleep... Machine translation ...put it under your pillow the grandmother’s dentures and fell asleep... Edited (source side) ...se puso la dentadura postiza de su abuela debajo de la almohada y se durmi´o... Machine translation ...he put his grandmother’s false teeth under the pillow and fell asleep...

Table VIII. Example of Target-Side Edits with Mismatched Context Original sentence ¡Un puente para ir al otro lado del ocano y buscar ms dientes! Ground truth A bridge to go across the ocean and look for more teeth! Machine translation A bridge to go across the ocean and look for more teeth! Edited (target side) Let’smakeabridgeoverthesea,sowecanfindmorefish!

The data collected with MonoTrans Widgets contains both successful clarification dialogues and cases where clarification dialogues may fail. In this section, we present some examples to illustrate the properties of successful clarification dialogues. First of all, clarification from the source side should be robust against unreliable machine translation. In Table VII, the subject was omitted in the original sentence, and the resulting machine translation did not have a subject. The source language speakers added a third-person pronoun (“se”) to clarify, and the translated sentence contained the correct segment (“he put”). In fact, the segment “se puso” was the result of back-translation from an earlier translation candidate, and this suggests that using phrases generated by the machine (back-)translation may lead to source sentences that are “easier to translate” for the machine translation engine and thus may result in better translations in the target language. Even if correctly translated, clarification from the source side should also match the receivers’ world knowledge to be accepted on the target side. Table VIII shows an example where the correct machine translation was rejected by the target language speakers. In this example, “teeth” in the machine translation was the correct trans- lation of “dientes” in the source sentence, but it was changed by the target language speakers into “fish.” We conjecture that with the widget design, some target language speakers did not have enough context about the story being translated (which is about a child’s tooth) and therefore could only resort to their world knowledge in which “ocean” (appearing earlier in the sentence) was more relevant to “fish” than “teeth.” The mistranslation from “teeth” to “fish” was also consistent with the “undetectable” error type discussed in Hu et al. [2010]. When a translation error (such as this one) is consistent with the target language users’ world knowledge, it is unlikely to be detected on the target side. To counterbalance monolingual speakers’ world knowledge when it is inconsistent with the translation context, it may be helpful for the machine trans- lation engine to indicate that translating “dientes” into “teeth” is much more likely than translating it into “fish,” i.e, showing the target language speakers a confidence score along with the translated phrase. This also implies that there is a tradeoff be- tween the simplicity of the widgets and translation context, as discussed in Hu et al. [2012]. Finally, for the source language speakers to confirm that the error has indeed been corrected, target-side edits should be back-translated well. Table IX shows an ex- ample of unreliable back-translation. In this example, the correct translation in the target language (“everyone at home was”) resulted in an incorrect back-translation (“todo el mundo en casa estaba”), whereas the incorrect translation (“everyone at home

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:25

Table IX. Example of Incorrect Back-translation Original sentence ...todos en casa estaban muy felices. Ground truth ...everyone at home was very happy. Correct translation ...everyone at home was very happy. Back-translation ...todo el mundo en casa estaba muy feliz. Incorrect translation ...everyone at home were very happy. Back-translation ...todos en casa estaban muy contentos.

Table X. The Distribution of Annotations Used in the ICDL Experiment

were”) had a back-translation (“todos en casa estaban”) that better matched the original meaning.

8.3. Effectiveness of Annotation Projection With many annotation types defined, MonoTrans2 is a good platform to study the effectiveness of the annotation channel in the crowdsourced monolingual translation protocol. In the experiment with MonoTrans2 (Section 7.2), participants generated 1,071 translation candidates. They generated fewer annotations compared to the translation candidates. The target language speakers highlighted 284 phrases in the translation candidates, indicating that those phrases might be translation errors; the source lan- guage speakers responded to 218 of the 284 phrases marked as possible translation errors, resulting in 218 annotations. The distribution of annotations is given in Table X. The most frequently used anno- tation type was “paraphrasing.” One possible reason is that paraphrasing is the easiest to understand for the source and target language speakers; another likely reason is that generating paraphrases requires the fewest actions. These annotations provided a good dataset to answer the following questions: —Did annotations improve quality? —For every annotation type, did annotations of this type improve quality? In MonoTrans2, all annotations related to the same sentence are visible to the users simultaneously. Therefore, once an annotation is generated, it may affect later transla- tion candidates even if the candidates belong to other translation threads (of the same sentence). For this study, we defined quality improvement introduced by an annotation as the quality difference between the best translation candidate generated before the annotation and the average quality among translation candidates generated after the annotation. Translation quality was measured with an automatic metric, TER (Translation Error Rate [Snover et al. 2006]). TER is an error metric for machine translation that measures the number of edits required to change a system output into one of the references. TER scores are percentages. Lower TER scores represent better translation quality. Unlike previous experiments, human judgments were not used due to the high vol- 31 ume of translated candidates and the time frame of this study. To calculate TER,

31 Another common metric, BLEU, was not used because it is used most appropriately to evaluate quality at the document level.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:26 C. Hu et al.

Table XI. Paired t-Test Results

professional translators were used to generate ground truth reference translations (as described in Section 7.0.1). Overall, quality improvement introduced by each annotation was calculated using Algorithm 1. ALGORITHM 1: Calculating quality before and after annotation function QUALITYBEFOREANDAFTER(a) t ← a.timeStamp select candidates c where c.timeStamp < t into list C1 select from C1 candidate cbest with the best (lowest) TER score qbefore ← TER(cbest) select candidates c where c.timeStamp ≥ t into list C2 qafter ← mean(TER(c ∈ C2)) return (qbefore, qafter) end function

We then performed a paired t-test between each pair of (qbefore and qafter)(TER scores before and after annotation). We also conducted a paired t-test for each type of anno- tation. The results are shown in Table XI. For annotations as a whole, the statistical significance showed that annotations did improve translation quality. Paraphrase was the most effective type of annotation, consistent with the fact that it was also the mostly used. Explanations with templates were not effective. A possible explanation is that the predefined templates did not provide monolingual participants with a chance to add new clarifying content. This analysis is preliminary due to the interaction design of MonoTrans2. The im- provement of quality might have resulted simply from participants being exposed to previous translation candidates. For a more precise study, a control condition in which no annotations are used would be needed to isolate the effect of annotations.

8.4. Error Category Analysis As discussed in Section 5.4, the crowdsourced monolingual translation protocol re- duces machine translation errors. In order to understand how the protocol corrected translation errors, we compared the translation errors in machine translation output and MonoTrans Widgets output. Translation errors in machine translation output and MonoTrans Widgets output were classified into categories and compared within each category. The error category analysis used categories defined in Vilar et al. [2006]. Only the first-level categories and their immediate subcategories were used (see Figure 1 of Vilar et al. [2006]). Two native English speakers (also proficient in Spanish) labeled the translation errors in both the Google Translate output and the MonoTrans Widgets output. For reference, they were shown both the Spanish original and an English ground-truth translation (produced by bilingual ICDL volunteer translators32). The labelers were given Vilar et al. [2006] as instructions. They were told to use only the first two levels of the categories and to follow the instructions as closely as possible.

32The bilingual translators did not use MonoTrans Widgets or MonoTrans2 to produce the translation. They translated the book as regular bilingual translators.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:27

Table XII. Number of Errors in Google Translate (GT) Output and MonoTrans Widgets (MW) Output in Each Category Categories in which MonoTrans Widgets produced much fewer errors are bolded. “Not categorized” indicates that the labeler only labeled the error with a first-level category.

Table XIII. Examples of Machine Translation Errors Corrected by MonoTrans Widgets These errors were labeled by both labelers. Only the segments of the sentences containing the errors are shown.

Using Vilar et al. [2006] as instructions produced highly subjective labels. In par- ticular, one labeler did not assign second-level labels for many errors. Therefore, no statistical analysis was conducted. However, the analysis of the labels was still in- formative. The analysis (see Table XII) showed that the MonoTrans Widgets output contained fewer words with incorrect sense or incorrect form. Without reference to the original meaning (by definition of monolingual tasks on the target side), monolingual target language speakers sometimes “overcorrected” er- rors that did not exist. For example, monolingual target language speakers filled in many missing words, and they even made excessive corrections in some cases, as there were more “extra words” errors in the MonoTrans Widgets output than in the Google Translate output (Table XII, Labeler 1). A closer look into each individual error showed that many “incorrect sense” or “in- correct form” errors were indeed corrected by MonoTrans Widgets. A few examples are given in Table XIII.

8.5. A Selected Successful Translation Example To further understand the translation process with MonoTrans Widgets, here we present an example taken from the translation process of the Spanish book33 into English by MonoTrans Widgets (Table XIV).

33What does the mouse want my tooth for?, Chevez´ and Chavarr´ıa, Fundacion´ Libros para Ninos,˜ 2009.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:28 C. Hu et al.

Table XIV. A Selected Example of Successful Translation The original sentence and the initial machine translation are shown on the top. Each following row is a translation candidate generated by either source-side paraphrasing (“→”) or target-side editing (“←”), from earliest to latest. The shaded candidate is the one that received the most votes. It is also the final translation shown at the bottom.

The original sentence is “- afirm´o julio.” It is a rather short sentence, but it is par- ticularly interesting because it contained a typo (due to OCR errors): the person’s name “Julio” was mistaken as “julio,” which is also a legitimate Spanish word (“July”). This typo caused some confusion among the participants of crowdsourced monolingual translation, but it turned out to be a vivid example of the protocol’s error-correcting power. Not surprisingly, the initial machine translation was “- Said in July.” Although some source language speakers corrected the back-translation in ways that restored the correct meaning (rows #2 and #4), the target language speakers did not capture the correction (rows #5 to #11). It was not until later that some translation candidates with the correct meaning started to appear, but they were not voted up by enough people on both sides (rows #12, #17, and #20) to cause the process to terminate. After a number of edits by the target language speakers, another source language speaker stepped in to provide another paraphrased source sentence (row #27), and it might have triggered the new paraphrase on the target side in which the verb “explained”

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:29 was first introduced (row #28). Finally, a translation with the verb “explained” was voted as the best translation (row #30).34 This example illustrates how the interaction between source language speakers and the target language speakers recovered the intended original meaning by producing paraphrases. Interestingly, it even corrected the typo in the original sentence.

8.6. A Random Example This example was randomly selected and turned out to be an unsuccessful one. Com- pared to the first example, this example did not converge on a correct translation (see Table XV). The correct translation was indeed generated quite early on (row #8), but since votes were concentrated on an even earlier candidate (row #5) that appeared to be completely fluent, the correct translation was not chosen as the final translation. This shows that the translation protocol may converge with undetectable errors, which is again consistent with the discussion in Hu et al. [2010].

9. SUMMARY OF FINDINGS To explore supporting unskilled humans in crowdsourcing, this article presented crowd- sourced monolingual translation, a new translation method that combines machine translation and crowds of people who only speak the source or the target language, but not both. In Section 1, we put forward three hypotheses: (1) H1: Crowdsourced monolingual translation, a protocol supporting collaboration among monolingual people, performs better than machine translation and mono- lingual postediting in terms of translation quality. (2) H2: The feedback loop improves translation quality. (3) H3: Overall, using annotations during the translation process improves translation quality; each type of annotation also improves translation quality. H1 was found to be true in Sections 7.2 and 7.3. Section 7.2 showed that MonoTrans2 performed better than monolingual postediting when translating children’s books. On a 5-point accuracy scale, there was a 0.30-point average improvement (p < 0.001). At the same time, accuracy improvement obtained by monolingual postediting over machine translation was not significant. Systems that implemented crowdsourced monolingual translation also performed better than machine translation alone. Section 7.2 showed that MonoTrans2 performed better than machine translation when translating children’s books. With MonoTrans2, the percentage of high-quality sentences improved from 10% to 68%.35 In addition, the MonoTrans Widgets system was deployed on the ICDL website for 9 months. It translated 11 books into five languages. These books contain 3,600 words. Had a translation service been hired to translate them, at the minimum of $0.10 per word, the translation of books into five common languages would have cost $1,800. MonoTrans Widgets performed better than machine translation when translating children’s books. MonoTrans Widgets improved the percentage of high-quality sen- tences from machine translation’s 31% to 52% (Section 7.3).36 Studies did not confirm H2 with statistical significance due to the difficulty in re- covering the feedback process from data collected with systems that used the asyn- chronous interaction model (Section 8.2). However, preliminary analysis in Section 8.2

34Nonetheless, the punctuation “-” was still not translated. 35The criterion was that a translation is high quality only when it is highly fluent and highly accurate. 36Same as previously, the criterion was that a translation is high quality only when it is highly fluent and highly accurate.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:30 C. Hu et al. ce-side Table XV. A Random Example ”), from earliest to latest. The shaded candidate is the one that received the most votes. It is also the final translation ← ”) or target-side editing (“ → The original sentenceparaphrasing and (“ the initialshown machine at translation the bottom. are shown at the top. Each following row is a translation candidate generated by either sour

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:31 still showed that the feedback loop may improve translation quality if it forms a suc- cessful clarification dialogue. Regarding H3, Section 8.3 confirmed that using annotations during the translation process improves translation quality. Among all the annotation types analyzed, para- phrases and picture improved translation quality. However, part of H3 was not con- firmed. In particular, explanations with templates did not improve translation quality (Section 8.3). A possible explanation is that the predefined templates did not provide monolingual participants with a chance to add new clarifying content. In addition to confirming the hypotheses, analyses in Sections 8.2 and 8.4 also show that the protocol performs better when the errors are detectable, and that its error-correcting ability decreases with nondetectable errors.

10. REFLECTIONS ON SYSTEM EFFECTIVENESS As indicated by the experiments, crowdsourced monolingual translation improved ma- chine translation quality and improved over monolingual postediting. The feedback loop at the heart of the system is effective overall. Using annotations as a separate communication also seemed effective from our experiments previously discussed. However, from the usage of different annotation types, paraphrasing was much more frequently used than all other annotation types. We attribute this difference in usage partially to the user interface design. In MonoTrans Widgets, paraphrasing was the only annotation type provided. In both MonoTrans2 and MonoTrans, it takes longer to attach an image or use a predefined question template than it takes to rephrase parts of a sentence. It is also reasonable to assume that image labels are not as easily found for abstract concepts as it is to paraphrase. Paraphrasing is also more open-ended than using the predefined set of question/answer templates.

10.1. Applicability of System Since crowdsourced monolingual translation relies on both machine translation and monolingual people’s inference of the original meaning, it would naturally perform better when either one improves. If the initial machine translation is more accurate, it would be easier for the target language speakers to recover the original meaning. For example, translating shorter text is easier than translating longer text; translating description of facts that match users’ world knowledge is easier than sentences based on imaginative settings or fig- urative ones. For example, translating rescue requests is easier than translating fairy tales. On the other hand, if machine translation output is so high quality that it only contains minor errors, then the protocol would be less efficient than simply performing monolingual postediting on the target side, since all translation errors would be easily corrected by target language speakers. Anecdotally, translation works better for crowds with some exposure to the other lan- guage. Even experience with a related language helps. During the Haitian Emergency Message experiment [Hu et al. 2011], some (monolingual) English speakers mentioned that their previous experience with French helped their understanding of English sen- tences translated from Haitian Creole, which is heavily influenced by French. It is important to note that the goal of crowdsourced monolingual translation is not to replace professional translators, or even bilingual humans. As the previous experiments showed, the convergence on an accurate translation can take many iterations and much time. The translation quality (as indicated by accuracy and fluency scores) is usually not as high as that produced by bilingual humans. However, crowdsourced monolingual translation is unique in its dependence only on monolingual human participants, and its value is not completely reflected in

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:32 C. Hu et al. the speed and quality comparisons. To reach a certain level of translation quality, crowdsourced monolingual translation does not need bilingual people. At the same time, its quality is better than machine translation and monolingual postediting. Instead of competing with professional translators or machine translation, we foresee a situation in which crowdsourced monolingual translation coexists with these trans- lation technologies, occupying a significant but different portion of the solution space. Crowdsourced monolingual translation can be a reasonable alternative to translators when they are hard to find, when the translation material is easy, or when high transla- tion accuracy is not necessary. It can also be used to generate preliminary translations to enhance professional translators’ efficiency. Assuming crowdsourced monolingual translation systems handle easy material relatively well, one can use crowdsourced monolingual translation to preprocess the source text, leaving only the difficult text for human translators. To test this hypothesis, we would suggest two kinds of experiments: (1) measuring translation quality by translation type and (2) evaluating the human effort to postedit crowdsourced monolingual translation output.

11. FUTURE WORK There are many open questions and much room for improvement for the crowdsourced monolingual translation protocol. In this section, we discuss new methods to implement the protocol and development of similar crowdsourcing systems in other domains.

11.1. Tighter Integration with Machine Translation All the systems implemented used an off-the-shelf machine translation engine.37 The machine translation engine was convenient and could translate between many lan- guage pairs, but its output was limited to either the n-best list (the best n translation candidates given a source sentence) or 1-best translation with word alignments. A more open machine translation engine that allows access to more of its internal features would provide more ways to further integrate machine translation into the protocol and save human effort. In future work, we plan to explore these ideas using the cdec framework [Dyer et al. 2010].

11.2. Extrinsic Motivators In MonoTrans, MonoTrans2, and especially MonoTrans Widgets, great effort was taken to engage more casual users. However, the problem of engaging users may be solved independently and more directly by introducing extrinsic motivators. For example, online labor markets such as the could be used to recruit monolingual people, who are then directed to MonoTrans Widgets as workers.38 However, much care must be given to quality control with workers on Me- chanical Turk. In our earlier experiment in which Mechanical Turk workers were recruited to give phrase-level paraphrases on the source side (“targeted paraphrases” [Buzek et al. 2010]), Mechanical Turk workers generated a considerably higher per- centage of nonsense input than the input from the ICDL users. Therefore, a system that recruits online workers (e.g., Duolingo) may need extra verification steps, as suggested by Little et al. [2010].

11.3. New Designs for Crowdsourcing Systems Although this article only explored translation, the motivating principle behind crowd- sourced monolingual translation is that crowds can be linked together by a feedback

37The engine was Google Translate and its Research API. 38The Amazon Mechanical Turk is at http://mturk.com.

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:33 loop, and that the communication between crowds can enable them to complete more complex tasks and to improve output quality. This general idea can easily be applied to domains besides translation. For example, in a system that employs sighted people to provide photo-to-text recognition for visually impaired people (such as VizWiz [Bigham et al. 2010]), simple feedback such as whether text is well positioned in the photo can greatly improve the recognition results. In fact, in every crowdsourcing system where there is a crowd of workers and a crowd of end- users, it is possible in principle to introduce such a feedback loop. This new design also opens up an interesting opportunity for crowdsourcing systems: tasks are no longer dependent on a single worker crowd with a particular skill. On the other hand, it also poses several new problems, such as coordinating between multiple crowds. To this end, MonoTrans Widgets represents an attempt to solve these problems through task prioritization based on crowd size [Hu et al. 2012].

12. CONCLUSIONS This article presented crowdsourced monolingual translation, a new translation pro- tocol that supports collaboration between two crowds of people who only speak the source or the target language, respectively. Crowdsourced monolingual translation is a translation protocol based on the computationally supported collaboration between two groups of monolingual people, each group speaking only one of the two languages involved. With the three systems, we evaluated the protocol’s overall translation quality, com- pared it with postediting, analyzed its components (i.e., the feedback loop and the annotations), and analyzed its translation errors. Our findings showed that the pro- tocol is effective and its improvement over monolingual postediting is statistically significant. Although there is still much to explore in terms of its widespread applica- tion, such as refining the types of annotations in its feedback loop, integrating it with machine translation, and motivating casual participants, crowdsourced monolingual translation suggests a new crowdsourcing approach for which skillful workers are not needed.

ACKNOWLEDGMENTS This research was supported by NSF contract #BCS0941455 and by a Google Research Award. We also appreciate the contributions on various portions of this project by Olivia Buzek, Alexander J. Quinn, and Yakov Kronrod.

REFERENCES J. Allen. 2003. Post-editing. Edited by Harold Somers. Benjamins Translation Library, Vol. 35. Amsterdam: John Benjamins., 297–317. M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. 2010. Soylent: A word processor with a crowd inside. In ACM (UIST’10). ACM, 313–322. J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. 2010. VizWiz: Nearly real-time answers to visual questions. Science 16, 3 (2010), 333–342. DOI: http://dx.doi.org/10.1145/1866029.1866080 O. Buzek, P. Resnik, and B. B. Bederson. 2010. Error driven paraphrase annotation using mechanical turk. In NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. C. Callison-Burch. 2005. Linear B system description for the 2005 NIST MT evaluation exercise. Available at http://www.cs.jhu.edu/ ccb/publications/linear-b-system-description-for-nist-mt-eval-2005.pdf. H. H. Clark and C. R. Marshall. 1981. Definite reference and mutual knowledge. In A. K. Joshi, B. Webber, & I. Sag (Eds.), Elements of discourse understanding (pp. 10–63). Cambridge: Cambridge University Press. M. Dabbadie, A. Hartley, M. King, K. J. Miller, W. M. El Hadi, A. Popescu-Belis, F. Reeder, and M. Vanni. 2002. A hands-on study of the reliability and coherence of evaluation metrics. In Proceedings of the

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. 22:34 C. Hu et al.

2002 Language Resources and Evaluation Conference (LREC’02). Available at http://mt-archive.info/ LREC-2002-Dabbadie-2.pdf. D. Danet and S. C. Herring (Eds.). 2007. The Multilingual : Language, Culture, and Communication Online. Oxford University Press, New York. C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of the Annual Meeting of the Association of Computational Linguistics. 7–12. J. Edwards. 1994. Multilingualism. Routledge. K. Everitt, C. Lim, O. Etzioni, J. Pool, S. Colowick, and S. Soderland. 2010. Evaluating lemmatic communi- cation. Journal of Translation and Technical Communication Research 3 (2010), 70–84. M. Gabsdil. 2003. Clarification in spoken dialogue systems. In Proceedings of the 2003 AAAI Spring Sym- posium. Workshop on Natural Language Generation in Spoken and Written Dialogue. Available at https://www.aaai.org/Papers/Symposia/Spring/2003/SS-03-06/SS03-06-006.pdf. S. Hampshire and C. Porta Salvia. 2010. Traslation and the Internet: Evaluating the quality of free online machine translators. Quaderns: Revista de traducci´o 17 (2010), 197–209. C. Hu, B. B. Bederson, and P. Resnik. 2010. Translation by iterative collaboration between monolingual users. In Proceedings of Graphics Interface 2010. Canadian Information Processing Society, Ottawa, Ontario, Canada, 39–46. C. Hu, B. B. Bederson, P. Resnik, and Y. Kronrod. 2011. MonoTrans2: A new human computation system to support monolingual translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1133–1136. DOI: http://dx.doi.org/10.1145/1978942.1979111 C. Hu, P. Resnik, Y. Kronrod, and B. Bederson. 2012. Deploying monotrans widgets in the wild. In Proceedings of the 2012 ACM Annual Conference on Human Factors in Computing Systems (CHI’12). ACM, New York, NY, 2935. DOI: http://dx.doi.org/10.1145/2207676.2208700 C. Hu, P. Resnik, Y. Kronrod, V. Eidelman, O. Buzek, and B. B. Bederson. 2011. The value of monolin- gual crowdsourcing in a real-world translation scenario: Simulation using Haitian Creole emergency SMS messages. Association for Computational Linguistics, 399–404. Available at http://www.aclweb. org/anthology/W11-2148http://aclweb.org/anthology-new/W /W11/W11-2148.pdf. J. Hutchins. 2005. Current commercial machine translation systems and computer-based translation tools: system types and their uses. International Journal of Translation 17, 1–2 (2005), 5–38. R. Hwa, P. Resnik, A. Weinberg, and O. Kolak. 2002. Evaluating translational correspondence using an- notation projection. In Proceedings of the Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, PA, 392–399. T. Ishida. 2006. Language grid: An infrastructure for intercultural collaboration. In Proceedings of the International Symposium on Applications and the Internet (SAINT’06). 96–100. DOI: http://dx.doi. org/10.1109/SAINT.2006.40 M. Kay. 1997. The proper place of men and machines in language translation. Machine Translation 12, 1/2 (1997), 3–23. A. Kittur. 2010. Crowdsourcing, collaboration and creativity. XRDS Crossroads the ACM Magazine for Stu- dents 17, 2 (2010), 22–26. P. Koehn. 2010. Enabling monolingual translators: Post-editing vs. options. In Proceedings of Human Lan- guage Technologies the 2010 Annual Conference of the North American Chapter of the ACL. 88, 537–545. W. Lewis. 2010. Haitian Creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes. In EAMT 2010: Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT’10). Available at http://research.microsoft.com/pubs/145627/eamt-05.pdf. G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. 2010. Exploring iterative and parallel human compu- tation processes. In Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems. ACM, 4309–4314. DOI: http://dx.doi.org/10.1145/1837885.1837907 A. Lopez. 2008. Statistical machine translation. ACM Computing Surveys 40, 3 (2008), 1–49. DOI: http:// dx.doi.org/10.1145/1380584.1380586 D. Morita and T. Ishida. 2009. Collaborative translation by monolinguals with machine translators. In Proceedings of the 14th International Conference on Intelligent User Interfaces. ACM, 361–366. DOI: http://dx.doi.org/10.1145/1502650.1502701 D. Morita and T. Ishida. 2011. Collaborative translation protocols. In The Language Grid, T. Ishida (Ed.). Springer, Berlin, 215–230. DOI: http://dx.doi.org/10.1007/978-3-642-21178-2\_14. N. S. Nise. 1995. Control Systems Engineering (2 ed.). Vol. 34. Addison-Wesley, 11–13. D. W. Oard. 2003. The surprise language exercises. ACM Transactions on Asian Language Information Processing 2, 2 (2003), 79–84. DOI: http://dx.doi.org/10.1145/974740.974741

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014. Crowdsourced Monolingual Translation 22:35

A. J. Quinn and B. B. Bederson. 2009. A Taxonomy of Distributed Human Computation.Human- Computer Interaction Lab Technical Report, University of Maryland. Available at http://hcil.cs.umd.edu/ trs/2009-23/2009-23.pdf. C. E. Shannon. 1948. A mathematical theory of communication. Bell System Technical Journal 27, October (1948), 623–656. C. E. Shannon. 1951. Prediction and entropy of printed English. Bell System Technical Journal 30, 1 (1951), 50–64. C. E. Shannon, Bell Laboratories, and M. Hill. 1950. The redundancy of English. In Proceedinigs of the 7th Conference on Cybernetics, Circular Causal and Feedback Mechanisms in Biological and Social Systems. 248–272. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the Conference of the Association for Machine Translation in the Americas. 231, 223. S. Soderland, C. Lim, B. Q Mausam, O. Etzioni, and J. Pool. 2009. Lemmatic Machine Translation. In Proceedings of the 12th Machine Translation Summit (MT Summit XII). 128–135. J. Surowiecki. 2005. The Wisdom of Crowds. Vol. 75. Anchor. DOI: http://dx.doi.org/10.1038/climate.2009.73 D. Vilar, J. Xu, L. F. D’Haro, and H. Ney. 2006. Error analysis of statistical machine translation output. In Proceedings of the Language Resources and Evaluation Conference (LREC’06). 697–702. L. Von Ahn and L. Dabbish. 2004. Labeling images with a computer game. In Proceedings of the 2004 Conference on Human Factors in Computing Systems (CHI’04). 319–326. DOI: http://dx.doi.org/ 10.1145/985692.985733 D. Yarowsky, G. Ngai, and R. Wicentowski. 2001. Inducing multilingual text analysis tools via robust pro- jection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research. Association for Computational Linguistics, 1–8. O. Zaidan and C. Callison-Burch. 2011. Crowdsourcing translation: Professional quality from non- professionals. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies. Association for Computational Linguistics, 1220–1229.

Received April 2013; revised January 2014; accepted March 2014

ACM Transactions on Computer-Human Interaction, Vol. 21, No. 4, Article 22, Publication date: June 2014.