ASSESSING THE COMPREHENSIBILITY AND PERCEPTION OF MACHINE

A PILOT STUDY

Aantal woorden: 16 804

Iris Ghyselen Studentennummer: 01400320

Promotor: Prof. dr. Lieve Macken

Masterproef voorgelegd voor het behalen van de graad master in het vertalen in de richting Toegepaste Taalkunde

Academiejaar: 2017 – 2018

ASSESSING THE COMPREHENSIBILITY AND PERCEPTION OF MACHINE TRANSLATIONS

A PILOT STUDY

Aantal woorden: 16 804

Iris Ghyselen Studentennummer: 01400320

Promotor: Prof. dr. Lieve Macken

Masterproef voorgelegd voor het behalen van de graad master in het vertalen in de richting Toegepaste Taalkunde

Academiejaar: 2017 – 2018

i

VERKLARING I.V.M. AUTEURSRECHT De auteur en de promotor(en) geven de toelating deze studie als geheel voor consultatie beschikbaar te stellen voor persoonlijk gebruik. Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van gegevens uit deze studie.

ii

ACKNOWLEDGMENTS There are several people who deserve my profound gratitude for their help and support while I was writing this dissertation.

In the first place I would like to give thanks to all the respondents who filled in my questionnaire. That small act of kindness meant a great deal to me. The questionnaire required some time and attention to fill in and was distributed in a period when people were being overwhelmed on social media with all kinds of questionnaires for other theses. Therefore, I am very grateful to each and every one of them who helped me.

Secondly, I would like to express my deep sense of gratitude to my supervisor, Prof. Dr. Lieve Macken, who has helped me with everything I could ask for and who far exceeded my expectations of any supervisor. This dissertation could not have been completed without her support.

I want to thank Céline Van De Walle as well, for taking the time to help me with the language, style and structure of my text by providing some very useful feedback.

My parents’ support has also been instrumental for me while writing this dissertation. They were prepared to listen to me at all times and helped me to find respondents. Moreover, their love and support have helped me throughout my entire education and I am very grateful for the chance to pursue my studies and by extension my dreams.

Fourthly, I would like to extend this thanks to my entire family for supporting me in every way that they could. From visiting me all on my Erasmus exchange to sending me good luck cards and texts during stressful exam periods, they have really been there for me.

Lastly, I want to thank my friends for being there whenever I needed them. The library sessions with them made writing this dissertation so much more fun.

iii

ABSTRACT This dissertation addresses the results of reading comprehension tests and perception questions on both human translated and raw (unedited) machine translated texts. These translations are based on three source texts of the English Machine Evaluation version (CREG-MT- eval) of the Corpus of Reading Comprehension Exercises (CREG). The author of this dissertation translated the human translations herself and the neural machine translation engines used are DeepL and Translate. The experiment was undertaken via a SurveyMonkey questionnaire, which 99 respondents filled in. The questionnaire contained five reading comprehension questions, as well as five perception questions. The translations were shown before answering the questions, but not during. The results show that respondents can tell which translation is a human or machine translation and that the human translations receive the best clarity scores. The mistakes that bother readers most have to do with grammar, sentence length, level of idiomaticity and incoherence. Comprehension is best with human translations when respondents are asked directly, but the comprehension questions show that the human translation only performs best for one text, with DeepL scoring better for the other two. As for the machine translations, there is no definite answer as to which machine translation tool performs better.

iv

TABLE OF CONTENTS

1 Introduction ...... 1 2 Literature study ...... 4 2.1 Comprehensibility ...... 4 2.2 Quality ...... 5 2.2.1 General text quality ...... 5 2.2.2 Quality of translated texts ...... 6 2.2.3 Quality evaluation ...... 8 2.2.3.1 Human evaluation ...... 8 A. Scoring ...... 10 B. Reading comprehension ...... 10 2.2.3.2 Automatic evaluation ...... 11 2.3 Error typology ...... 11 2.4 MT approaches ...... 13 2.4.1 RBMT ...... 13 2.4.2 SMT ...... 14 2.4.3 NMT ...... 14 2.4.3.1 ...... 14 2.4.3.2 DeepL ...... 15 2.4.3.3 Typical errors of NMT ...... 15 3 Methodology ...... 17 4 Applied error typology ...... 22 5 Results ...... 25 5.1 General results ...... 25 5.2 Text-specific questions ...... 27 5.2.1 Human or machine translation ...... 27 5.2.1.1 Human translation ...... 28 A. HT labelled as MT ...... 28 B. HT labelled as HT ...... 29 5.2.1.2 Google Translate ...... 31 A. GT labelled as HT ...... 31 B. GT labelled as MT ...... 32 5.2.1.3 DeepL ...... 34 A. DL labelled as HT ...... 34

v

B. DL labelled as MT ...... 36 5.2.1.4 Summary ...... 38 5.2.2 Clarity score ...... 39 5.2.3 Comprehension ...... 42 5.2.4 Notable mistakes ...... 48 5.3 General results for comprehension questions ...... 53 5.4 Comprehension questions text 1 ...... 55 5.5 Comprehension questions text 2 ...... 57 5.6 Comprehension questions text 3 ...... 59 5.7 Linguists versus non-linguists ...... 61 6 Conclusion and discussion ...... 63 Bibliography ...... 65 Appendix ...... 69 Appendix I: Instructions questionnaire ...... 69 Appendix II: Translations ...... 69 Appendix III: Questions questionnaire ...... 74 Appendix IV: Comprehensive discussion of comprehension questions text 1 ...... 75 Appendix V: Applied error typology ...... 79

vi

LIST OF ABBREVIATONS Abbreviation In full MT Machine translation HT Human translation GT Google Translate DL DeepL T1_HT Text 1 human translation T1_GT Text 1 Google Translate T1_DL Text 1 DeepL translation T2_HT Text 2 human translation T2_GT Text 2 Google Translate T2_DL Text 3 DeepL translation T3_HT Text 3 human translation T3_GT Text 3 Google Translate T3_DL Text 3 DeepL translation Q1 Question 1 Q2 Question 2 Q3 Question 3 Q4 Question 4 Q5 Question 5 L Linguists NL Non-linguists

LIST OF GRAPHS Graph 1: Reasons why respondents labelled HT incorrectly as MT ...... 29 Graph 2: Reasons why respondents labelled HT correctly as HT ...... 30 Graph 3: Reasons why respondents labelled GT incorrectly as HT ...... 31 Graph 4: Reasons why respondents labelled GT correctly as MT ...... 33 Graph 5: Reasons why respondents labelled DL incorrectly as HT ...... 35 Graph 6: Reasons why respondents labelled DL correctly as MT ...... 37 Graph 7: Clarity scores text 1 ...... 39 Graph 8: Clarity scores text 2 ...... 40 Graph 9: Clarity scores text 3 ...... 41 Graph 10: Comprehension text 1 ...... 42 Graph 11: Comprehension text 2 ...... 44 Graph 12: Comprehension text 3 ...... 46

LIST OF FIGURES

Figure 1: “Translation quality cline in terms of human to machine translation including crowdsourcing” Jiménez-Crespo (2017)...... 7 Figure 2: Error categorisation acceptability (Daems & Macken, 2013) ...... 12

vii

Figure 3: Error categorisation adequacy (Daems & Macken, 2013) ...... 13 Figure 4: Print screen of the Google Spreadsheet setup. (The email addresses are replaced to ensure the respondents’ anonymity.) ...... 19 Figure 5: Screenshot of Google Translate: harmonica (19/06/2018) ...... 24 Figure 6: Screenshot of Google Translate: guitar (19/06/2018) ...... 24 Figure 7: Screenshot of DeepL: harmonica (19/06/2018) ...... 24 Figure 8: Screenshot of DeepL: guitar (19/06/2018) ...... 24

LIST OF TABLES Table 1: Overview of the different experiments used in the questionnaire...... 20 Table 2: Error typology mistakes ...... 22 Table 3: Number of mistakes per translation method per subcategory ...... 23 Table 4: Questionnaire data ...... 25 Table 5: Number of respondents per text ...... 26 Table 6: Answer of respondents when asked if the text is a human or a machine translation . 28 Table 7: Average clarity score linked to the number of different mistakes (error typology) .. 42 Table 8: Average sentence length for incomprehensible sentences ...... 48 Table 9: Average comprehension score per text and translation method...... 54 Table 10: Comprehension questions text 1 ...... 57 Table 11: Comprehension questions text 2 ...... 59 Table 12: Comprehension questions text 3 ...... 61 Table 13: Linguists' and non-linguists' responses ...... 62 Table 14: Correct answers T1 Atlas Q1 ...... 75 Table 15: Correct answers T1 Atlas Q2 ...... 76 Table 16: Correct answers T1 Atlas Q3 ...... 77 Table 17: Correct answers T1 Atlas Q4 ...... 77 Table 18: Correct answers T1 Atlas Q5 ...... 78 Table 19: Correct answers to first question T1 Atlas Q5 ...... 78 Table 20: Correct answers to second question T1 Atlas Q5 ...... 79 Table 21: Error typology T1_GT ...... 80 Table 22: Error typology T1_DL ...... 81 Table 23: Error typology T2_GT ...... 82 Table 24: Error typology T2_DL ...... 83 Table 25: Error typology T3_GT ...... 84 Table 26: Error typology T3_DL ...... 86

viii

ix

1 INTRODUCTION The society we are currently living in is characterized by an ever-increasing globalisation, which has its repercussions on several aspects of our lives. Cultures are being brought closer to each other, businesses are thriving in many different countries and branches are being opened halfway across the world. This means that there is an increase in communication between people who do not have the same mother tongue and the ones who share different cultural values.

A way to solve this problem is to employ language specialists. These specialists are occupied with translations, interpretations or localisations. According to an article published by Common Sense Advisory (CSA) (DePalma, Pielmeier, Lommel, & Stewart, 2017), the amount of money to be gained from the language industry adds up to 43 billion dollars worldwide. Another report by CSA (Henderson, 2016) demonstrates that the worldwide annual growth of the translation industry amounts to 5,52%. These figures show the rising economic importance of translations, interpretations and localisations. As globalisation continues, these figures will undoubtedly increase year after year.

As translations gain importance, machine translation tools (MT tools) are used more frequently, especially with the improving quality development associated with Neural Machine Translation (NMT). These days, both professional translators and general audiences count on the input found on sites such as Google Translate and DeepL, although both groups have different goals. Forcada (2017) states that MT is chiefly used for two purposes. On the one hand there is “dissemination”, which is used by professional translators to base their final translation on after they post-edit the text, and on the other hand there is “assimilation” or “gisting”, which can be used by anyone to grasp the basic meaning of a text in another language. In this dissertation we will focus on dissemination, as we want to see how machine translations are perceived compared to human translations as a full-fledged substitute.

The topic of MT is becoming more important in society: MT has considerable economic relevance, the interest in using it is growing, as well as the actual use of it, and the quality keeps increasing. That is why this dissertation focuses on the comprehensibility and perception of MT. Among other things, we will investigate what hinders the capability of readers to understand the text and where the quality of MT is at in comparison with human translations.

The introduction of MT in the language industry has had several consequences. MT creates new tasks, such as pre-editing and post-editing, and expert translators are needed to develop,

1 evaluate and improve MT (Garcia, 2015). Still, not everyone is ready to accept the use of MT. A survey from 2011 (Piróth) showed that several professional translators refused to work with them, claiming they were more often than not too low in quality and using them as a basis would imply a lot more work than simply translating a text without the help of MT. We should however bear in mind that this survey dates back to the period before NMT and that MT has improved a lot since.

The low quality these translators complained about is often caused by the following disadvantages of MT output (Doherty & Gaspari, 2013): the lack of real-world knowledge and the inability of MT tools to look beyond the context of a sentence when translating. The latter causes coherence problems or inconsistency when it comes to translating certain words. Doherty & Gaspari (2013) also point out that post-editors often need to mind and adapt punctuation and capitalisation. However, as mentioned before, the quality of MT output has increased greatly over the past few years.

This is exactly why it is important to keep studying these improvements. Scarton & Specia (2016), for example, studied the comprehensibility of several Statistical Machine Translation (SMT) and Rule-based Machine Translation (RBMT) systems. However, not many studies have been conducted yet on NMT because of its novelty. With this dissertation, we want to bridge this gap.

Some of the most important and known MT sites are Google Translate, DeepL, Bing and Systranet. Especially DeepL has been gaining a lot of attention lately. The NMT site has been online since August 2017 and achieves very high quality translations compared to other automatic translation sites. DeepL is a quite recent tool, which is why it is now highly debated and discussed upon in formal and informal ways, e.g. on groups such as GentVertaalt where professional translators come together.

In the following sections, we will first elaborate on related research in our literature study (section 2) where we will discuss the term ‘comprehensibility’. The chapter will deal with text and translation quality and several quality evaluation methods. Following this, we will discuss an error typology and give an overview of the different types of MT. In section 3 we will present the methodology employed in this dissertation and we will argue why we chose to use the particular setup of the questionnaire used in this dissertation. The section is completed by our research questions. In section 4 we apply the error typology to the texts used in this dissertation.

2

Section 5 deals with the results of the research and the conclusion and discussion can be found in section 6.

3

2 LITERATURE STUDY

Before studying the quality and different evaluation methods of texts and MT types, a note on terminology is in order.

2.1 Comprehensibility

The term ‘comprehensibility’ is defined by Merriam-Webster as “capable of being comprehended, intelligible”1. The Oxford English Dictionary defines the term as “quality of being comprehensible” (‘Home’, n.d.). On Synonyms.net we find that ‘comprehensibility’ is a synonym of ‘understandability’, with the definition of both being “the quality of comprehensible language or thought” (‘Synonyms.net’, n.d.). Consequently, when looking up ‘understandability’ in the Oxford English Dictionary we find a small thesaurus with a sentence that links ‘understandability’ to “lack of ambiguity”. From this link we can deduce that a comprehensible text is at least characterised by a lack of ambiguity.

‘Comprehensibility’ is not completely the same as ‘readability’, although they are sometimes mentioned together. For example, Tan, Gabrilovich, & Pang (2012) use ‘comprehensibility’ to refer to one sense of ‘readability’ that is used in the literature (“the degree of difficulty of text as judged by average sentence length and vocabulary size”, p. 233). Merriam-Webster’s definition of ‘readable’ is “able to read easily: such as a) legible b) interesting to read” and the Oxford English Dictionary describes ‘readability’ as “The ease with which a text may be scanned or read; the quality in a book, etc., of being easy to understand and enjoyable to read”. These definitions seem to imply some form of subjective value judgement, which does not show up in the definitions of ‘comprehensibility’.

A study by Lesgold, Roth, & Curtis (1979) found that sentences in a text take more time to understand if the information that the sentence refers to is no longer actively in mind when reading the sentence. One of the characteristics of a comprehensible text would therefore be that there is sufficient coherence. Schwarz & Flammer (1981) concluded that if a text is coherent, the reader should be able to “construct a sense of the total text” (p. 65). They also found that comprehensibility increases if there is enough time to process the text and that, if the reader understands even the smallest part of a text, this may be enough to convince them that the text is not entirely incomprehensible. Moreover, they remark that some texts may require prior knowledge which renders a text incomprehensible if absent.

1 (‘Dictionary by Merriam-Webster: America’s most-trusted online dictionary’, n.d.)

4

Tan et al. (2012) agree on this last part and use the term “user-specific” to refer to these texts. They add that comprehensibility of texts is also “topic-specific”, stating that the same people might sometimes comprehend very difficult texts, but other times do not understand anything of a rather easy or mediocre text. This all depends on the topic of the texts and the reader’s interests, educational background or profession.

Göpferich (2009) combines the Hamburg comprehensibility concept (Langer, Schulz von Thun & Tausch, 1993) and the comprehensibility concept from Heidelberg (Groeben, 1982) and comes up with six dimensions that define the comprehensibility of a text, two of which are coined by him. This categorisation is known as the Karlsruhe comprehensibility concept. A first dimension is ‘concision’ and relates to a maximum economy of words. This is followed by ‘correctness’. Next is the ‘motivation’, which implies that a text must attract and retain the reader’s attention. The fourth dimension is ‘structure’ and takes place both at global and local text level. ‘Simplicity’ is another dimension and denotes lexis and syntax. This category mostly refers to lexical simplicity, grammatical simplicity, ambiguity and consistency. The last dimension is ‘perceptibility’ and determines whether layout, fonts, etc. are easily processed.

Comprehensibility can be tested with several methods, including ranking, readability formulas, comprehension questions, cloze tests and Likert-scores. More on evaluation will follow in section 2.2.3.

2.2 Quality

2.2.1 General text quality

Comprehensibility is also an important factor when speaking of general text quality. This can be linked with the CCC model for text evaluation by Renkema (2001). This model claims that when the three C’s are met, a text has a sufficient level of quality2. The three C’s stand for correspondence, consistency and correctness. This first C means that in order for a text to be of good quality, the goal of the text needs to be achieved and the reader’s needs should have been met. The second C is consistency, which means that the writer should maintain uniformity, and the third correctness, which translates into a mistake-free text. Based on this model, comprehensibility might be classified under the first C. After all, this C implies that a good text cannot be accomplished without comprehending the text.

2 ‘dossier taalverzorging renkema ccc-model | Genootschap Onze Taal’, 2015 5

Text readability can be determined by employing statistical methods to calculate readability indices (Tan et al., 2012). Examples given of these indices or readability formulas are the Flesch-Kincaid Index and the Gunning FOG Index. The indices take the word length and sentence length into account to measure the complexity of a text, among other things. The formulas are easy and cheap and are therefore used frequently by organisations (Schriver, 1989). However, as mentioned before in section 2.2.3.1, they do not take into account everything that makes a text comprehensible.

Another way to measure human readability can be found in the paper by Jones et al. (2005), where the researchers tested the readability by (1) studying the accuracy of respondents when answering content questions about the text, (2) timing the respondents’ answers and (3) asking the respondents to score the text. They found that texts with errors slow respondents down and yield worse results for the comprehension questions.

2.2.2 Quality of translated texts

General text quality is inherently linked with the quality of translated texts, since translations need to have a good general text quality and need to be well transferred into another language. This means that translations need to be comprehensible and fluent as well as adequate. Fluency refers to the translation itself and adequacy to the correct meaning of source and target text (Snover, Madnani, Dorr, & Schwartz, 2009).

Quality of texts is one of the most controversial topics in Translation Studies. Jiménez-Crespo (2017) writes that the language industry “has long ago acknowledged and accepted that it is impossible to provide top quality in all situations due to economic, resource and time restrictions” (p. 478). He goes on to say (as cited in Drugan, 2013, p. 180) that sometimes it is better to have a translation, even if it is of low quality, than none at all. The quality that is deemed sufficient depends on many factors. Jiménez-Crespo mentions the end user and their needs, the context, the sentimental value, geographic locations and of course the time available and the price. He gives the example of humanitarian crises, in which life or death can depend on information being available in several languages and in which not much time is available. In these situations, money may not be at hand either, whether this is a temporary obstacle or whether this will never be available. Therefore, unpaid crowdsourcing has been used in the past (as cited in Munro, 2013). The preceding idea is formulated as follows by (Göroj, 2014b, p. 389):

6

“The only way to offer large amounts of information and goods in multiple languages fast while staying within reasonable budgets is by making a compromise and provide content with different levels of quality using new translation channels and translation technology.”

Text type along with the purpose of texts are some of the most important factors for text quality. Jiménez-Crespo (2017) applies this general notion of text quality to quality of translations and presents this in a translation quality triangle (see figure 1). He lists several types of texts and categorises them in accordance with the quality level of translation that they require (high, medium or low). He then determines how the text should be translated to obtain the quality required for the different text types.

Figure 1: “Translation quality cline in terms of human to machine translation including crowdsourcing” Jiménez-Crespo (2017).

The shift from top quality of translations to the acceptance that quality is subject to text type, purpose, time and money has brought about an evolution in quality evaluation of translated texts. According to Jiménez-Crespo (2017), these evaluations now take into account which content is presented in the text and how the translations are created, namely whether it was made by a professional translator, post-edited from an MT or obtained via crowdsourcing. The former is exemplified further (as cited in O’Brien, 2012) by questioning the usefulness of high quality for translating internal e-mails or tweets from famous people.

7

This change in expected and required quality of translations is also reflected in regulations. Jiménez-Crespo (2017) refers to the ISO 9000 quality3 and the EN 15038. The first is a standard to ensure the quality of businesses in general and the second does the same, but is applied specifically to the translation business. He also gives examples of new translation quality models, such as the Dynamic Framework of the Translation Automation User Society (TAUS) (as cited in Göroj, 2014b; O’Brien, 2012) and the Multidimensional Quality Metrics (MQM) framework (as cited in Lommel et al., 2014). This latter will be further discussed in section 2.3.

We have now discussed expectations of quality in all kinds of translations: professional human translations, crowdsourcing and machine translations. Concerning the quality of machine- translated texts specifically, Bentivogli et al. (2016) and Toral and Sánchez-Cartagena (2017) (as cited in Van Brussel, Tezcan, & Macken, 2018) write that the text quality declines along with the length of the sentences in translated text and that this is a particular disadvantage to NMT approaches. However, Van Brussel et al. (2018) found no such results in their paper.

2.2.3 Quality evaluation

There are several methods to determine the quality of a text. A first distinction between these methods is whether the evaluation is executed by a human or automatically. If a human carries out the evaluation, other distinctions can be made, as will be discussed in section 2.2.3.1. Within these distinctions and within automatic evaluation there are several ways to evaluate translation quality. We will discuss the options for the human evaluations in the following sections, along with their advantages and disadvantages and give a brief overview of automatic evaluation and readability evaluation.

2.2.3.1 Human evaluation

Human evaluation can generally be performed by two different groups of people. The evaluators are either language experts or a regular crowd. De Clercq et al. (2014) found that for their evaluation method using a crowdsourcing setup yielded similar results as having experts evaluate the same texts. This fits in the belief of the language industry that everyone involved in translating should be able to decide when the quality of a text is good enough for the purpose

3 The author refers to the ISO 900 quality, but seeing as this is nowhere to be found, we assumed here that the text contained a typo. 8 of that text, even though academics share the opinion that a theoretical basis is needed to assess quality (Jiménez-Crespo, 2017).

A distinction between different human evaluations to determine text comprehensibility can be made based on the focus of the method (Schriver, 1989; as cited in Göpferich, 2009). There are three categories: “text-focused methods”, “expert-judgement-focused methods” and “reader- focused methods”. Göpferich (2009) goes on to give an example for every method. She links the text-focused method to readability formulas, the expert-judgement-focused method to the Karlsruhe comprehensibility concept (mentioned earlier in 2.1 Comprehensibility) and the reader-focused (or target-group-focused) method to usability testing for instructive texts and having the crowd answer questions on texts to determine the comprehensibility.

In order to choose a method based on these categories, researchers need to take a closer look at the advantages and disadvantages of the approaches. According to Göpferich (2009), readability formulas are easily and quickly applied, but they only consider certain aspects of lexicon, syntax and style. This means that they do not reflect all the reasons why a text is comprehensible or not. On the subject of the Karlsruhe comprehensibility concept, she adds that it is a good instrument to improve non-instructive texts where there are obvious problems with the comprehensibility of the text, but that an empirical research involving readers will provide better results. She is of the opinion that reader-focused or target-group-focused methods yield the most reliable results, because it is ultimately the readers who determine whether the text is comprehensible for themselves. However, they have a very big disadvantage: in order to conduct such investigations, a lot of time is required.

Within the different categories of human quality evaluation, there are some which make use of labels. These labels traditionally focus on lexicon, syntax and style, but nowadays new methods use a different type of labels, depending on purpose. These new models are often the result of the changes in translation practices and the increasing use of MT. Jiménez-Crespo (2017) lists some of these quality evaluations with non-traditional labels. TAUS (the Translation Automation Society) introduces two levels for post editing MT: good enough and publishable quality. He also refers to a study by Gouadec (2010) with the following labels for delivery or broadcast in the Canadian government: rough cut, fit for delivery, fit for broadcast and fit for revision and to Allen (2003) with the labels: no post editing, minor post editing intended for gisting purposes and full post editing. The label of ‘minor post editing intended for gisting purposes’ refers to the term “assimilation” (Forcada, 2017), which was mentioned earlier in the introduction. The label ‘full post editing’ would then refer to “dissemination”. 9

We will now give a brief summary of the human quality evaluation methods used later on in this dissertation.

A. Scoring

Scoring is, together with ranking , the most commonly used human evaluation method (Macken, L., 2017). This method relies on human evaluators who are asked to give Likert scores to machine translated texts. This is usually on a scale from 1 to 5 and the score can give an indication of several features, such as fluency and adequacy (Sun, 2010; as cited in Moerman, 2017). The feature used in this dissertation is comprehensibility. A disadvantage to scoring is that it may lead to biased results (Vilar et al., 2007; as cited in Moerman, 2017).

B. Reading comprehension

In their paper Scarton & Specia (2016) hypothesize that if respondents can correctly answer reading comprehension questions about a translated text without having read or having access to the source text, the text is a good translation. If not, it is a bad translation. Reading comprehension is therefore tested by asking questions on the content of a translation to determine its quality. They refer to a study by Jones et al. (2005a) which showed that respondents answer more questions correctly if they have read a human translation, rather than a machine translation. However, in their study they found that humans did not perform better for the human translated documents.

The reading comprehension questions used in the paper by Scarton & Specia (2016) are mostly open questions. These questions were classified in the following classes:

 Question forms o Yes/no questions o Alternative questions o True/false questions o Wh-questions  Comprehension types: o Literal questions o Reorganisation questions o Inference questions For further explanation of the terms, we refer you to the paper concerned.

10

Göpferich (2009) remarks that target-group-focused empirical methods using questions that explore the comprehensibility of the translated texts, otherwise known as Cloze procedures, only measure whether an aspect of the text is comprehensible or whether the entire text is comprehensible. However, this latter happens in an overly superficial manner and is insufficient. Thus, she claims that it is impossible to fully determine both using these methods. Moreover, she emphasizes that comprehensibility is not the same as retainability.

2.2.3.2 Automatic evaluation

A less time-consuming evaluation method is making use of automatic metrics. This is mostly done in the development phase of MT systems to compare, among other things, different versions of a system (Macken, L., 2017). In this case, automatic metrics are preferred to cut down on costs. Automatic metrics test how much the machine translation resembles one or several human reference translations by using formulas to calculate several variables. However, all of these automatic evaluation metrics make their judgements based on quality, not specifically on comprehensibility.

Since this dissertation uses human evaluation, we will only mention a few examples of the most common automatic metrics: precision and recall, WER (Word Error Rate), BLEU (Papineni et al., 2002), TER (Translation Error Rate), HTER (Human-targeted Translation Error Rate) and METEOR (Banerjee & Lavie, 2005).

2.3 Error typology

O’Brien defines error typology as an evaluation “by a qualified linguist who flags errors, applies penalties and establishes whether the content meets a pass threshold” (2012, p. 66). Error typologies are sometimes better than human evaluation, in the sense that by analysing errors, researchers can say how translations differ, what the typical problems of an MT system are and they can give an objective score to translations (Macken, L., 2017). These scores are calculated based on the weight that evaluators attribute to certain errors. These weights go from potential or minor problems (where the text can still be understood) to major problems (where comprehensibility is hindered, but the text is still considered as usable) to critical problems (where the text is not accurately translated or thus understandable that the text cannot be used) (Daems & Macken, 2013; Lommel, 2015).

As O’Brien (2012) states, error typology, along with the scoring and the error weights, is one of the favourite ways to judge quality in the translation industry. However, she adds that the

11 typologies need to be adapted to the changes in the industry, such as social media. Other challenges that need to be taken into account, according to Daems & Macken (2013), are the translation situation, subjective assessments and the too complex categories. Error typologies, such as Multidimensional Quality Metrics (MQM) (Lommel, 2015) and the Annotation Guidelines for English-Dutch Translation Quality Assessment (Daems & Macken, 2013), try to consider these challenges by, among other things, creating open categories that can be customized. The former typology also does this by including localisation and transcreation. Additionally, the latter points out that the translation problems are not always errors, depending on the text type.

In this dissertation we will focus on the error typology by Daems & Macken (2013). The typology is divided into two main categories: acceptability and adequacy. These categories contain subcategories, but they are not complete: subcategories can be added if necessary. This is indicated by adding categories in the tables labelled as ‘other’ or ‘other meaning shift’.

The main subcategories under acceptability are: grammar & syntax, lexicon, spelling & typos, style & register and coherence. All of these are once more divided into other categories (see figure 2 below).

Figure 2: Error categorisation acceptability (Daems & Macken, 2013)

The error categorisation relating to adequacy is subdivided as well, but only once, and looks as follows:

12

Figure 3: Error categorisation adequacy (Daems & Macken, 2013)

An application of this error typology can be found in section 4, where it is implemented on the machine translated texts used in this dissertation.

2.4 MT approaches

Systran, a provider of online machine translations, gives the following definition of MT (‘What is Machine Translation?’, n.d.): “Machine translation (MT) is automated translation. It is the process by which computer software is used to translate a text from one natural language (such as English) to another (such as Spanish).”

To obtain these machine translations, there are essentially three different approaches (Macken, L., 2017). The first approach is the Rule-based Machine Translation, or RBMT, and is the only approach that focuses on dictionaries and grammar rules. Statistical Machine Translation (SMT) uses data consisting of corpora, as well as Neural Machine Translation (NMT), which is based on artificial neural networks. Since this last approach is used for both Google Translate and DeepL, the two translation sites used in this dissertation, we will focus more on NMT and give only a brief overview of RBMT and SMT.

2.4.1 RBMT

There are three types of Rule-based Machine Translation approaches: direct systems, transfer systems and interlingual systems (Macken, L., 2017). These types differ in the depth of the grammatical analysis. The direct systems employ word-for-word translations and do not apply

13 any further analysis. The transfer systems carry out a grammatical analysis, consisting of an analysis, transfer and generation phase, but they are very labour-intensive and need to be adapted with each new language used. The interlingual system would remove that last disadvantage, but it does not exist yet.

2.4.2 SMT

A Statistical Machine Translation system is based on large data sets that consists of monolingual and bilingual information (Macken, L., 2017). The system is built up out of three steps: a translation model, language model and decoder. The translation model applies statistical word alignment to take context into account. The language model contains the monolingual information. In all the translation and language phases, the probability of phrases is calculated. This information is used in the decoder phase to obtain the best translation possible.

2.4.3 NMT

The Neural Machine Translation approach is created with an artificial neural network and uses large data sets from which they can learn (Macken, L., 2017). This approach is inspired by studies of the human brain and usually consists of both an encoder and a decoder which are trained together. The system is typically built up of three layers: the input layer, hidden layer and output layer. The connections made between words in every layer receive a probability weight during the training of the network to represent the sequences that are most likely. An artificial neural network is also called deep learning which refers to one or several hidden layers in the network.

Artificial neural networks have existed for a while, but the rising interest has to do with the digitalisation and subsequent increase in data available, the improvement in algorithms, the faster computers and the more capable processors. Because of all this the quality has increased enormously, since the system can take the context into account.

2.4.3.1 Google Translate

Google Translate (‘Google Translate’, n.d.) used to be a Statistical Machine Translation site, but it switched to Neural Machine Translation in 2016 (Macken, L., 2017). The machine translation site has an application and can also be used offline (‘Google Translate - Apps on Google Play’, n.d.). According to the site, it also translates between 103 languages at the most, depending on how you use it, and can translate pictures of texts or speech text.

14

2.4.3.2 DeepL

The DeepL Translator (‘DeepL Translator’, n.d.) is a very recent addition in the field of MT and uses NMT. The free website is made by the same team that created Linguee and currently hosts seven language pairs. The site was launched in August 2017 (‘Persinformatie’, n.d.) and praises itself as “the best translation machine in the world” (‘DeepL — Vertaalkwaliteit’, n.d.). This statement is supported by a study they conducted in which professional translators were asked to rate sentences without knowing which MT system made them (DeepL Translator, Google Translate, Microsoft Translator or Facebook) and in which the best scores were attributed to DeepL. Several other sources also mention that DeepL is considered better than for instance Google Translate, although it still has its faults (Peleman, 2017; ‘Why DeepL Got into Machine Translation and How It Plans to Make Money’, 2017).

2.4.3.3 Typical errors of NMT

The paper by Van Brussel, Tezcan, & Macken (2018) gives an overview of the typical errors found in NMT output compared to RBMT and PBMT (Phrase-Based Machine Translation, a form of SMT) for the language pair English-Dutch. They found that NMT translations overall contain fewer mistakes and one third of the sentences were translated completely correctly, a much higher percentage than for the other two approaches. The errors were divided into two categories: accuracy errors and fluency errors. These were in turn subdivided: (1) accuracy was made up of mistranslation, do-not-translate (DNT), untranslated, addition, omission, and mechanical, and (2) fluency consists of grammar, lexicon, orthography, multiple errors and other. For further subcategories or some examples about the categories, see Van Brussel et al. (2018).

NMT outperformed RBMT and PBMT on accuracy and fluency in general. The most notable results where NMT scored better than either RBMT or PBMT, or both, were made in the following subcategories: mistranslations in general, number of omission errors and number of superfluous words. However, it is surmounted by one or either approaches when it comes to the subcategories DNT (do-not-translate, for instance proper names), words left out per omission, semantically unrelated mistranslations, other mistranslations, lexicon (lexical choice errors, or mistakes in the choice between content words or function words) and repetition (redundant repetitions of words or phrases). When referring to the invisibility of the omission errors, Van Brussel et al. (2018) remark that this is certainly not an advantage, since it impairs the quality of NMT output and hinders gisting.

15

We may expect, then, to find more errors belonging to these subcategories where NMT scored worse than RBMT or PBMT further along in this dissertation.

16

3 METHODOLOGY The purpose of this dissertation is to investigate the comprehensibility, acceptability and perception of translations generated by two NMT systems on its own and in comparison with the human translations of the same text. This is tested in cases where the reader cannot fall back on the source text and where he or she may be given either a human translation or a machine translation. In the course of the investigation, we will attempt to answer the following research questions: I. Can readers tell the difference between human or machine translations? II. Which translation is preferred by readers: human or machine? III. Are human translations more comprehensible than machine translations? IV. Are machine translations comprehensible for the chosen language pair (English-Dutch) and using the chosen machine translation tools (Google Translate and DeepL)? V. If the machine translations mentioned in question III are comprehensible, are there still other elements which complicate comprehension (on sentence level e.g. grammatical errors or text level e.g. coherence)? VI. If machine translations are not comprehensible for English-Dutch using Google Translate and/or DeepL, which type of mistakes hamper comprehension (grammatical, lexical, …)? VII. Which mistakes in machine translations bother readers most? VIII. Which method of translation generates the best result on comprehension questions: human translation, Google Translate or DeepL? IX. Which machine translation tool is the better of the two: Google Translate or DeepL (in terms of number of errors, clarity scores and comprehension rates)? X. Do the comprehension problems correspond with the mistakes in the applied error typology? XI. Are language specialists more severe on translated texts than non-specialists? XII. How do these results correspond with the research by Scarton & Specia, 2016?

Two of the research questions immediately raise a hypothesis, based on our literature study. For research question XI, we can formulate the following hypothesis. Based on the findings by De Clercq et al. (2014) that language experts gave very similar evaluations to non-experts, our hypothesis is the following: language specialists are not more severe on translated texts than non-specialists. For research question IX, our hypothesis is based on Peleman (2017) and is the following: DeepL is the better machine translation tools of the two.

17

The setup for the investigation is a questionnaire. The focus of the research is therefore based on human evaluation. Since De Clercq et al. (2014) successfully used crowdsourcing and Göpferich (2009) attributes the best results to reader-focused empirical research, as mentioned before in section 2.3.1, we opted for a setup where (non-)expert respondents were appealed to.

The questionnaire involved three different texts, translated either by ourselves (a human translation), Google Translate or DeepL. The source texts were selected from the Machine Translation Evaluation version (CREG-MT-eval) (Scarton & Specia, 2016) of the Corpus of Reading Comprehension Exercises in German (CREG)4 and were obtained from the website https://github.com/carolscarton/CREG-MT-eval. The corpus contains texts that were translated from German to English. The human English translations of these texts were used for this dissertation and serve as the source texts. This corpus was used because of the similar setup to measure comprehensibility, namely via reading comprehension tests. Another reason to opt for this corpus was because the texts included in it are very generic and not very difficult. That way, we could ensure that the texts were not “user-specific” or “topic-specific” (Tan et al., 2012), as mentioned in section 2.1.

Our corpus incorporated questions to measure reading comprehensibility. Those questions were translated from German into English by a human translator. Some of these human-translated questions were also used in the questionnaire. However, we adapted some questions in order to investigate the influence of certain errors in the machine translations. Furthermore, the corpus uses different question forms. We chose to omit some of these question forms because we were convinced that they would not yield interesting results. The forms that were left out are: yes/no questions, alternative questions and true/false questions. Inference questions were also excluded from our questionnaire because we would not know if the reader could not answer them because they did not comprehend the text or they did not possess the world knowledge required. The question forms that were used in the questionnaire are the following: wh- questions, literal questions and reorganisation questions. An example of a wh-question used is ‘What is needed to produce paper?’ and a literal question would be, for instance, ‘How much energy is saved when three sheets of recycled paper are used?’ A reorganisation question obliges the respondent to look for the answer in several parts of the text and ‘Which city is the front-runner according to the atlas? Why?’ is such a question.

4Machine Translation Evaluation version (CREG-MT-eval) (Scarton & Specia, 2016)

18

The texts that were selected for the questionnaire were first screened on possible errors found in the machine translations in order to obtain an interesting result. Some texts did not contain many mistakes, but others showed some interesting ones. We chose to use three texts because we wanted enough responses per text per method of translation and this way we could include two machine translations and a human translation.

The questionnaire was built up as follows. Respondents were invited to open a link to a Google Spreadsheet where they could find the link to one of the experiments. To ensure that every experiment was equally represented, they were asked to write their email addresses in the first available box in column C next to their personal questionnaire link (see figure 2). On the first page of the questionnaire we gave some explanation about the experiment, after which they were able to read the first translated text. On the following page they found the content questions which the respondents were asked to fill in. After this, the participants were presented with text- specific questions about the translated text. Here, the translation was shown so participants had the chance to quote the text. This process was repeated with a second translation, after which the respondents were asked to fill in some profile questions. The instructions given to the readers, questionnaire texts and questions can be found in appendices I, II and III.

Figure 4: Print screen of the Google Spreadsheet setup. (The email addresses are replaced to ensure the respondents’ anonymity.)

19

The questionnaire was distributed through e-mail and Facebook and conducted through SurveyMonkey. This latter proved to be the best possible platform because we wanted to make sure people could not edit their responses by going back in the questionnaire. This was a result from our decision to let the respondents read the translation carefully before answering the questions and to hide it while having the respondents fill in their answers to the content questions. SurveyMonkey also enables researchers to extract most of the information automatically from the web site. The information gained was used here to discuss the results and was represented in tables produced in Excel.

Since we asked every respondent to evaluate two texts, we came up with a categorisation that would allow an equal number of completed evaluations per text per method of translation. This categorisation resulted in nine different experiments, as shown in the table below. We made sure that there were never two human translations in one experiment, but since two thirds of the translations were machine translations, we had to combine two machine translations in some of the experiments. Readers might not expect this and might participate in the questionnaire in the conviction that they would get to see both a human translation and a machine translation. This may influence the results somewhat and we will research this in section 4. Experiment Text 1 Text 2 Experiment 1 T1_HT T3_DL Experiment 2 T1_GT T2_HT Experiment 3 T1_DL T3_HT Experiment 4 T2_DL T3_GT Experiment 5 T2_GT T3_DL Experiment 6 T3_HT T1_GT Experiment 7 T2_GT T1_DL Experiment 8 T2_DL T1_HT Experiment 9 T2_HT T3_GT Table 1: Overview of the different experiments used in the questionnaire.

All of the content questions were made obligatory to answer, as well as all of the profile questions. The text-specific questions were mostly made obligatory, except when readers were asked to copy incomprehensible parts of the text or mistakes that bothered them. These last questions were not obligatory because we speculated that readers would not always encounter problems with the translations, especially with the human translations.

20

The text-specific questions about the quality of the translations were asked both for the human translations and the machine translations. This was done to maintain uniformity of the questionnaire, to prevent people from suspecting certain texts to be human translations based on this observation and to demonstrate any difference between the evaluations of human and machine translations.

To maintain uniformity, we also selected texts of equal length. Text 3 was a lot longer than text 1 and 2, but we shortened it for this purpose. We further manipulated the text and left out most of the free indirect speech to make the text easier to read. The goal of the questionnaire was, after all, to discover how comprehensible translations are, not the source texts.

In our literature study we mentioned that Schwarz & Flammer (1981) found that comprehensibility increases if there is enough time to process the text. This was one of the reasons why we decided that the respondents could read the translations for as long as they wanted. Another reason was that the degree of difficulty when answering the content questions was already sufficiently high because the text was not visible during the questions. Therefore, a time limit would have made answering the questions too difficult and consequently would have resulted in a dissatisfactory outcome.

For the comprehension questions, we gave scores of 1, 0.5 and 0 to answers, depending on their level of correctness. A completely correct answered received 1, a partially correct one 0.5 and an incorrect one 0. To receive a score of 0.5, the answer had to contain at least one element of the gold standard answer. It was sometimes difficult to determine the difference between 1 and 0.5. For cases of doubt, we had the gold standard answers in mind and fixed on several elements that were indispensable for the answer to be correct. For example, concerning the question ‘What kind of atlas is presented in Berlin?’ for text 1, the word ‘paper’ was necessary for the answer to be correct or partially correct. Even if one aspect was mentioned, which would normally lead to a 0.5, the answer would get a 0 if paper was omitted. This same question also dealt with the machine translation error ‘papieren atlas’ instead of ‘papieratlas’. The first answer was rejected because this has a different meaning (atlas made of paper instead of atlas describing paper). So even though the respondents answered correctly according to the text they had read, the meaning of their answer was wrong because of the incorrect translation.

21

4 APPLIED ERROR TYPOLOGY As mentioned earlier in section 2.3, we applied the error typology by Daems & Macken (2013) on the machine translated texts that were used in the questionnaire. The fully applied error typology can be found in appendix V. The number of different mistakes per text are presented in the table below. DeepL has the least number of mistakes for the second and third text, but it has more mistakes for the first text. Based on this, we expect better results for DeepL than for Google Translate. The numbers also indicate that both translations of the first text and the DeepL translation of the second text are of the best quality, followed by T2_GT and concluded by both translations of the third text. The typology in general also shows that the machine translations often contain the same mistakes for the same texts and that they frequently fail to translate the same passages correctly.

Text Number of different mistakes

T1_GT 10

T1_DL 12

T2_GT 15

T2_DL 12

T3_GT 19

T3_DL 16

Table 2: Error typology mistakes However, it should be noted that sometimes one type of mistake was counted twice if several examples were found in the text, such as when the narrator is switched or when two named entities contain one equal part which is translated wrong twice. If one word is incorrect and is repeated several times it was only counted once, but sometimes an error belonged to several categories, in which case it was counted several times, for instance ‘gerecycled’. Sometimes mistakes were also caused by the source text, such as capitalization (‘DDR: Een’). In these cases, it was still annotated as an error. Nevertheless, as the MT tools seem to be hindered by the same elements, this is usually the case for both machine translations of the same text and the number of mistakes are therefore still very adequate for comparison.

We also calculated how many mistakes Google Translate and DeepL made for which subcategories over the three texts (see table 3). These results show a fairly even number per subcategory: Google Translate and DeepL both made several similar or identical mistakes when

22 translating the texts. Most of the errors belong in the category acceptability. Especially the subcategories ‘grammar & syntax’, ‘lexicon’ and ‘style & register’ contain a lot of mistakes.

Category Subcategory GT DL Acceptability Grammar & syntax 7 10

Lexicon 9 7 Spelling & typos 4 3 Style & register 10 9 Coherence 3 2 Adequacy Word sense 4 2 disambiguation Part of Speech 1 1 Meaning shift caused 1 1 by incorrect translation of function word Meaning shift caused 2 2 by misplaced word Deletion 1 2 Meaning shift caused 2 1 by other Table 3: Number of mistakes per translation method per subcategory If we compare the texts separately, we find that in the first text, ‘gerecycled papier’ causes problems. It is a mistake that can be classified under different categories and therefore returns a few times. The second and third text have a few wrong collocations. As these are rather obvious mistakes, we expect the respondents to notice them as well and indicate them as mistakes in the questionnaire (see the results section). The third text contains the most adequacy errors. In addition, both the first and third text contain untranslated words: ‘Westfalia’, ‘gerecycled’, ‘notebooks’ and ‘hijack’.

For text 2 we found an article error that made us wonder if the sentence would have been translated wrong as well if we made a small alteration to it. The phrase ‘I also like to play the harmonica’, translated as ‘Ik speel ook graag de mondharmonica’, is not that common, but the phrase ‘I also like to play the guitar’ is much more so. Indeed, the redundant article caused by interference is not shown in the translation for the second phrase in both Google Translate and DeepL. This is a clear example of the disadvantages of a machine translation approach based on large data sets (see section 2.4).

23

Figure 5: Screenshot of Google Translate: harmonica (19/06/2018)

Figure 6: Screenshot of Google Translate: guitar (19/06/2018)

Figure 7: Screenshot of DeepL: harmonica (19/06/2018)

Figure 8: Screenshot of DeepL: guitar (19/06/2018) Lastly, some of the mistakes encountered in the texts can be linked to the typical errors of NMT as discussed in section 2.4.3.3. In T3_GT, for instance, the named entities Berlin-Schönefeld and Berlin-Tempelhof in which Berlin was translated into Dutch match the subcategory DNT. Part of the translation in T1_DL also features a redundant repetition: ‘We gebruiken overal papier… we gebruiken papier’.

24

5 RESULTS

The results section is divided into several subsections. The first subsection discusses the general results of the questionnaire, including a short account of the profile questions. Following that are the results of the text-specific questions of the three texts together. After that, we lay out the results of the comprehension questions per text.

5.1 General results

We will start this section by providing some general data about the questionnaire. Below we give an overview of the number of respondents, the percentage of completion and the average time it took respondents to complete, all per experiment. These numbers were all obtained through SurveyMonkey.

Respondents % of completion Time to complete Experiment 1 12 92% 15 mins Experiment 2 11 91% 17 mins Experiment 3 12 92% 15 mins Experiment 4 11 91% 14 mins Experiment 5 11 100% 11 mins Experiment 6 10 90% 19 mins Experiment 7 11 82% 15 mins Experiment 8 10 90% 14 mins Experiment 9 11 73% 14 mins Total number of respondents 99 Table 4: Questionnaire data The total number of respondents for the questionnaire is 99. Ideally, every experiment would have been filled in 11 times based on this number, but some respondents might have selected a different link to the questionnaire, because of the Google Spreadsheets formatting when it comes to hyperlinks. However, the table still shows quite an even distribution of respondents for all the experiments. The experiments are therefore representative enough as to allow us to compare results.

The percentage of completion is generally high, with only experiments 7 and 9 showing a lower percentage. The 82% of experiment 7 can be explained by pointing out that 2 respondents did not answer any profile questions and 1 of these respondents already quitted the questionnaire

25 after the content questions of the second text, leaving the text-specific questions and the profile questions blank. This cannot account for the 73%, however, so for experiment 9 a larger perspective must be considered. SurveyMonkey indicates in the ‘individual reactions’ section that three of the responses are incomplete. Yet, when we take a closer look at the answers, nearly all of them are filled in, excluding the non-obligatory ones. Although some of the answers merely contained punctuation marks (‘.’, ‘?’ or ‘/’), the explanation more likely has to do with people not clicking through to the last page after the profile questions. As this does not influence the results in the least, this is not a matter of concern.

The time needed to complete the questionnaire is largely the same overall, with experiments 5 and 6 standing out. Experiment 5 was filled in rather quickly (11 minutes on average) and experiment 6 rather slowly (19 minutes on average). A possible explanation might be that some respondents merely took their time to complete the questionnaire, as opposed to others who went through it more quickly. Perhaps some of the respondents read the translated text several times, while others read it only once.

Table 4 gives an overview of the following number of respondents per text. As with the number of respondents that filled in the different experiments, this table also shows a very similar distribution.

Text Number of respondents Text 1: Atlas 66 HT 22 GT 21 DL 23 Text 2: Greifswald 64 HT 21 GT 22 DL 21 Text 3: Escape GDR 66 HT 22 GT 22 DL 22 Table 5: Number of respondents per text

26

What follows are some statistics about the respondents, based on the answers obtained in the profile questions. In total, 4 people did not fill in the profile questions. As a result of this, these statistics are based on 95 respondents. The average age of the respondents was 27 years old. 61 of the respondents were female, 33 male and 1 respondent selected the category other. No less than 43 of the respondents indicated that their current education or degree was related to the field of languages and only 8 respondents are not currently pursuing a higher education or have not obtained a degree in higher education. In section 2.2.3.1 we mentioned that De Clercq et al. (2014) appealed to both a regular crowd and an expert crowd and that they both generated similar results. Therefore, we expect that the relation to the field of languages will not influence the results here. We will verify this in section 5.3. Furthermore, 4 people said that they have never used a machine translation service. Of the remaining 91, 74 were positive in their observations, but 52 added a remark, usually that the translation should always be checked afterwards or that they only use it in certain cases. Lastly, 20 respondents indicate that they mostly use machine translations for the purpose of information gisting and 8 even state that they prefer DeepL over Google Translate.

5.2 Text-specific questions

5.2.1 Human or machine translation

The respondents were asked if the text they had just read was a human or a machine translation. The table below represents their answers. Most of the respondents think that the human translation is in fact translated by a human translator and that the texts translated by Google Translate and DeepL are machine translations.

Respondents think ‘human’ Respondents think ‘machine’ Correct in % Text 1: Atlas HT 15 7 68% GT 5 16 76% DL 5 17 77% Text 2: Greifswald HT 14 7 67% GT 4 18 82% DL 1 20 95% Text 3: Escape GDR

27

HT 18 4 82% GT 9 13 59% DL 9 13 59% Table 6: Answer of respondents when asked if the text is a human or a machine translation

For text 1 it is striking that for both machine translations the percentage of respondents with correct answers to this question is nearly the same, but that the percentage for the human translated text is lower. The second text has about the same percentage of respondents who correctly labelled the human translation as such, but a great deal more respondents were correct in assuming the machine translations were in fact machine translated. Especially DeepL obtains a high result here. This seems to imply that the machine translations of the first text were of better quality than the ones of the second, since it was less obvious that they were machine translations. For the third text, the percentage of correct human translation labels is notably higher than for the first two texts. Moreover, the machine translations both got the same score, which is lower than in the previous texts. These numbers suggest that the third text was translated better overall, by the human translator as well as by Google Translate and DeepL.

These numbers all show that the majority of respondents were able to tell when the text was translated by a human or a machine. In about half the cases over 75% of the respondents were correct in their assumptions. The respondents were also asked why they thought the text was either a human or a machine translation. These results will be subdivided per translation method and discussed below.

5.2.1.1 Human translation

A. HT labelled as MT

We will first discuss the reasons why respondents labelled the human translations as machine translations. These reasons are presented in the bar graph below. The horizontal axis shows the reasons and the vertical axis shows the number of participants who listed that reason. The category ‘incoherence’ includes the presence or lack thereof of discourse markers. This will be the case for the following graphs as well.

The graph clearly shows that the incoherence is the greatest challenge for the human translations, since several respondents indicated this for all three texts. The short, simple sentences also posed a challenge for text 2. These challenges might be a result from translating the source text too literally.

28

Reasons why respondents labelled HT incorrectly as MT

Incoherence

Disfluency

Sentence structure

Long sentences

T1 Short, simple sentences ('staccatostijl') T2 T3 Inflection

Repetition

Use of commas

Static text

0 2 4 6 8 10 12 14

Graph 1: Reasons why respondents labelled HT incorrectly as MT

For the first text, 2 respondents mention explicitly that they were not sure of their choice. The repetition for the second text refers to ‘dat vind ik heel leuk’ and the inflection to ‘student’ and ‘docent’, since both forms miss an additional ‘-e’ when indicating specifically to a female. The plural of ‘museum’ is also mentioned by a respondent (‘museums’ instead of ‘musea’), but since the word ‘museum’ officially has two plural forms in Dutch, this is not a mistake. It was not listed in the graph, therefore. The disfluency for text 3 was applied to the phrase ‘tot de verbeelding sprekende’.

B. HT labelled as HT

The graph below presents the reasons why respondents labelled the human translation correctly as a human translation. The first four categories ‘coherence’, ‘lack of mistakes’, ‘fluency’ and ‘idiomaticity’ were all mentioned for all the texts and they seem to have the highest number of respondents per category.

29

Reasons why respondents labelled HT correctly as HT

Coherence

Lack of mistakes

Fluency

Idiomaticity

Text structure

Sentence structure

Long or compound sentences T1 Correct word order in long complex sentences T2 Short sentences T3

Clarity

Punctuation marks

Word choice (adjectives, terms)

Anaphora

Dynamic style of the text

Free translation

0 2 4 6 8 10 12 14

Graph 2: Reasons why respondents labelled HT correctly as HT

The phrase ‘een pot koffie zetten’ from text 1 was mentioned by 2 respondents and was placed in the category ‘idiomaticity’. An additional remark from a respondent from text 2 was that he or she was not certain and was rather guessing. The reason for this was that the text contained very short and clear sentences, something that the respondent acknowledged that may have been caused by the source text. Another respondent remarked here that the text was fairly simple and that a machine translation would translate this very adequately as well. The lack of mistakes for text 3 was specified by the respondents and refers to content, word order, conjugations and sentence structure. There were 2 respondents for this text who explicitly mentioned that they went for a human translation because it was better than the previous text, which they had labelled as a machine translation. These respondents had indeed filled in experiment 3, in which T1_DL was the first text to be judged and T3_HT the second.

30

5.2.1.2 Google Translate

A. GT labelled as HT

The following graph lists all the reasons why respondents thought that the Google Translate texts were human translations. Contrary to the graphs for the human translation, there is no category mentioned for all three texts here. Only the categories ‘lack of mistakes’, ‘idiomaticity’, ‘sentence structure’ and ‘clarity’ are mentioned more than once.

Reasons why respondents labelled GT incorrectly as HT

Lack of mistakes

Cohesion

Fluency

Idiomaticity

Text structure T1 T2 Sentence structure T3 Short, simple sentences

Clarity

Abbreviation 'uni'

Different expectations in mistakes with an MT

0 2 4 6 8 10 12 14

Graph 3: Reasons why respondents labelled GT incorrectly as HT The idiomaticity mentioned for text 1 is restricted: the respondent excludes ‘gerecycled papier’, since he or she does not find this very idiomatic. Here, 1 of the respondents was hesitating and chose for a human translation, despite the simple and very short sentences that would, according to the respondent, normally indicate a machine translation.

For the third text, 2 respondents added a remark to their answer: one mentioned the presence of a few mistakes that seemed to make the text more idiomatic, and another indicated the lack of mistakes, except for starting a sentence with ‘and’ or ‘but’. In the answers of 2 of the respondents who filled in T3_GT, we observe a comparison with the previous text they had

31 read and rated. They had labelled their first text as a machine translation and now thought this was a human translation because of the better quality. After further research we found that the first text was indeed a machine translation (T2_DL, experiment 4). We also noticed the difference between the respondents that had filled in experiment 4 and the ones that had received experiment 9. The first one was made up of T2_DL and then T3_GT and the second one of T2_HT followed by T3_GT. For experiment 4 we saw that no less than 8 out of 11 respondents thought that the text was a human translation. However, in the experiment with a human translation as the first text only 1 respondent incorrectly believed that this was the case.

B. GT labelled as MT

The graph below shows the reasons why respondents thought that the Google Translate text was a machine translation. Three of the categories are mentioned for all three texts: ‘incoherence’, ‘absence of idiomaticity’ and ‘word order’. The previous graphs did not show a lot of great differences between the three texts, but it is striking here that every text has several categories indicated by respondents which do not return with the other texts. For example, no less than 4 respondents remark on the incorrect sentences for text 3, while no respondents do so for the other 2 texts.

32

Reasons why respondents labelled GT correctly as MT

Comprehensibility

Incoherence

Disfluency

Absence of idiomaticity

Word order

Short sentences

Incomplete sentences

Incorrect sentences

Strange, illogical constructions T1

Punctuation T2 T3 Inconsistency

Repetition

Anaphora

(Grammatical) mistakes

Complexity

Incorrect translations

Literal translations

Rhythm

0 2 4 6 8 10 12 14

Graph 4: Reasons why respondents labelled GT correctly as MT

A lot of respondents comment on the inconsistency for text 1. In the text, ‘u’ and ‘je’ and ‘gerecycled papier’ and ‘gerecycleerd papier’ are used interchangeably and there is also a change in narrator perspective. The repetition refers both to ‘gerecycled papier’ and sentence structure and the incorrect translation to ‘Noordrijn-Westfalia’, which should be ‘Noordrijn- Westfalen’. The following remarks given here were also mentioned when discussing the error typology of T1_GT: the inconsistency, the repetition and the untranslated ‘Westfalia’.

The absence of idiomaticity mentioned for text 2 includes the abbreviation of ‘uni’. This is used a lot in German and English, but not in Dutch. The punctuation concerns the fragment ‘zuster. Agneta’ and the strange, illogical constructions concerns the phrase ‘Het huis van mijn kamergenoot woont in Berlijn’. An example of a grammatical mistake here is ‘ik speel ook graag de mondharmonica’. Nearly all the mistakes mentioned for text 2 can also be found in

33 the error typology, especially the ones where examples were just given. However, not all the mistakes from the error typology are mentioned here. This may be due to the more general descriptions of the errors that respondents gave and where they in fact meant it to concern those mistakes or it may be that these errors were rather unimportant, especially for text comprehension.

The phrase ‘met een plastic geweer’ causes the word order problems for text 3: the current position seems to imply that the plastic gun was used to land the plane, instead of to force the pilot to do so. The incomplete and incorrect sentences are the following:

 ‘Hijack vlak voor het bereiken van de luchthaven van bestemming.’,  ‘Maar zo was het niet de bedoeling.’,  ‘Een speelgoedgeweer gebruiken.’.

Text 3 also deals with inconsistency in voice (protagonist perspective vs. narrator), just like with the first text. As with the other texts, a lot of the mistakes mentioned here were also mentioned in the error typology. However, a lot of mistakes were not mentioned as well. They may have been replaced with more generic formulations or may have been mentioned later in section 5.2.4.

5.2.1.3 DeepL

A. DL labelled as HT

The last translation method in the questionnaire is DeepL. What follows is an overview of the reasons why respondents regarded the DeepL translation as a human translation. Not one category here is mentioned with more than two texts.

34

Reasons why respondents labelled DL incorrectly as HT

Coherence

Lack of mistakes

Fluency

T1 Clarity T2 T3 (Complicated) sentence structure

Long sentences

Collocations

0 2 4 6 8 10 12 14

Graph 5: Reasons why respondents labelled DL incorrectly as HT

The main reason given for text 1 was the lack of mistakes, as shown in the graph above, although 1 respondent remarks that the text could be more fluent. Another respondent appears to have been influenced by the experiment setup and compares this text with the previous text he or she had judged. Since this text was clearer than the previous one, he or she supposed this one to be a human translation. This particular respondent had completed experiment 7, which consisted of T2_GT and T1_DL, both machine translations. It is possible that the respondent did not know that it could be possible to receive two machine translations and that this influenced his or her decision. To avoid this, perhaps it would have been better to explicitly mention this at the beginning of the experiments.

The only respondent who was convinced that the second text was a machine translation thought so because the mistakes made were mostly collocations that got confused. The respondent reasoned that these are more common mistakes with language learners than with machine translations.

35

Of the 9 respondents who thought that the third text was a human translation, 4 were clearly comparing it with the previous text they had seen, as in both experiments this was the second text. These 4 respondents favour this text because it is more fluent and clear, there is more coherence (in the form of discourse markers) and the sentence structure is better, especially at the end. However, 1 of these respondents remarks that it is rather strange that there are sentences in the text that do not have a verb. Another respondent thinks that machine translations make more mistakes with long sentences and therefore presumes this is a human translation. The clarity of the text is further illustrated by 1 respondent when he or she says that one read of the text was enough to comprehend it fully, down to the very last detail.

B. DL labelled as MT

Finally, the graph below discusses the reasons why respondents labelled the DeepL translated texts correctly as machine translations. The categories ‘incoherence’, ‘sentence structure’, ‘short sentences’ and ‘(grammatical) mistakes’ are mentioned for all three texts. Text 2 receives many remarks about the ‘(grammatical) mistakes’ and ‘absence of idiomaticity’ categories and a lot of respondents mention the ‘repetition’ for text 1.

36

Reasons why respondents labelled DL correctly as MT

Incoherence

Disfluency

Absence of idiomaticity

Text structure

Sentence structure

Short sentences

(Grammatical) mistakes T1 T2 Illogical content T3 Repetition

Literal translations

Punctuation

Incomplete sentences

Anglicisms

Unnecessary words

0 2 4 6 8 10 12 14

Graph 6: Reasons why respondents labelled DL correctly as MT

A lot of the respondents for text 1 indicated that the repetition made them choose for a machine translation. They also give examples of this repetition: “we gebruiken overal papier… we gebruiken papier’, ‘gerecycled papier van oud papier’ (pleonasm), ‘papier’, sentence structure and phrasing. Examples given by the 2 respondents who mentioned anglicisms are ‘een rol van toiletpapier’ and ‘gerecycled papier’. Another respondent refers to the questionnaire setup by writing that it is difficult to remember certain things since the same words are often repeated. If the text had been shown during the content questions, perhaps the respondent would not have minded the repetition so much.

Once more, some of the mistakes mentioned are similar to the mistakes noted in the error typology, although some categories are different. The repetition of ‘we gebruiken overal papier… we gebruiken papier’ for example is listed as a grammar & syntax error in the

37 typology. Other similarities are found between register (‘gerecycled papier’) and untranslated on the one hand and repetition and anglicisms on the other.

The mistakes referred to for text 2 are usually grammar mistakes such as ‘ik kookt’, ‘ik speel de harmonica’, etc., but 1 respondent also mentions errors in spelling. Some of the respondents had a few extra remarks. One of them said that the many non-idiomatic structures were clearly translations influenced directly by German and another observed that ‘docent Duitser’ must be more common in corpora than ‘docent Duits’ for the translation for ‘Deutsch’. A last remark was that if the text was translated by a human translator, this person would presumably be a non-native.

Again, almost all of the mistakes here can be traced in the error typology. This time it was also the other way around: just about all the mistakes from the error typology were brought up here as well. Moreover, one of the few mistakes that were not mentioned was impossible to notice, since the source text was needed to establish the deletion.

According to 1 respondent, text 3 has some incomplete sentences: ‘Een speelgoedpistool gebruiken.’ and ‘Hijack net voor het bereiken van de luchthaven van bestemming.’. The punctuation here refers to a couple of full stops in front of ‘maar’ instead of a comma. For this text, only a few mistakes are mentioned that return in the error typology and numerous mistakes were not discussed. Perhaps the other mistakes will still be mentioned in section 5.2.4.

5.2.1.4 Summary

The category ‘incoherence’ is mentioned in nearly all of the graphs above when the respondents chose for a machine translation. It is striking that the respondents base their decisions in a lot of the cases on that category, especially since it was not included very often in the applied error typology. If it was included, it was mostly to point out the inconsistency of the terms used. A possible explanation here is that the respondents did not have the source texts to fall back on and that this incoherence was already present in the source texts. The importance of coherence for comprehensibility was already mentioned as a dimension of the Karlsruhe comprehensibility concept (Göpferich, 2009) in section 2.1. Another one of these dimensions that can be mentioned here is ‘structure’, since text and sentence structure return a few times as well. Consistency was also discussed earlier when describing the CCC model by Renkema (2015) in section 2.2.1.

38

Other categories that were mentioned often are ‘fluency’, ‘idiomaticity’, ‘lack of mistakes’ and ‘clarity’ in cases where respondents thought that the text was a human translation and ‘disfluency’, ‘absence of idiomaticity’, ‘short sentences’ and ‘repetitions’ in cases where they thought it was a machine translation. This repetition coincides with the dimension ‘concision’ of the Karlsruhe comprehensibility concept by Göpferich (2009) discussed in section 2.1.

The remarks about the long complex sentences refers to a problem found mostly in machine translations and mentioned in section 2.2.2. The respondents seem to be aware to some extent about the declining text quality of MT in longer sentences (Van Brussel et al., 2018). Thus, when respondents find that long sentences do not contain any mistakes, they assume the text was produced by a human translator.

5.2.2 Clarity score

The respondents were then asked to give the text a score based on the clarity. If given a score of 1, the text was judged to be completely incomprehensible and a score of 5 marks the text as perfectly comprehensible. As can be seen below for text 1, the human translation was rated better than either machine translation, since it received the least scores from 1 to 3 and the best score for 5. If we compare the machine translations, we see that Google Translate gets a more positive review: DeepL notably has more lower scores than Google Translate. This is logical since the applied error typology showed that the Google Translate text contained less errors than the DeepL text (see section 4).

Clarity scores text 1 14

12

10

8 Human translation 6 Google Translate 4 DeepL

Numberparticipants of 2

0 1 2 3 4 5 Score

Graph 7: Clarity scores text 1

39

For text 2 the human translation is again rated best, since it receives the most 5/5

scores for clarity and the least 2 and 3 scores. For the second time, Google Translate scores better than DeepL. This result is rather surprising, since the text produced by DeepL contains less mistakes in the error typology. This may indicate that not all mistakes have the same impact on comprehensibility.

Clarity scores text 2 14

12

10

8 Human translation 6 Google Translate 4 DeepL

Numberparticipants of 2

0 1 2 3 4 5 Score

Graph 8: Clarity scores text 2

As was the case for the previous texts, the human translation of text 3 does not receive a single 1/5. However, both machine translations are rated 1/5 once, just like with the machine translations of text 1. The human translation is ranked best once more and this time DeepL receives the highest score the most and the least low scores for the machine translations. Since DeepL also had less mistakes in the error typology, this result was to be expected.

40

Clarity scores text 3 14

12

10

8 Human translation 6 Google Translate 4 DeepL

Numberparticipants of 2

0 1 2 3 4 5 Score

Graph 9: Clarity scores text 3

We also calculated the average clarity score per text and per translation method. The table below shows that the human translations receive a steady score overall and that both machine translations perform worse. For text 1 and 2, Google Translate receives the higher score of the two with a constant 3.5/5, while DeepL achieves a higher score for the third text with that same score. In general, the average scores per text are very similar. However, text 2 still ranks first place and text 3 ends up right after text 1.

In the third column of the table we look back to the error typology in section 4 (table 2). Based on the number of different mistakes, we would expect the results to indicate that DeepL scores best for the second and third text, while Google Translate trumps DeepL for the first text. This is indeed the case for texts 1 and 3. However, this does not correspond entirely, since this was not found to be true for the second text.

Average clarity score Number of different mistakes (error typology) Text 1: Atlas 3.6 HT 4.1 GT 3.5 10 DL 3.2 12 Text 2: Greifswald 3.7 HT 4.1 GT 3.5 15

41

DL 3.4 12 Text 3: Escape GDR 3.5 HT 4.0 GT 3.1 19 DL 3.5 16 Table 7: Average clarity score linked to the number of different mistakes (error typology)

5.2.3 Comprehension

When asked if they had been able to comprehend everything in text 1, the 22 respondents who were given the human translation all answered ‘yes’. This was not the case with the machine translations, where 6 people in total did not comprehend the text fully. The respondents’ answers as represented in the graphs below indicate that Google Translate has a higher comprehension rate than DeepL, with the former having only 1 respondent with comprehension difficulties and the latter having 5. These findings seem to confirm our hypothesis that human translations are more comprehensible than machine translations.

Comprehension text 1 20

15

10 Google Translate DeepL

5

0 Yes No

Graph 10: Comprehension text 1

The respondents who answered ‘no’ to the previous question were given the chance to comment on this and to mark the passage that they found especially challenging. Although only 1 respondent did not comprehend the Google Translate text completely, 2 respondents wrote down the passages they regarded as difficult:

42

 ‘Gerecycled papier bespaart energie. Milieuactivisten eisen ook dat we alleen gerecycled papier gebruiken. Gerecycled papier van oud papier is beter voor het milieu.’  ‘Slechts drie stukjes gerecycled papier besparen genoeg energie om een pot koffie te maken.’

For the DeepL text, 4 of the 5 respondents mention the first sentence from the enumeration below. A sentence with a redundant repetition is referred to by 2 of the 5 and is listed as the second passage below. Furthermore, 2 respondents list sentences with a lot of repetition of ‘gerecycled papier’ and several other issues had already been mentioned once when asked to motivate their choice for a human or a machine translation. If a passage is cited by more than 1 respondent, this is indicated at the end of the line with the number between brackets and this will be done accordingly for this entire dissertation.

 ‘Slechts drie stuks gerecycled papier besparen genoeg energie om een pot koffie te maken.’ (4)  ‘we gebruiken overal papier... we gebruiken papier’ (2)  ‘Gerecycled papier van oud papier is beter voor het milieu. U heeft minder hout nodig om gerecycled papier te produceren en ook minder water en energie.’ (2)  ‘Om het aantal bomen dat wordt gekapt zo laag mogelijk te houden, moeten we papier spaarzaam gebruiken. Gerecycled papier bespaart energie. Ook milieubeschermers eisen dat we alleen gerecycled papier gebruiken. Gerecycled papier van oud papier is beter voor het milieu. U heeft minder hout nodig om gerecycled papier te produceren en ook minder water en energie.’  ‘De stad Aken in Noordrijn-Westfalen bespaart veel energie. Dit wordt aangegeven in een nieuwe papieren atlas die vandaag in Berlijn is geïntroduceerd.’  ‘Op school schrijf je in je notebooks, de leraar drukt veel hand-outs af en er hangt een rol van toiletpapier in je badkamer’  ‘Gerecycled papier van oud papier’  ‘een nieuwe papieren atlas’

Both sentences that were mentioned for the Google Translate text return for the DeepL text. The second sentence was even mentioned as much as 4 times for T1_DL. Thus, the machine translations appear to have had difficulties with the same sections for the first text.

The results found here can be linked to the number of mistakes noted in the error typology (section 4). According to those numbers, Google Translate scores better than DeepL for text 1.

43

These numbers correspond with the comprehension rates found here. The comprehension rates can also be further connected with the average clarity score, for which Google translate receives a higher score.

All 21 respondents who had received the human translation of the second text responded again that they had comprehended everything in the text (although 1 respondent did not fill out this question). However, some respondents had difficulties with comprehending the machine translations (see graph below). DeepL scores better than Google Translate: only 1 respondent answered ‘no’ to this question, whereas 4 people did so for Google Translate.

Comprehension text 2 20

15

10 Google Translate DeepL

5

0 Yes No

Graph 11: Comprehension text 2

All of the respondents who answered ‘no’ for the Google Translate text indicated only one passage that was incomprehensible to them:

 ‘Het huis van mijn kamergenoot woont in Berlijn (en soms nemen we de trein naar Berlijn om haar familie te bezoeken.)’

The respondent who felt that the DeepL translation was not entirely comprehensible quoted the following passage:

 ‘Mijn HUISGENOOT woont in Berlijn en soms nemen we de trein naar Berlijn om HAAR familie te bezoeken.’ The respondent then further elaborated on this by asking if by ‘huisgenoot’ they were referring to a male and if that meant that the family in question is his or ‘hers’.

44

If we compare text 2 with text 1, we see that the machine translations of text 1 each contained only 1 part that was incomprehensible, whereas the first text had several. This being the case even though the number of respondents that found the text incomprehensible was nearly the same, with 5 for text 2 to 6 for text 1, may indicate that text 2 is generally translated better and merely contains 1 sentence that was translated inadequately. This one sentence was moreover quoted for both machine translations and shows again that Google Translate and DeepL struggle with the same sentences. Both sentences also appeared in the error typology, were they were classified as adequacy errors.

The number of different mistakes in the error typology (section 4) showed that Google Translate came out worse here than DeepL. It is therefore not surprising that more people indicated that T2_GT was incomprehensible and less respondents did so for T2_DL. However, this is not reflected in the average clarity score, where Google Translate obtains a slightly better mark than DeepL.

Text 3 is the only text in which all translations caused problems for the comprehensibility. Even though the human translation was not entirely comprehensible to 2 respondents, it still scored considerably better than both machine translations. The texts translated by Google Translate and DeepL both had 22 responses, but Google Translate was rated less comprehensible than DeepL for text 3, with 10 respondents finding the Google Translate text incomprehensible and 8 finding the DeepL text so.

45

Comprehension text 3 20

15

Human translation 10 Google Translate DeepL 5

0 Yes No

Graph 12: Comprehension text 3 On the question examining why they had not comprehended the text, both respondents cited the same passage for the human translation:

 ‘Maar zo was het niet gepland. Op een gegeven moment kwamen ze op het idee om een vliegtuig met bestemming Berlin-Schönefeld in Oost-Berlijn te kapen en de piloten te dwingen om te landen in West-Berlijn op de luchthaven Berlin-Tempelhof.’

Even though 1 respondent had comprehended the text, he or she remarked that everything was comprehensible provided that the passage was read several times due to the long sentences that created confusion if the reader only skimmed the text.

For Google Translate, several respondents cited the same 2 passages: 2 respondents mentioned the first passage in the enumeration and 4 quoted the second. There was 1 person who mentioned both of the passages and he or she added the remark that it was explained in a complicated manner. The other passages that were mentioned are also listed below.

 ‘Mijn dochter moest het pistool in het vliegtuig dragen, herinnert Ingrid Maron zich. En een douanebeambte vond het pistool natuurlijk tijdens een veiligheidscontrole. Ze gaf het terug aan mijn dochter, legt Ingrid Maron uit.’ (2)  ‘Hijack vlak voor het bereiken van de luchthaven van bestemming.’ (4)  ‘Maar zo was het niet de bedoeling.’  ‘een jonge man dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic geweer in West-Berlijn te landen.’

46

 ‘De drie voortvluchtigen wachtten tevergeefs op de overeengekomen ontmoetingsplaats voor Horst Fischer.’  ‘Op een gegeven moment kwamen we op het idee om een vliegtuig te kapen dat verondersteld werd te landen op de Oost-Duitse luchthaven Berlijn-Schönefeld en het te laten landen op de luchthaven Berlijn-Tempelhof.’

A few respondents did not cite anything, but gave an explanation for their negative answer: 1 respondent commented that the text at times seemed like two stories mixed up and that some sentences lacked correct sentence structure and coherence in words, and another respondent claimed not to have understood the whole story (names, purpose, motives, …).

Concerning incomprehensibility in the DeepL text, again several respondents quoted the same passages as indicated below in the enumeration. As for the second sentence, 1 of the respondents asked whether it did happen like that in the end. Other respondents added a few more incomprehensible passages.

 ‘Een jongeman dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic pistool in West-Berlijn te landen.’ (3)  ‘Maar zo moest het niet gaan.’ (2)  ‘Hijack net voor het bereiken van de luchthaven van bestemming.’ (4)  ‘Een speelgoedpistool gebruiken.’ (2)  ‘Mijn dochter moest het pistool op het vliegtuig dragen, herinnert Ingrid Maron zich.’  ‘vliegtuig te kapen dat op de Oost-Duitse luchthaven Berlin-Schönefeld zou moeten landen en het te laten landen op de West-Berlijnse luchthaven Berlin-Tempelhof.’

Several respondents again pointed out more general problems with the text. For 1 respondent the text was not put together well content-wise and another considered the translated text too chaotic regarding the characters, function and places.

More respondents have comprehension problems when it comes to T3_GT than T3_DL. Since the first-mentioned contains more errors according to the typology in section 4, this was to be expected. This also matches the average clarity score for the machine translated texts.

Here the sentences which cause problems according to the readers are mostly the same as well. Only the passages ‘De drie voortvluchtigen wachtten tevergeefs op de overeengekomen ontmoetingsplaats voor Horst Fischer.’ and ‘Een speelgoedpistool gebruiken.’ are only mentioned for one of the texts. Since all of the machine translations for the three texts contain

47 errors or difficulties in the same text parts, we may assume that both systems are still fundamentally flawed concerning certain challenges.

To verify if the sentence length had anything to do with the incomprehensibility of the sentences, we calculated the average sentence length (see table 8) of the incomprehensible sentences quoted. As we can see, the total average is not very high, but as not all of the sentences were completely copied, the average may actually be higher. For text 2 the average is considerably higher. Still, the incomprehensibility is most likely not due to the sentence length and it probably has more to do with the finding by Jones et al. (2005). As mentioned in section 2.2.3.3, they found that texts with errors yield worse results for the comprehension questions.

Text Average sentence length Text 1 10.2 Text 2 19.5 Text 3 14.4 Total 12.5 Table 8: Average sentence length for incomprehensible sentences

5.2.4 Notable mistakes

The last text-specific question of the questionnaire explores any mistakes that may have bothered the respondents while reading the text. Here we expect a repetition of some of the complaints mentioned earlier in sections 5.2.1 and 5.2.3.

For the human translation of text 1, 6 respondents skipped this question altogether, which leaves us to think that they did not encounter any bothersome mistakes. The answers revealed that 8 more respondents informed us that they did not find anything irritating, although 1 person would have replaced ‘hand-out’ by ‘dia’. However, he or she writes that this is a matter of style. The other 8 respondents mention the following aspects:

 the ellipsis,  the sentence structure,  the short sentences,  the repetition of ‘papier’,  the incoherence: lack of discourse markers,  the insufficient cohesion,

48

 the lack of comma between two verbs.

As for the first text produced by Google Translate, 7 respondents skipped this question and will probably not have noticed any errors. Next, 2 respondents wrote that they had not found any mistakes. The remaining 12 have the following remarks:

 the repetition: o ‘gerecycleerd papier’, o ‘gerecycled papier’, o the same subject,  the wrong translation of ‘Noordrijn-Westfalen’,  the personifications: ‘Het laat zien hoeveel…’,  the inconsistency: ‘gerecycled papier’ and ‘gerecycleerd papier’,  the shortage of anaphora,  the incoherence: lack of discourse markers,  the absence of idiomaticity: ‘gerecycled papier’,  the use of capitals: ‘We gebruiken’.

Concerning the DeepL translation of text 1, 6 respondents skipped this question, but 1 of them had not filled in any of the text-specific questions for this text. Therefore, we may not assume that this person did not find any bothering mistakes. We can only assume this for the other 5 respondents who left this field blank. According to 2 respondents, the translation did not contain any notable or any mistakes at all. The biggest concerns of the other 15 respondents were:  the repetition: o ‘we gebruiken overal papier... we gebruiken papier’, o sentence structure, o anaphora,  the incoherence,  the absence of idiomaticity: o sentence structure, o ‘schrijf je in je notebooks’, o ‘rol van toiletpapier’,  the wrong conjugation of ‘gerecycleerd’, namely ‘gerecycled’,  the anglicisms: ‘hand-outs’, ‘notebooks’,  the lack of capitals in a few sentences.

49

The fact that the respondents still had remarks for the human translation of text 1 may suggest that, in already knowing the machine translations, we translated the texts too literally. An independent translator may have been a better choice to prevent this knowledge to be an influence on the results.

In total, 10 respondents skipped this question for the human translation of text 2. The answer of 4 other respondents explicitly stated that they had not noticed any mistakes. Another person said that there had not been any mistakes, but that the number of very short sentences, often without discourse markers, bothered him or her a lot. Yet another respondent mentioned that he or she would adapt the order of the sentences, but that this was a personal matter. Concerns of the other respondents were:

 the comma before linking word ‘en’,  the construction ‘Ik zou graag later’,  the lack of capitalisation: ‘Universiteit’, ‘Noordoosten’,  the name of the field of study: ‘Duits en onderwijs’,  the incoherence: lack of discourse markers,  the sequence of short sentences.

Only 5 respondents skipped this particular question for the Google Translate text, indicating a lower quality level. One sentence of this translated text clearly contains the largest mistakes, since no less than 12 respondents mentioned it. It is listed first in the enumeration. Several other mistakes were stated a couple of times as well: the second and third enumerated elements were brought up twice. A remark from 1 respondent is that the text is a clear translation in general. The other 17 respondents found the following aspects bothering:

 the sentence ‘Het huis van mijn kamergenoot woont in Berlijn’ (12),  the incorrect punctuation marks: ‘zuster. Agneta’ (2),  the incorrect articles: ‘ik speel ook graag de mondharmonica’ (2),  the sequence of short sentences,  the sentence structure,  the incorrect anaphora: ‘de uni… het is vrij klein’ (subject does not correspond),  the abbreviation of ‘uni’,  the errors in style: the repetition of sentence structures beginning with ‘ik’,  the constructions:

50

o ‘slechts ongeveer’, o ‘ik ben een student’, o ‘ik studeer Duits en onderwijs’, o ‘ik zou graag een Duitse leraar worden’,  the gender error: ‘leraar’ instead of ‘lerares’.

There were 3 respondents that skipped this question concerning the DeepL translation. As with the previous text, this low number suggests that the quality of the text is rather inadequate. Once more, some of the translated content was repeated several times: 10 respondents were irritated by the incorrect conjugation of verbs, 5 by incorrect phrases, 7 by incorrect grammatical constructions, 2 by repetition, 3 by abbreviations and 2 by wrong anaphora. This last element could, according to the explanation of the respondent, lead to a completely different meaning.  the incorrect conjugation of verbs: ‘ik kookt’ (10),  the phrase ‘docent Duitser’ (5),  the incorrect grammatical constructions (7): o ‘Dat is veel plezier voor mij.’, o ‘Dat bevalt me erg leuk.’,  the repetition in ‘slechts ongeveer’ (2),  the abbreviation of ‘uni’ (3),  the wrong anaphora used with (2): o ‘uni’, o ‘Mijn huisgenoot woont in Berlijn en soms nemen we de trein naar Berlijn om haar familie te bezoeken.’  the sequence of short sentences,  the gender error: ‘leraar’ instead of ‘lerares’,  the insufficient cohesion,  the interference of the source language.

This question for the human translation of text 3 was skipped by 10 respondents. Of the 12 that filled it in, 4 respondents indicated that they had not encountered any notable mistakes. There were two items that were mentioned twice by the respondents who had noticed some: the long sentences and the phrase ‘plastic pistool’. The other aspects specified by the respondents were:

 the incoherence and text structure,  the phrase ‘de luchthaven van bestemming’,

51

 the wrong anaphora: ‘dit’ used instead of ‘dat’ to refer back to something,  the phrase ‘vlak voor het bereiken van...’,  the overload of information in sentences,  the incorrect separation of phrases: ‘tangconstructie’,  the excessive use of ‘en’.

The remarks were sometimes nuanced: 1 respondent remarked that readers had to read further along in the text to comprehend something earlier on and another opted that the overload of information and the division of phrases that belong together (‘tangconstructie’) might be caused by the writing style of the original author.

As for the text produced by Google Translate, 6 respondents skipped the question and 2 wrote that they had found no notable mistakes. In this case multiple respondents mentioned the same mistakes as well. The mistakes indicated by the remaining 14 respondents were:

 the strange and illogical sentence structure (4),  the incorrect tense (2): ‘het plan is mislukt’,  the changing perspective of the story (3),  the phrase ‘een speelgoedpistool gebruiken’,  the incomplete sentences and absence of verbs (2),  the sentences that started out with ‘en’ or ‘maar’ (2),  the punctuation marks: o too many commas, o missing quotation marks,  the word ‘hijack’,  the very short sentences, which reduce readability,  the very long sentences.

Even though 1 respondent had remarked on the sentence structure, he or she claimed that the context made the story clear enough, albeit with a second read.

The question for the DeepL translation of text 3 was skipped 6 times and there were 2 respondents who believed that there were no notable mistakes in the text. As with a lot of the texts before, various respondents were bothered by the same mistakes, mentioned in the enumeration below:

 the incomplete and grammatically incorrect sentences (5):

52

o ‘Een speelgoedpistool gebruiken’ (5), o ‘Hijack net voor het bereiken van de luchthaven van bestemming.’ (3),  the attributive error in ‘Een jongeman dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic pistool in West-Berlijn te landen’ (2),  the strange constructions that do not make sense (4),  the sentence length (4): o too short sentences, o too long sentences,  the unclear sentences (2),  the incoherence and lack of cohesion (4): o ‘Een speelgoedpistool gebruiken’, o ‘En het plan was succesvol. Hijack net voor het bereiken van de luchthaven van bestemming.’,  the incorrect tenses (2): ‘Maar het plan is mislukt.’,  the changing perspective of the story,  the absence of idiomaticity: ‘Maar zo moest het niet gaan.’,  the incorrect use of discourse markers: ‘... en ... en ...’ instead of ‘..., ... en ...’.

As anticipated before in section 5.2.1.3, some of the mistakes that had not been mentioned there, but that were listed in the error typology appear here. A lot of the mistakes that were already mentioned in section 5.2.1 in general also return here.

5.3 General results for comprehension questions

The questionnaire also contained some comprehension questions, where respondents had to answer content questions, thus examining how much of the text respondents comprehended. As mentioned before in the methodology section, the translated texts received a score based on the correctness of the respondents’ answers. This score is calculated as follows:

(Correct answers *1) + (Semi-correct answers *0.5) + (Incorrect answers *0) = average score

The average score was then divided by the number of respondents who answered the questions and reduced to a score out of 5 since that is the maximum score that a respondent can get if he or she answers all the questions correctly. The average comprehension scores for all the texts and the translation methods can be found in the table below. In sections 5.4, 5.5 and 5.6 we will discuss the scores per text with more detail.

53

Average comprehension score out of 5 Text 1 HT 3.40 GT 3.00 DL 2.40 Text 2 HT 2.35 GT 1.60 DL 2.60 Text 3 HT 3.10 GT 3.30 DL 3.50 Table 9: Average comprehension score per text and translation method Against our expectations, the human translation only receives the highest average comprehension score once, for the first text. It does score better than Google Translate for the second text, but it ends up last for the third. As for the best of the two machine translations, DeepL comes first for the second and third text. This matches our hypothesis (see section 3).

It is noteworthy that the results for text 2 are a great deal lower than those for the other two texts. Perhaps that text was simply judged more harshly than the others. However, it is more plausible that the second text was not translated very well in general. This was already suggested in section 5.2.1 since more respondents correctly labelled the machine translations as such for the second text than for the other translations. The third text shows the best results, something which was also suggested in the same section because less respondents thought the texts by Google Translate and DeepL were machine translations and more respondents were correct when assuming the human translation was in fact made by a human translator.

A possible explanation for the results of the human translation might be the experiment setup. By not showing the text during the comprehension questions, more attention is given to the retainability of the respondents than to the comprehensibility of the texts. It is also possible that more effort was required to read the texts containing mistakes, which is why the respondents remembered the content better.

54

We also studied if the question forms used influence the results. The questions for text 1 can be categorised as following: question 1, 2 and 4 are wh-questions, question 3 is a literal question and question 5 a reorganisation question. Question 2 and 4 have very similar scores, but question 1 has much lower results. Question 3 and 5 are answered best for the human translation and worst for the DeepL translation and are generally answered rather well. The human translation and the Google Translate text receive the highest scores for this question. Text 2 only contains two types of questions. Questions 1, 2, 3 and 5 are wh-questions and the fourth question is a literal question. The fourth question is answered better than the others: it receives the highest score out of the 5 for the human and DeepL translation. For text 3 only wh-questions are used. This seems to imply that literal questions and reorganisations were better answered than wh-questions. However, there were less questions of the first two categories present in the questionnaire, so this conclusion may simply be a coincidence.

5.4 Comprehension questions text 1

In this section we discuss the results of the comprehension questions for text 1. A more comprehensive discussion can be found in appendix IV, where we give a detailed overview of the comprehension questions for text 1 as an example of how we also analysed the other two texts.

The formula mentioned in section 5.3 is used here as well:

(Correct answers *1) + (Semi-correct answers *0.5) + (Incorrect answers *0) = average score

The averages are calculated per question and per translation method. The maximum score is the same as the total number of respondents per question and per translation method. Since this is different a lot of the times and a score out of 22, for example, is difficult to compare, the score was then reduced to a score out of 5. All of this is presented in table 10 below. The scores are presented in the same order as the formula, as it shows the number of respondents that gave correct, semi-correct and incorrect answers. The same method is used in sections 5.5 and 5.6.

The questions and gold standard answers for the comprehension questions of text 1 are the following:

1. Question: What kind of atlas is presented in Berlin? Answer: A paper atlas (in Dutch: ‘een papieratlas’) 2. Question: What is needed to produce paper? Answer: Wood, water and energy

55

3. Question: How much energy is saved when three sheets of recycled paper are used? Answer: Enough to make a pot of coffee with 4. Question: Which three examples of paper products that are used at home and at school were mentioned in the text? Answer: Notebook, handouts, toilet paper 5. Question: Which city is the front-runner according to the atlas? Why? Answer: Aachen. This city uses only environmentally friendly paper

For all of the questions the human translation receives the highest score, except for question 2 where Google Translate gets a slightly better score than the human translation. The human translation is followed by Google Translate and DeepL respectively. The first question has notably lower scores than the other questions. This might be partly attributed to the translation of ‘paper atlas’, which was wrongly translated as ‘papieren atlas’ by both machine translations instead of the correct ‘papieratlas’. It should be noted that questions 2 and 4 may test the retention capability of respondents more than the comprehensibility of the texts itself, since the answers to these questions are both enumerations. In section 2.2.3.1 Göpferich (2009) already warned us against this by saying that comprehensibility is not the same as retainability. For question 5, the better results with the human translation might be have to do with the translation of the last sentence in the text. The source text uses ‘, and’ as a linking word, which both machine translations interpreted as ‘en’. For the human translation, however, we chose to add more cohesion to the text and used ‘want’ (because) instead.

Question Correct Semi-correct Incorrect Total Average score out of 5 answer: 1 answer: 0.5 answer: 0 respondents per question per translation method Q1 HT 2 16 4 22 2.25 GT 4 9 8 21 2.00 DL 6 6 11 23 1.95

Q2 HT 7 15 0 22 3.30 GT 7 14 0 21 3.35 DL 4 18 1 23 2.85 Q3 HT 14 8 0 22 4.10

56

GT 11 7 3 21 3.45 DL 6 9 8 23 2.30 Q4 HT 10 11 1 22 3.50 GT 7 14 0 21 3.35 DL 7 13 3 23 2.95 Q5 HT 11 11 0 22 3.75 GT 4 15 2 21 2.75 DL 2 14 7 23 1.95

Table 10: Comprehension questions text 1 The overall conclusion (based on the average scores for the five question) that the human translation is ranked highest and DeepL lowest is not that surprising, since this ranking can also be seen in the clarity score (section 5.2.2) and in the number of respondents who found the text incomprehensible (section 5.2.3).

5.5 Comprehension questions text 2

These are the questions and gold standard answers for the comprehension questions of text 2:

1. Question: What does Heike like so much about Stralsund? Answer: She likes that everyone knows each other because it’s so small. 2. Question: Who lives in Berlin? Answer: The family of Heike’s roommate. 3. Question: What do Heike’s parents do for a living? Answer: Her father is a banker and her mother is a teacher. 4. Question: Name two things that Heike likes to do in her spare time. Answer: Singing in the church choir, playing the harmonica, listening to rock music, shopping, cooking for friends 5. Question: What does Heike want to do for a job later and where? Answer: She wants to become a German teacher at an American university.

Unlike with the comprehension questions for text 1, the human translation does not yield the best results here. It only obtains the highest score for the second question. DeepL clearly scores better with the highest score for all of the other questions. However, the human translation is

57 rated lowest only once, for the third question. Google Translate ends up last for the other questions, with large differences at times. Especially the second question gets a bad score and the fourth and fifth question have very low scores compared with those of the other translation methods.

A possible explanation for the low scores for question 2 is that the text contained several mentions of family members. A large number of respondents gave answers involving aunts, uncles, siblings or parents. If the text had been shown during the comprehension questions, the scores might have been a lot higher. Moreover, the poor results for the machine translations stems from the errors in translation (see section 4 and appendix V). As for question 3, this may be another one of those questions where the focus is more on the retainability of the respondents than the actual comprehensibility of the text. A lot of the respondents only remember the profession of one of her parents and sometimes they attribute one parent’s profession to the other parent. Question 4 asks about Heike’s hobbies in her everyday life, but some respondents list activities that she likes to do while travelling to Berlin. Again, this question may have been answered better if the text was shown. For question 5 we anticipated that a lot of people would answer the question wrong based on the errors in the machine translations. “German teacher” was translated as “Duitse leraar” in Google Translate and “docent Duitser” instead of “leraar Duits”. We expected that especially the translation by Google Translate would yield problems. Since we wanted to examine the influence of these errors, answers with simply the aspect of “teacher” were labelled as incorrect. The text by Google Translate indeed received a low score, but the translation error apparently did not do any or much harm for the DeepL text, since it got the best results.

Question Correct Semi-correct Incorrect Total Average score out of 5 answer: 1 answer: 0.5 answer: 0 respondents per question per translation method

Q1 HT 5 10 6 21 2.38

GT 4 10 8 22 2.05

DL 7 11 3 21 2.98 Q2

HT 6 3 12 21 1.79

GT 1 0 21 22 0.23

58

DL 3 0 18 21 0.72 Q3

HT 5 6 10 21 1.91

GT 7 9 6 22 2.62

DL 8 8 5 21 2.86 Q4

HT 10 7 4 21 3.22

GT 3 8 11 22 1.59

DL 10 9 2 21 3.45 Q5

HT 7 6 8 21 2.38

GT 6 1 15 22 1.48

DL 10 5 6 21 2.98 Table 11: Comprehension questions text 2 DeepL scoring best here is rather surprising, since it got the lowest clarity score of all in section 5.2.2. The human translation received the highest score there, followed by Google Translate. However, DeepL showed less errors in the typology than Google Translate (section 4) and less respondents found the text incomprehensible than the Google Translate text, which lessened the surprise. Still, since the human translation is supposed to contain no errors at all and no respondents labelled it as incomprehensible, we had expected the human translation to perform best here.

5.6 Comprehension questions text 3

Below are the gold standard answers to the comprehension questions of text 3:

1. Question: Where did the escape of the three East Germans start? Answer: Gdansk, Poland. 2. Question: What was the original escape plan? Answer: Escape through Gdansk with a ferry while carrying fake documents. 3. Question: What weapon was used for the hijacking? Answer: A toy gun. 4. Question: How did the weapon get on board? Answer: Sabine, Ingrid Maron’s daughter carried it with her as a toy.

59

5. Question: Who was Horst Fischer? Answer: Ingrid Maron's boyfriend at the time.

Again, DeepL has the best results for the comprehension questions, getting the highest score for three questions, while Google Translate and the human translation both have that only once. Surprisingly, the human translation scores lowest here, ending alone in last place three times. For the first question, the human translation shares this last place with Google Translate, which has another question for which it receives the lowest score.

The formulation of question 1 was sometimes confusing to respondents since they were not sure which escape was meant: the first attempt that failed with the ferry or the attempt with the plane. One respondent mentioned this explicitly and indicated both places. Others answered with East-Berlin. The high number of respondents whose answers were semi-correct for question 2 is thanks to the mention of false identification. However, these respondents did not mention the ferry, which was an essential part of the answer, and therefore their answers could not be fully approved. The DeepL translation translated “Ingrid Maron's boyfriend at the time” by “toenmalige vriend van Ingrid Maron”, which may cause the sentence to change meaning. In the Google Translate text and the human translation the definite article “de” is added to prevent this. Again, we expected this to influence the results for comprehension question 5 and thus did not fully approve answers that did not involve this aspect. Indeed, DeepL receives a lower score than the human translation. However, this is not the case with Google Translate.

Question Correct Semi-correct Incorrect Total Average score out of 5 answer: 1 answer: 0.5 answer: 0 respondents per question per translation method Q1

HT 12 0 10 22 2.73

GT 12 0 10 22 2.73

DL 14 0 8 22 3.18 Q2

HT 6 11 5 22 2.62

GT 11 5 6 22 3.07

DL 15 3 4 22 3.75 Q3

HT 18 0 4 22 4.09

60

GT 21 0 1 22 4.78

DL 18 1 3 22 4.21 Q4

HT 16 2 4 22 3.87

GT 17 1 4 22 3.98

DL 19 1 2 22 4.43 Q5

HT 6 7 9 22 2.16

GT 6 5 11 22 1.93

DL 2 14 6 22 2.05 Table 12: Comprehension questions text 3 For the clarity scores in section 5.2.2, we saw that the human translation scored best, followed by DeepL and Google Translate respectively. The comprehension question in section 5.2.3 showed that the human translation came out best here as well. DeepL was second and Google Translate was found most incomprehensible. DeepL scoring best here and the human translation coming up last is therefore quite surprising.

5.7 Linguists versus non-linguists

We tested the hypothesis ‘Language specialists are not more severe on translated texts than non-specialists’ (see section 3) by comparing how many linguists and non-linguists were able to recognise if the text given was a human or machine translation, what the average clarity score given by either linguists and non-linguists was, and if approximately the same percentage of linguists and non-linguists comprehended everything in the texts. These percentages and scores are presented in table 13.

As mentioned before in section 5.1, 4 people did not fill out the profile questions. Consequently, these respondents were not taken into account here. The results sometimes need to be nuanced: the percentages show large differences at times, because they were calculated each time with approximately 11 respondents. Therefore, one small difference can cause a large difference in the percentages.

We can see that the percentages that show the correct assumptions for the human or machine translation do not show noticeable differences, except for T3_DL. If there were any differences,

61 it would have been logical that the linguists were stricter. However, the linguists were not more correct there and in fact a lot of them thought that the text was a human translation.

The average clarity scores do not show much differences either: the average scores given by linguists and non-linguists are largely the same. The differences are only large for T2_HT, T2_DL and T3_DL. The differences for T2_HT and T2_DL show that the non-linguists award a higher average clarity score to the text than the linguists. The opposite is true for T3_DL, in which case the linguists give a higher score than the non-linguists.

The percentages of respondents who comprehended everything do not demonstrate major differences either, with only T1_DL, T2_GT and T3_DL showing somewhat larger differences. However, these do not differ so much and bearing in mind the low number of respondents the percentages are calculated on, we can conclude that linguists and non-linguists rate the texts nearly the same for this parameter as well.

Text Human or machine Average clarity Comprehended translation: correct score everything Text 1: Atlas L NL L NL L NL

HT 57% 73% 4.1 4.1 100% 100%

GT 82% 70% 3.5 3.6 91% 100%

DL 70% 83% 3.3 3.1 90% 67% Text 2: Greifswald L NL L NL L NL

HT 64% 70% 3.9 4.4 100% 100%

GT 82% 78% 3.5 3.4 91% 67%

DL 91% 100% 3.1 3.8 91% 100% Text 3: Escape GDR L NL L NL L NL

HT 78% 85% 3.9 4.0 100% 85%

GT 64% 55% 3.2 3.1 55% 55%

DL 29% 73% 4.0 3.2 86% 53% Table 13: Linguists' and non-linguists' responses

62

6 CONCLUSION AND DISCUSSION In this dissertation we investigated the comprehensibility and perception of machine translations compared with human translations by means of a survey. We based ourselves on the research by Scarton & Specia (2016). For the survey we made use of three different texts that we translated ourselves and had translated by DeepL and Google Translate. Based on the content and text-specific questions, we obtained the results described below.

We found that the majority of respondents could correctly discern a human translation from a machine translation and that they usually relied on coherence, fluency, idiomaticity, clarity, sentence length and repetition to make the decision. The clarity scores demonstrate that the human translations all obtain the best scores. Respondents indicate that comprehension is best for the human translations and worst for Google Translate. The mistakes that bother readers most have to do with grammar, sentence length, level of idiomaticity and incoherence. Furthermore, the passages that caused comprehension problems for the respondents correspond with the mistakes in the applied error typology, since all the passages contained mistakes listed there. Lastly, we found that language specialists are not more severe than non-specialists.

As for the comprehension questions, our results showed that the human translation was only rated best once, while DeepL proved itself to be the best for two of the three texts. This latter finding proves that the machine translations are comprehensible for the language pair English- Dutch and the machine translation tools Google Translate and DeepL, although respondents quoted certain passages that hindered comprehension. These results show a similarity with those found in Scarton & Specia (2016), where humans did not perform better for the human translated documents either.

There is no definite answer as to which machine translation tool performs better. Both tools score best in different areas: Google Translate receives a better clarity score for two of the three texts, while DeepL scores best for comprehension. However, the applied error typology showed that DeepL contained less errors for two of the three translations.

In hindsight, there were some aspects of this dissertation that could have been improved and that would be interesting to implement in further research. Respondents’ answers showed that not all of them were aware that the translations they received could both be machine translations. By leaving this option unsaid, the results may be slightly influenced. It might have been better, therefore, to mention this more explicitly at the beginning of the questionnaire.

63

At times it was also difficult to approve or disapprove of certain answers to the comprehension questions. The system where an answer was given a score of 1, 0.5 or 0 sometimes seems inadequate since there are more gradations in the correctness of an answer. The answers that received 0.5 were sometimes very different: one answer would be almost correct, while another was barely semi-correct. A system with more detailed scores, such as the one used in the research by Scarton & Specia (2016), might have been a better option.

It would also have been a good idea to have an independent, unbiased human translator make a translation of the source texts. Since I had needed to select the texts that we were going to use for the questionnaire, I had unavoidably studied them alongside the machine translations before translating them myself. This may have influenced the translation style of the human translations somewhat. Appealing to a translator who has never seen the machine translations before, would have avoided this.

64

BIBLIOGRAPHY

Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved

correlation with human judgments. In Proceedings of the acl workshop on intrinsic and

extrinsic evaluation measures for machine translation and/or summarization (pp. 65–72).

Daems, J., & Macken, L. (2013). Annotation Guidelines for English-Dutch Translation Quality

Assessment, version 1.0. LT3 Technical Report-LT3 13.02. Retrieved from

http://users.ugent.be/~jvdaems/TQA_guidelines_2.0.html

De Clercq, O., Hoste, V., Desmet, B., Van Oosten, P., De Cock, M., & Macken, L. (2014). Using the

crowd for readability prediction. Natural Language Engineering, 20(3), 293–325.

DeepL — Vertaalkwaliteit. (n.d.). Retrieved 19 June 2018, from https://www.deepl.com/quality.html

DeepL Translator. (n.d.). Retrieved 21 December 2017, from https://www.deepl.com/translator

DePalma, D.A., Pielmeier, H., Lommel, A., & Stewart, R.G. (2017, June 30). Who’s Who in Language

Services and Technology: 2017 Rankings. Retrieved 4 April 2018, from

http://www.commonsenseadvisory.com/AbstractView/tabid/74/ArticleID/39815/Title/TheLa

nguageServicesMarket2017/Default.aspx

Dictionary by Merriam-Webster: America’s most-trusted online dictionary. (n.d.). Retrieved 11 April

2018, from https://www.merriam-webster.com/

Doherty, S., & Gaspari, F. (2013). Effective Post-Editing in Human & Machine Translation Workflows:

Critical Knowledge & Techniques. Centre for Next Generation Localisation. dossier taalverzorging renkema ccc-model | Genootschap Onze Taal. (2015, April). Retrieved 6 June

2018, from nieuws-en-dossiers/dossiers/taalverzorging/het-ccc-model

Forcada, M. L. (2017, February). Is machine translation research running around in circles? Leuven.

Garcia, I. (2015). Translators and social media : communicating in a connected world. Proceedings of

the 23rd NZSTI National Conference, New Zealand Society of Translators and Interpreters:

Communicating in a Connected World, Auckland, 21-22 June 2014, 1–9.

Google Translate. (n.d.). Retrieved 3 July 2018, from https://translate.google.com/

65

Google Translate - Apps on Google Play. (n.d.). Retrieved 19 June 2018, from

https://play.google.com/store/apps/details?id=com.google.android.apps.translate&hl=en_U

S

Göpferich, S. (2009). Comprehensibility assessment using the Karlsruhe comprehensibility concept.

The Journal of Specialised Translation, 11(2009), 31–52.

Göroj. (2014b). Translation and quality: Editorial. Tradumatica, (12), 388–391.

Henderson, S. (2016, August 3). The Top 100 Language Service Providers: 2016. Retrieved 5 April

2018, from

http://www.commonsenseadvisory.com/AbstractView/tabid/74/ArticleID/36544/Title/TheT

op100LanguageServiceProviders2016/Default.aspx

Home : Oxford English Dictionary. (n.d.). Retrieved 11 April 2018, from http://www.oed.com/

Jiménez-Crespo, M. A. (2017). How much would you like to pay? Reframing and expanding the notion

of translation quality through crowdsourcing and volunteer approaches. Perspectives, 25(3),

478–491. https://doi.org/10.1080/0907676X.2017.1285948

Jones, D., Gibson, E., Shen, W., Granoien, N., Herzog, M., Reynolds, D., & Weinstein, C. (2005).

Measuring human readability of machine generated text: three case studies in speech

recognition and machine translation. In Acoustics, Speech, and Signal Processing, 2005.

Proceedings.(ICASSP’05). IEEE International Conference on (Vol. 5, p. v–1009). IEEE.

Lesgold, A. M., Roth, S. F., & Curtis, M. E. (1979). Foregrounding effects in discourse comprehension.

Journal of Verbal Learning and Verbal Behavior, 18(3), 291–308.

https://doi.org/10.1016/S0022-5371(79)90164-6

Lommel, A. (2015). Multidimensional quality metrics (MQM) definition. Retrieved 5 August 2018,

from http://www.qt21.eu/mqm-definition/definition-2015-12-30.html

Macken, L. (2017). Inleiding tot de vertaaltechnologie [Syllabus]. Ghent: Department of Translation,

Interpreting and Communication, Ghent University.

Moerman. (2017). The understandability of machine-translated texts. Universiteit Gent.

66

O’Brien, S. (2012). Towards a dynamic quality evaluation model for translation. The Journal of

Specialised Translation, 17, 55–77.

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of

Machine Translation. In Proceedings of the 40th Annual Meeting on Association for

Computational Linguistics (pp. 311–318). Stroudsburg, PA, USA: Association for

Computational Linguistics. Retrieved from https://doi.org/10.3115/1073083.1073135

Peleman, E. (2017, November 6). De opvolger van Google Translate: de schrik van elke menselijke

vertaler? Retrieved 19 June 2018, from https://epvertalingen.eu/opvolger-google-translate-

deepl

Persinformatie. (n.d.). Retrieved 19 June 2018, from https://www.deepl.com/press.html

Piróth, A. (2011). Translation automation survey among translators. IAPTI (International Association

of Professional Translators and Interpreters) Report.

Renkema, J. (2001). Undercover research into text quality as a tool for communication management.

Reading and Writing Public Documents: Problems, Solutions and Characteristics John

Benjamins Publishing Company: Amsterdam, 37–57.

Scarton, C., & Specia, L. (2016). A Reading Comprehension Corpus for Machine Translation

Evaluation. In LREC.

Schriver, K. A. (1989). Evaluating text quality: The continuum from text-focused to reader-focused

methods. IEEE Transactions on Professional Communication, 32(4), 238–255.

Schwarz, M. N. K., & Flammer, A. (1981). Text structure and title—effects on comprehension and

recall. Journal of Verbal Learning and Verbal Behavior, 20(1), 61–66.

https://doi.org/10.1016/S0022-5371(81)90301-7

Snover, M., Madnani, N., Dorr, B. J., & Schwartz, R. (2009). Fluency, Adequacy, or HTER?: Exploring

Different Human Judgments with a Tunable MT Metric. In Proceedings of the Fourth

Workshop on Statistical Machine Translation (pp. 259–268). Stroudsburg, PA, USA:

67

Association for Computational Linguistics. Retrieved from

http://dl.acm.org/citation.cfm?id=1626431.1626480

Synonyms.net. (n.d.). Retrieved 11 April 2018, from https://www.synonyms.net/

Tan, C., Gabrilovich, E., & Pang, B. (2012). To Each His Own: Personalized Content Selection Based on

Text Comprehensibility. In Proceedings of the Fifth ACM International Conference on Web

Search and Data Mining (pp. 233–242). New York, NY, USA: ACM.

https://doi.org/10.1145/2124295.2124325

Van Brussel, L., Tezcan, A., & Macken, L. (2018). A Fine-grained Error Analysis of NMT, PBMT and

RBMT Output for English-to-Dutch. In Eleventh International Conference on Language

Resources and Evaluation (pp. 3799–3804). European Language Resources Association

(ELRA).

What is Machine Translation? Rule Based vs. Statistical | SYSTRAN. (n.d.). Retrieved 9 June 2018,

from http://www.systransoft.com/systran/translation-technology/what-is-machine-

translation/

Why DeepL Got into Machine Translation and How It Plans to Make Money. (2017, October 19).

Retrieved 19 June 2018, from https://slator.com/technology/deepl-got-machine-translation-

plans-make-money/

68

APPENDIX Appendix I: Instructions questionnaire

Beste deelnemer

Zo meteen krijg je twee korte teksten te zien. Deze teksten zijn ofwel machinevertalingen ofwel menselijke vertalingen.

Lees de teksten zorgvuldig en vul dan de vragen in. Bij de eerste reeks vragen (inhoudsvragen) worden de teksten niet meer getoond. Lees daarom de teksten zo aandachtig mogelijk.

Aan het eind van de vragenlijst volgen nog enkele vragen om een profiel van jou als respondent te kunnen schetsen.

De vragen worden anoniem verwerkt, en het invullen van de vragenlijst zal maximum een kwartier in beslag nemen. Vragen gemarkeerd met een asterisk (*) moeten worden beantwoord.

Alvast bedankt!

Iris Ghyselen

Mocht je vragen hebben met betrekking tot deze vragenlijst of meer willen weten over de resultaten van dit onderzoek naar automatische vertalingen, dan kan je een mail sturen naar [email protected] (masterstudente Vertalen UGent).

Appendix II: Translations

T1_HT

Een nieuwe atlas toont welke steden milieuvriendelijk papier gebruiken. Op school schrijf je in schriften, leerkrachten drukken een heleboel hand-outs af en in je toilet hangt een rol wc- papier … Overal gebruiken we papier. Er wordt onder andere hout gebruikt om papier te produceren. Als we het aantal bomen dat we omhakken om papier te produceren zo laag mogelijk willen houden moeten we spaarzaam omgaan met papier. Bovendien bespaart gerecycleerd papier energie. Milieuactivisten eisen dan ook dat we enkel dat soort papier gebruiken. Gerecycleerd papier gemaakt van oud papier is beter voor het milieu. Je hebt minder hout nodig om het te produceren en ook minder water en energie. Drie vellen gerecycleerd papier alleen al besparen genoeg energie om een pot koffie mee te zetten. De stad Aken in Noordrijn-Westfalen bespaart heel wat energie. Dat wordt aangegeven in een nieuwe papieratlas die vandaag in Berlijn wordt voorgesteld. Die atlas laat zien hoeveel gerecycleerd

69 papier er wordt gebruikt in de grotere steden in Duitsland. Aken staat bovenaan de lijst, want daar gebruiken ze enkel milieuvriendelijk papier.

T1_GT

Een nieuwe atlas laat zien welke steden milieuvriendelijk papier gebruiken. Op school schrijf je in je notitieboekjes, de leraar drukt veel hand-outs af en een rol wc-papier hangt in je badkamer ... we gebruiken overal papier. Hout wordt gebruikt om papier te maken. Om het aantal bomen dat wordt gekapt zo laag mogelijk te houden, moeten we spaarzaam met papier omgaan. Gerecycled papier bespaart energie. Milieuactivisten eisen ook dat we alleen gerecycled papier gebruiken. Gerecycled papier van oud papier is beter voor het milieu. U hebt minder hout nodig om gerecycleerd papier te produceren en ook minder water en energie. Slechts drie stukjes gerecycled papier besparen genoeg energie om een pot koffie te maken. De stad Aken in Noordrijn-Westfalia bespaart veel energie. Dit wordt aangegeven in een nieuwe papieren atlas die vandaag in Berlijn werd geïntroduceerd. Het laat zien hoeveel gerecycleerd papier wordt gebruikt in de grotere steden in Duitsland. Aken staat bovenaan de lijst en gebruikt alleen milieuvriendelijk papier.

T1_DL

Een nieuwe atlas laat zien welke steden milieuvriendelijk papier gebruiken. Op school schrijf je in je notebooks, de leraar drukt veel hand-outs af en er hangt een rol van toiletpapier in je badkamer.... we gebruiken overal papier... we gebruiken papier. Hout wordt gebruikt voor de productie van papier. Om het aantal bomen dat wordt gekapt zo laag mogelijk te houden, moeten we papier spaarzaam gebruiken. Gerecycled papier bespaart energie. Ook milieubeschermers eisen dat we alleen gerecycled papier gebruiken. Gerecycled papier van oud papier is beter voor het milieu. U heeft minder hout nodig om gerecycled papier te produceren en ook minder water en energie. Slechts drie stuks gerecycled papier besparen genoeg energie om een pot koffie te maken. De stad Aken in Noordrijn-Westfalen bespaart veel energie. Dit wordt aangegeven in een nieuwe papieren atlas die vandaag in Berlijn is geïntroduceerd. Het laat zien hoeveel gerecycled papier er in de grotere steden in Duitsland wordt gebruikt. Aken staat bovenaan de lijst en gebruikt alleen milieuvriendelijk papier.

T2_HT

Hallo! Mijn naam is Heike Kron en ik ben student. Ik studeer Duits en onderwijs aan de Universiteit van Greifswald. Ik zou graag later als docent Duits aan de slag gaan aan een

70

Amerikaanse universiteit. Greifswald is een leuke stad in het noorden van Duitsland. De universiteit is redelijk klein. Er zijn slechts 12.000 studenten. In de stad staan er veel oude huizen. De Baltische Zee is heel dichtbij en er is een riviertje genaamd de Ryk. In mijn vrije tijd ga ik graag winkelen in het mooie stadscentrum van Greifswald. Dat vind ik heel leuk. Ik speel ook graag harmonica en vind het fijn om de hele dag naar rockmuziek te luisteren. Daarnaast zing ik in het kerkkoor ’Dicke Marie’. Ik kook graag voor mijn vrienden. Mijn familie woont in Stralsund, een stadje in het noordoosten van Duitsland aan de Baltische kust. Iedereen kent elkaar in kleine steden. Dat vind ik erg leuk. Mijn vader werkt in een bank en mijn moeder is leerkracht. Ik heb een broer, Hans-August, en een zus, Agneta. De familie van mijn kamergenoot woont in Berlijn, en soms nemen we de trein naar daar om haar familie te bezoeken. In Berlijn gaan we dan naar museums of restaurants of gaan we zelfs dansen en winkelen. Soms breng ik mijn zomervakantie door in Eutin. Ik bezoek er dan mijn oom en tante die daar wonen. Verder heb ik geen andere familieleden.

T2_GT

Hallo! Mijn naam is Heike Kron en ik ben een student. Ik studeer Duits en onderwijs aan de universiteit van Greifswald. Ik zou graag een Duitse leraar worden aan een Amerikaanse universiteit. Greifswald is een leuke stad in het noorden van Duitsland. De uni is vrij klein. Het heeft slechts ongeveer 12.000 studenten. De stad heeft veel oude huizen. De Oostzee is heel dichtbij en er is een kleine rivier genaamd de Ryk. In mijn vrije tijd ga ik graag winkelen in het mooie centrum van Greifswald. Dat is erg leuk voor mij. Ik speel ook graag de mondharmonica en luister de hele dag naar rockmuziek. Ik zing ook in het kerkkoor "Dicke Marie". Ik kook graag voor mijn vrienden. Mijn familie woont in Stralsund, een klein stadje in het noordoosten van Duitsland aan de Baltische kust. Iedereen kent elkaar in kleine steden. Ik vind dat erg leuk. Mijn vader werkt voor een bank en mijn moeder is een leraar. Ik heb een broer, Hans-augustus en een zuster. Agneta. Het huis van mijn kamergenoot woont in Berlijn en soms nemen we de trein naar Berlijn om haar familie te bezoeken. In Berlijn gaan we naar musea of restaurants of zelfs dansen en winkelen. Soms breng ik mijn zomervakantie door in Eutin. Ik bezoek mijn tante en oom die daar wonen. Ik heb geen andere familieleden anders.

T2_DL

Hallo! Mijn naam is Heike Kron en ik ben student. Ik studeer Duits en onderwijs aan de Universiteit van Greifswald. Ik zou graag docent Duitser willen worden aan een Amerikaanse universiteit. Greifswald is een leuke stad in Noord-Duitsland. De uni is vrij klein. Het heeft

71 slechts ongeveer 12.000 studenten. De stad heeft veel oude huizen. De Oostzee is zeer dichtbij en er is een riviertje genaamd de Ryk. In mijn vrije tijd ga ik graag winkelen in het mooie centrum van Greifswald. Dat is veel plezier voor mij. Ik speel ook graag de harmonica en luister de hele dag naar rockmuziek. Ik zing ook in het kerkkoor "Dicke Marie". Ik kookt graag voor mijn vrienden. Mijn familie woont in Stralsund, een klein stadje in het noordoosten van Duitsland aan de Baltische kust. Iedereen kent elkaar in kleine stadjes. Dat bevalt me erg leuk. Mijn vader werkt voor een bank en mijn moeder is leraar. Ik heb één broer, Hans-August, en één zus. Agneta. Mijn huisgenoot woont in Berlijn en soms nemen we de trein naar Berlijn om haar familie te bezoeken. In Berlijn gaan we naar musea of restaurants of zelfs dansen en winkelen. Soms breng ik mijn zomervakantie door in Eutin. Ik bezoek mijn oom en tante die er wonen. Ik heb geen andere familieleden.

T3_HT

Het is waarschijnlijk de meest tot de verbeelding sprekende ontsnapping in de geschiedenis van de DDR: een jonge man die met een plastic pistool een piloot van een lijnvlucht naar Oost- Berlijn dwingt om in West-Berlijn te landen. Maar zo was het niet gepland. Toen Detlef Alexander Tiede en zijn beste vriendin Ingrid Maron en haar dochter Sabine in augustus 1978 naar Gdansk in Polen reden, hadden ze maar één doel voor ogen: van daar ontsnappen naar het westen met de veerboot. Horst Fischer, de toenmalige vriend van Ingrid Maron die in West- Berlijn woonde, wilde hen helpen. Hij zorgde voor valse papieren en probeerde de documenten naar Gdansk te smokkelen. Maar het plan mislukte. De drie voortvluchtigen wachtten tevergeefs op Horst Fischer op de afgesproken ontmoetingsplaats: hij was onderweg betrapt met de valse identiteitskaarten en gearresteerd door de staatspolitie. Tiede, Maron en dochter Sabine zaten vast in Gdansk. Op een gegeven moment kwamen ze op het idee om een vliegtuig met bestemming Berlin-Schönefeld in Oost-Berlijn te kapen en de piloten te dwingen om te landen in West-Berlijn op de luchthaven Berlin-Tempelhof. En dit met behulp van een speelgoedpistool. Ingrid Maron herinnert zich nog hoe haar dochter het pistool mee aan boord moest nemen. De douanebeambte vond het pistool natuurlijk tijdens een veiligheidscontrole, maar gaf het terug aan de dochter. En het plan lukte. Het vliegtuig werd gekaapt vlak voor het bereiken van de luchthaven van bestemming.

T3_GT

Het is hoogstwaarschijnlijk de meest ongewone ontsnapping in de geschiedenis van de DDR: een jonge man dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic

72 geweer in West-Berlijn te landen. Maar zo was het niet de bedoeling. Toen Detlef Alexander Tiede en zijn beste vriendin Ingrid Maron en haar dochter Sabine in augustus 1978 naar Gdansk, Polen, reden, hadden ze maar één doel: om met de veerboot naar het westen te vluchten. Horst Fischer, de toenmalige vriend van Ingrid Maron die in West-Berlijn woonde, wilde hen helpen. Hij verkreeg valse identificatie en probeerde de documenten naar Gdansk te smokkelen. Maar het plan is mislukt. De drie voortvluchtigen wachtten tevergeefs op de overeengekomen ontmoetingsplaats voor Horst Fischer: hij werd onderweg betrapt met de valse identiteitskaarten en gearresteerd door de staatspolitie. Tiede, Maron en dochter Sabine zaten vast in Gdansk. Op een gegeven moment kwamen we op het idee om een vliegtuig te kapen dat verondersteld werd te landen op de Oost-Duitse luchthaven Berlijn-Schönefeld en het te laten landen op de luchthaven Berlijn-Tempelhof. Een speelgoedpistool gebruiken. Mijn dochter moest het pistool in het vliegtuig dragen, herinnert Ingrid Maron zich. En een douanebeambte vond het pistool natuurlijk tijdens een veiligheidscontrole. Ze gaf het terug aan mijn dochter, legt Ingrid Maron uit. En het plan was succesvol. Hijack vlak voor het bereiken van de luchthaven van bestemming.

T3_DL

Het is waarschijnlijk de meest ongebruikelijke ontsnapping in de geschiedenis van de DDR: Een jongeman dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic pistool in West-Berlijn te landen. Maar zo moest het niet gaan. Toen Detlef Alexander Tiede en zijn beste vriendin Ingrid Maron en haar dochter Sabine in augustus 1978 naar Gdansk in Polen reden, hadden ze maar één doel: van daaruit per veerboot naar het westen te ontsnappen. Horst Fischer, toenmalige vriend van Ingrid Maron, die in West-Berlijn woonde, wilde hen helpen. Hij verkreeg valse identificatie en probeerde de documenten naar Gdansk te smokkelen. Maar het plan is mislukt. De drie voortvluchtigen wachtten tevergeefs op de afgesproken ontmoetingsplek voor Horst Fischer: hij werd onderweg met de valse identiteitskaarten opgepakt en door de staatspolitie gearresteerd. Tiede, Maron en dochter Sabine zaten vast in Gdansk. Op een gegeven moment kwamen we met het idee om een vliegtuig te kapen dat op de Oost-Duitse luchthaven Berlin-Schönefeld zou moeten landen en het te laten landen op de West-Berlijnse luchthaven Berlin-Tempelhof. Een speelgoedpistool gebruiken. Mijn dochter moest het pistool op het vliegtuig dragen, herinnert Ingrid Maron zich. En natuurlijk vond een douanebeambte het wapen tijdens een veiligheidscontrole. Ze gaf het terug aan mijn dochter, legt Ingrid Maron uit. En het plan was succesvol. Hijack net voor het bereiken van de luchthaven van bestemming.

73

Appendix III: Questions questionnaire

Content questions

T1

1. What kind of atlas is presented in Berlin? 2. What is needed to produce paper? 3. How much energy is saved when three sheets of recycled paper are used? 4. Which three examples of paper products that are used at home and at school were mentioned in the text? 5. Which city is the front-runner according to the atlas? Why?

T2

1. What does Heike like so much about Stralsund? 2. Who lives in Berlin? 3. What do Heike’s parents do for a living? 4. Name two things that Heike likes to do in her spare time. 5. What does Heike want to do for a job later and where?

T3

1. Where did the escape of the three East Germans start? 2. What was the original escape plan? 3. What weapon was used for the hijacking? 4. How did the weapon get on board? 5. Who was Horst Fischer?

Text-specific questions

1. In your opinion, is this a machine translation or a human translation and why? 2. Give a score out of 5 for this text, 1 being very unclear and 5 very clear. 3. Did you understand everything from the text? 4. If you answered ‘no’ to the previous question, indicate which passages were incomprehensible to you by copying the sentence (or phrase). 5. Which errors were most bothering to you, if any?

Profile questions

74

1. Age 2. Gender identity which you identify with most 3. Highest obtained degree or current studies 4. Have you ever used machine translations before yourself (Google Translate, DeepL, …)? What did you think of this experience?

Appendix IV: Comprehensive discussion of comprehension questions text 1

Q1: What kind of atlas is presented in Berlin?

For this question, answers including all of the following aspects were completely approved:

 Recycled paper  Consumption  Concerning several (German) cities

If one aspect was mentioned, but the word ‘paper’ was left out, the answer was not approved at all, even though it would normally receive a 0.5. This was done because ‘paper’ is a very essential part of the correct answer. If ‘milieuvriendelijk papier’ (environmentally friendly paper) was used instead of ‘gerecycleerd papier’, this was seen as correct. ‘Papieratlas’ was approved, but ‘papieren atlas’ was rejected because this has a different meaning (atlas made of paper instead of atlas describing paper). The second answer was only encountered once with the human translation, but it was much more common with the machine translations, as a consequence of the incorrect translations by Google Translate and DeepL.

1 0.5 0 Number % Number % Number % Total HT 2 9% 16 73% 4 18% 22 GT 4 19% 9 43% 8 38% 21 DL 6 26% 6 26% 11 48% 23 Table 14: Correct answers T1 Atlas Q1 In the table above, the numbers and percentages of correct, semi-correct and wrong answers per translation can be found. Surprisingly, the human translation scores worst on the completely correct answers, bested by both Google Translate and DeepL. However, its semi-correct answers are the highest of all and it has the lowest amount of wrong answers. If we combine correct and semi-correct answers, moreover, the human translation scores better than the machine translations. If we compare the machine translations, we see that DeepL has the highest

75 number and percentage of the two, but it has less semi-correct scores and more incorrect scores. Furthermore, with the combined 1 and 0.5 scores, Google Translate scores better than DeepL.

Q2: What is needed to produce paper?

The answer to this question contains an enumeration of three aspects: wood, water and energy. A score of 1 is attributed to answers with all three aspects, 0.5 to answers with one or two and 0 to incorrect answers.

1 0.5 0 Number % Number % Number % Total HT 7 32% 15 68% 0 0% 22 GT 7 33% 14 67% 0 0% 21 DL 4 17% 18 78% 1 4% 23 Table 15: Correct answers T1 Atlas Q2 This question is better answered than the first one, with only one person not getting at least one aspect right (see table 15). The scores for the human translation and the one produced by Google Translate are nearly the same. DeepL achieves the highest number of semi-correct answers, but scores lower with the completely correct answers and even has a respondent with an incorrect answer. This matches the results found earlier with the clarity score and comprehension. There, DeepL also obtained the worst scores, with an average 3.2 clarity score compared to 4.1 and 3.5 for the human translation and Google Translate respectively, and 5 people not comprehending everything as opposed to 0 and 1. However, as the answer to this question is an enumeration, it may test the retention capability of respondents more than the comprehensibility of the texts itself.

Q3: How much energy is saved when three sheets of recycled paper are used?

The answer to this question is ‘enough to make a pot of coffee with’. Answers that included ‘cup of coffee’ instead of ‘pot of coffee’ were marked as semi-correct and received a 0.5.

1 0.5 0 Number % Number % Number % Total HT 14 64% 8 36% 0 0% 22 GT 11 52% 7 33% 3 14% 21 DL 6 26% 9 39% 8 35% 23

76

Table 16: Correct answers T1 Atlas Q3 As table 16 shows us, the human translation is the only one to have only correct or semi-correct answers. Both machine translations have incorrect answers, with DeepL scoring lower than Google translate with 8 to 3. The amount of semi-correct answers is generally the same, with DeepL having the largest amount and Google Translate the lowest. The human translation obtains the highest number of respondents with a correct answer, followed by Google Translate. The latter almost has twice the number of correct answers as DeepL. This is not that surprising, since the respondents rated the machine translation by Google Translate better than the one by DeepL and 5 respondents found the DeepL translation incomprehensible at times, while only 1 respondent thought that was the case for Google Translate (see sections 5.2.2 and 5.2.3).

Q4: Which three examples of paper products that are used at home and at school were mentioned in the text?

The answer here is again an enumeration. The following elements were necessary to form a correct answer:

 Notebooks  Hand-outs  Roll of toilet paper

1 0.5 0 Number % Number % Number % Total HT 10 45% 11 50% 1 5% 22 GT 7 33% 14 66% 0 0% 21 DL 7 30% 13 57% 3 13% 23 Table 17: Correct answers T1 Atlas Q4 Table 17 shows us that the human translation has the highest number of respondents with a completely correct answer. The number of respondents is the same with the machine translations, although the percentage of Google Translate is higher. However, the human translation scores lower than Google Translate for the incorrect answers, with one respondent as opposed to none. In turn, DeepL scores lower than the human translation, with three incorrect answers. The semi-correct answers are led by Google Translate, followed by the human translation and DeepL respectively.

77

Q5: Which city is the front-runner according to the atlas? Why?

This question is twofold and should receive two different answers. The first answer is simply ‘Aken’ (Aachen) and the second should contain the following elements:

 The city uses  Only  Recycled paper

Since ‘only’ is an essential part of the correct answer, it was not sufficient here to mention simply one aspect to obtain a 0.5 score. If ‘only’ was not present, the answer was seen as incorrect.

1 0.5 0 Number % Number % Number % Total HT 11 50% 11 50% 0 0% 22 GT 4 19% 15 71% 2 10% 21 DL 2 9% 14 61% 7 30% 23 Table 18: Correct answers T1 Atlas Q5 The human translation definitely obtains the best answers to this question. Half of the respondents had a partially correct answer and the other half gave a completely correct answer. Again, Google Translate scores better than DeepL, with 2 more correct answers, 1 more semi- correct and only 2 incorrect answers, as opposed to the 7 from DeepL.

If we simply look at the first answer, we get a different table:

Correct Incorrect Number % Number % Total HT 22 100% 0 0% 22 GT 19 90% 2 10% 21 DL 16 70% 7 30% 23 Table 19: Correct answers to first question T1 Atlas Q5 We can see that the column of incorrect answers has remained unchanged. The first column presents that the human translation has only correct answers. This is followed by Google Translate, with 19 correct answers and DeepL, with 16 correct answers.

78

After studying the second answer some more, we came up with the table below. If one of the three aspects was mentioned here, the answer was judged as semi-correct.

 The city uses  Only  Recycled paper

1 0.5 0 Number % Number % Number % Total HT 11 50% 2 9% 9 41% 22 GT 4 19% 2 10% 15 71% 21 DL 2 9% 4 17% 17 74% 23 Table 20: Correct answers to second question T1 Atlas Q5 The first column is the same as in table 18. DeepL has the highest number of semi-correct answers, with 4 as opposed to the 2 with both the human translation and the one by Google Translate. However, the human translation scores better overall, with the highest number of correct answers and lowest number of incorrect answers, and Google Translate has a lower number of incorrect answers than DeepL.

The better results with the human translation might be attributed to the translation of the last sentence in the text. The source text uses ‘, and’ as a linking word, which both machine translations interpreted as ‘en’. For the human translation, however, we chose to add more cohesion to the text and used ‘want’ (because) instead.

Appendix V: Applied error typology

T1_GT

Een nieuwe atlas laat zien welke steden milieuvriendelijk papier gebruiken. Op school schrijf je in je notitieboekjes, de leraar drukt veel hand-outs af en een rol wc-papier hangt in je badkamer ... we gebruiken overal papier. Hout wordt gebruikt om papier te maken. Om het aantal bomen dat wordt gekapt zo laag mogelijk te houden, moeten we spaarzaam met papier omgaan. Gerecycled papier bespaart energie. Milieuactivisten eisen ook dat we alleen gerecycled papier gebruiken. Gerecycled papier van oud papier is beter voor het milieu. U hebt minder hout nodig om gerecycleerd papier te produceren en ook minder water en energie. Slechts drie stukjes gerecycled papier besparen genoeg energie om een pot koffie te maken. De stad Aken in Noordrijn-Westfalia bespaart veel energie. Dit wordt aangegeven in een nieuwe

79 papieren atlas die vandaag in Berlijn werd geïntroduceerd. Het laat zien hoeveel gerecycleerd papier wordt gebruikt in de grotere steden in Duitsland. Aken staat bovenaan de lijst en gebruikt alleen milieuvriendelijk papier.

Acceptability Spelling & typos Capitalization Badkamer … we Style & register Register (BE-NL) Gerecycled papier5 Untranslated Westfalia Gerecycled Repetition Papier Coherence Inconsistency Gerecycled / gerecycleerd U/je Adequacy Word sense disambiguation Geïntroduceerd <-> voorgesteld Part of Speech Papieren atlas Meaning shift caused by other Notitieboekjes Table 21: Error typology T1_GT

T1_DL

Een nieuwe atlas laat zien welke steden milieuvriendelijk papier gebruiken. Op school schrijf je in je notebooks, de leraar drukt veel hand-outs af en er hangt een rol van toiletpapier in je badkamer.... we gebruiken overal papier... we gebruiken papier. Hout wordt gebruikt voor de productie van papier. Om het aantal bomen dat wordt gekapt zo laag mogelijk te houden, moeten we papier spaarzaam gebruiken. Gerecycled papier bespaart energie. Ook milieubeschermers eisen dat we alleen gerecycled papier gebruiken. Gerecycled papier van oud papier is beter voor het milieu. U heeft minder hout nodig om gerecycled papier te produceren en ook minder water en energie. Slechts drie stuks gerecycled papier besparen genoeg energie om een pot koffie te maken. De stad Aken in Noordrijn-Westfalen bespaart veel energie. Dit wordt aangegeven in een nieuwe papieren atlas die vandaag in Berlijn is geïntroduceerd. Het laat zien hoeveel gerecycled papier er in de grotere steden in Duitsland wordt gebruikt. Aken staat bovenaan de lijst en gebruikt alleen milieuvriendelijk papier.

Acceptability Grammar & Superfluous word/ Rol van toiletpapier syntax constituent We gebruiken overal papier… we gebruiken papier Lexicon Wrong collocation Stuks <-> vellen Spelling & typos Capitalization badkamer… : we Style & register Register (BE-NL) Gerecycled papier Untranslated Notebooks Repetition Gerecycled papier Other Nominalisation:

5 See https://onzetaal.nl/taaladvies/recyclen-recycleren 80

Voor de productie van papier Coherence Inconsistency U/je Adequacy Word sense disambiguation Geïntroduceerd <-> voorgesteld Part of Speech Papieren atlas Meaning shift caused by misplaced word Ook milieubeschermers (< Environmentalists also demand that) Table 22: Error typology T1_DL

T2_GT

Hallo! Mijn naam is Heike Kron en ik ben een student. Ik studeer Duits en onderwijs aan de universiteit van Greifswald. Ik zou graag een Duitse leraar worden aan een Amerikaanse universiteit. Greifswald is een leuke stad in het noorden van Duitsland. De uni is vrij klein. Het heeft slechts ongeveer 12.000 studenten. De stad heeft veel oude huizen. De Oostzee is heel dichtbij en er is een kleine rivier genaamd de Ryk. In mijn vrije tijd ga ik graag winkelen in het mooie centrum van Greifswald. Dat is erg leuk voor mij. Ik speel ook graag de mondharmonica en luister de hele dag naar rockmuziek. Ik zing ook in het kerkkoor "Dicke Marie". Ik kook graag voor mijn vrienden. Mijn familie woont in Stralsund, een klein stadje in het noordoosten van Duitsland aan de Baltische kust. Iedereen kent elkaar in kleine steden. Ik vind dat erg leuk. Mijn vader werkt voor een bank en mijn moeder is een leraar. Ik heb een broer, Hans-augustus en een zuster. Agneta. Het huis van mijn kamergenoot woont in Berlijn en soms nemen we de trein naar Berlijn om haar familie te bezoeken. In Berlijn gaan we naar musea of restaurants of zelfs dansen en winkelen. Soms breng ik mijn zomervakantie door in Eutin. Ik bezoek mijn tante en oom die daar wonen. Ik heb geen andere familieleden anders.

Acceptability Grammar & Article Ik ben een student syntax Ik speel ook graag de mondharmonica Mijn moeder is een leraar Article-noun De uni… Het <-> ze agreement Structure Gaan we naar musea of restaurants of zelfs dansen en winkelen (no contraction of verbs with adpositional phrase and verb phrase) Lexicon Wrong collocation Dat is erg leuk voor mij. Word non-existent Uni Spelling & typos Capitalization universiteit van Greifswald <-> Universiteit Punctuation Zuster. Agneta.

81

Style & register Repetition Geen andere familieleden anders Other Personification: De stad heeft Adequacy Word sense disambiguation Hans-augustus Meaning shift caused by misplaced word Een Duitse leraar Deletion Luister (< I also like to play the harmonica and listen to rock music all day.) Meaning shift caused by other Het huis van mijn kamergenoot woont in Berlijn Table 23: Error typology T2_GT

T2_DL

Hallo! Mijn naam is Heike Kron en ik ben student. Ik studeer Duits en onderwijs aan de Universiteit van Greifswald. Ik zou graag docent Duitser willen worden aan een Amerikaanse universiteit. Greifswald is een leuke stad in Noord-Duitsland. De uni is vrij klein. Het heeft slechts ongeveer 12.000 studenten. De stad heeft veel oude huizen. De Oostzee is zeer dichtbij en er is een riviertje genaamd de Ryk. In mijn vrije tijd ga ik graag winkelen in het mooie centrum van Greifswald. Dat is veel plezier voor mij. Ik speel ook graag de harmonica en luister de hele dag naar rockmuziek. Ik zing ook in het kerkkoor "Dicke Marie". Ik kookt graag voor mijn vrienden. Mijn familie woont in Stralsund, een klein stadje in het noordoosten van Duitsland aan de Baltische kust. Iedereen kent elkaar in kleine stadjes. Dat bevalt me erg leuk. Mijn vader werkt voor een bank en mijn moeder is leraar. Ik heb één broer, Hans-August, en één zus. Agneta. Mijn huisgenoot woont in Berlijn en soms nemen we de trein naar Berlijn om haar familie te bezoeken. In Berlijn gaan we naar musea of restaurants of zelfs dansen en winkelen. Soms breng ik mijn zomervakantie door in Eutin. Ik bezoek mijn oom en tante die er wonen. Ik heb geen andere familieleden.

Acceptability Grammar & Article Ik speel ook graag de syntax harmonica Verb form Ik kookt graag Article-noun De uni… Het <-> ze agreement Structure Gaan we naar musea of restaurants of zelfs dansen en winkelen (no contraction of verbs with adpositional phrase and verb phrase) Lexicon Wrong collocation Dat is veel plezier voor mij

82

Dat bevalt me erg leuk (‘contaminatie’) Word non-existent Uni Spelling & typos Punctuation Zus. Agneta. Style & register Other Personification: De stad heeft Adequacy Deletion Mijn huisgenoot woont in Berlijn (‘familie’ left out) Luister (< I also like to play the harmonica and listen to rock music all day.) Meaning shift caused by other docent Duitser Table 24: Error typology T2_DL

T3_GT

Het is hoogstwaarschijnlijk de meest ongewone ontsnapping in de geschiedenis van de DDR: een jonge man dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic geweer in West-Berlijn te landen. Maar zo was het niet de bedoeling. Toen Detlef Alexander Tiede en zijn beste vriendin Ingrid Maron en haar dochter Sabine in augustus 1978 naar Gdansk, Polen, reden, hadden ze maar één doel: om met de veerboot naar het westen te vluchten. Horst Fischer, de toenmalige vriend van Ingrid Maron die in West-Berlijn woonde, wilde hen helpen. Hij verkreeg valse identificatie en probeerde de documenten naar Gdansk te smokkelen. Maar het plan is mislukt. De drie voortvluchtigen wachtten tevergeefs op de overeengekomen ontmoetingsplaats voor Horst Fischer: hij werd onderweg betrapt met de valse identiteitskaarten en gearresteerd door de staatspolitie. Tiede, Maron en dochter Sabine zaten vast in Gdansk. Op een gegeven moment kwamen we op het idee om een vliegtuig te kapen dat verondersteld werd te landen op de Oost-Duitse luchthaven Berlijn-Schönefeld en het te laten landen op de luchthaven Berlijn-Tempelhof. Een speelgoedpistool gebruiken. Mijn dochter moest het pistool in het vliegtuig dragen, herinnert Ingrid Maron zich. En een douanebeambte vond het pistool natuurlijk tijdens een veiligheidscontrole. Ze gaf het terug aan mijn dochter, legt Ingrid Maron uit. En het plan was succesvol. Hijack vlak voor het bereiken van de luchthaven van bestemming.

Acceptability Grammar & Verb form Is mislukt <-> mislukte syntax Structure Een speelgoedpistool gebruiken <-> met behulp van een speelgoedpistool (< using a toy gun) Lexicon Wrong collocation Maar zo was het niet de bedoeling.

83

Valse identificatie / identiteitskaarten <-> valse papieren Wachtten […] voor Horst Fischer Dat verondersteld werd <-> dat zou landen Overeengekomen ontmoetingsplaats <-> afgesproken Named entity Luchthaven Berlijn- Schönefeld <-> Berlin-Schönefeld Luchthaven Berlijn-Tempelhof <-> Berlin-Tempelhof Spelling & typos Punctuation , Polen, reden, Style & register Untranslated Hijack <-> Het vliegtuig kapen Disfluent sentence / Op een gegeven moment construction kwamen we op het idee… herinnert Ingrid Maron zich. legt Ingrid Maron uit Coherence Inconsistency Valse identificatie / identiteitskaarten Adequacy Word sense disambiguation Geweer <-> pistool Moest het pistool op het vliegtuig dragen <-> mee aan boord nemen Meaning shift caused by incorrect Geplande vlucht <-> lijnvlucht translation of function word Meaning shift caused by misplaced word Dwingt […] om met een plastic geweer in West-Berlijn te landen (Wrong position of adpositional phrase) Table 25: Error typology T3_GT

T3_DL

Het is waarschijnlijk de meest ongebruikelijke ontsnapping in de geschiedenis van de DDR: Een jongeman dwingt de piloot van een geplande vlucht naar Oost-Berlijn om met een plastic pistool in West-Berlijn te landen. Maar zo moest het niet gaan. Toen Detlef Alexander Tiede en zijn beste vriendin Ingrid Maron en haar dochter Sabine in augustus 1978 naar Gdansk in Polen reden, hadden ze maar één doel: van daaruit per veerboot naar het westen te ontsnappen. Horst Fischer, toenmalige vriend van Ingrid Maron, die in West-Berlijn woonde, wilde hen helpen. Hij verkreeg valse identificatie en probeerde de documenten naar Gdansk te smokkelen.

84

Maar het plan is mislukt. De drie voortvluchtigen wachtten tevergeefs op de afgesproken ontmoetingsplek voor Horst Fischer: hij werd onderweg met de valse identiteitskaarten opgepakt en door de staatspolitie gearresteerd. Tiede, Maron en dochter Sabine zaten vast in Gdansk. Op een gegeven moment kwamen we met het idee om een vliegtuig te kapen dat op de Oost-Duitse luchthaven Berlin-Schönefeld zou moeten landen en het te laten landen op de West-Berlijnse luchthaven Berlin-Tempelhof. Een speelgoedpistool gebruiken. Mijn dochter moest het pistool op het vliegtuig dragen, herinnert Ingrid Maron zich. En natuurlijk vond een douanebeambte het wapen tijdens een veiligheidscontrole. Ze gaf het terug aan mijn dochter, legt Ingrid Maron uit. En het plan was succesvol. Hijack net voor het bereiken van de luchthaven van bestemming.

Acceptability Grammar & Article Toenmalige vriend (‘de’ syntax missing) Verb form Is mislukt <-> mislukte Superfluous word/ Naar het westen te ontsnappen constituent Structure Een speelgoedpistool gebruiken <-> met behulp van een speelgoedpistool ( valse papieren Wachtten […] voor Horst Fischer Spelling & typos Capitalization DDR: Een Style & register Untranslated Hijack <-> Het vliegtuig kapen Disfluent sentence / Op een gegeven moment construction kwamen we met het idee… herinnert Ingrid Maron zich. legt Ingrid Maron uit Coherence Inconsistency Valse identificatie / identiteitskaarten Adequacy Word sense disambiguation Moest het pistool op het vliegtuig dragen <-> mee aan boord nemen Meaning shift caused by incorrect Geplande vlucht <-> lijnvlucht translation of function word Meaning shift caused by misplaced word Dwingt […] om met een plastic pistool in West-Berlijn te landen

85

(Wrong position of adpositional phrase) Table 26: Error typology T3_DL

86

87