<<

Best practices for the human evaluation of automatically generated text

Chris van der Lee Albert Gatt Emiel van Miltenburg Tilburg University University of Malta Tilburg University [email protected] [email protected] [email protected]

Sander Wubben Emiel Krahmer Tilburg University Tilburg University [email protected] [email protected]

Abstract the evaluation of NLG (see Ananthakr- ishnan et al., 2007; Novikova et al., 2017; Sulem Currently, there is little agreement as to how et al., 2018; Reiter, 2018, and the discussion in Natural Language Generation (NLG) systems Section2). should be evaluated, with a particularly high degree of variation in the way that human eval- Previous studies have also provided overviews uation is carried out. This paper provides an of evaluation methods. Gkatzia and Mahamood overview of how human evaluation is currently (2015) focused on NLG papers from 2005-2014; conducted, and presents a set of best practices, Amidei et al.(2018a) provided a 2013-2018 grounded in the literature. With this paper, we overview of evaluation in question generation; and hope to contribute to the and consis- Gatt and Krahmer(2018) provided a more general tency of human evaluations in NLG. of the state-of-the-art in NLG. However, 1 Introduction the aim of these papers was to give a structured overview of existing methods, rather than discuss Even though automatic text generation has a long shortcomings and best practices. Moreover, they tradition, going back at least to Peter(1677) (see did not focus on human evaluation. also Swift, 1774; Rodgers, 2017), human eval- Following Gkatzia and Mahamood(2015), Sec- uation is still an understudied aspect. Such an tion 3 provides an overview of current evaluation evaluation is crucial for the development of Nat- practices, based on papers from INLG and ACL ural Language Generation (NLG) systems. With in 2018. Apart from the broad range of meth- a well-executed evaluation it is possible to assess ods used, we also observe that evaluation practices the quality of a and its properties, and to have changed since 2015: for example, there is a demonstrate the progress that has been made on a significant decrease in the number of papers fea- task, but it can also help us to get a better under- turing extrinsic evaluation. This may be caused standing of the current state of the field (Mellish by the current focus on smaller, decontextualized and Dale, 1998; Gkatzia and Mahamood, 2015; tasks, which do not take users into account. van der Lee et al., 2018). The importance of evalu- Building on findings from NLG, but also statis- ation for NLG is itself uncontentious; what is per- tics and the behavioral sciences, Section 4 pro- haps more contentious is the way in which eval- vides a set of recommendations and best practices uation should be conducted. This paper provides for human evaluation in NLG. We hope that our an overview of current practices in human evalua- recommendations can serve as a guide for new- tion, showing that there is no consensus as to how comers in the field, and can otherwise help NLG NLG systems should be evaluated. As a result, it research by standardizing the way human evalua- is hard to compare the results published by differ- tion is carried out. ent groups, and it is difficult for newcomers to the field to identify which approach to take for eval- 2 Automatic versus human evaluation uation. This paper addresses these issues by pro- viding a set of best practices for human evaluation Automatic metrics such as BLEU, METEOR, and in NLG. A further motivation for this paper’s fo- ROUGE are increasingly popular; Gkatzia and cus on human evaluation is the recent discussion Mahamood’s (2015) survey of NLG papers from on the (un)suitability of automatic measures for 2005-2014 found that 38.2% used automatic met-

355 Proceedings of The 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan, 28 Oct - 1 Nov, 2019. c 2019 Association for Computational Linguistics rics, while our own survey (described more fully human evaluation for every step of the develop- in Section 3) shows that 80% of the empirical pa- ment process, since this would be costly and time- pers presented at the ACL track on NLG or at the consuming. Furthermore, there may be automatic INLG conference in 2018 reported on automatic metrics that reliably capture some qualitative as- metrics. However, the use of these metrics for the pects of NLG output, such as fluency or stylistic assessment of a system’s quality is controversial, compatibility with reference texts. But for a gen- and has been criticized for a variety of reasons. eral assessment of overall system quality, human The two main points of criticism are: evaluation remains the gold standard. Automatic metrics are uninterpretable. Text generation can go wrong in different ways while 3 Overview of current work still receiving the same scores on automated met- This section provides an overview of current hu- rics. Furthermore, low scores can be caused man evaluation practices, based on the papers pub- by correct, but unexpected verbalizations (Anan- lished at INLG (N=51) and ACL (N=38) in 2018. thakrishnan et al., 2007). Identifying what can We did not observe noticeable differences in eval- be improved therefore requires an error analysis. uation practices between INLG and ACL, which Automatic metric scores can also be hard to in- is why they are merged for the discussion of the terpret because it is unclear how stable the re- bibliometric study. 2 ported scores are. With BLEU, for instance, li- braries often have their own BLEU score imple- 3.1 Intrinsic and extrinsic evaluation mentation, which may differ from one another, Human evaluation of natural language generation thus affecting the scores (this is recently addressed systems can be done using intrinsic and extrinsic by Post, 2018). Reporting the scores accompanied methods (Sparck Jones and Galliers, 1996; Belz by confidence intervals, calculated using bootstrap and Reiter, 2006). Intrinsic approaches aim to resampling (Koehn, 2004), may increase the sta- evaluate properties of the system’s output, for in- bility and therefore interpretability of the results. stance, by asking participants about the fluency of However, such statistical tests are not straightfor- the system’s output in a . Extrinsic ward to perform. approaches aim to evaluate the impact of the sys- Automatic metrics do not correlate with hu- tem, by investigating to what degree the system man evaluations. This has been repeatedly ob- achieves the overarching task for which it was de- served (e.g. Belz and Reiter, 2006; Reiter and veloped. While extrinsic evaluation has been ar- 1 Belz, 2009; Novikova et al., 2017). In light of gued to be more useful (Reiter and Belz, 2009), it this criticism, it has been argued that automated is also rare. Only three papers (3%) in the sam- metrics are not suitable to assess linguistic prop- ple of INLG and ACL papers presented an extrin- erties (Scott and Moore, 2007), and Reiter(2018) sic evaluation. This is a notable decrease from discouraged the use of automatic metrics as a (pri- Gkatzia and Mahamood(2015), who found that mary) evaluation metric. The alternative is to per- nearly 25% of studies contained an extrinsic eval- form a human evaluation. uation. Of course, extrinsic evaluation is the most There are arguably still good reasons to use au- time- and cost-intensive out of all possible evalu- tomatic metrics: they are a cheap, quick and re- ations (Gatt and Krahmer, 2018), which might ex- peatable way to approximate text quality (Reiter plain the rarity, but does not explain the decline and Belz, 2009), and they can be useful for er- in (relative) frequency. That might be because of ror analysis and system development (Novikova the set-up of the tasks we see nowadays. Extrinsic et al., 2017). We would not recommend using evaluations require that the system is embedded in its target use context (or a suitable simulation 1In theory this correlation might increase when more ref- thereof), which in turn requires that the system ad- erence texts are used, since this allows for more variety in the generated texts. However, in contrast to what this dresses a specific purpose. In practice, this often theory would predict, both Doddington(2002) and Turian means the system follows the ‘traditional’ NLG et al.(2003) report that correlations between metrics and hu- man judgments in machine translation do not improve sub- 2For the ACL papers, we focused on the following tracks: stantially as the number of reference texts increases. Simi- Machine Translation, Summarization, Question Answering, larly, Choshen and Abend(2018) found that reliability issues and Generation. See Supplementary Materials for a detailed of reference-based evaluation due to low-coverage reference overview of the investigated papers and their evaluation char- sets cannot be overcome by attainably increasing references. acteristics

356 Criterion Total Criterion Total Scale Count Fluency 13 Manipulation check 3 Likert (5-point) 14 Naturalness 8 Informativeness 3 Preference 10 Quality 5 Correctness 3 Likert (2-point) 6 Meaning preservation 5 Syntactic correctness 2 Likert (3-point) 5 Relevance 5 Qualitative analysis 2 Other Likert (4,7,10-point) 5 Grammaticality 5 Appropriateness 2 Rank-based Magnitude Estimation 5 Overall quality 4 Non-redundancy 2 Free text comments 1 Readability 4 Semantic adequacy 2 Clarity 3 Other criteria 25 Table 2: Types of scales used for human evaluation

Table 1: Criteria used for human evaluation from all papers. Separate counts for ACL and INLG 2018 are in the appendix. expert-focused approach, meaning that between 1 and 4 expert annotators evaluated system output. pipeline (Reiter and Dale, 2000), encompassing 13 papers (26%) employed a larger-scale reader- many of these pipeline sub-tasks to go from input focused method in which 10 to 60 readers judged to complete output texts (Mellish et al., 2006; the generated output. We found a median of 4 Gatt and Krahmer, 2018). Such systems were a annotators. However, these numbers might not mainstay of NLG literature until recently (e.g., reflect : only 55% of papers specified the Harris, 2008; Gatt and Portet, 2010; Reiter et al., number of participants and an even smaller num- 2003), but the field has shifted towards focusing on ber (18%) reported the demographics of their sam- only one or a few of the sub-tasks from the NLG ple. Only 12.5% of the papers with a human eval- pipeline (e.g. text planning, surface realization, re- uation reported inter-annotator agreement, using ferring expression generation), with a concomitant Krippendorff’s α, Fleiss’ κ, Weighted κ or Co- focus on text output quality, for which an intrin- hen’s κ. Agreement in most cases ranged from sic evaluation may be sufficient. However, we are 0.3 to 0.5, but given the variety of metrics and starting to see a swing back towards a full pipeline the thresholds used to determine acceptable agree- approach with separate neural modules handling ment, this range should be treated with caution. sub-tasks (Castro Ferreira et al., 2019), which may 3.4 Design also cause a resurgence of extrinsic evaluation. Apart from participant sample size, an important 3.2 Properties of text quality issue that impacts statistical power is the number of items (e.g. generated sentences) used in an eval- Many studies take some notion of ‘text quality’ as uation. Among papers that reported these num- their primary evaluation measure, but this is bers, we observed a median of 100 items used for not easy to assess, since text quality criteria dif- human evaluation in INLG and ACL papers. The fer across tasks (see Section 4.1 for further dis- number of items however ranged between 2 and cussion). This variety, suggesting a lack of agree- 5,400, illustrating a sizable discrepancy. In 83% ment, is clear from Table 1. Except for fluency, of papers that reported these figures, all annotators and for naturalness and quality which were used saw all examples. Only 12.5% of papers reported for a shared task, most criteria are infrequent; the other aspects of evaluation study design, such as numerous ‘other criteria’ are those which are used the order in which items were presented, randomi- only once. At the same time, there is probably sig- sation and counterbalancing methods used (e.g. a nificant overlap. For instance, naturalness is some- latin square design), or whether criteria were mea- times linked to fluency, and informativeness to ad- sured at the same time or separately. equacy (Novikova et al., 2018). In short, there is no standard evaluation model for NLG. Further- 3.5 Number of questions and types of scales more, there is significant variety in naming con- In addition to the diversity in criteria used to mea- ventions. sure text quality (see Section 3.2), there is a wide range of rating methods that are used to measure 3.3 Sample size and demographics those criteria. Do note that Likert and rating scales When looking at sample size, it is possible to are treated indistinctly here (for a distinction, see distinguish between expert-focused and reader- Amidei et al., 2019). The 5-point Likert scale is focused evaluation. 14 papers (28%) used an the most popular option, but preference ratings are

357 a close second (see Table 2). Other types of rat- ual dimensions or criteria (for various studies of ing methods are much less common. Rank-based such criteria, see Dell’Orletta et al., 2011; Falken- Magnitude Estimation, a continuous metric, was jack et al., 2013; Nenkova et al., 2010; Pitler and only found among shared task papers, and only Nenkova, 2008, inter alia). one paper reported using free-text comments. The choice between these two options turns out We also investigated the number of ratings used to be a point of contention. Highly correlated to measure a single criterion (e.g. a paper may use scores on different quality criteria suggest that two ratings to measure two different aspects of flu- human annotators find them hard to distinguish ency). Only 34% of papers with a human evalua- (Novikova et al., 2017). For this reason, some re- tion reported the number of ratings to measure a searchers directly measure the overall quality of a criterion. These numbers ranged from 1 to 4 rat- text. However, Hastie and Belz(2014) note that an ings for a criterion, with 1 rating being the most overall communicative goal is often too abstract a common. construct to measure directly. They argue against this practice and in favour of identifying separate 3.6 and data analysis criteria, weighted according to their importance in A minority (33%) of papers report one or more contributing to the overall goal. statistical analyses for their human evaluation to The position taken by Hastie and Belz(2014) investigate if findings are statistically significant. implies that, to the extent that valid and agreed- The types of statistical analyses vary greatly: there upon definitions exist for specific quality crite- is not one single that is the most common. Ex- ria, these should be systematically related to over- amples of tests found are Student’s T test, Mann- all communicative success. Yet, this relationship Whitney U test, and McNemar’s test. Theoreti- need not be monotonic or linear. For example, two cally, such statistical tests should be performed to texts might convey the underlying intention (in- test a specific hypothesis (Navarro, 2019). How- cluding the intention to inform) equally success- ever, not all papers using a statistical test report fully, while varying in fluency, perhaps as long as their hypotheses. And conversely, some papers some minimal level of fluency is satisfied by both. reporting hypotheses do not perform a statistical In that case, the relationship would not be mono- test. 19% of all papers explicitly state their hy- tonic (higher fluency may not guarantee success potheses or research questions. beyond a point). A further question is how the various criteria interact. For instance, it is conceiv- 4 Best practices able that under certain conditions (e.g. summaris- This section provides best practices for carrying ing high-volume, heterogeneous data in a short out and reporting human evaluation in NLG. We span of text), readability and adequacy are mutu- (mostly) restrict ourselves to intrinsic evaluation. ally conflicting beyond a certain point (e.g. because adequately conveying all will 4.1 Text quality and criteria result in more convoluted text which is harder to Renkema(2012, p. 37) defines text quality in understand). terms of whether the writer (or: NLG system) suc- Ultimately, the criteria to be considered will de- ceeds in conveying their intentions to the reader. pend on the task. For example, in style transfer, He outlines three requirements for this to be manipulation checks are important to determine achieved: (i) the writer needs to achieve their goal whether the style has been transferred correctly, while meeting the reader’s expectations; (ii) lin- while also ensuring meaning preservation. These guistic choices need to match the goal; and (iii) criteria are not necessarily important for a sys- the text needs to be free of errors. tem that generates weather reports from numeri- If successfully conveying communicative inten- cal data, where accuracy, fluency, coherence and tion is taken to be the main overarching criterion genre compatibility might be more prominent con- for quality, then two possibilities arise. One could cerns. By contrast, coherence and fluency would treat quality as a primitive, as it were, evaluating not be important criteria for the PARRY chatbot it directly with users. Alternatively—and more in (Colby et al., 1971) which attempts to simulate the line with current NLG evaluation practices—one speech of a person with paranoid schizophrenia. could take text quality to be contingent on individ- As we have shown, the criteria used for NLG

358 evaluation are usually treated as subjective (as in with Likert scales. the case of judgments of fluency, adequacy and Evidence suggests that expert readers approach the like). It is also conceivable that these cri- evaluation differently from general readers, inject- teria can be assessed using more objective mea- ing their own opinions and (Amidei et al., sures, similar to existing readability measures 2018b). This might be troublesome if a system is (e.g., Ambati et al., 2016; Kincaid et al., 1975; meant for the general population, as expert opin- Pitler and Nenkova, 2008; Vajjala and Meurers, ions and biases might not be representative for 2014), where objective text metrics (e.g. aver- those of non-experts. This is corroborated by age word length, average parse tree height, aver- Lentz and De Jong(1997), who found that expert age number of nouns) are used in a formula, or as judgments only predict the outcomes of reader- features in a regression model, to obtain a score focused evaluation to a limited extent. Experts for a text criterion. Similarly, it may be possible are also susceptible to considerable variance, so to use separate subjective criteria as features in a that automatic metrics are sometimes more reli- regression model to calculate overall text quality able (Belz and Reiter, 2006). Thus, the conclusion scores. This would also provide information about of Belz and Reiter(2006) in favour of large-scale the importance of the subjective criteria for over- reader-focused studies, rather than expert-focused all text quality judgments. However, such research ones, seems well-taken. on the relationship between subjective criteria and An additional factor to consider is the types of objective measures is currently lacking for NLG. ‘general’ or ‘expert’ populations that are accessi- One obstacle to addressing the difficulties iden- ble to NLG researchers. It is not untypical for eval- tified in this section is the lack of a standard- uations to be carried out with students, or fellow ised nomenclature for different text quality cri- researchers (recruited, for instance, via SIGGEN teria. This presents a practical problem, in that or other mailing lists). This may introduce sam- it is hard to compare evaluation results to previ- pling biases of the kind that have been critiqued ously reported work; but it also presents a theo- in psychology in recent years, where experimental retical problem, in that different criteria may over- results based on samples of WEIRD (Western, Ed- lap or be inter-definable. As Gatt and Belz(2010) ucated, Industrialised, Rich and Developed) popu- and Hastie and Belz(2014) suggest, common and lations may well have given rise to biased models shared evaluation guidelines should be developed (see, for example, Henrich et al., 2010). for each task, and efforts should be made to stan- dardise criteria and naming conventions. In the Evaluator agreement The varying opinions of absence of such guidelines, care should be taken judges are also reflected in low Inter-Annotator to explicitly define the criteria measured and high- Agreement (IAA), where adequate thresholds also light possible overlaps between them. tend to be open to interpretation (Artstein and Poe- sio, 2008). Amidei et al.(2018b) argue that, given the variable nature of natural language, it is unde- 4.2 Sample size, demographics and sirable to use restrictive thresholds, since an osten- agreement sibly low IAA score could be due to a host of fac- Expert- versus reader-focused Section 3.3 tors, including personal . The authors there- made a distinction between expert-focused and fore suggest reporting IAA statistics with confi- reader-focused evaluation. With an expert-focused dence intervals. However, narrower confidence design, a small number of expert annotators is intervals (suggesting a more precise IAA score) recruited to judge aspects of the NLG system. would normally be expected with large samples A reader-focused design entails a typically larger (e.g., 1000 or more comparisons McHugh, 2012), sample of (non-expert) participants. Lentz and which are well beyond most sizes reported in our De Jong(1997) found that these two methods can overview (§ 3.4). be complementary: expert problem detection may When the goal of an evaluation is to identify po- highlight textual problems that are missed by gen- tential problems with output texts, a low IAA, in- eral readers. However, this strength is mostly dicating variety among annotators, can be highly applicable when a more qualitative analysis is informative (Amidei et al., 2018b). On the other used, whereas most expert-focused evaluations in hand, low IAA in evaluations of text quality can our sample of papers used closed-ended questions also suggest that results should not be extrapolated

359 to a broader reader population. An additional con- within the NLG domain (and probably in many sideration is that some statistics (such as κ; see other domains), the use of this scale has been re- McHugh, 2012) make overly restrictive assump- ceiving more and more criticism. Recent studies tions, though they have the advantage of account- have found that participant ratings are more reli- ing for chance agreement. Thus, apart from re- able, consistent, and are less prone to order ef- porting such statistics, it is advisable to also re- fects when they involve ranking rather than Lik- port percentage agreement, which is easily inter- ert scales (Martinez et al., 2014; Yannakakis and pretable (McHugh, 2012). Mart´ınez, 2015; Yannakakis and Hallam, 2011). Similarly, for the development of an automatic Sample size For expert-focused evaluations, metric for NLG, Chaganty et al.(2018) found that good advice is provided by Van Enschot et al. annotator variance decreased significantly when (2017): difficult coding tasks (which most NLG using post-edits as a metric instead of a Likert evaluations are) require three or more annotators scale survey. Finally, Novikova et al.(2018) com- (though preferably more; see Potter and Levine- pared Likert scales for NLG system evaluation to Donnerstein, 1999), more straightforward tasks two continuous scales: a vanilla magnitude esti- can do with two to three. In the case of large- mation measure and a rank-based magnitude esti- scale studies, Brysbaert(2019) recently stated that mation measure. The researchers found that both most studies with less than 50 participants are un- magnitude estimation scales delivered more reli- derpowered and that for most designs and analy- able and consistent text evaluation scores. ses 100 or more participants are needed. With the introduction of crowdsourcing such numbers are All these studies seem to suggest that ranking- obtainable, at least for widely-spoken languages based methods (combined with continuous scales) (though see van Miltenburg et al. 2017 for a coun- are the preferred method. However, there are two terexample). Furthermore, the number of partici- critical remarks to be made on this. Firstly, a draw- pants necessary can be decreased by having multi- back of ranking-based methods is that the num- ple observations per condition per participant (i.e., ber of judgments increases substantially as more having participants perform more judgments). systems are compared. To mitigate this, Novikova TM Whatever the sample size, a minimum good et al.(2018) illustrated that the TrueSkill algo- practice guideline is to always report participant rithm (Herbrich et al., 2007) can be implemented. numbers, with relevant demographic data (i.e., This algorithm uses binary comparisons to reliably gender, nationality, age, fluency in the target lan- rank systems, which greatly reduces the amount of guage, academic background, etc), in order to en- data needed for multiple-system comparisons. hance replicability and enable readers to gauge the Another point of criticism is that studies com- meaningfulness of the results. paring Likert scales to other research instruments mostly look at single-rating constructs, that is, ex- 4.3 Number of questions and types of scales periments where a single judgment is elicited on As shown in Section 3.5, Likert scales are the a given criterion. While constructs measured with prevalent rating method for NLG evaluation, 5- one rating are also the most common in NLG re- point scales being the most popular, followed by search, this practice has been criticized. It is un- 2-point, and 3-point scales. While the most appro- likely that a complex concept (e.g. fluency or ade- priate number of response points may depend on quacy) can be captured in a single rating (McIver the task itself, 7-point scales (with clear verbal an- and Carmines, 1981). Furthermore,a single Likert choring) seem best for most tasks. Most of the ex- scale often does not provide enough points of dis- perimental literature’s findings found that 7-point crimination: a single 7-point Likert question has scales maximise reliability, validity and discrim- only 7 points to discriminate on, while 5 7-point inative power (for instance, Miller, 1956; Green Likert questions have 5 * 7 = 35 points of discrim- and Rao, 1970; Jones, 1968; Cicchetti et al., 1985; ination. A practical objection against single-item Lissitz and Green, 1975; Preston and Colman, scales is that no reliability measure for internal 2000). These studies discourage smaller scales, consistency (e.g., Cronbach’s alpha) can be calcu- and adding more response points than 7 also does lated for a single item. At least two items or more not increase reliability according to these studies. are necessary for this. In light of these concerns, While Likert scales are the most popular scale Diamantopoulos et al.(2012) advocate great cau-

360 tion in the use of single-item scales, unless the NLG (Bernhard et al., 2012; Han et al., 2017; Sri- construct in question is very simple, clear and one- pada et al., 2005). dimensional. Under most conditions, multi-item Finally, some remarks on qualitative evaluation scales have much higher predictive validity. Us- methods are in order. Reiter and Belz(2009) note ing multiple items may well make the reliabil- that free-text comments can be beneficial to diag- ity of Likert scales on a par with that of rank- nose potential problems of an NLG system. Fur- ing tasks; this, however, has not been empirically thermore, Sambaraju et al.(2011) argue the added tested. Also, do note that the use of multiple-item of and scales versus single-item scales affects the type of for evaluation. Such qualitative analyses can find statistical testing needed (for an overview and ex- potential blind spots of quantitative analyses. At planation, see Amidei et al., 2019). the same time, the subjectivity that is often inher- In sum, we advise to use either multiple-item ent in studies based on discourse analysis, such as 7-point Likert scales, or a (continuous) ranking Sambaraju et al.(2011) would need to be offset by task. The latter should be used in combination data from larger-scale, quantitative studies. with TrueSkillTM when multiple systems are com- pared. As Aroyo and Welty(2014) note, disagree- 4.4 Design ment in the responses can be due to three factors: Few papers report exact details of the design the item, the worker, and the task. Therefore, it of their human evaluation , although is necessary to pilot the rating task before deploy- most indicate that multiple systems were com- ing it more widely, and to analyze disagreement pared and annotators were shown all examples. on the annotator level, to see whether This suggests that within-subjects designs are a annotators are causing discrepancies in the ratings common practice. for different items. Within-subjects designs are susceptible to order Alternative evaluation instruments should not effects: over the course of an , anno- be ruled out either. Ever since a pilot in 2016 (Bo- tators can change their responses due to fatigue, jar et al., 2016a), recent editions of the Confer- practice, carryover effects or other (external) fac- ence on Machine Translation (WMT), have used tors. If the order in which the output of systems Direct Assessment, whereby participants com- are presented is fixed, differences found between pare an output to a reference text on a contin- systems may be due to order effects rather than uous (0-100) scale (Graham et al., 2017; Bojar differences in the output itself. To mitigate this, re- et al., 2016b), similar to Magnitude Estimation searchers can implement measures in the task de- (Bard et al., 1996). Zarrieß et al.(2015) used a sign. Practice effects can be reduced with a prac- mouse contingent reading paradigm in an evalua- tice trial in which examples of both very good (flu- tion study of generated text, finding that features ent, accurate, grammatical) and very bad (disflu- recorded using this paradigm (e.g. reading time) ent, inaccurate, ungrammatical) outputs are pro- provided valuable information to gauge text qual- vided before the actual rating task. This allows ity levels. It should also be noted that most met- for the participants to calibrate their responses, be- rics used in NLG are reader-focused. However, fore starting with the actual task. Carryover effects in many real-world scenarios, especially ‘creative’ can be reduced by increasing the amount of time NLG applications, NLG systems and human writ- between presenting different conditions (Shaugh- ers work alongside each other in some way (see nessy et al., 2006). Fatigue effects can be re- Maher, 2012; Manjavacas et al., 2017). With such duced by shortening the task, although this also a in mind, it makes sense to also means more participants are necessary since fewer investigate writer-focused methods. Having par- observations per condition per participant means ticipants edit generated texts. Then processing less statistical power (Brysbaert, 2019). Another these edits using post-editing distance measures way to tackle fatigue effects sometimes seen in like Translation Edit Rate (Snover et al., 2006), research is to remove all entries with missing might be a viable method to investigate the time data, or to remove participants that failed ‘atten- and cost associated with using a system. While tion checks’ (or related checks e.g. instructional more commonly seen in Machine Translation, au- manipulation checks, or trap questions) from the thors have explored the use of such metrics in sample. However, the use of attention checks is

361 to debate, with some researchers point- want to compare various versions of their own ing out that after such elimination procedures, the novel system (e.g. with or without output varia- remaining cases may be a biased subsample of tion, or relying on different word embedding mod- the total sample, thus biasing the results (Anduiza els, to give just two more or less random exam- and Galais, 2016; Bennett, 2001; Berinsky et al., ples) to compare them to each other, to some other 2016). Experiments show that excluding partici- (‘state-of-the-art’) systems, and/or with respect to pants that failed attention checks introduces a de- one or more baselines. Notice that this quickly mographic bias, and attention checks actually in- gives rise to a rather complex statistical design duce low-effort responses or socially desirable re- with multiple factors and multiple levels. Ironi- sponses (Clifford and Jerit, 2015; Vannette, 2016). cally, with every system or baseline that is added Order effects can also be reduced by present- to the evaluation, the comparison becomes more ing the conditions in a systematically varied or- interesting but the statistical model becomes more der. Counterbalancing is one such measure. With complex, and power issues become more press- counterbalancing, all examples are presented in ing (Cohen, 1988; Button et al., 2013). However, every possible order. While such a design is the statistical power—the probability that the statisti- best way to reduce order-effects, it quickly be- cal test will reject the null hypothesis (H0) when comes expensive. When annotators judge 4 ex- the alternative hypothesis (H1, e.g., that your new amples, 4! = 24 different orders should be investi- NLG system is the best) is true—are seldom (if gated (this, however, can be partially mitigated by ever) discussed in the NLG literature. grouping items randomly into sets, and counter- A related issue is that clear hypotheses are of- balancing the order of sets rather than individual ten not stated (see Section 3.6). Of course, re- items). In most cases, randomising the order of ex- searchers generally assume that their system will amples should be sufficient. Another possibility is be rated higher than the comparison systems. But to use a between-subjects design, in which the sub- they will not necessarily assume that they will per- jects only judge the (randomly ordered) outputs of form better on all dependent variables. Moreover, one system. When order effects are expected and a they may have no specific hypotheses about which large number of conditions are investigated, such variant of their own system will perform best. a design is preferable (Shaughnessy et al., 2006). In fact, in the scenario sketched above there Novikova et al.(2018) found that the presenta- may be multiple (implicit) hypotheses: new sys- tion of questions matters. When evaluating text tem better than state-of-the-art, new system bet- criteria, answers to questions about different cri- ter than baseline, etcetera. When testing multiple teria tend to correlate when they are presented si- hypotheses, the probability of making at least one multaneously for a given item. When participants false claim (incorrectly rejecting a H0) increases are shown an item multiple times and questioned (such errors are known as false positives or Type I about each text criterion separately, this correla- errors). Various remedies for this particular prob- tion is reduced. lem exist, one being an application of the simple Bonferroni correction, which amounts to lowering 4.5 Statistics and data analysis the significance threshold α—commonly .05, but Within behavioral sciences, it is standard to evalu- see for example Benjamin et al.(2018) and Lakens ate hypotheses based on whether findings are sta- et al.(2018)—to α/m, where m is the number of tistically significant or not (typically, in published hypotheses tested. This procedure is not systemat- papers, they are), although a majority of NLG pa- ically applied in NLG, although the awareness of pers do not report statistical tests (see Section 3.6). the issues with multiple comparisons is increasing. However, there is a growing awareness that statis- Finally, statistical tests are associated with as- tical tests are often conducted incorrectly, both in sumptions about their applicability. One is the in- NLP (Dror et al., 2018) and in behavioral sciences dependence assumption (especially relevant for t- more generally (e.g., Wagenmakers et al., 2011). tests and ANOVAs, for example), which amounts Moreover, one may wonder whether standard null- to assuming that the value of one observation ob- hypothesis significance testing (NHST) is applica- tained in the experiment is unaffected by the value ble or helpful in human NLG evaluation. of other observations. This assumption is difficult In a common scenario, NLG researchers may to guarantee in NLP research (Dror et al., 2018),

362 Topic Best practice General Always conduct a human evaluation (if possible). Criteria Use separate criteria rather than an overall quality assessment. Properly define the criteria that are used in the evaluation. Preferably use a (large-scale) reader-focused design rather than a (small-scale) expert-focused design. Always recruit sufficiently many participants. Report (and motivate) the sample size and the demographics. Annotation For a qualitative analysis, recruit multiple annotators (at least 2, more is better) Report the Inter-Annotator Agreement score with confidence intervals, plus a percentage agreement. Measurement For a quantitative study, use multiple item 7-point (preferably) Likert scales, or (continuous) ranking. Design Reduce order- and learning effects by counterbalancing/random ordering, and properly report this. Statistics If the evaluation study is exploratory, only report exploratory data analysis. If the study is confirmatory, consider preregistering and conduct appropriate statistical analyses.

Table 3: List of best practices for human evaluation of automatically generated text. if only because different systems may rely on the (which dependent variables are reported?), etc. By same training data. In view of these issues, some being explicit beforehand (i.e., by preregistering), have argued that NHST should be abandoned (Ko- any flexibility in the analysis (be it intentional or plenig, 2017; McShane et al., 2019). not) is removed. Preregistration is increasingly In our opinion, the distinction between ex- common in medical and psychological science, ploratory and confirmative (hypothesis) testing and even though it is not perfect (Claesen et al., should be taken more seriously within NLG. Much 2019) at least it has made research more transpar- human evaluation of NLG could better be ap- ent and controllable, which has a positive impact proached from an exploratory perspective, and in- on the possibilities to replicate earlier findings. stead of full-fledged hypothesis testing it would Finally, alternative statistical models deserve be more appropriate to analyse findings with ex- more attention within NLG. For example, within ploratory data analysis techniques (Tukey, 1980; psycholinguistics it is common to look both at Cumming, 2013). When researchers do have clear participant and item effects (Clark, 1973). This hypotheses, statistical significance testing can be a would make a lot of sense in human NLG eval- powerful tool (assuming it is applied correctly). In uations as well, because it might well be that a these cases, we recommend preregistering the hy- new NLG system works well for one kind of gen- potheses and analysis plans before conducting the erated item (short active sentences, say) and less actual evaluation.3 well for another kind (complex sentences with rel- Preregistration is still uncommon in NLG and ative clauses). Mixed effects models capture such other fields of AI (with a few notable exceptions, potential item aspects very well (e.g., Barr et al., like for instance Vogt et al., 2019), but it ad- 2013), and deserve more attention in NLG. Fi- dresses an important issue with human evalua- nally, Bayesian models are worth exploring, be- tions. Conducting and analysing a human exper- cause they are less sensitive to the aforementioned iment is like entering a garden of forking paths problems with NHST (e.g., Gelman et al., 2006; (Gelman and Loken, 2013): along the way re- Wagenmakers, 2007). searchers have many choices to make, and even though each choice may be small and seemingly 5 Conclusion innocuous, collectively they can have a substantial effect on the outcome of the statistical analyses, to We have provided an overview of the current state the extent that it becomes possible to present virtu- of human evaluation in NLG, and presented a set ally every finding as statistically significant (Sim- of best practices, summarized in Table3. This is a mons et al., 2011; Wicherts et al., 2016). In human broad topic, and for reasons of space we were not NLG evaluation, choices may include for instance, able to cover all aspects of human evaluation in de- termination criteria (when does the data collection tail. Nevertheless, we hope that this overview will stop?), exclusion criteria (when is a participant re- serve as a useful reference for NLG practitioners, moved from the analysis?), reporting of variables and in future work we aim to provide a more ex- tensive set of best practices for carrying out human 3For example at osf.io or aspredicted.org evaluations in Natural Language Generation.

363 Acknowledgements Anja Belz and Ehud Reiter. 2006. Comparing auto- matic and human evaluation of NLG systems. In We received support from RAAK-PRO SIA 11th Conference of the European Chapter of the As- (2014-01-51PRO) and The Netherlands Organiza- sociation for Computational Linguistics, pages 313– tion for Scientific Research (NWO 360-89-050), 320. Association for Computational Linguistics. which is gratefully acknowledged. We would also Daniel J Benjamin, James O Berger, Magnus Johan- like to thank the anonymous reviewers for their nesson, Brian A Nosek, E-J Wagenmakers, Richard valuable and insightful comments. Berk, Kenneth A Bollen, Bjorn¨ Brembs, Lawrence Brown, Colin Camerer, et al. 2018. Redefine statis- tical significance. Nature Human Behaviour, 2(1):6.

References Derrick A Bennett. 2001. How can I deal with miss- Bharat Ram Ambati, Siva Reddy, and Mark Steedman. ing data in my study? Australian and New Zealand 2016. Assessing relative sentence complexity us- Journal of Public Health, 25(5):464–469. ing an incremental CCG parser. In Proceedings of Adam J Berinsky, Michele F Margolis, and Michael W the 2016 Conference of the North American Chap- Sances. 2016. Can we turn shirkers into workers? ter of the Association for Computational Linguistics, Journal of Experimental Social Psychology, 66:20– pages 1051–1057, San Diego, California, USA. As- 28. sociation for Computational Linguistics. Delphine Bernhard, Louis De Viron, Veronique´ Jacopo Amidei, Paul Piwek, and Alistair Willis. 2018a. Moriceau, and Xavier Tannier. 2012. Question gen- Evaluation in Automatic Question eration for french: Collating parsers and paraphras- Generation 2013-2018. INLG 2018, page 307. ing questions. Dialogue & Discourse, 3(2):43–74. Jacopo Amidei, Paul Piwek, and Alistair Willis. 2018b. Ondrejˇ Bojar, Rajen Chatterjee, Christian Federmann, Rethinking the agreement in human evaluation Yvette Graham, Barry Haddow, Matthias Huck, An- tasks. In Proceedings of the 27th International Con- tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- ference on Computational Linguistics, pages 3318– gacheva, Christof Monz, Matteo Negri, Aurelie´ 3329. Nev´ eol,´ Mariana Neves, Martin Popel, Matt Post, Jacopo Amidei, Paul Piwek, and Alistair Willis. 2019. Raphael Rubino, Carolina Scarton, Lucia Spe- The use of rating and Likert scales in Natural Lan- cia, Marco Turchi, Karin Verspoor, and Marcos guage Generation human evaluation tasks: A review Zampieri. 2016a. Findings of the 2016 confer- and some recommendations. In Proceedings of the ence on machine translation. In Proceedings of the 12th International Conference on Natural Language First Conference on Machine Translation: Volume Generation, Tokyo, Japan. Association for Compu- 2, Shared Task Papers, pages 131–198, Berlin, Ger- tational Linguistics. many. Association for Computational Linguistics.

R Ananthakrishnan, Pushpak Bhattacharyya, Ondrej Bojar, Christian Federmann, Barry Haddow, M Sasikumar, and Ritesh M Shah. 2007. Some Philipp Koehn, Matt Post, and Lucia Specia. 2016b. issues in automatic evaluation of English-Hindi Ten years of WMT evaluation campaigns: Lessons MT: More blues for BLEU. ICON. learnt. In Proceedings of the LREC 2016 Workshop Translation Evaluation From Fragmented Tools and Eva Anduiza and Carol Galais. 2016. Answering with- Data Sets to an Integrated Ecosystem, pages 27–34. out reading: IMCs and strong satisficing in online surveys. International Journal of Public Opinion Marc Brysbaert. 2019. How many participants do we Research, 29(3):497–519. have to include in properly powered experiments? A tutorial of power analysis with reference tables. Lora Aroyo and Chris Welty. 2014. The three sides of Journal of , 2(1):1–38. crowdtruth. Journal of Human Computation, 1:31– 34. Katherine S Button, John PA Ioannidis, Claire Mokrysz, Brian A Nosek, Jonathan Flint, Emma SJ Ron Artstein and Massimo Poesio. 2008. Inter-coder Robinson, and Marcus R Munafo.` 2013. Power fail- agreement for computational linguistics. Computa- ure: Why small sample size undermines the reliabil- tional Linguistics, 34(4):555–596. ity of neuroscience. Nature Reviews Neuroscience, 14(5):365. Ellen Gurman Bard, Dan Robertson, and Antonella So- race. 1996. Magnitude estimation of linguistic ac- Thiago Castro Ferreira, Chris van der Lee, Emiel van ceptability. Language, pages 32–68. Miltenburg, and Emiel Krahmer. 2019. Neural data- to-text generation: A comparison between pipeline Dale J Barr, Roger Levy, Christoph Scheepers, and and end-to-end architectures. In Proceedings of the Harry J Tily. 2013. Random effects structure for 2019 Conference on Empirical Methods in Natural confirmatory hypothesis testing: Keep it maximal. Language Processing, Hong Kong, SAR. Associa- Journal of Memory and Language, 68(3):255–278. tion for Computational Linguistics.

364 Arun Chaganty, Stephen Mussmann, and Percy Liang. guage Technology Research, pages 138–145. Mor- 2018. The price of debiasing automatic metrics in gan Kaufmann Publishers Inc. natural language evaluation. In Proceedings of the 56th Annual Meeting of the Association for Compu- Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- tational Linguistics (Volume 1: Long Papers), pages ichart. 2018. The hitchhikers guide to testing statis- 643–653. tical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the As- Leshem Choshen and Omeri Abend. 2018. Inherent sociation for Computational Linguistics (Volume 1: biases in reference-based evaluation for grammati- Long Papers), pages 1383–1392. cal error correction and text simplification. In Pro- ceedings of 56th Annual Meeting of the Association Johan Falkenjack, Katarina Heimann Muhlenbock,¨ and for Computational Linguistics, pages 632–642, Mel- Arne Jonsson.¨ 2013. Features indicating readability bourne, Australia. Association for Computational in Swedish text. In Proceedings of the 19th Nordic Linguistics. Conference of Computational Linguistics (NODAL- IDA 2013), pages 27–40. Domenic V Cicchetti, Donald Shoinralter, and Peter J Albert Gatt and Anja Belz. 2010. Introducing shared Tyrer. 1985. The effect of number of rating scale tasks to NLG: The TUNA shared task evaluation categories on levels of interrater reliability: A Monte challenges. In Empirical Methods in Natural Lan- Carlo investigation. Applied Psychological Mea- guage Generation, pages 264–293. Springer. surement, 9(1):31–36. Albert Gatt and Emiel Krahmer. 2018. Survey of the Aline Claesen, Sara Lucia Brazuna Tavares Gomes, state of the art in natural language generation: Core Francis Tuerlinckx, et al. 2019. Preregistration: tasks, applications and evaluation. Journal of Artifi- PsyArXiv Comparing dream to reality. . cial Intelligence Research, 61:65–170. Herbert H Clark. 1973. The language-as-fixed-effect Albert Gatt and Franc¸ois Portet. 2010. Textual prop- fallacy: A critique of language statistics in psycho- erties and task based evaluation: Investigating the logical research. Journal of verbal learning and ver- role of surface properties, structure and content. In bal behavior, 12(4):335–359. Proceedings of the 6th International Natural Lan- guage Generation Conference, pages 57–65. Asso- Scott Clifford and Jennifer Jerit. 2015. Do attempts to ciation for Computational Linguistics. improve respondent attention increase social desir- ability bias? Public Opinion Quarterly, 79(3):790– Andrew Gelman and Eric Loken. 2013. The garden of 802. forking paths: Why multiple comparisons can be a problem, even when there is no fishing expedition or Jacob Cohen. 1988. Statistical power analysis for the p-hacking and the research hypothesis was posited behavioral sciences. Routledge. ahead of time. Unpublished Manuscript. Kenneth Mark Colby, Sylvia Weber, and Franklin Den- Andrew Gelman et al. 2006. Prior distributions for nis Hilf. 1971. Artificial paranoia. Artificial Intelli- variance parameters in hierarchical models (com- gence, 2(1):1–25. ment on article by Browne and Draper). Bayesian analysis Geoff Cumming. 2013. Understanding the new statis- , 1(3):515–534. tics: Effect sizes, confidence intervals, and meta- Dimitra Gkatzia and Saad Mahamood. 2015. A snap- analysis. Routledge. shot of NLG evaluation practices 2005-2014. In Proceedings of the 15th European Workshop on Nat- Felice Dell’Orletta, Simonetta Montemagni, and Giu- ural Language Generation (ENLG), pages 57–60. lia Venturi. 2011. READ-IT: Assessing readability Association for Computational Linguistics. of Italian texts with a view to text simplification. In Proceedings of the Second Workshop on Speech Yvette Graham, Timothy Baldwin, Alistair Moffast, and Language Processing for Assistive Technolo- and Justin Zobel. 2017. Can machine translation gies, pages 73–83. Association for Computational systems be evaluated by the crowd alone. Natural Linguistics. Language Engineering, 23(1):330. Adamantios Diamantopoulos, Marko Sarstedt, Paul E Green and Vithala R Rao. 1970. Rating scales Christoph Fuchs, Petra Wilczynski, and Sebastian and information recovery: How many scales and re- Kaiser. 2012. Guidelines for choosing between sponse categories to use? Journal of Marketing, multi-item and single-item scales for construct 34(3):33–39. measurement: A predictive validity perspective. Journal of the Academy of Marketing Science, Bo Han, Will Radford, Ana¨ıs Cadilhac, Art Harol, An- 40(3):434–449. drew Chisholm, and Ben Hachey. 2017. Post-edit analysis of collective biography generation. In Pro- George Doddington. 2002. Automatic evaluation ceedings of the 26th International Conference on of machine translation quality using n-gram co- World Wide Web Companion, pages 791–792, Perth, occurrence statistics. In Proceedings of the Sec- Australia. International World Wide Web Confer- ond International Conference on Human Lan- ences Steering Committee.

365 Mary Dee Harris. 2008. Building a large-scale com- Robert W Lissitz and Samuel B Green. 1975. Effect of mercial NLG system for an EMR. In Proceedings the number of scale points on reliability: A Monte of the Fifth International Natural Language Gener- Carlo approach. Journal of Applied Psychology, ation Conference (INLG ’08), pages 157–160, Mor- 60(1):10. ristown, NJ, USA. Association for Computational Linguistics. Mary Lou Maher. 2012. Computational and collective creativity: Who’s being creative? In Proceedings of Helen Hastie and Anja Belz. 2014. A comparative the 3rd International Conference on Computer Cre- evaluation for NLG in interactive sys- ativity, pages 67–71, Dublin, Ireland. Association tems. In Proceedings of the Ninth International for Computational Linguistics. Conference on Language Resources and Evaluation (LREC-2014). Enrique Manjavacas, Folgert Karsdorp, Ben Burten- shaw, and Mike Kestemont. 2017. Synthetic liter- Joseph Henrich, Steven J Heine, and Ara Norenzayan. ature: Writing science fiction in a co-creative pro- 2010. The weirdest people in the world? The Be- cess. In Proceedings of the Workshop on Compu- havioral and Brain Sciences, 23:61–83; discussion tational Creativity in Natural Language Generation 83–135. (CC-NLG 2017), pages 29–37, Santiago de Com- postela, Spain. Association for Computational Lin- Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. guistics. TrueSkillTM: A Bayesian skill rating system. In Ad- vances in Neural Information Processing Systems, Hector P Martinez, Georgios N Yannakakis, and John pages 569–576. Hallam. 2014. Don’t classify ratings of affect; rank them! IEEE transactions on affective computing, Richard R Jones. 1968. Differences in response consis- 5(3):314–326. tency and subjects preferences for three personality inventory response formats. In Proceedings of the Mary L McHugh. 2012. Interrater reliability: The 76th Annual Convention of the American Psycholog- kappa statistic. Biochemia Medica, 22(3):276–282. ical Association, volume 3, pages 247–248. Ameri- can Psychological Association Washington, DC. John McIver and Edward G Carmines. 1981. Unidi- mensional scaling. 24. Sage. J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation Blakeley B McShane, David Gal, Andrew Gelman, of new readability formulas (Automated Readabil- Christian Robert, and Jennifer L Tackett. 2019. ity Index, FOG count and Flesch reading ease for- Abandon statistical significance. The American mula) for navy enlisted personnel. Research Branch Statistician, 73(sup1):235–245. Report, 8(75). Chris Mellish and Robert Dale. 1998. Evaluation in the Philipp Koehn. 2004. Statistical significance tests context of natural language generation. Computer for machine translation evaluation. In Proceed- Speech & Language, 12(4):349–373. ings of the 2004 Conference on Empirical Meth- ods in Natural Language Processing, pages 388– Chris Mellish, Donia Scott, Lynne Cahill, Daniel Paiva, 395, Barcelona, Spain. Association for Computa- Roger Evans, and Mike Reape. 2006. A refer- tional Linguistics, Association for Computational ence architecture for natural language generation Linguistics. systems. Natural Language Engineering, 12(01):1– 34. Alexander Koplenig. 2017. Against statistical signifi- cance testing in corpus linguistics. Corpus Linguis- George A Miller. 1956. The magical number seven, tics and Linguistic Theory. plus or minus two: Some limits on our capacity for processing information. Psychological Review, Daniel Lakens, Federico G Adolfi, Casper J Al- 63(2):81. bers, Farid Anvari, Matthew AJ Apps, Shlomo E Argamon, Thom Baguley, Raymond B Becker, Emiel van Miltenburg, Desmond Elliott, and Piek Stephen D Benning, Daniel E Bradford, et al. 2018. Vossen. 2017. Cross-linguistic differences and sim- Justify your alpha. Nature Human Behaviour, ilarities in image descriptions. In Proceedings of the 2(3):168. 10th International Conference on Natural Language Generation, pages 21–30, Santiago de Compostela, Chris van der Lee, Bart Verduijn, Emiel Krahmer, and Spain. Association for Computational Linguistics. Sander Wubben. 2018. Evaluating the text quality, human likeness and tailoring component of PASS: Daniel Navarro. 2019. Learning statistics with R: A tu- A Dutch data-to-text system for soccer. In Proceed- torial for psychology students and other beginners: ings of the 27th International Conference on Com- Version 0.6.1. University of Adelaide. putational Linguistics, pages 962–972. Ani Nenkova, Jieun Chae, Annie Louis, and Emily Leo Lentz and Menno De Jong. 1997. The evaluation Pitler. 2010. Structural features for predicting the of text quality: Expert-focused and reader-focused linguistic quality of text. In Empirical Methods methods compared. IEEE transactions on profes- in Natural Language Generation, pages 222–241. sional communication, 40(3):224–234. Springer.

366 Jekaterina Novikova, Ondrejˇ Dusek,ˇ Amanda Cercas Rahul Sambaraju, Ehud Reiter, Robert Logie, Andy Curry, and Verena Rieser. 2017. Why we need new McKinlay, Chris McVittie, Albert Gatt, and Cindy evaluation metrics for NLG. In Proceedings of the Sykes. 2011. What is in a text and what does it do: 2017 Conference on Empirical Methods in Natural Qualitative evaluations of an NLG system –the BT- Language Processing, pages 2241–2252. Nurse– using content analysis and discourse analy- sis. In Proceedings of the 13th European Workshop Jekaterina Novikova, Ondrejˇ Dusek,ˇ and Verena Rieser. on Natural Language Generation, pages 22–31. As- 2018. RankME: Reliable human ratings for Natural sociation for Computational Linguistics. Language Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Donia Scott and Johanna Moore. 2007. An NLG eval- Association for Computational Linguistics: Human uation competition? Eight reasons to be cautious. In Language Technologies, Volume 2 (Short Papers), Proceedings of the Workshop on Shared Tasks and pages 72–78. Comparative Evaluation in Natural Language Gen- eration, pages 22–23. John Peter. 1677. Artificial Versifying, or the School- boys Recreation. John Sims, London, UK. JJ Shaughnessy, EB Zechmeister, and JS Zechmeister. Emily Pitler and Ani Nenkova. 2008. Revisiting read- 2006. Research methods in psychology. McGraw- ability: A unified framework for predicting text Hill. quality. In Proceedings of the conference on empiri- cal methods in natural language processing, pages Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. 186–195. Association for Computational Linguis- 2011. False-positive psychology: Undisclosed flex- tics. ibility in data collection and analysis allows present- ing anything as significant. Psychological science, Matt Post. 2018. A call for clarity in reporting BLEU 22(11):1359–1366. scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- 191, Belgium, Brussels. Association for Computa- nea Micciulla, and John Makhoul. 2006. A study of tional Linguistics. translation edit rate with targeted human annotation. In Proceedings of Association for Machine Trans- W James Potter and Deborah Levine-Donnerstein. lation in the Americas, pages 223–231, Cambridge, 1999. Rethinking validity and reliability in content MA, USA. Association for Machine Translation in analysis. Journal of Applied Communication Re- the Americas. search, 27:258–284. Karen Sparck Jones and Julia R. Galliers. 1996. Eval- Carolyn C Preston and Andrew M Colman. 2000. Op- uating Natural Language Processing Systems: An timal number of response categories in rating scales: Analysis and Review. Springer, Berlin and Heidel- reliability, validity, discriminating power, and re- berg. spondent preferences. Acta Psychologica, 104(1):1– 15. Somayajulu Sripada, Ehud Reiter, and Lezan Hawizy. Ehud Reiter. 2018. A structured review of the validity 2005. Evaluation of an NLG system using post-edit of BLEU. Computational Linguistics, pages 1–12. data: Lessons learnt. In Proceedings of the 10th Eu- ropean Workshop on Natural Language Generation, Ehud Reiter and Anja Belz. 2009. An investigation into pages 133–139, Aberdeen, Scotland. Association for the validity of some metrics for automatically evalu- Computational Linguistics. ating natural language generation systems. Compu- tational Linguistics, 35(4):529–558. Elior Sulem, Omri Abend, and Ari Rappoport. 2018. BLEU is not suitable for the evaluation of text sim- Ehud Reiter and Robert Dale. 2000. Building Natural plification. arXiv preprint arXiv:1810.05995. Ac- Language Generation Systems. Cambridge Univer- cepted for publication as a short paper at EMNLP sity Press, Cambridge, UK. 2018. Ehud Reiter, Roma Robertson, and Liesl M Osman. Jonathan Swift. 1774. Travels Into Several Remote Na- 2003. Lessons from a failure: Generating tailored tions of the World: In Four Parts. By Lemuel Gul- Artificial Intelligence smoking cessation letters. , liver. First a Surgeon, and Then a Captain of Several 144(1-2):41–58. Ships..., volume 1. Benjamin Motte, London, UK. Jan Renkema. 2012. Schrijfwijzer, 5 edition. SDU Uit- gevers, Den Haag, The Netherlands. John W Tukey. 1980. We need both exploratory and confirmatory. The American Statistician, 34(1):23– Johannah Rodgers. 2017. The genealogy of an image, 25. or, what does literature (not) have to do with the his- tory of computing?: Tracing the sources and recep- Joseph P Turian, Luke Shen, and I Dan Melamed. 2003. tion of Gullivers ‘ Engine. Humanities, Evaluation of machine translation and its evaluation. 6(4):85. In Proceedings of MT Summit IX.

367 Sowmya Vajjala and Detmar Meurers. 2014. Assessing Language Generation, pages 38–47, Brighton, UK. the relative reading level of sentence pairs for text Association for Computational Linguistics. simplification. In Proceedings of the 14th Confer- ence of the European Chapter of the Association for Computational Linguistics, pages 288–297, Gothen- burg, Sweden. Association for Computational Lin- guistics.

Renske Van Enschot, Wilbert Spooren, Antal van den Bosch, Christian Burgers, Liesbeth Degand, Jacque- line Evers-Vermeul, Florian Kunneman, Christine Liebrecht, Yvette Linders, and Alfons Maes. 2017. Taming our wild data: On intercoder reliability in discourse research. Unpublished Manuscript, pages 1–18.

David L Vannette. 2016. Testing the effects of differ- ent types of attention interventions on data quality in web surveys. Experimental evidence from a 14 country study. In 71st Annual Conference of the American Association for Public Opinion Research.

Paul Vogt, Rianne van den Berghe, Mirjam de Haas, Laura Hoffman, Junko Kanero, Ezgi Mamus, Jean- Marc Montanier, Cansu Oranc, Ora Oudgenoeg- Paz, Daniel Hernandez Garcia, , Fotios Papadopou- los, Thorsten Schodde, Josje Verhagen, Christopher Wallbridge, Bram Willemsen, Jan de Wit, Tony Bel- paeme, Tilbe Goksun,¨ Stefan Kopp, Emiel Krah- mer, Aylin Kuntay,¨ Paul Leseman, and Amit Ku- mar Pandey. 2019. Second language tutoring using social robots: A large-scale study. In 2019 14th ACM/IEEE International Conference on Human- Robot Interaction (HRI). IEEE.

Eric-Jan Wagenmakers. 2007. A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14(5):779–804.

Eric-Jan Wagenmakers, Ruud Wetzels, Denny Bors- boom, and Han van der Maas. 2011. Why psychol- ogists must change the way they analyze their data: the case of psi: Comment on Bem (2011). Journal of personality and social psychology, 100(3):426.

Jelte M Wicherts, Coosje LS Veldkamp, Hilde EM Au- gusteijn, Marjan Bakker, Robbie Van Aert, and Mar- cel ALM Van Assen. 2016. Degrees of freedom in planning, running, analyzing, and reporting psy- chological studies: A checklist to avoid p-hacking. Frontiers in psychology, 7:1832.

Georgios N Yannakakis and John Hallam. 2011. Rank- ing vs. preference: A comparative study of self- reporting. In International Conference on Affective Computing and Intelligent Interaction, pages 437– 446. Springer.

Georgios N Yannakakis and Hector´ P Mart´ınez. 2015. Ratings are overrated! Frontiers in ICT, 2:13.

Sina Zarrieß, Sebastian Loth, and David Schlangen. 2015. Reading times predict the quality of gener- ated text above and beyond human ratings. In Pro- ceedings of the 15th European Workshop on Natural

368