Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 Using Natural Language Generation to 4 5 5 6 Bootstrap Empty Articles: 6 7 7 8 8 9 A Human-centric Perspective 9 10 10 * 11 Lucie-Aimée Kaffee , Pavlos Vougiouklis and Elena Simperl 11 12 School of Electronics and Computer Science, University of Southampton, UK 12 13 E-mails: [email protected], [email protected], [email protected] 13 14 14 15 15 16 16 17 17 Abstract. Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media 18 18 management. Recent advances in machine learning have made it possible to train NLG systems to achieve human-level perfor- 19 19 mance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it 20 with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-served , 20 21 which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to 21 22 generate text from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The 22 23 evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the text snippets score 23 24 well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles. It also hints 24 25 at several potential implications of using NLG solutions in Wikipedia at large, including content quality, trust in technology, and 25 26 algorithmic transparency. 26 27 27 Keywords: Wikipedia, Wikidata, ArticlePlaceholder, Under-resourced languages, Natural Language Generation, Neural 28 28 networks 29 29 30 30 31 31 32 1. Introduction riety of content integration affordances in Wikipedia, 32 33 including links to articles in other languages and in- 33 34 34 Wikipedia is available in 301 languages, but its con- foboxes. An example can be seen in Figure 1: in the 35 35 tent is unevenly distributed [1]. Under-resourced lan- , the infobox shown in the article 36 guage versions face multiple challenges: fewer editors 36 37 about cheese (right) automatically draws in data from 37 means fewer articles and less quality control, making Wikidata (left) and displays it in French. 38 that particular Wikipedia less attractive for readers in 38 In previous work of ours, we proposed the Article- 39 that language, which in turn makes it more difficult to 39 40 Placeholder, a tool that takes advantage of Wikidata’s 40 recruit new editors from among the readers. 41 41 Wikidata, the structured-data backbone of Wikipedia multilingual capabilities to increase the coverage of 42 42 [2], offers some help. It contains information about under-resourced Wikipedias [5]. When someone looks 43 43 more than 55 million entities, for example, people, for a topic that is not yet covered by Wikipedia in 44 44 places or events, edited by an active international com- their language, the ArticlePlaceholder tries to match 45 45 munity of volunteers [3]. More importantly, it is mul- the topic with an entity in Wikidata. If successful, it 46 46 tilingual by design and each aspect of the data can be 47 then redirects the search to an automatically generated 47 translated and rendered to the user in their preferred 48 placeholder page that displays the relevant informa- 48 language [4]. This makes it the tool of choice for a va- 49 tion, for example the name of the entity and its main 49 50 properties, in their language. The ArticlePlaceholder is 50 51 *Corresponding author. E-mail: [email protected]. currently used in 14 Wikipedias (see Section 3.1). 51

1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 Fig. 1. Representation of Wikidata triples and their inclusion in a Wikipedia infobox. Wikidata triples (left) are used to fill out the fields of the 23 24 infobox in the article about fromage infobox from the French Wikipedia 24 25 25 26 In this paper, we propose an iteration of the Ar- tences with limited usability in user-facing systems; 26 27 ticlePlaceholder to improve the representation of the one of the most common problems is their ability to 27 28 data on the placeholder page. The original version of handle rare words [12, 13], which are words that the 28 29 the tool pulled the raw data from Wikidata (available model does not meet frequently enough during train- 29 30 as triples with labels in different languages) and dis- ing, such as localisations of names in different lan- 30 31 played it in tabular form (see Figure 3 in Section 3). In guages. We introduce a mechanism called property 31 32 the current version, we use Natural Language Genera- placeholder [14] to tackle this problem, learning mul- 32 33 tion (NLG) techniques to automatically produce a text tiple verbalisations of the same entity in the text [6]. 33 34 snippet from the triples instead. Presenting structured In building the system we aimed to pursue the fol- 34 35 data as text rather than tables helps people uninitiated lowing research questions: 35 36 with the involved technologies to make sense of it [6]. 36 37 This is particularly useful in contexts where one can- RQ1 Can we train a neural network to generate 37 38 not make any assumptions about the levels of data lit- text from triples in a low-resource setting? 38 39 eracy of the audience, as it is the case for a large share To answer this question we first evaluated the 39 40 of the . system using a series of predefined metrics and 40 41 Our NLG solution builds upon the general encoder- baselines. In addition, we undertook a quanti- 41 42 decoder framework for neural networks, which is cred- tative study with participants from two under- 42 43 ited with promising results in similar text-centric tasks, served Wikipedia language communities (Arabic 43 44 such as machine translation [7, 8] and question gen- and Esperanto), who were asked to assess, from a 44 45 eration [9–11]. We extend this framework to meet the reader’s perspective, whether the text is fluent and 45 46 needs of different Wikipedia language communities in appropriate for Wikipedia. 46 47 terms of text fluency, appropriateness to Wikipedia, RQ2 How do editors perceive the generated text 47 48 and reuse during article editing. Given an entity that on the ArticlePlaceholder page? To add depth 48 49 was matched by the ArticlePlaceholder, our system to the quantitative findings of the first study, we 49 50 uses its triples to generate a short Wikipedia-style sum- undertook a second, mixed-methods study within 50 51 mary. Many existing NLG techniques produce sen- six Wikipedia language communities (Arabic, 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 3

1 Swedish, Hebrew, Persian, Indonesian, and Ukrainian). around the tokens both during reading and 1 2 We carried out semi-structured interviews, in writing, they did not check the text for correctness, 2 3 which we asked editors to comment on their ex- nor queried where the text came from and what the to- 3 4 perience with reading the summaries generated kens meant. This suggests that we need more research 4 5 through our approach and we identified common into how to communicate the provenance of content in 5 6 themes in their answers. Among others, we were Wikipedia, especially in the context of automatic con- 6 7 interested to understand how editors perceive text tent generation and deep fakes [15], as well as algo- 7 8 that is the result of the artificial intelligence (AI) rithmic transparency. 8 9 algorithm rather than being manually crafted, and 9 Structure of the paper The remainder of the paper is 10 how they deal with so-called tokens in 10 organised as follows. We start with some background 11 the sentences. Those tokens represent realisations 11 for our work and related papers in Section 2. Next, 12 of infrequent entities in the text, that data-driven 12 we introduce our approach to bootstrapping empty 13 approaches generally struggle to verbalise [12]. 13 Wikipedia articles, which includes the ArticlePlace- 14 RQ3 How do editors use the text snippets in their 14 holder tool and its NLG extension (Section 3). In Sec- 15 work? As part of the second study, we also asked 15 tion 4 we provide details on the evaluation methodol- 16 participants to edit the placeholder page, start- 16 ogy, whose findings we present in Section 5. We then 17 ing from the automatically generated text or re- 17 discuss the main themes emerging from the findings 18 moving it completely. We assessed text reuse 18 and their implications and the limitations of our work 19 both quantitatively, using a string-matching met- 19 in Section 6, before concluding with a summary of 20 20 ric, and qualitatively through the interviews. Just contributions and planned future work in Section 8. 21 like in RQ2, we were also interested to under- 21 22 stand whether summaries with tokens, Previous submissions A preliminary version of this 22 23 which point to limitations in the algorithm, would work was published in [14, 16]. In the current pa- 23 24 be used when editing and how the editors would per, we have carried out a comprehensive evaluation of 24 25 work around the tokens. the approach, including a new qualitative study and a 25 26 task-based evaluation with editors from six language 26 27 The evaluation helps us build a better understand- communities. By comparison, the previous publica- 27 28 ings of the tools and experience we need to help nur- tions covered only a metric-based corpus evaluation 28 29 ture under-served Wikipedias. Our quantitative anal- which was complemented by a small quantitative study 29 30 ysis of the reading experience showed that partici- of text fluency and appropriateness in the second one. 30 31 pants rank the text snippets close to the expected qual- The neural network architecture has been presented in 31 32 ity standards in Wikipedia, and are likely to consider detail in [14]. 32 33 them as part of Wikipedia. This was confirmed by the 33 34 interviews with editors, which suggested that people 34 35 believe the summaries to come from a Wikimedia- 2. Background and related work 35 36 internal source. According to the editors, the new for- 36 37 mat of the ArticlePlaceholder enhances the reading ex- We divide this section into three areas. First we pro- 37 38 perience: people tend to look for specific bits of infor- vide some background on Wikipedia and Wikidata, 38 39 mation when accessing a Wikipedia page and the com- with a focus on multilingual aspects. Then we sum- 39 40 pact nature of the generated text supports that. In ad- marise the literature on text generation and conclude 40 41 dition, the text seems to be a useful starting point for with methodologies to evaluate NLG systems. 41 42 further editing and editors reuse a large portion of it 42 43 even when it includes tokens. 2.1. Multilingual Wikipedia and Wikidata 43 44 We believe the two studies could also help advance 44 45 the state of the art in two other areas: together, they Wikipedia is a community-built encyclopedia and 45 46 propose a user-centred methodology to evaluate NLG, one of the most visited websites in the world. There are 46 47 which complements automatic approaches based on currently 301 language versions of Wikipedia, though 47 48 standard metrics and baselines, which are the norm in coverage is unevenly distributed. Previous studies have 48 49 most papers; at the same time, they also shed light discussed several biases, including gender of the edi- 49 50 on the emerging area of human-AI interaction in the tors [17], and topics, for instance a general lack of in- 50 51 context of NLG. While the editors worked their way formation on the Global South [18]. 51 4 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 Language coverage tells a similar story. [19] noted 1 2 that only 67% of the world’s population has access to 2 3 encyclopedic knowledge in their first or second lan- 3 4 guage. Another significant problem is caused by the 4 5 extreme differences in content coverage between lan- 5 6 guage versions. As early as 2005, Voss[20] found huge 6 7 gaps in the development and growth of Wikipedias, 7 Fig. 2. Example of labeling in Wikidata. Each entity can be labeled 8 which make it more difficult for smaller communities 8 in multiple languages using the labeling property rdfs:label 9 9 to catch up with the larger ones. Alignment cannot be (not displayed in the figure). 10 simply achieved through translation - putting aside the 10 11 fact that each Wikipedia needs to reflect the interests anced than on the web at large. This was the starting 11 12 and points of view of their local community rather than point for our work on the ArticlePlaceholder (see Sec- 12 13 iterate over content transferred from elsewhere, stud- tion 3), which leverages the multilingual support of 13 14 ies have shown that the existing content does not over- Wikidata to bootstrap empty articles in under-served 14 15 lap, discarding the so-called English-as-Superset con- Wikipedias. 15 16 jecture for as many as 25 language versions [1, 21]. 16 17 To help tackle these imbalances, Bao et al. [22] built 17 2.2. Text generation 18 a system to give users easier access to the language 18 19 diversity of Wikipedia. 19 Our work is directly related to approaches that pro- 20 Our work is motivated and complements previous 20 duce text from triples expressed as RDF (Resource De- 21 studies and frameworks that argue that the language of 21 scription Framework) or similar. Many of them rely 22 global projects such as Wikipedia Hecht [21] should 22 on templates, which are either based on linguistic fea- 23 express cultural reality Kramsch and Widdowson [23]. 23 tures e.g., grammatical rules [25] or are hand-crafted 24 Instead of using content from one Wikipedia version 24 [26]. An example is Reasonator, a tool for lexical- 25 to bootstrap another, we take structured data labelled 25 ising Wikidata triples with templates translated by 26 in the relevant language and create a more accessible 26 users.1 Such approaches face many challenges - they 27 representation of it as text, keeping those cultural ex- 27 cannot be easily transferred to different languages or 28 pressions unimpaired. 28 scale to broader domains, as templates need to be ad- 29 To do so, we leverage Wikidata [2]. Wikidata was 29 justed to any new language or domain they are ported 30 originally created to support Wikipedia’s language 30 to. This makes them unsuitable for under-resources 31 connections, for instance links to articles in other lan- 31 Wikipedias, which rely on small numbers of contribu- 32 guages or infoboxes, see example from Section 1); 32 tors. 33 however, it soon evolved into becoming a critical 33 To tackle this limitation, [27, 28] introduced a 34 source of data for many other applications. 34 distant-supervised method to verbalise triples, which 35 Wikidata contains statements on general knowledge, 35 learns templates from existing Wikipedia articles. 36 e.g. about people, places, events and other entities of 36 While this makes the approach more suitable for 37 interest. The knowledge base is created and maintained 37 language-independent tasks, templates always assume 38 collaboratively by a community of editors, assisted 38 that entities will always have the relevant triples to 39 by automated tools called bots [3, 24]. Bots take on 39 fill in the slots. This assumption is not always true. In 40 repetitive tasks such as ingesting data from a different 40 our work, we propose a template-learning baseline and 41 source, or simple quality checks. 41 show that adapting to the varying triples available can 42 The basic building blocks of Wikidata are items and 42 achieve better performance. 43 properties. Both have identifiers, which are language- 43 Sauper and Barzilay [29] and Pochampally et al. [30] 44 independent, and labels in one or more languages. 44 generate Wikipedia summaries by harvesting sen- 45 They form triples, which link items to their attributes, 45 46 or to other items. Figure 2 shows a triple linking a tences from the Internet. Wikipedia articles are used 46 47 place to its country, where both items and the property to automatically derive templates for the topic struc- 47 48 between them have labels in three languages. ture of the summaries and the templates are afterward 48 49 We analysed the distribution of label languages filled using web content. Both systems work best on 49 50 in Wikidata in [4] and noted that while there is a 50 51 bias to English, the language distribution is more bal- 1https://tools.wmflabs.org/reasonator/ 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 5

1 specific domains and for languages like English, for curacy of the generated literacy and numeracy assess- 1 2 which suitable web content is readily available [31]. ments by a sample of 230 participants, which costed 2 3 A different line of research aims to explore knowl- as much as 25 thousand UK pounds. Reiter et al. [42] 3 4 edge bases as a resource for NLG [6, 27, 32, 33]. In all described a clinical trial with over 2000 smokers cost- 4 5 these examples, linguistic information from the knowl- ing 75 thousand pounds, assumed to be the most costly 5 6 edge base is used to build a parallel corpus containing NLG evaluation at this point. All of the smokers com- 6 7 triples and equivalent text sentences from Wikipedia, pleted a smoking questionnaire in the first stage of 7 8 which is then used to train the NLG algorithm. Directly the experiment, in order to find what portion of those 8 9 relevant to the model we propose are the proposals by who received the automatically generated letters from 9 10 Lebret et al. [34], Chisholm et al. [32], and Vougiouk- STOP had managed to quit. 10 11 lis et al. [6], which extend the general encoder-decoder Given these challenges, most research systems tend 11 12 neural network framework from [7, 8]. They use struc- to use judgement-based rather than task-based evalua- 12 13 tured data from Wikidata and DBpedia as input and tions [27, 28, 32, 35, 39, 40, 45, 46]. Besides their lim- 13 14 generate one-sentence summaries that match the style ited scope, most studies in this category do not recruit 14 15 of the in a single domain. While from the relevant user population, relying on more ac- 15 16 this is a rather narrow task compared to other genera- cessible options such as online crowdsourcing. Sauper 16 17 tive tasks such as translation, Chisholm et al. [32] dis- and Barzilay [29] is a rare exception. In their paper, the 17 18 cuss its challenges in detail and show that it is far from authors describe the generation of Wikipedia articles 18 19 being solved. Compared to these previous works, the using a content-selection algorithm that extracts infor- 19 20 NLG algorithm presented in this paper is for open- mation from online sources. They test the results by 20 21 domain tasks and multiple languages. publishing 15 articles about diseases on Wikipedia and 21 22 measuring how the articles change (including links, 22 23 2.3. Evaluating text generation systems formatting and grammar). Their evaluation approach 23 24 is not easy to follow, as the Wikipedia community 24 25 Related literature suggests three ways of deter- tends to disagree with conducting research on their 25 26 mining how well an NLG system achieves its goals. platform.2 26 27 The first, which is commonly referred to as metric- The methodology we follow draws from all these 27 28 based corpus evaluation [35], use text-similarity met- three areas: we start with an automatic, metric-based 28 29 rics such as BLEU [36], ROUGE [37] and METEOR corpus evaluation to get a basic understanding of the 29 30 [38]. These metrics essentially compare how similar quality of the text the system produces and compare it 30 31 the generated texts are to texts from the corpus. The with relevant baselines. We then use quantitative anal- 31 32 other two involve people and are either task-based or ysis based on human judgements to assess how use- 32 33 judgement/rating-based [35]. Task-based evaluations ful the text snippets are for core tasks such as read- 33 34 aim to assess how an NLG solution assists participants ing and editing. To add context to these figures, we run 34 35 in undertaking a particular task, for instance learning an mixed-methods study with an interview and a task- 35 36 about a topic or writing a text. judgement-based eval- based component, where we learn more about the user 36 37 uations rely on a set of criteria against which partici- experience with NLG and measure the extent to which 37 38 pants are asked to rate the quality of the automatically NLG text is reused by editors, using a metric inspired 38 39 generated text [35]. from journalism [47] and plagiarism detection [48]. 39 40 Metric-based corpus evaluations are widely used as 40 41 they offer an affordable, reproducible way to auto- 41 42 matically assess the linguistic quality of the gener- 3. Bootstrapping empty Wikipedia articles 42 43 ated texts [14, 32, 34, 35, 39, 40]. However, they do 43 44 not always correlate with manually curated quality rat- The overall aim of our system is to give edi- 44 45 ings [35]. tors access to information that is not yet covered in 45 46 Task-based studies are considered most useful, as Wikipedia, but is available, in the relevant language, 46 47 they allow system designers to explore the impact of in Wikidata. The system is built on the ArticlePlace- 47 48 the NLG solution to end-users [35, 41]. However, they holder that displays Wikidata triples dynamically on 48 49 can be resource-intensive - previous studies [42, 43] 49 50 cite five figure sums, including data analysis and plan- 2https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_ 50 51 ning [44]. The system in [43] was evaluated for the ac- not#Wikipedia_is_not_a_laboratory 51 6 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 under-served Wikipedias. In this paper, we extend Figure 4 shows how the model generates a summary 1 2 the ArticlePlacehoder with an NLG component that from those triples, f1, f2 to fn. A vector representation 2 3 3 generates an introductory sentence on each Article- h f1 , h f2 to h fn for each of the input triples is computed 4 Placeholder page in the target language from Wikidata by processing their subject, predicate and object. 4 5 triples. Such triple-level vector representations help com- 5 6 6 pute a vector representation for the whole input set hFE . 7 7 3.1. ArticlePlaceholder hFE , along with a special start-of-summary 8 token. This initialises the decoder, which predicts to- 8 9 As discussed earlier, some Wikipedias suffer from kens (“[[Q1586267, Nigragorga]]ˆ ", “estas", 9 10 a lack of content, which means fewer readers, and in “rimarkinda" etc.). 10 11 11 turn, fewer potential editors. The idea of the Article- Surface form tuples A summary such as “Nigragorgaˆ 12 12 Placeholder is to use Wikidata, which contains infor- estas rimarkinda . . . " consists of regular words (e.g. 13 13 mation about 55 million entities (by comparison, the “estas" and “rimarkinda") and mentions of entities in 14 14 English Wikipedia covers around 5 million topics), the text (“[[Q1586267, Nigragorga]]ˆ "). How- 15 15 often in different languages, to bootstrap articles in ever, an entity can be expressed in a number of dif- 16 16 under-served language versions. An initial version of ferent ways in the summary. For example, “actor" and 17 17 this tool was presented in [5]. “actress" refer to the same concept in the knowledge 18 18 ArticlePlaceholders are pages on Wikipedia that are graph. In order to be able to learn an arbitrary num- 19 19 dynamically drawn from Wikidata triples. When the ber of different lexicalisations of the same entity in 20 20 information in Wikidata changes, the ArticlePlace- the summary (e.g. “aktorino”, “aktoro”), we adapt the 21 21 holder pages are automatically updated. In the origi- concept of surface form tuples [6]. “[[Q1586267, 22 22 nal release, the pages display the triples in a tabular Nigragorga]]ˆ ” is an example of a surface form tu- 23 23 way, purposely not reusing the design of a standard ple whose first element is the entity in the knowledge 24 24 Wikipedia page to make the reader aware that the page graph and its second is its realisation in the context of 25 25 was automatically generated and requires further atten- this summary. 26 tion. An example of the interface can be seen in Fig- 26 27 ure 3. Property placeholders The model from [6], which 27 28 The Article Placeholder is deployed on 14 Wikipedias was the starting point for the component presented 28 29 with an median of 69, 623.5 articles, between 253, 539 here, leverages instance-type-related information from 29 30 (Esperanto) and 7464 (Northern Sami). DBpedia in order to generate text that covers rare or 30 31 unseen entities. We broadened its scope and adapted 31 32 3.2. Text generation it to Wikidata without using external information from 32 33 other knowledge bases. In the current setup, surface 33 34 We use a data-driven approach that allows us to ap- form tuples in the text that correspond to rare enti- 34 35 pend a short text to ArticlePlaceholder pages. ties that are found to participate in relations in the in- 35 36 put triple set are replaced by the token of the property 36 37 3.2.1. Neural architecture of the matched relationship. We refer to those place- 37 38 We reuse the encoder-decoder architecture intro- holder tokens [10, 13] as property placeholders. These 38 39 duced in previous work of ours in Vougiouklis et tokens are appended to the target dictionary of the gen- 39 40 al. [6], which was focused on a closed-domain text erated summaries. We used the distant-supervision as- 40 41 generative task for English. The model consists of a sumption for relation extraction [49] for the property 41 42 feed-forward architecture, the triple encoder, which placeholders. After identifying the rare entities that 42 43 encodes an input set of triples into a vector of fixed di- participate in relations with the main entity of the ar- 43 44 mensionality, and a Recurrent Neural Network (RNN) ticle, they are replaced from the introductory sentence 44 45 based on Gated Recurrent Units (GRUs) [7], which with their corresponding property placeholder tag (e.g. 45 46 generates a text snippet by conditioning the output on [[P17]] in Table 1). During testing, any property 46 47 the encoded vector. placeholder token that is generated by our system is 47 48 The model is displayed in Figure 4. The Article- replaced by the label of the entity of the relevant triple 48 49 Placeholder provides a set of triples about the Wikidata (i.e. triple with the same property as the generated to- 49 50 item of Nigragorgaˆ (i.e., Q1586267 (Nigragorga)ˆ is ken). In Table 1, [[P17]] in the processed summary 50 51 either the subject or the object of the triples in the set). is an example of a property placeholder. In case it is 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 7

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 Fig. 3. Example page of the ArticlePlaceholder as deployed now on 14 Wikipedias. This example contains information from Wikidata on 23 Triceratops in Haitian-Creole. 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 Fig. 4. Representation of the neural network architecture. The triple encoder computes a vector representation for each one of the three input 34 34 triples from the ArticlePlaceholder, h f1 , h f2 and h f3 . Subsequently, the decoder is initialised using the concatenation of the three vectors, 35 35 [h f1 ; h f2 ; h f3 ]. The purple boxes represent the tokens of the generated text. Each snippet starts and ends with special tokens: start-of-summary 36 and end-of-summary 36 37 37 38 Table 1 38 39 The ArticlePlaceholder provides our system with a set of triples about Floridia, whose either subject or object is related to the item of Floridia. 39 Subsequently, our system summarizes the input set of triples as text. We train our model using the summary with the extended vocabulary (i.e. 40 40 “Summary w/ Property placeholders”). 41 41 42 42 Article- f1 : Q490900 (Floridia) P17 (stato)ˆ Q38 (Italio) 43 Placeholder f2 : Q490900 (Floridia) P31 (estas) Q747074 (komunumo de Italio) 43 Triples 44 f3 : Q30025755 (Floridia) P1376 (cefurboˆ de) Q490900 (Floridia) 44 45 Textual Summary Floridia estas komunumo de Italio. 45 46 Summary w/ 46 Property [[Q490900, Floridia]] estas komunumo de [[P17]]. 47 47 Placeholders 48 48 49 49 50 generated by our model, it is replaced with the label of the object of the triple with which they share the 50 51 51 8 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 same property (i.e. Q490900 (Floridia) P17 (stato)ˆ and quantitative components. We instructed editors to 1 2 Q38 (Italio)). complete a reading and an editing task and to describe 2 3 their experiences along a series of questions. We used 3 3.2.2. Training dataset 4 thematic analysis to identify common themes in the 4 In order to train and evaluate our system, we cre- 5 answers. For the editing task, we also used a quantita- 5 ated a dataset for text generation from knowledge base 6 tive element in the form of a text reuse metric, which 6 triples in two languages. We used two language ver- 7 is described below. 7 sions of Wikipedia (we provide further details about 8 8 how we prepared the summaries for the rest of the lan- 9 4.1. RQ1 - Metric-based corpus evaluation 9 guages in Section 4.3) which differ in terms of size (see 10 10 Table 2) and language support in Wikidata [4]. The 11 We evaluated the generated summaries against 11 dataset aligns Wikidata triples about an item with the 12 two baselines on their original counterparts from 12 first sentence of the Wikipedia article about that entity. 13 Wikipedia. We used a set of evaluation metrics for text 13 For each Wikipedia article, we extracted and tok- 14 generation BLEU 2, BLEU 3, BLEU 4, METEOR and 14 enized the first sentence using a multilingual Regex to- 15 ROUGE . BLEU calculates n-gram precision multi- 15 kenizer from the NLTK toolkit [50]. Afterwards, we L 16 plied by a brevity penalty, which penalizes short sen- 16 retrieved the corresponding Wikidata item to the arti- 17 tences to account for word recall. METEOR is based 17 cle and queried all triples where the item appeared as 18 on the combination of uni-gram precision and recall, 18 a subject or an object in the Wikidata truthy dump.3 19 with recall weighted over precision. It extends BLEU 19 In doing so, we relied on keyword matching against 20 by including stemming, synonyms and paraphrasing. 20 labels from Wikidata from the corresponding lan- 21 ROUGE is a recall-based metric which calculates the 21 guage, due to the lack of reliable entity linking tools L 22 length of the most common subsequence between the 22 for under-served languages. 23 generated summary and the reference. 23 24 We use property placeholders (as describe in the 24 4.1.1. Data 25 previous section) to avoid the lack of vocabulary typ- 25 Both the Arabic and Esperanto corpus are split into 26 ical for under-resourced languages. An example of 26 training, validation and test, with respective portions of 27 a summary which is used for the training of the 27 85%, 10%, and 5%. We used a fixed target vocabulary 28 neural architecture is: “Floridia estas komunumo de 28 [[P17]] [[P17]] that consisted of the 15000 and 25000 of the most fre- 29 .”, where is the property place- 29 quent tokens (i.e. either words or surface form tuples 30 holder for Italio (see Table 1). In case a rare entity in 30 of entities) of the summaries in Arabic and Esperanto 31 the text is not matched to any of the input triples, its 31 respectively. 32 realisation is replaced by the special token. 32 We replaced any rare entities in the text that partici- 33 33 pate in relations in the aligned triple set with the corre- 34 34 sponding property placeholder of the upheld relations. 35 4. Evaluation methodology 35 We include all property placeholders that occur at least 36 36 20 times in each training dataset. Subsequently, the 37 We followed a mixed-methods approach to inves- 37 dictionaries of the Esperanto and Arabic summaries 38 tigate the three questions discussed in the introduc- 38 are expanded by 80 and 113 property placeholders re- 39 tion (Table 3). To answer RQ1, we used an auto- 39 spectively. In case the rare entity is not matched to any 40 matic metric-based corpus evaluation, as well as a 40 subject or object of the set of corresponding triples, it 41 judgements-based quantitative evaluation with readers 41 is replaced by the special token. Each 42 in two language Wikipedias, who are native (or fluent, 42 summary in the corpora is augmented with the respec- 43 in the case of Esperanto) speakers of the language. We 43 tive start-of-summary and end-of-summary 44 showed them text generated through our approach, as 44 tokens. The former acts a signal with which the 45 well as genuine Wikipedia sentences and news snip- 45 decoder initialises the text generation process whereas 46 pets of similar length, and asked them to rate their 46 the latter is outputted when the summary has been gen- 47 fluency and appropriateness for Wikipedia on a scale. 47 erated completely [8, 51]. 48 To answer RQ2 and RQ3, we carried out an interview 48 49 study with editors of six Wikipedias, with qualitative 4.1.2. Baselines 49 50 Due to the variety of approaches for text genera- 50 51 3https://dumps.wikimedia.org/wikidatawiki/entities/ tion, we demonstrate the effectiveness of our system 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 9

1 Table 2 1 2 Recent page statistics and number of unique words (vocabulary size) of Esperanto, Arabic and English Wikipedias in comparison with Wikidata. 2 Retrieved 27 September 2017. Active users are registered users that have performed an action in the last 30 days. 3 3 4 Page stats Esperanto Arabic English Wikidata 4 5 5 Articles 241,901 541,166 5,483,928 37,703,807 6 6 7 Average number of edits/page 11.48 8.94 21.11 14.66 7 8 Active users 2,849 7,818 129,237 17,583 8 9 Vocabulary size 1.5M 2.2M 2M – 9 10 10 11 Table 3 11 12 Evaluation methodology 12 13 13 Data Method Participants 14 14 15 Metric-based evaluation and 15 Metrics and survey 16 RQ1 judgement-based evaluation Readers of two Wikipedias 16 answers 17 quantitative 17 18 Task-based evaluation, qualitative 18 RQ2 Interview answers Editors of six Wikipedias 19 (thematic analysis) 19 20 20 Interview answers and text Task-based evaluation, qualitative RQ3 Editors of six Wikipedias 21 reuse metrics (thematic analysis) and quantitative 21 22 22 23 23 24 by comparing it against two baselines of different na- summaries from the training dataset. The second one 24 25 ture. (TPext) retrieves summaries with the special tokens for 25 vocabulary extension. A summary can act as a tem- 26 Machine Translation (MT) For the MT baseline, 26 plate after replacing its entities with their correspond- 27 we used Google Translate on English Wikipedia sum- 27 ing Property Placeholders (see Table 1). 28 maries. Those translations are compared to the actual 28 29 target language’s Wikipedia entry. This limits us to KN The KN baseline is a 5-gram Kneser-Ney (KN) [55] 29 30 articles that exist in both English and the target lan- language model. KN has been used before as a base- 30 31 guage. In our dataset, the concepts in Esperanto and line for text generation from structured data [34] and 31 32 Arabic that are not covered by English Wikipedia ac- provided competitive results on a single domain in En- 32 33 count for 4.3% and 30.5% respectively. This indicates glish. We also introduce a second KN model (KNext), 33 34 the content coverage gap between different Wikipedia which is trained on summaries with the special to- 34 35 languages [1]. kens for property placeholder. During test time, we use 35 36 beam search of size 10 to sample from the learned lan- 36 37 Template Retrieval (TP) Similar to template-based guage model. 37 38 approaches for text generation [28, 52], we build a 38 39 template-based baseline that retrieves an output sum- 4.2. RQ1 - judgement-based evaluation 39 40 mary from the training data based on the input triples. 40 41 First, the baseline encodes the list of input triples that We defined quality in terms of text fluency and 41 42 corresponds to each summary in the training/test sets appropriateness, where fluency refers to how under- 42 43 into a sparse vector of TF-IDF weights [53]. After- standable and grammatically correct a text is, and ap- 43 44 wards, it performs LSA [54] to reduce the dimension- propriateness captures how well the text ’feels like’ 44 45 ality of that vector. Finally, for each item in the test Wikipedia content. We asked two sets of participants 45 46 set, we employ the K-nearest neighbors algorithm to from two different language Wikipedias to assess the 46 47 retrieve the vector from the training set that is the same text snippets on a scale according to these two 47 48 closest to this item. The summary that corresponds metrics. 48 49 to the retrieved vector is used as the output summary Participants were asked to fill out a survey combin- 49 50 for this item in the test set. We provide two versions ing fluency and appropriateness questions. An exam- 50 51 of this baseline. The first one (TP) retrieves the raw ple for a question can be found in Figure 5. 51 10 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 Fig. 5. Example of a question to the editors about quality and appropriateness for Wikipedia of the generated summaries in Arabic. They see this 26 27 page after the instructions are displayed. First, the user is asked to evaluate the quality from 0 to 6 (Question 69), then they are asked whether 27 28 the sentence could be part of Wikipedia (Question 70). 28 29 29 30 4.2.1. Recruitment 4.2.2. Participants 30 31 Our study targets any speaker of Arabic and Es- We recruited a total of 54 participants (see Table 4). 31 32 32 peranto who reads that particular Wikipedia, indepen- Coincidentally, 27 of them were from each language 33 dent of their contributions to Wikipedia. We wanted 33 34 community. 34 to reach fluent speakers of each language who use 35 35 Wikipedia and are familiar with it even if they do not 36 4.2.3. Ethics 36 37 edit it frequently. For Arabic, we reached out to Arabic 37 38 speaking researchers from research groups working on The research was approved by the Ethics Committee 38 39 Wikipedia-related topics. For Esperanto, as there are of the University of Southampton under ERGO Num- 39 40 40 fewer speakers and they are harder to reach, we pro- ber 30452. 41 41 moted the study on social media such as Twitter and 42 42 4 43 Reddit using the researchers’ accounts. The survey in- 4.2.4. Data 43 44 structions and announcements were translated to Ara- For both languages, we created a corpus consist- 44 45 bic and Esperanto.5 The survey was open for 15 days. 45 46 ing of 60 summaries of which 30 are generated 46 47 through our approach, 15 are from news, and 15 from 47 48 Wikipedia sentences used the train the neural network 48 49 4 49 https://www.reddit.com/r/Esperanto/comments/75rytb/help_in_ model. For news in Esperanto, we chose introductory 50 a_study_using_ai_to_create_esperanto/ 50 51 5https://github.com/luciekaffee/Announcements sentences of articles in the Esperanto version of Le 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 11

1 Table 4 1 50% 2 Judgement-based evaluation: total number of participants (P), total number of sentences (S), number of participants who evaluated at least 2 of the sentences, average number of sentences evaluated per participant, and number of total annotations 3 3 4 #P #S #P: S>50% Avg #S/P All Ann. 4 5 5 Fluency 27 60 5 15.03 406 6 6

Arab. Appropriateness 27 60 5 14.78 399 7 7 8 Fluency 27 60 3 8.7 235 8 Appropriateness 27 60 3 8.63 233 9 Esper. 9 10 10 11 Monde Diplomatique6. For news in Arabic, we did the Table 5 11 12 same and used the RSS feed of BBC Arabic7. Number of Wikipedia articles, active editors on Wikipedia (editors 12 30 13 that performed at least one edit in the last days), and number of 13 native speakers in million. 14 4.2.5. Metrics 14 15 Let S = {s1, s2,..., sn} be a set of sentences to be Language # Articles Active Editors Native Speakers 15 16 assessed and N = {N1, N2,... Nn} be the set of the 16 Arabic 541,166 5,398 280 17 total number of judgements for sentence s1...n. 17 Swedish 3,763,584 2,669 8.7 18 Each participant was asked to assess the fluency of 18 Hebrew 231,815 2,752 8 19 60 sentences on a scale from 0 to 6 as follows: 19 Persian 643,635 4,409 45 20 20 (6) No grammatical flaws and the content can be un- Indonesian 440,948 2,462 42.8 21 21 derstood with ease Ukrainian 830,941 2,755 30 22 22 (5) Comprehensible and grammatically correct sum- 23 23 mary that reads a bit artificial 24 be part of a Wikipedia article. We tested whether a 24 (4) Comprehensible summary with minor grammati- 25 reader can tell the difference from just one sentence 25 cal errors 26 whether a text is appropriate for Wikipedia, using the 26 (3) Understandable, but has grammatical issues 27 news sentences as a baseline. This gave us an insight 27 (2) Barely understandable summary with significant 28 into whether the text produced by the neural network 28 grammatical errors 29 “feels” like Wikipedia text. Participants were asked not 29 (1) Incomprehensible summary, but a general theme 30 to use any external tools for this task and had to give a 30 can be understood 31 binary answer. 31 (0) Incomprehensible summary 32 Given the same sets of sentences, S , and total num- 32 33 ber of annotations that each sentence in S has re- 33 For each sentence, we calculated the mean flu- s s s s 34 ceived, we defined Asi = {a i , a i ,..., a i } s.t. a i ∈ 34 ency given by all participants and then averaging over 1 2 Ni j 35 {0, 1}∀ j ∈ [1, N ] as the set of all the N appropriate- 35 all summaries of each category. We defined F si = i i 36 si si si si ness ratings that the sentence si has received. The av- 36 { f , f ,..., f } s.t. f ∈ [0, 6]∀ j ∈ [1, Ni] as the set 37 1 2 Ni j erage appropriateness score for the summaries of each 37 of all the Ni fluency ratings that the sentence si has 38 38 received. We computed the overall fluency as follows: method was calculated as follows: 39 39 40 40   41   n Ni 41 n Ni 1 1 1 1 X X si 42 X X si  a j  . (2) 42  f j  . (1) n Ni 43 n Ni i=1 j=1 43 i=1 j=1 44 44 45 To assess the appropriateness, participants were 45 46 asked to assess whether the displayed sentence could 4.3. RQ2 and RQ3 - Task-based evaluation 46 47 47 48 We ran a series of semi-structured interviews with 48 6http://eo.mondediplo.com/, accessed on the 28th of September, 49 49 2017 editors of six Wikipedias to get an in-depth under- 50 7http://feeds.bbci.co.uk/arabic/middleeast/rss.xml, accessed on standing of their experience with reading and using the 50 51 the 28th of September, 2017 automatically generated text. Each interview started 51 12 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 Fig. 6. Screenshot of the reading task. The page is stored as a subpage of the author’s userpage on Wikidata, therefore the layout copies the 23 24 original layout of any Wikipedia. The layout of the information displayed mirrors the ArticlePlaceholder setup. The participants sees the sentence 24 to evaluate alongside information included from the Wikidata triples (such as the image and statements) in their native language (Arabic in this 25 25 example). 26 26 27 with general questions about the experience of the par- We were in contact with 18 editors from different 27 28 28 ticipant with Wikipedia and Wikidata, and their under- Wikipedia communities. We allowed all editors to par- 29 29 standing of different aspects of these projects. The par- ticipate but had to exclude editors who edit only on 30 ticipants were then asked to open and read an Article- 30 31 English Wikipedia (as it is outside our use-case) and 31 Placeholder page including text generated through the 32 editors who did not speak a sufficient level of English, 32 NLG algorithm as shown in Figure 6. Finally, partici- 33 which made conducting the interview impossible. 33 pants were asked to edit the content of a page, which 34 34 contained the same information as the one they had 35 4.3.2. Participants 35 to read earlier, but was displayed as plain text in the 36 n 36 Wikipedia edit field. The editing field can be seen in Our sample consists of = 10 Wikipedia editors 37 37 Figure 7. of different under-resourced languages (measured in 38 38 their number of articles). We originally conducted in- 39 4.3.1. Recruitment 39 terviews with 11 editors from seven different language 40 The goal was to recruit a set of editors from dif- 40 communities, but had to remove one interview with an 41 ferent language backgrounds to have a comprehensive 41 42 understanding of different language communities. We editor of the Breton Wikipedia, as we were not able 42 43 reached out to editors of different Wikipedia editor to generate the text for the reading and editing tasks 43 44 mailinglists8 and tweeted a call for contribution using because of a lack of training data. 44 45 the lead author’s account9. Among the participants, 4 were from the Arabic 45 46 46 Wikipedia and participated in the judgement-based 47 47 8 evaluation from RQ . The remaining were from 48 Mailinglists contacted: [email protected] 1 6 48 (Arabic), [email protected] (Esperanto), wikifa- 49 under-resourced language communities: Persian, In- 49 [email protected] (Persian), Wikidata mailinglist, Wikimedia 50 Research Mailinglist donesian, Hebrew, Ukrainian (one per language) and 50 51 9https://twitter.com/frimelle/status/1031953683263750144 Swedish (two participants). 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 13

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 Fig. 7. Screenshot of the editing task. The page is stored on a subpage of the author’s userpage on Wikidata, therefore the layout is equivalent 24 25 as the current MediaWiki installations on Wikipedia. The participants sees the sentence, that they saw before in the reading task and the triples 25 26 from Wikidata in their native language (Arabic in this example). 26 27 27 28 While Swedish is officially the third largest Wikipedia with the project. n = 4 participants have been active 28 10 29 in terms of number of articles, most of the articles are on Wikidata for over 3 years (with n = 2 editors being 29 30 not manually edited. In 2013, the involved since the start of the project in 2013), n = 5 30 31 passed one million articles, thanks to a a bot called editors had some experience with editing Wikidata and 31 32 lsjbot, which at that point in time had created almost one editor had never edited Wikidata, but knew the 32 11 33 half of the articles on Swedish Wikipedia . Such bot- project. 33 34 generated articles are commonly limited, both in terms Table 6 displays the sentence used for the inter- 34 35 of information content and length. The high activity views in different languages. The Arabic sentence is 35 36 of a single bot is also reflected in the small number of generated by the network based on Wikidata triples, 36 37 active editors compared to the large number of articles while the other sentences are synthetically created as 37 38 (see Table 5). described below. 38 The participants were experienced Wikipedia edi- 39 4.3.3. Ethics 39 40 tors, with average tenures of 9.3 years in Wikipedia 40 The research was approved by the Ethics Committee 41 (between 3 and 14 years, median 10). All of them have 41 of the University of Southampton under ERGO Num- 42 contributed to at least one language besides their main 42 ber 44971 and written consent was obtained from each 43 language, and to the English Wikipedia. Further, n = 4 43 participant ahead of the interviews. 44 editors worked in at least two other languages beside 44 45 their main language, while n = 2 editors were active 4.3.4. Data 45 46 in as many as 4 other languages. All participants knew For Arabic, we reused a text snippet from the RQ1 46 47 about Wikidata, but had varying levels of experience evaluation. For the other language communities, we 47 48 emulated the sentences the network would produce. 48 49 49 10https://meta.wikimedia.org/wiki/List_of_Wikipedias First, we picked an entity to test with the editors. Then, 50 11https://blog.wikimedia.org/2013/06/17/ we looked at the output produced by the network for 50 51 swedish-wikipedia-1-million-articles/ the same concept in Arabic and Esperanto to under- 51 14 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 Table 6 1 2 Overview of sentences and number of participants in each language 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 stand possible mistakes and tokens produced by the 4.3.5. Task 23 23 The interview started with an opening, explaining 24 network. We chose the city of Marrakesh as the con- 24 that the researcher will observe the reading and edit- 25 cept editors would work on. Marrakesh is a good start- 25 26 ing point, as it is a topic possibly highly relevant to ing of the participant in their language Wikipedia. Un- 26 27 readers and is widely covered, but falls into the cate- til both reading and editing were finished, the partic- 27 ipant did not know about the provenance of the text. 28 gory of topics that are potentially under-represented in 28 To start the interview, we asked demographic ques- 29 Wikipedia due to its geographic location [18]. 29 tions about the participants’ contributions to Wikipedia 30 We collected the introductory sentences for Mar- 30 31 and Wikidata, and to test their knowledge on the exist- 31 rakesh in the editors’ languages from Wikipedia. ing ArticlePlaceholder. Before reading, they were in- 32 Those are the sentences the network would be trained 32 33 troduced to the idea of displaying content from Wiki- 33 on and tries to reproduce. We ran the keyword matcher 34 data on Wikipedia. Then, they saw the mocked page 34 that was used for the preparation of the dataset on the 35 of the ArticlePlaceholder as can be seen in Figure 6 35 36 original Wikipedia sentences. It marked the words the in Arabic. Each participant saw the page in their re- 36 37 network would pick up or would be replaced by prop- spective language. As the interviews were remotely, 37 38 erty placeholders. Therefore, these words could not be the interviewer asked them to share the screen so they 38 39 removed. could point out details with the mouse cursor. Ques- 39 40 As we were particularly interested in how editors tions were asked to let them describe their impression 40 41 would interact with the missing word tokens, the net- of the page while they were looking at the page. Then, 41 42 work can produce, we removed up to two words in they were asked to open a new page, which can be 42 43 seen in Figure 7. Again, this page would contain in- 43 each sentence: the word for the concept in its native 44 formation in their language. They were asked to edit 44 language (e.g. Morocco in Arabic for non-Arabic sen- 45 the page and describe what they are doing at the same 45 tences), as we saw that the network does the same, and 46 time freely. We asked them to not edit a whole page 46 47 the word for a concept that is not connected to the main but only write two to three sentences as the introduc- 47 48 entity of the sentence on Wikidata (e.g. Atlas Moun- tion to a Wikipedia article on the topic with as much of 48 49 tains, which is not linked to Marrakesh). An overview the information given as needed. After the editing was 49 50 of all sentences used in the interviews can be found in finished, they were asked questions about theur expe- 50 51 Table 6. rience. Only then, the provenance of the sentences was 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 15

1 revealed. Given this new information, we asked them means that copied sequences of single or double words 1 2 about the predicted impact on their editing experience. will not count in the calculation of reuse. We calcu- 2 3 Finally, we left them time discuss open questions of lated a reuse score gstscore by counting the lengths of 3 4 the participants. The interviews were scheduled to last the detected tiles, and normalized by the length of the 4 5 between 20 minutes to one hour. generated summary. 5 6 6 4.3.6. Analysing the interviews P 7 |ti| 7 We interviewed a total of 11 editors of seven differ- gstscore S, D ti∈T (3) 8 ( ) = 8 ent language Wikipedias. The interviews took place in |S | 9 9 September 2018. We used thematic analysis to evalu- 10 We classified each of the edits into three groups ac- 10 ate the results of the interviews. The interviews were 11 cording to the gstscore as proposed by [47]: 1) Wholly 11 coded by two researchers independently, in the form 12 Derived (WD): the text snippet has been fully reused 12 of inductive coding based on the research questions. 13 in the composition of the editor’s text (gstscore 13 After comparing and merging all themes, both re- > 14 0.66); 2) Partially Derived (PD): the text snippet has 14 searchers independently applied these common themes 15 been partially used (0.66 > gstscore 0.33); 3) Non 15 on the text again. > 16 Derived (ND): the text has been changed completely 16 17 4.3.7. Editors’ reuse metric (0.33 > gstscore). 17 18 Editors were asked to complete a writing task. We 18 19 assessed how they used the automatically generated 19 20 text snippets in their work by measuring the amount 5. Results 20 21 of text reuse. We based the assessment on the editors’ 21 22 resultant summaries after the interviews were finished. 5.1. RQ1 - Metric-based corpus evaluation 22 23 To quantify the amount of reuse in text we use the 23 24 Greedy String-Tiling (GST) algorithm [56]. GST is a As noted earlier, the evaluation used standard met- 24 25 substring matching algorithm that computes the degree rics in this space and a data from the Arabic and Es- 25 26 of reuse or copy from a source text and a dependent peranto Wikipedias. We compared against five base- 26 27 one. GST is able to deal with cases when a whole block lines, one in machine translation (MT), two based on 27 28 is transposed, unlike other algorithms such as the Lev- templates (TP and TPext), the other main class of tech- 28 29 enshtein distance, which calculates it as a sequence of niques in NGL, and two language models (KN and 29 30 single insertions or deletions rather than a single block KNext). As displayed in Table 7, our model showed 30 31 move. Adler and de Alfaro [57] introduce the concept a significant enhancement compared to all baselines 31 32 of Edit Distance in the context of vandalism detec- across the majority of the evaluation metrics in both 32 33 tion on Wikipedia. They measure the trustworthiness languages. We achieved a 3.01 and 5.11 enhancement 33 34 of a piece of text by measuring how much it has been in BLEU 4 score in Arabic and Esperanto respectively 34 35 changed over time. However, their algorithm punishes over TPext, the strongest baseline. 35 36 the copy of the text, as they measure every edit to the The tests also hinted at the limitations of using ma- 36 37 original text. In comparison, we want to measure how chine translation for this task. We attributed this re- 37 38 much of the text is reused, therefore GST is appropri- sult to the different writing styles across language ver- 38 39 ate for the task at hand. sions of Wikipedia. The data confirms that generating 39 40 Given a generated summary S = s1, s2, .. and an language from labels of conceptual structures such as 40 41 edited one D = d1, d2, .., each consisting of a sequence Wikidata is a much more suitable approach. 41 42 of tokens, GST identifies a set of disjoint longest se- Around 6 1% of the generated text snippets in 42 43 quences of tokens in the edited text that exist in the Arabic and 2% of the Esperanto ones did not end 43 44 source text (called tiles) T = {t1, t2, ..}. It is expected in token and were hence incomplete. We as- 44 45 that there will be common stop words appearing in sumed the difference between the two was due to the 45 46 both the source and the edited text. However, we were size of the Esperanto dataset, which makes it harder 46 47 rather interested in knowing how much of real struc- for the trained models (i.e., with or without property 47 48 ture of the generated summary is being copied. Thus, placeholders) to generalise on unseen data (i.e. sum- 48 49 we set minimum match length factor mml = 3 when maries for which the special end-of-summary 49 50 calculating the tiles, s.t. ∀ti ∈ T : ti ⊆ S ∧ ti ⊆ token is generated). These snippets were excluded 50 51 D ∧ |ti| > mml and ∀ti, t j ∈ T|i =6 j : ti ∩ t j = ∅. This from the evaluation. 51 16 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 Table 7 1 2 − 4 2 Automatic evaluation of our model against all other baselines using BLEU , ROUGE and METEOR for both Arabic and Esperanto 2 Validation and Test set. 3 3 4 BLEU 2 BLEU 3 BLEU 4 ROUGE METEOR 4 Model L 5 Valid. Test Valid. Test Valid. Test Valid Test Valid. Test 5 6 6 KN 2.28 2.4 0.95 1.04 0.54 0.61 17.08 17.09 29.04 29.02 7 7 8 KNext 21.21 21.16 16.78 16.76 13.42 13.42 28.57 28.52 30.47 30.43 8 9 MT 19.31 21.12 12.69 13.89 8.49 9.11 31.05 30.1 29.96 30.51 9 10 TP 34.18 34.58 29.36 29.72 25.68 25.98 43.26 43.58 32.99 33.33 10 11 Arabic 11 TPext 42.44 41.5 37.29 36.41 33.27 32.51 51.66 50.57 34.39 34.25 12 12 13 Ours 47.38 48.05 42.65 43.32 38.52 39.20 64.27 64.64 45.89 45.99 13 14 w/ PrP 47.96 48.27 43.27 43.60 39.17 39.51 64.60 64.69 46.09 46.17 14 15 15 KN 6.91 6.64 4.18 4.0 2.9 2.79 37.48 36.9 31.05 30.74 16 16 17 KNext 16.44 16.3 11.99 11.92 8.77 8.79 44.93 44.77 33.77 33.71 17 18 MT 1.62 1.62 0.59 0.56 0.26 0.23 0.66 0.68 4.67 4.79 18 19 TP 33.67 33.46 28.16 28.07 24.35 24.3 46.75 45.92 20.71 20.46 19

20 Esperanto 20 TPext 43.57 42.53 37.53 36.54 33.35 32.41 58.15 57.62 31.21 31.04 21 21 Ours 42.83 42.95 38.28 38.45 34.66 34.85 66.43 67.02 40.62 41.13 22 22 23 w/ PrP 43.57 43.19 38.93 38.62 35.27 34.95 66.73 66.61 40.80 40.74 23 24 24 25 Arabic Esperanto We show that i) the high performance of our system 25 80 26 26 70 is not skewed towards some domains at the expense of 27 60 others, and that ii) our model has a good generalisation 27 28 50 across domains – better than any other baseline. 28 40

29 BLEU 4 For instance, the over performance of TP is lim- 29 30 ext 30 30 20 ited to a few number of domains – plotted as the few 31 31 10 outliers in Figure 8 for TPext – , despite its performance 32 0 being much lower on average for all the domains. 32 Our TPext KNext Our TPext KNext 33 w/ PrP w/ PrP 33 Despite the fact that TPext achieves the highest 34 34 recorded performance in a few domains (i.e. TPext out- 35 Fig. 8. A box plot showing the distribution of BLEU 4 scores of all 35 systems for each category of generated summaries. liers in Figure 8), its performance is much lower on 36 36 average for all the domains. The valuable generalisa- 37 37 The introduction of the property placeholders to tion of our model across domains is mainly due to the 38 38 our encoder-decoder architecture enhances our perfor- language model in the decoder layer of our model, 39 39 mance further by 0.61 − 1.10 BLEU (using BLEU 4). which is more flexible than rigid templates and can 40 40 In general, our property placeholder mechanism 41 adapt easier to multiple domains. Despite the fact that 41 benefits the performance of all the competitive sys- 42 the Kneser-Ney template-based baseline (KNext) has 42 tems. 43 exhibited competitive performance in a single-domain 43 44 Generalisation Across Domains To investigate how context [34], it is failing to generalise in our multi- 44 45 well different models can generalise across multiple domain text generation scenario. Unlike our approach, 45 46 domains, we categorise each generated summary into KNext does not incorporate the input triples directly for 46 47 one of 50 categories according to its main entity in- generating the output summary, but rather only uses 47 48 stance type (e.g. village, company, football player). We them to replace the special tokens after a summary 48 49 examine the distribution of BLEU-4 scores per cate- has been generated. This might yield acceptable per- 49 50 gory to measure how well the model generalises across formance in a single domain, where most of the sum- 50 51 domains (Figure 8). maries share a very similar pattern. However, it strug- 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 17

Table 8 1 text and asked them to comment on a series of issues. 1 Results for fluency and appropriateness. 2 We grouped their answers into several general themes 2 3 Fluency Appropriateness around: their use of the snippets, their opinions on text 3 4 provenance, the ideal length of the text, the importance 4 Mean SD Part of Wikipedia 5 of the text for the ArticlePlaceholder, and limitations 5 6 Ours 4.7 1.2 77% of the algorithm. 6 7 Wikipedia 4.6 0.9 74% 7

Arabic Use The participants appreciated the text snippets: 8 News 5.3 0.4 35% 8 9 9 Ours 4.5 1.5 69% "I think it would be a great opportunity for gen- 10 10 Wikipedia 4.9 1.2 84% eral Wikipedia readers to help improve their expe-

11 Esper. 11 News 4.2 1.2 52% rience, while reading Wikipedia" (P7). 12 12 13 Some of them noted that the text gave them a use- 13 14 gles to generate a different pattern for each input set of ful overview and quick introduction to the topic of the 14 15 triples in multiple domain summary generation. page, particularly for people trained in one language or 15 16 non-English speakers: 16 5.2. RQ1 - judgement-based evaluation 17 "I think that if I saw such an article in Ukrainian, 17 18 I would probably then go to English anyway, be- 18 5.2.1. Fluency 19 cause I know English, but I think it would be a huge 19 As shown in Table 8, overall, the quality of the gen- 20 help for those who don’t" (P10). 20 21 erated text is high (4.7 points out of 6 in average in 21 22 Arabic and 4.5 in Esperanto). In Arabic, 63.3% of the Provenance As part of the reading task, we asked the 22 23 summaries scored as much as 5 in fluency on aver- editors what they believed was the provenance of the 23 24 age. In Esperanto, which had the smaller training cor- information displayed on the page. This gave us more 24 25 pus, the participants nevertheless gave as many as half context to the promising fluency and appropriateness 25 26 of the snippets an average of 5, with 33% of them scores achieved in the quantitative study. The editors 26 27 reaching a 6 by all participants. We concluded that made general comments about the page and tended to 27 28 in most cases, the text we generated was very under- assume that the text was from other Wikipedia lan- 28 29 standable and grammatically correct. In addition, the guage versions: 29 30 results were perceived to match the quality of writing 30 in Wikipedia and news reporting. [The text was] "taken from Wikipedia, from Wikipedia 31 projects in different languages." (P1) 31 32 5.2.2. Appropriateness 32 33 The results for appropriateness are summarised in Editors more familiar with Wikidata suggested the 33 34 Table 8. A majority of the snippets were considered to information might be derived from Wikidata’s descrip- 34 35 be part of Wikipedia (77% for Arabic and 69% in Es- tions: 35 36 peranto). The participants confirmed that they seemed "it should be possible to be imported from Wiki- 36 37 to identify a certain style and manner of writing with data" (P9). 37 38 Wikipedia - by comparison, only 35% of the Arabic 38 39 news snippets and 52% of the Esperanto ones could Only one editor could spot a difference in the text 39 40 have passed as Wikipedia content in the study. Gen- and regular Wikidata triples: 40 41 uine Wikipedia text was recognised as such (in 77% "I think it’s taken from the outside sources, the text, 41 42 42 and 84% of the cases, respectively). the first text here, anything else I don’t think it has 43 43 Our model was able to generate text that is not only been taken from anywhere else, as far as I can tell" 44 44 accurate from a writing point of view, but in a high (P2). 45 number of cases, felt like Wikipedia and could blend 45 46 in with other Wikipedia content. Overall, the answers supported our assumption that 46 47 NLG, trained on Wikidata labelled triples, could be 47 48 5.3. RQ2 Task-based evaluation naturally added to Wikipedia pages without chang- 48 49 ing the reading experience. In the same time, the 49 50 As part of our interview study, we asked editors to task revealed questions around algorithmic complexity 50 51 read an ArticlePlaceholder page with included NLG and capturing provenance. Both are relevant to ensure 51 18 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 transparency and accountability and help flag quality Importance When reading the ArticlePlaceholder 1 2 issues. page, people looked first at our text: 2 3 3 4 Length We were interested in understanding how we "The introduction of this line, that’s the first thing 4 5 could iterate over the NLG capabilities of our system I see" (P4). 5 to produce text of appropriate length. While the model 6 This is their way to confirm that they landed on the 6 generated just one sentence, the editors thought it to be 7 right page and if the topic matches what they were 7 a helpful starting point: 8 looking for: 8 9 "Actually I would feel pretty good learning the ba- 9 "Yeah, it does help because that’s how you know 10 sics. What I saw is the basic information of the city 10 if you’re on the right article and not a synonym or 11 so it will be fine, almost like a stub" (P4). 11 12 some other article" (P1). 12 13 While generating larger pieces of text could ar- 13 This makes the text critical for the engagement with 14 guably be more useful, reducing the need for manual 14 the ArticlePlaceholder page, where most of the infor- 15 editing even further, the fact that the placeholder page 15 mation is expressed as Wikidata triples. Natural lan- 16 contained just one sentence made it clear to the editors 16 guage can add context to structured data representa- 17 that the page still requires work. In this context, an- 17 tions: 18 other editor referred to a ’magic threshold’ for an au- 18 19 tomatically generated text to be useful (see also Sec- "Well that first line was, it’s really important be- 19 20 tion 5.4). Their expectations for an NLG output were cause I would say that it [the ArticlePlaceholder 20 21 clear: page] doesn’t really make a lot of sense [without 21 22 it] ... it’s just like a soup of words, like there should 22 23 "So the definition has to be concise, a little bit not 23 very long, very complex, to understand the topic, be one string of words next to each other so this all 24 ties in the thing together. This is the most important 24 25 is it the right topic you’re looking for or". (P1) 25 thing I would say." (P8) 26 We noted that whatever the length of the snippet, it 26 27 needs to match reading practice. Editors tend to skim Tags To understand how people react to 27 28 articles rather than reading them in detail: a fault in the NLG algorithm, we chose to leave the 28 29 tags in the text snippets the participants saw 29 30 "[...] most of the time I don’t read the whole article, during the reading task. As mentioned earlier, such 30 31 it’s just some specific, for instance a news piece tags refer to entities in the triples that the algorithm 31 32 or some detail about I don’t know, a program in was unsure about and could not verbalise. We did not 32 33 languages or something like that and after that, I explain the meaning of the tokens to the participants 33 34 just try to do something with the knowledge that I beforehand and none of the them mentioned them dur- 34 35 learned, in order for me to acquire it and remember ing the interviews. We believe this was mainly because 35 36 it" (P6) "I’m getting more and more convinced that they were not familiar with what the tokens meant and 36 37 37 I just skim" (P1) "I should also mention that very not because they were not able to spot errors overall. 38 38 often, I don’t read the whole article and very often For example, in the case of Arabic, the participants 39 39 I just search for a particular fact" (P3) "I can’t say pointed to a Wikidata property with an incorrect label, 40 40 that I read a lot or reading articles from the begin- which our NLG algorithm reused. They also picked up 41 ning to the end, mostly it’s getting, reading through 41 42 on a missing label in the native language for a city. 42 the topic, “Oh, what this weird word means,” or However, the tokens were not noticed in any 43 something" (P10) 43 44 of the 10 reading tasks until explicitly mentioned by 44 45 When engaging with content, people commonly go the interviewer. The name of the city Marrakesh in 45 46 straight to the part of the page that contains what they one of the native languages (Berber) was realised us- 46 47 need. If they are after an introduction to a topic, having ing the token (see the Arabic sentence in Ta- 47 48 a summary at the top of the page, for example in the ble 6). One editor explained that the fact that they are 48 49 form of an automatically generated text snippet, could not familiar with this language (Berber) and can there- 49 50 make a real difference in matching their information fore not evaluate the correctness of the statement is the 50 51 needs. main reason that they oversaw the token: 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 19

1 "the language is specific, it says that this is a lan- Editing experience While GST scores gave us an 1 2 guage that is spoken mainly in Morocco and Alge- idea about the extent to which the automatically gen- 2 3 ria, the language, I don’t even know the symbols erated text is reused by the participants, the interviews 3 4 for the alphabets [...] I don’t know if this is correct, helped us understand how they experienced the text. 4 5 I don’t know the language itself so for me, it will Overall, the text snippets were widely praised as help- 5 6 go unnoticed. But if somebody from that area who ful, especially for newcomers: 6 7 knows anything about this language, I think they 7 "Especially for new editors, they can be a good 8 might think twice" (P8). 8 starting help: "I think it would be good at least to 9 9 make it easier for me as a reader to build on the 10 10 5.4. RQ3 - Task-based evaluation page if I’m a new volunteer or a first time edit, 11 11 it would make adding to the entry more appealing 12 12 Our third research questions focused on how people (...) I think adding a new article is a barrier. For me 13 13 work with automatically generated text. The overall I started only very few articles, I always built on 14 14 aim of adding NLG to the ArticlePlaceholder is to help other contribution. So I think adding a new page 15 15 editors from under-served Wikipedia language com- is a barrier that Wikidata can remove. I think that 16 16 munities bootstrap missing articles without disrupting would be the main difference." (P4) 17 their editing practices. 17 18 As noted earlier, we carried out a task-based evalu- All participants were experienced editors. Just like 18 19 ation, in which the participants were presented with an in the reading task, they thought having a shorter text 19 20 ArticlePlaceholder page that included the text snippet to start editing had advantages: 20 21 21 and triples from Wikidata relevant to the topic. We car- "It wasn’t too distracting because this was so short. 22 22 ried out a quantitative analysis of the editing activities If it was longer, it would be (...) There is this magi- 23 23 via the GST score, as well as a qualitative, thematic cal threshold up to which you think that it would be 24 24 analysis of the interviews, in which the participants ex- easier to write from scratch, it wasn’t here there." 25 25 plained how and they changed the text. In the follow- (P10) 26 ing we will first present the GST scores, and then dis- 26 27 cuss the themes that emerged from the interviews. The length of the text is also related to the ability 27 28 of the editors to work around and fix errors, such as 28 29 Reusing the text As shown in Table 9, the snippets the tokens discussed earlier. When editing, 29 30 are heavily used and all participants reused them at participants were able to grasp what information was 30 31 least partially. 8 of them were wholly derived (WD) missing and revise the text accordingly: 31 32 and the other 2 were partially derived (PD) from the 32 "So I have this first sentence, which is a pretty good 33 text we provided, which means an average GST score 33 sentence and then this is missing in Hebrew and 34 of 0.77. These results are in line with a previous edit- 34 well, since it’s missing and I do have this name 35 ing study of ours, described in [16], with a sample of 35 here, I guess I could quickly copy it here so now 36 54 editors from two language communities, 79% and 36 it’s not missing any longer." (P3) 37 93% of the snippets were either wholly (WD) or par- 37 38 tially (PD) reused. The same participant checked the Wikidata triples 38 39 We manually inspected all edits and compared them listed on the same ArticlePlaceholder page to find the 39 40 to the ’originals’ - as explained in Section 4, we had 10 missing information, which was not available there ei- 40 41 participants from 6 language communities, who edited ther, and then looked it up on a different language ver- 41 42 6 language versions of the same article. In the 8 cases sion of Wikipedia. They commented: 42 43 where editors reused more of the text, they tended to 43 "This first sentence at the top, was it was written, 44 copy it with minimal modifications, as illustrated in 44 it was great except the pieces of information were 45 sequences A and B in Table 9. Three editors did not 45 missing, I could quite easily find them, I opened 46 change the text snippet at all, but only added to it based 46 the different Wikipedia article and I pasted them, 47 on the triples shown on the ArticlePlaceholder page. 47 that was really nice" (P3). 48 One of the common things that hampers the full 48 49 reusability are tokens. This can lead editors Other participants mentioned a similar approach, 49 50 to rewrite the sentence completely, as in the PD exam- though some decided to delete the entire snippet be- 50 51 ple in Table 9. cause of the token and start from scratch. 51 20 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 Table 9 1 2 Number of snippets in each category of reuse. A generated snippet (top) and its edited version (bottom). Solid lines represent reused tiles, while 2 dashed lines represent overlapping sub-sequences not contributing to the gstscore. The first two examples are created in the studies, the last one 3 3 (ND) is from a previous experiment. 4 4 5 Category Examples # 5 6 6 7 7 8 WD 8 8 9 9 10 10 11 11 PD 2 12 12 13 13 14 ND 0 14 15 15 16 16 However, the text they added turned out to be very not been summarised in the text, or by trying to find 17 17 close to what the algorithm generated. that information elsewhere on Wikipedia. 18 18 19 "I know it, I have it here, I have the name in Arabic, 19 20 so I can just copy and paste it here." (P10) 20 6. Discussion 21 "[deleted the whole sentence] mainly because of 21 22 the missing tokens, otherwise it would have been 22 Our first research question focuses on how well 23 fine" (P5) 23 24 an NLG algorithm can generate summaries from the 24 25 One participant commented at length the presence Wikipedia reader’s perspective. In most of the cases, 25 26 of the tokens: the text is considered to be from the Wikimedia envi- 26 27 ronment. Readers do not clearly differentiate between 27 "I didn’t know what rare [is], I thought it was some 28 the generated summary and an original Wikipedia sen- 28 kind of tag used in machine learning because I’ve 29 tence. While this indicates the high quality of the gen- 29 seen other tags before but it didn’t make any sense 30 erated textual content, it is problematic with respect 30 because I didn’t know what use I had, how can I 31 to trust in Wikipedia. Trust in Wikipedia and how hu- 31 32 use it, what’s the use of it and I would say it would mans evaluate trustworthiness of a certain article has 32 33 be distracting, if it’s like in other parts of the page been investigated using both quantitative and qualita- 33 34 here. So that would require you to understand first tive methods. Adler et al. [58] and Adler and de Al- 34 35 what rare does there, what is it for and that would faro [57] develop a quantitative framework based on 35 36 take away the interest I guess, or the attention span Wikipedia’s history. Lucassen and Schraagen [59] use 36 37 so it would be better just to, I don’t know if it’s a qualitative methodology to code readers’ opinions on 37 38 for instance, if the word is not, it’s rare, this part the aspects that indicate the trustworthiness of an ar- 38 39 right here which is, it shouldn’t be there, it should ticle. However, none of these approaches take the au- 39 40 be more, it would be better if it’s like the input box tomatic creation of text by non-human algorithms into 40 41 or something". (P1) account. Therefore, a high quality, not distinguishable 41 42 from a human-generated Wikipedia summary can be a 42 43 Overall, the editing task and the follow-up inter- double-edged sword. While conducting the interviews, 43 44 views showed that the text snippets were a useful start- we realized that the Arabic generated summary has a 44 45 ing point for editing the page. Missing information, factual mistake. We could show in previous work [6] 45 46 presented in the form of markup did not hin- that those factual mistakes are relatively seldom, how- 46 47 der participants from editing and did not make them ever they are a known drawback of neural text genera- 47 48 consider the snippets less useful. While they were un- tion. In our case, the Arabic sentence stated that Mar- 48 49 sure about what the tokens meant, they intuitively re- rakesh was located in the north, while it is actually in 49 50 placed them with the information they felt was miss- the center of the country. One of the participants lives 50 51 ing, either by consulting the Wikidata triples that had in Marrakesh. Curiously, they did not realize this mis- 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 21

1 take, even while translating the sentence to the inter- approach, however using a tool with gamification fea- 1 2 viewee: tures [66]. The ArticlePlaceholder, and in particular 2 3 the provided summaries in natural language, can have 3 "Yes, so I think, so we have here country, Moroc- 4 an impact on how people start editing. 4 can city in the north, I would say established and 5 In comparison to Wikipedia Adventure, the readers 5 in year (. . . )" (P1) 6 are exposed to the ArticlePlaceholder pages and, thus, 6 7 As we did not introduce the sentence as automatically it could lower their reservation to edit by offering a 7 8 generated, we assume it is due to the trust Wikipedians more natural start of editing. 8 9 have in their platform and the quality of the sentence: Lastly, we asked the research question how editors 9 10 use the textual summaries in their workflow. Gener- 10 "This sentence was so well written that I didn’t 11 ally, we can show that the text is highly reused. One 11 even bother to verify if it’s actually a desert city" 12 of the editors mentions a magic threshold, that makes 12 (P3) 13 the summary acceptable as a starting point for editing. 13 14 To not misinform and eventually disrupt this trust, fu- This seems similar to post-editing in machine transla- 14 15 ture work will have to investigate ways of dealing with tion (or rather monolingual machine translation [67], 15 16 those wrongly generated statements. On a technical where a user only speaks the target or source lan- 16 17 level that will mean exploring ways to excluding sen- guage). Translators have been found to oppose ma- 17 18 tences with factual mistakes. On an HCI level, investi- chine translation and post-editing as they perceive it 18 19 gating ways of communicating the problems of neural as more time consuming and restricting with respect 19 20 text generation to the reader will be needed. One pos- to their freedom in the translation (e.g. sentence struc- 20 21 sible solution to this problem can be visualizations, as ture) [68]. Nonetheless, it has been shown that a differ- 21 22 these can influence the trust given to a particular arti- ent interface can not only lead to reduced time and en- 22 23 cle [60], such as WikiDashboard [61]. A similar tool hanced quality, but also convinces a user to believe in 23 24 could be used for the provenance of text. the improved quality of the machine translation [69]. 24 25 Supporting the previous results from the readers, ed- This underlines the importance of the right integration 25 26 itors have also a positive perception of the summaries. with machine-generated sentences, as we aim for in the 26 27 It is the first thing they read when they arrive at a page ArticlePlaceholder. 27 28 and it helps them to quickly verify that the page is The core of this work is to understand the percep- 28 29 about the topic they are looking for. tion of a community, such as Wikipedians, of the in- 29 30 When creating the summaries, we assumed their rel- tegration of a state-of-the-art machine learning tech- 30 31 atively short length might be a point for improvement nique for NLG in their platform. We can show that 31 32 from an editors’ perspective. In particular, as research an integration can work and be supported. This find- 32 33 suggests that the length of an article indicates its qual- ing aligns with other projects already deployed on 33 34 ity – basically the longer, the better [62]. However, Wikipedia. For instance, bots (short for robots) that 34 35 from the interviews with editors, we found that they monitor Wikipedia, have become a trusted tool for van- 35 36 mostly skim articles when reading them. This seems to dalism fighting [70]; so much that they can even re- 36 37 be the more natural way of browsing the information vert edits made by humans if they believe them to be 37 38 on an article and is supported by the short summary, malicious. The cooperative work between humans and 38 39 giving an overview on the topic. machines on Wikipedia has been also theorized in ma- 39 40 All editors we worked with are part of the mul- chine translation. Alegria et al. [71] argue for the inte- 40 41 tilingual Wikipedia community, editing in at least gration of machine translation in Wikipedia, that learns 41 42 two Wikipedias. Hale [63, 64] highlight that users of from the post-editing of the editors. Such a human-in- 42 43 this community are particularly active compared to the-loop approach is also applicable to our NLG work, 43 44 their monolingual counterparts and confident in edit- where a algorithm could learn from the humans’ con- 44 45 ing across different languages. However, taking poten- tributions. 45 46 tial newcomers into account, they suggest that the Ar- There is a need of investigating this direction fur- 46 47 ticlePlaceholder might be helpful to lower the barrier ther, as NLG algorithms will not achieve the same 47 48 of starting to edit. Recruiting more editors has been quality as humans. Especially in a low resource setting, 48 49 a long-standing objective, with initiatives such as the as the one observed in this work, human support is 49 50 Tea House [65] aiming at welcoming and comforting needed. However, automated tools can be a great way 50 51 new editors; Wikipedia Adventure employs a similar of allocating the limited human resources to the tasks 51 22 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 that are mostly needed. Post-editing the summaries can gether, those two groups form the Wikipedia commu- 1 2 serve a purely data-driven approach such as ours with nity as new editors are recruited from the existing pool 2 3 additional data that can be used to further improve of readers and readers contribute in essential ways to 3 4 the quality of the automatically generated content. To Wikipedia as shown by Antin and Cheshire [75]. 4 5 make such an approach feasible for the ArticlePlace- Beside the Arabic sentence, the editors worked on 5 6 holder, we need an interface that encourages the edit- are synthesized, i.e. not generated but created by the 6 7 ing of the summaries. The less effort this editing re- authors of this study. While those sentences were not 7 8 quires, the more we can ensure an easy collaboration created by natural language generation, they were cre- 8 9 of human and machine. ated and discussed with other researchers in the field. 9 10 We therefore focused on the most common problem in 10 11 text generative tasks similar to ours: the to- 11 12 7. Limitations kens. Other problems of neural language generation, 12 13 such as factually wrong sentences or hallucinations 13 14 We interviewed ten editors having different levels [76], were excluded from the study as they are not a 14 15 of Wikipedia experience. As all editors are already common problem for short summaries as ours. Fur- 15 16 Wikipedia editors, the conclusions we can draw for ther, our study set up was focusing on the integration in 16 17 new editors are limited. We focus on experienced edi- the ArticlePlaceholder, i.e. the quality of the sentences 17 18 tors, as we expect them to be the first editors to adapt rather than their factual correctness. 18 19 the ArticlePlaceholder in their workflow. Typically, 19 20 new editors will follow the existing guidelines and 20 21 standards of the experienced editors, therefore, the fo- 8. Conclusion 21 22 cus on experienced editors will give us an understand- 22 23 ing of how the editors will accept and interact with the We conducted a quantitative study with members of 23 24 new tool. Further, it is difficult to sample from new ed- the Arabic and community and 24 25 itors, as there is a variety of factors that can make a semi-structured interviews with members of six dif- 25 26 contributor develop into a long-term editor or not [72]. ferent Wikipedia communities to understand the com- 26 27 The distribution of languages favors Arabic, as the munities understanding and acceptance of generated 27 28 community was most responsive. This is can be as- text in Wikipedia. To understand the impact of auto- 28 29 sumed to be due to previous collaborations. While we matically generated text for a community such as the 29 30 cover different languages, it is only a small part of the Wikipedia editors of under-resourced languages, we 30 31 different language communities that Wikipedia covers surveyed their perception of generated summaries in 31 32 in total. Most studies of Wikipedia editors currently terms of fluency and appropriatness. To deepen this 32 33 focus on English Wikipedia [73]. Even the few stud- understanding, we conducted 10 semi-structured inter- 33 34 ies that observe multiple language Wikipedia editors views with experienced Wikipedia editors from 6 dif- 34 35 do not include the span of insights from different lan- ferent language communities and measured their reuse 35 36 guages that we provide in this study. In our study we of the original summaries. 36 37 treat the different editors as members of a unified com- The addition of the summaries seems to be natu- 37 38 munity of Wikipedia underserved languages. This is ral for readers: We could show that Wikipedia editors 38 39 supported by the fact that their answers and themes rank our text close to the expected quality standards of 39 40 were consistent across different languages. Therefore, Wikipedia, and are likely to consider the generated text 40 41 adding more editors of the same languages would not as part of Wikipedia. The language the neural network 41 42 have brought a benefit. produces integrates well with the existing content of 42 43 We aimed our evaluation at two populations: read- Wikipedia, and readers appreciate the summary on the 43 44 ers and editors. While the main focus was on the ArticlePlaceholder as it is the most helpful element on 44 45 editors and their interaction with the new informa- the page to them. 45 46 tion, we wanted to include the readers’ perspective. We could emphasize that editors would assume the 46 47 In the readers evaluation we focus on the quality of text was part of Wikipedia and that the summary 47 48 text (in terms of fluency and appropriateness), as this improves the ArticlePlaceholder page fundamentally. 48 49 will be the most influential factor for their experience Particularly the summary being short supported the 49 50 on Wikipedia. Readers, while usually overseen, are an usual workflow of skimming the page for information 50 51 important part of the Wikipedia community [74]. To- needed. The missing word token, that we included to 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 23

1 gain an understanding how users interact with faulty [9] H. ElSahar, C. Gravier and F. Laforest, Zero-Shot Question 1 2 produced text, did not hinder the reading experience Generation from Knowledge Graphs for Unseen Predicates 2 3 nor the editing experience. Editors are likely to reuse and Entity Types, in: Proceedings of the 2018 Conference of 3 the North American Chapter of the Association for Compu- 4 a large portion of the generated summaries. Addition- 4 tational Linguistics: Human Language Technologies, NAACL- 5 ally, participants mentioned that the summary can be a HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Vol- 5 6 good starting point for new editors. ume 1 (Long Papers), 2018, pp. 218–228. https://aclanthology. 6 7 info/papers/N18-1020/n18-1020. 7 8 [10] I.V.Serban, A. García-Durán, Ç. Gülçehre, S. Ahn, S. Chandar, 8 A.C. Courville and Y. Bengio, Generating Factoid Questions 9 9. Acknowledgment 9 With Recurrent Neural Networks: The 30M Factoid Question- 10 10 Answer Corpus, in: Proceedings of the 54th Annual Meeting 11 This research was supported by the EPSRC-funded of the Association for Computational Linguistics, ACL 2016, 11 12 project Data Stories under grant agreement num- August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 12 13 ber EP/P025676/1; and DSTL under grant number 2016. 13 14 DSTLX-1000094186; and EU H2020 Program for the [11] X. Du, J. Shao and C. Cardie, Learning to Ask: Neural Ques- 14 tion Generation for Reading Comprehension, in: Proceedings 15 Project No. 727658 (IASIS). We would like to express 15 of the 55th Annual Meeting of the Association for Compu- 16 our gratitude to the Wikipedia communities involved tational Linguistics, ACL 2017, Vancouver, Canada, July 30 16 17 in this study for embracing our research and particu- - August 4, Volume 1: Long Papers, 2017, pp. 1342–1352. 17 18 larly to the participants of the interviews. doi:10.18653/v1/P17-1123. 18 19 [12] T. Luong, I. Sutskever, Q.V. Le, O. Vinyals and W. Zaremba, 19 20 Addressing the Rare Word Problem in Neural Machine Trans- 20 lation, in: Proceedings of the 53rd Annual Meeting of the Asso- 21 References 21 ciation for Computational Linguistics and the 7th International 22 Joint Conference on Natural Language Processing of the Asian 22 [1] B. Hecht and D. Gergle, The Tower of Babel Meets Web 2.0: 23 Federation of Natural Language Processing, ACL 2015, July 23 User-generated Content and Its Applications in a Multilingual 24 26-31, 2015, Beijing, China, Volume 1: Long Papers, 2015, 24 Context, in: Proceedings of the SIGCHI conference on human pp. 11–19. 25 factors in computing systems, ACM, 2010, pp. 291–300. 25 [13] L. Dong and M. Lapata, Language to Logical Form with Neu- 26 [2] D. Vrandeciˇ c´ and M. Krötzsch, Wikidata: A Free Collaborative 26 ral Attention, in: Proceedings of the 54th Annual Meeting of the Knowledgebase, Communications of the ACM 57(10) (2014), 27 Association for Computational Linguistics, ACL 2016, August 27 78–85. 28 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. 28 [3] L.-A. Kaffee and E. Simperl, Analysis of Editors’ Lan- 29 http://aclweb.org/anthology/P/P16/P16-1004.pdf. 29 guages in Wikidata, in: Proceedings of the 14th Interna- [14] L. Kaffee, H. ElSahar, P. Vougiouklis, C. Gravier, F. Lafor- 30 tional Symposium on Open Collaboration, OpenSym 2018, 30 est, J.S. Hare and E. Simperl, Learning to Generate Wikipedia 31 Paris, France, August 22-24, 2018, 2018, pp. 21:1–21:5. 31 Summaries for Underserved Languages from Wikidata, in: 32 doi:10.1145/3233391.3233965. 32 Proceedings of the 2018 Conference of the North Ameri- 33 [4] L.-A. Kaffee, A. Piscopo, P. Vougiouklis, E. Simperl, L. Carr 33 and L. Pintscher, A Glimpse into Babel: An Analysis of Mul- can Chapter of the Association for Computational Linguis- 34 34 tilinguality in Wikidata, in: Proceedings of the 13th Interna- tics: Human Language Technologies, NAACL-HLT, New Or- 35 tional Symposium on Open Collaboration, ACM, 2017, p. 14. leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short 35 36 [5] L.A. Kaffee, Generating ArticlePlaceholders from Wikidata Papers), 2018, pp. 640–645. https://aclanthology.info/papers/ 36 N18-2101/n18-2101. 37 for Wikipedia: Increasing Access to Free and Open Knowl- 37 [15] P. Isola, J. Zhu, T. Zhou and A.A. Efros, Image-to-Image 38 edge, 2016. 38 [6] P. Vougiouklis, H. Elsahar, L.-A. Kaffee, C. Gravier, Translation with Conditional Adversarial Networks, in: 2017 39 39 F. Laforest, J. Hare and E. Simperl, Neural Wikipedian: IEEE Conference on Computer Vision and Pattern Recogni- 40 Generating Textual Summaries from Knowledge tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, 40 41 Base Triples, Journal of Web Semantics (2018). pp. 5967–5976. doi:10.1109/CVPR.2017.632. 41 42 doi:https://doi.org/10.1016/j.websem.2018.07.002. [16] L. Kaffee, H. ElSahar, P. Vougiouklis, C. Gravier, F. Laforest, 42 J.S. Hare and E. Simperl, Mind the (Language) Gap: Genera- 43 http://www.sciencedirect.com/science/article/pii/ 43 S1570826818300313. tion of Multilingual Wikipedia Summaries from Wikidata for 44 44 [7] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, ArticlePlaceholders (2018), 319–334. doi:10.1007/978-3-319- 45 H. Schwenk and Y. Bengio, Learning Phrase Representations 93417-4_21. 45 46 using RNN Encoder-Decoder for Statistical Machine Transla- [17] B. Collier and J. Bear, Conflict, criticism, or confidence: an 46 47 tion, CoRR abs/1406.1078 (2014). empirical examination of the gender gap in wikipedia contribu- 47 tions, in: CSCW ’12 Computer Supported Cooperative Work, 48 [8] I. Sutskever, O. Vinyals and Q.V. Le, Sequence to Sequence 48 Learning with Neural Networks, in: Advances in Neural In- Seattle, WA, USA, February 11-15, 2012, 2012, pp. 383–392. 49 49 formation Processing Systems 27, Z. Ghahramani, M. Welling, doi:10.1145/2145204.2145265. 50 C. Cortes, N.D. Lawrence and K.Q. Weinberger, eds, Curran [18] M. Graham, B. Hogan, R.K. Straumann and A. Medhat, Un- 50 51 Associates, Inc., 2014, pp. 3104–3112. even Geographies of User-generated Information: Patterns of 51 24 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 Increasing Informational Poverty, Annals of the Association of [33] Y. Mrabet, P. Vougiouklis, H. Kilicoglu, C. Gardent, 1 2 American Geographers 104(4) (2014), 746–764. D. Demner-Fushman, J. Hare and E. Simperl, Aligning Texts 2 3 [19] M.J. Pat Wu, State of connectivity 2015: A report on global and Knowledge Bases with Semantic Sentence Simplification, 3 internet access, 2016. http://newsroom.fb.com/news/2016/02/ Association for Computational Linguistics, 2016, pp. 29–36. 4 4 state-of-connectivity-2015-a-report-on-global-internet-access/. [34] R. Lebret, D. Grangier and M. Auli, Neural Text Genera- 5 [20] J. Voss, Measuring Wikipedia (2005). tion from Structured Data with Application to the Biography 5 6 [21] B.J. Hecht, The Mining and Application of Diverse Cultural Domain, in: Proceedings of the 2016 Conference on Empiri- 6 7 Perspectives in User-generated Content, PhD thesis, North- cal Methods in Natural Language Processing, EMNLP 2016, 7 8 western University, 2013. Austin, Texas, USA, November 1-4, 2016, 2016, pp. 1203– 8 9 [22] P. Bao, B. Hecht, S. Carton, M. Quaderi, M. Horn and D. Ger- 1213. 9 gle, Omnipedia: Bridging the Wikipedia Language Gap, in: [35] E. Reiter and A. Belz, An Investigation into the Validity of 10 10 Proceedings of the SIGCHI Conference on Human Factors in Some Metrics for Automatically Evaluating Natural Language 11 Computing Systems, ACM, 2012, pp. 1075–1084. Generation Systems, Comput. Linguist. 35(4) (2009), 529–558. 11 12 [23] C. Kramsch and H. Widdowson, Language and Culture, Ox- doi:10.1162/coli.2009.35.4.35405. 12 13 ford University Press, 1998. [36] K. Papineni, S. Roukos, T. Ward and W. Zhu, Bleu: a Method 13 14 [24] T. Steiner, Bots vs. Wikipedians, Anons vs. Logged-Ins (Re- for Automatic Evaluation of Machine Translation, in: Proceed- 14 dux): A Global Study of Edit Activity on Wikipedia and ings of the 40th Annual Meeting of the Association for Compu- 15 15 Wikidata, in: Proceedings of The International Symposium tational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., 16 16 on Open Collaboration, OpenSym ’14, ACM, New York, 2002, pp. 311–318. 17 NY, USA, 2014, pp. 25:1–25:7. ISBN 978-1-4503-3016-9. [37] C.-Y. Lin, Rouge: A package for automatic evaluation of sum- 17 18 doi:10.1145/2641580.2641613. maries, in: Text summarization branches out: Proceedings of 18 19 [25] L. Wanner, B. Bohnet, N. Bouayad-Agha, F. Lareau and the ACL-04 workshop, Vol. 8, Barcelona, Spain, 2004. 19 20 D. Nicklaß, Marquis: Generation of User-Tailored Multilingual [38] A. Lavie and A. Agarwal, METEOR: An Automatic Metric for 20 Air Quality Bulletins, Applied Artificial Intelligence 24(10) MT Evaluation with High Levels of Correlation with Human 21 21 (2010), 914–952. doi:10.1080/08839514.2010.529258. Judgments, in: Proceedings of the Second Workshop on Statis- 22 [26] D. Galanis and I. Androutsopoulos, Generating Multilingual tical Machine Translation, StatMT ’07, Association for Com- 22 23 Descriptions from Linguistically Annotated OWL Ontologies: putational Linguistics, Stroudsburg, PA, USA, 2007, pp. 228– 23 24 The NaturalOWL system, in: Proceedings of the Eleventh Eu- 231. 24 25 ropean Workshop on Natural Language Generation, Associa- [39] G. Angeli, P. Liang and D. Klein, A Simple Domain- 25 tion for Computational Linguistics, 2007, pp. 143–146. independent Probabilistic Approach to Generation, in: Pro- 26 26 [27] D. Duma and E. Klein, Generating Natural Language from ceedings of the 2010 Conference on Empirical Methods in Nat- 27 27 Linked Data: Unsupervised template extraction, in: Proceed- ural Language Processing, EMNLP ’10, Association for Com- 28 ings of the 10th International Conference on Computational putational Linguistics, Stroudsburg, PA, USA, 2010, pp. 502– 28 29 Semantics (IWCS 2013) – Long Papers, Association for Com- 512. http://dl.acm.org/citation.cfm?id=1870658.1870707. 29 30 putational Linguistics, Potsdam, Germany, 2013, pp. 83–94. [40] I. Konstas and M. Lapata, A Global Model for Concept-to-text 30 31 http://www.aclweb.org/anthology/W13-0108. Generation, J. Artif. Int. Res. 48(1) (2013), 305–346. http://dl. 31 [28] B. Ell and A. Harth, A language-independent method for the acm.org/citation.cfm?id=2591248.2591256. 32 32 extraction of RDF verbalization templates, in: INLG 2014 - [41] C. Mellish and R. Dale, Evaluation in the context of natu- 33 Proceedings of the Eighth International Natural Language ral language generation, Computer Speech & Language 12(4) 33 34 Generation Conference, Including Proceedings of the INLG (1998), 349–373. doi:10.1006/csla.1998.0106. 34 35 and SIGDIAL 2014 Joint Session, 19-21 June 2014, Philadel- [42] E. Reiter, R. Robertson and S. Sripada, Acquiring Correct 35 36 phia, PA, USA, 2014, pp. 26–34. Knowledge for Natural Language Generation, J. Artif. Int. 36 [29] C. Sauper and R. Barzilay, Automatically Generating Res. 18 (2003), 491–516. doi:https://doi.org/10.1613/jair.1176. 37 37 Wikipedia Articles: A Structure-aware Approach, in: Proceed- https://jair.org/index.php/jair/article/view/10332. 38 38 ings of the Joint Conference of the 47th Annual Meeting of [43] S. Williams and E. Reiter, Generating basic skills reports 39 the ACL and the 4th International Joint Conference on Natural for low-skilled readers, Natural Language Engineering 14(4) 39 40 Language Processing of the AFNLP: Volume 1-Volume 1, As- (2008), 495–525–. doi:10.1017/S1351324908004725. 40 41 sociation for Computational Linguistics, 2009, pp. 208–216. [44] E. Reiter, Natural Language Generation, in: The Handbook 41 42 [30] Y. Pochampally, K. Karlapalem and N. Yarrabelly, Semi- of Computational Linguistics and Natural Language Process- 42 Supervised Automatic Generation of Wikipedia Articles for ing, Wiley-Blackwell, 2010, pp. 574–598, Chapter 20. ISBN 43 43 Named Entities, in: Wiki@ ICWSM, 2016. 9781444324044. doi:10.1002/9781444324044.ch20. 44 [31] W.D. Lewis and P. Yang, Building MT for a Severely Under- [45] X. Sun and C. Mellish, An experiment on free generation from 44 45 Resourced Language: White Hmong, Association for Machine single RDF triples, in: Proceedings of the Eleventh European 45 46 Translation in the Americas, October (2012). Workshop on Natural Language Generation, Association for 46 47 [32] A. Chisholm, W. Radford and B. Hachey, Learning to gener- Computational Linguistics, 2007, pp. 105–108. 47 48 ate one-sentence biographies from Wikidata, in: Proceedings [46] A.-C. Ngonga Ngomo, L. Bühmann, C. Unger, J. Lehmann 48 of the 15th Conference of the European Chapter of the Associ- and D. Gerber, Sorry, I don’t speak SPARQL – Translating 49 49 ation for Computational Linguistics: Volume 1, Long Papers, SPARQL Queries into Natural Language, in: Proceedings of 50 Association for Computational Linguistics, Valencia, Spain, the 22nd international conference on World Wide Web, ACM, 50 51 2017, pp. 633–642. 2013, pp. 977–988. 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 25

1 [47] P.D. Clough, R.J. Gaizauskas, S.S.L. Piao and Y. Wilks, ME- [59] T. Lucassen and J.M. Schraagen, Trust in Wikipedia: 1 2 TER: MEasuring TExt Reuse, in: Proceedings of the 40th An- How Users Trust Information from an Unknown Source, 2 3 nual Meeting of the Association for Computational Linguistics, in: Proceedings of the 4th ACM Workshop on Infor- 3 July 6-12, 2002, Philadelphia, PA, USA., 2002, pp. 152–159. mation Credibility on the Web, WICOW 2010, Raleigh, 4 4 [48] M. Potthast, T. Gollub, M. Hagen, J. Kiesel, M. Michel, North Carolina, USA, April 27, 2010, 2010, pp. 19–26. 5 5 A. Oberländer, M. Tippmann, A. Barrón-Cedeño, P. Gupta, doi:10.1145/1772938.1772944. 6 P. Rosso and B. Stein, Overview of the 4th International [60] A. Kittur, B. Suh and E.H. Chi, Can You Ever Trust a 6 7 Competition on Plagiarism Detection, in: CLEF 2012 Evalua- Wiki?: Impacting Perceived Trustworthiness in Wikipedia, 7 8 tion Labs and Workshop, Online Working Notes, Rome, Italy, in: Proceedings of the 2008 ACM Conference on Com- 8 9 September 17-20, 2012, 2012. puter Supported Cooperative Work, CSCW 2008, San Diego, 9 CA, USA, November 8-12, 2008, 2008, pp. 477–480. 10 [49] M. Mintz, S. Bills, R. Snow and D. Jurafsky, Distant supervi- 10 sion for relation extraction without labeled data, in: ACL 2009, doi:10.1145/1460563.1460639. 11 11 Proceedings of the 47th Annual Meeting of the Association [61] P. Pirolli, E. Wollny and B. Suh, So You Know You’re Getting 12 for Computational Linguistics and the 4th International Joint the Best Possible Information: a Tool that Increases Wikipedia 12 13 Conference on Natural Language Processing of the AFNLP, 2- Credibility, in: Proceedings of the 27th International Confer- 13 14 7 August 2009, Singapore, 2009, pp. 1003–1011. ence on Human Factors in Computing Systems, CHI 2009, 14 Boston, MA, USA, April 4-9, 2009, 2009, pp. 1505–1508. 15 [50] S. Bird, NLTK: The Natural Language Toolkit, in: ACL 2006, 15 doi:10.1145/1518701.1518929. 16 21st International Conference on Computational Linguistics 16 and 44th Annual Meeting of the Association for Computational [62] J.E. Blumenstock, Size Matters: Word Count as a Measure 17 17 Linguistics, Proceedings of the Conference, Sydney, Australia, of Quality on Wikipedia, in: Proceedings of the 17th In- 18 17-21 July 2006, 2006. ternational Conference on World Wide Web, WWW 2008, 18 19 [51] T. Luong, H. Pham and C.D. Manning, Effective Approaches Beijing, China, April 21-25, 2008, 2008, pp. 1095–1096. 19 doi:10.1145/1367497.1367673. 20 to Attention-based Neural Machine Translation, in: Proceed- 20 [63] S.A. Hale, Multilinguals and Wikipedia editing, in: 21 ings of the 2015 Conference on Empirical Methods in Natural 21 ACM Web Science Conference, WebSci ’14, Blooming- Language Processing, Association for Computational Linguis- 22 ton, IN, USA, June 23-26, 2014, 2014, pp. 99–108. 22 tics, Lisbon, Portugal, 2015, pp. 1412–1421. http://aclweb.org/ 23 doi:10.1145/2615569.2615684. 23 anthology/D15-1166. [64] S.A. Hale, Cross-language Wikipedia Editing of Okinawa, 24 [52] A.M. Rush, S. Chopra and J. Weston, A Neural Attention 24 Japan, in: Proceedings of the 33rd Annual ACM Conference 25 Model for Abstractive Sentence Summarization, in: Proceed- 25 on Human Factors in Computing Systems, CHI 2015, Seoul, 26 ings of the 2015 Conference on Empirical Methods in Nat- 26 Republic of Korea, April 18-23, 2015, 2015, pp. 183–192. ural Language Processing, EMNLP 2015, Lisbon, Portugal, 27 doi:10.1145/2702123.2702346. 27 September 17-21, 2015, 2015, pp. 379–389. 28 [65] J.T. Morgan, S. Bouterse, H. Walls and S. Stierch, Tea 28 [53] T. Joachims, A Probabilistic Analysis of the Rocchio Algo- 29 and Sympathy: Crafting Positive New User Experiences on 29 rithm with TFIDF for Text Categorization, in: Proceedings of 30 Wikipedia, in: Computer Supported Cooperative Work, CSCW 30 the Fourteenth International Conference on Machine Learn- 2013, San Antonio, TX, USA, February 23-27, 2013, 2013, 31 ing (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, 31 pp. 839–848. doi:10.1145/2441776.2441871. 32 1997, pp. 143–151. 32 [66] S. Narayan, J. Orlowitz, J.T. Morgan, B.M. Hill and 33 [54] N. Halko, P. Martinsson and J.A. Tropp, Finding Structure with A.D. Shaw, The Wikipedia Adventure: Field Evaluation of 33 34 Randomness: Probabilistic Algorithms for Constructing Ap- an Interactive Tutorial for New Users, in: Proceedings of the 34 SIAM Review 35 proximate Matrix Decompositions, 53(2) (2011), 2017 ACM Conference on Computer Supported Cooperative 35 217–288. doi:10.1137/090771806. 36 Work and Social Computing, CSCW 2017, Portland, OR, USA, 36 [55] K. Heafield, I. Pouzyrevsky, J.H. Clark and P. Koehn, Scalable February 25 - March 1, 2017, 2017, pp. 1785–1799. http: 37 37 Modified Kneser-Ney Language Model Estimation, in: Pro- //dl.acm.org/citation.cfm?id=2998307. 38 38 ceedings of the 51st Annual Meeting of the Association for [67] C. Hu, B.B. Bederson, P. Resnik and Y. Kronrod, MonoTrans2: 39 Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, A New Human Computation System to Support Monolingual 39 40 Bulgaria, Volume 2: Short Papers, 2013, pp. 690–696. Translation, in: Proceedings of the International Conference 40 41 [56] M.J. Wise, YAP3: Improved detection of similarities in com- on Human Factors in Computing Systems, CHI 2011, Van- 41 puter program and other texts, ACM SIGCSE Bulletin 28(1) 42 couver, BC, Canada, May 7-12, 2011, 2011, pp. 1133–1136. 42 (1996), 130–134. doi:10.1145/1978942.1979111. 43 43 [57] B.T. Adler and L. de Alfaro, A Content-driven Reputation [68] E. Lagoudaki, Translation Editing Environments, in: MT Sum- 44 System for the Wikipedia, in: Proceedings of the 16th In- mit XII: Workshop on Beyond Translation Memories, 2009. 44 45 ternational Conference on World Wide Web, WWW 2007, [69] S. Green, J. Heer and C.D. Manning, The Efficacy of Human 45 46 Banff, Alberta, Canada, May 8-12, 2007, 2007, pp. 261–270. Post-editing for Language Translation, in: 2013 ACM SIGCHI 46 47 doi:10.1145/1242572.1242608. Conference on Human Factors in Computing Systems, CHI 47 [58] B.T. Adler, K. Chatterjee, L. de Alfaro, M. Faella, I. Pye ’13, Paris, France, April 27 - May 2, 2013 48 , 2013, pp. 439–448. 48 and V. Raman, Assigning Trust to Wikipedia Content, doi:10.1145/2470654.2470718. 49 49 in: Proceedings of the 2008 International Symposium on [70] R.S. Geiger and D. Ribes, The Work of Sustaining Order in 50 Wikis, 2008, Porto, Portugal, September 8-10, 2008, 2008. Wikipedia: The Banning of a Vandal, in: Proceedings of the 50 51 doi:10.1145/1822258.1822293. 2010 ACM Conference on Computer Supported Cooperative 51 26 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

1 Work, CSCW 2010, Savannah, Georgia, USA, February 6-10, Proceedings of the Twelfth ACM International Conference 1 2 2010, 2010, pp. 117–126. doi:10.1145/1718918.1718941. on Web Search and Data Mining, WSDM 2019, Melbourne, 2 [71] I. Alegria, U. Cabezón, U.F. de Betoño, G. Labaka, A. Mayor, 3 VIC, Australia, February 11-15, 2019, 2019, pp. 618–626. 3 K. Sarasola and A. Zubiaga, Reciprocal Enrichment Between 4 4 and Machine Translation, in: The People’s doi:10.1145/3289600.3291021. 5 5 Web Meets NLP, Collaboratively Constructed Language Re- [75] J. Antin and C. Cheshire, Readers are not free-riders: reading 6 sources, 2013, pp. 101–118. doi:10.1007/978-3-642-35085- 6 as a form of participation on wikipedia, in: Proceedings of the 7 6_4. 7 2010 ACM Conference on Computer Supported Cooperative 8 [72] S. Kuznetsov, Motivations of Contributors to Wikipedia, 8 SIGCAS Computers and Society 36(2) (2006), 1. Work, CSCW 2010, Savannah, Georgia, USA, February 6-10, 9 9 doi:10.1145/1215942.1215943. 2010, 2010, pp. 127–130. doi:10.1145/1718918.1718942. 10 [73] K.A. Panciera, A. Halfaker and L.G. Terveen, Wikipedians 10 11 are born, not made: a study of power editors on Wikipedia, [76] A. Rohrbach, L.A. Hendricks, K. Burns, T. Darrell and 11 12 in: Proceedings of the 2009 International ACM SIGGROUP K. Saenko, Object Hallucination in Image Captioning, in: Pro- 12 13 Conference on Supporting Group Work, GROUP 2009, Sani- ceedings of the 2018 Conference on Empirical Methods in Nat- 13 bel Island, Florida, USA, May 10-13, 2009, 2009, pp. 51–60. 14 ural Language Processing, Brussels, Belgium, October 31 - 14 doi:10.1145/1531674.1531682. 15 [74] F. Lemmerich, D. Sáez-Trumper, R. West and L. Zia, Why November 4, 2018, 2018, pp. 4035–4045. https://aclanthology. 15 16 the World Reads Wikipedia: Beyond English Speakers, in: info/papers/D18-1437/d18-1437. 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51