Semantic Web 0 (0) 1 1 IOS Press
1 1 2 2 3 3 4 Using Natural Language Generation to 4 5 5 6 Bootstrap Empty Wikipedia Articles: 6 7 7 8 8 9 A Human-centric Perspective 9 10 10 * 11 Lucie-Aimée Kaffee , Pavlos Vougiouklis and Elena Simperl 11 12 School of Electronics and Computer Science, University of Southampton, UK 12 13 E-mails: [email protected], [email protected], [email protected] 13 14 14 15 15 16 16 17 17 Abstract. Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media 18 18 management. Recent advances in machine learning have made it possible to train NLG systems to achieve human-level perfor- 19 19 mance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it 20 with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-served Wikipedias, 20 21 which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to 21 22 generate text from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The 22 23 evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the text snippets score 23 24 well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles. It also hints 24 25 at several potential implications of using NLG solutions in Wikipedia at large, including content quality, trust in technology, and 25 26 algorithmic transparency. 26 27 27 Keywords: Wikipedia, Wikidata, ArticlePlaceholder, Under-resourced languages, Natural Language Generation, Neural 28 28 networks 29 29 30 30 31 31 32 1. Introduction riety of content integration affordances in Wikipedia, 32 33 including links to articles in other languages and in- 33 34 34 Wikipedia is available in 301 languages, but its con- foboxes. An example can be seen in Figure 1: in the 35 35 tent is unevenly distributed [1]. Under-resourced lan- French Wikipedia, the infobox shown in the article 36 guage versions face multiple challenges: fewer editors 36 37 about cheese (right) automatically draws in data from 37 means fewer articles and less quality control, making Wikidata (left) and displays it in French. 38 that particular Wikipedia less attractive for readers in 38 In previous work of ours, we proposed the Article- 39 that language, which in turn makes it more difficult to 39 40 Placeholder, a tool that takes advantage of Wikidata’s 40 recruit new editors from among the readers. 41 41 Wikidata, the structured-data backbone of Wikipedia multilingual capabilities to increase the coverage of 42 42 [2], offers some help. It contains information about under-resourced Wikipedias [5]. When someone looks 43 43 more than 55 million entities, for example, people, for a topic that is not yet covered by Wikipedia in 44 44 places or events, edited by an active international com- their language, the ArticlePlaceholder tries to match 45 45 munity of volunteers [3]. More importantly, it is mul- the topic with an entity in Wikidata. If successful, it 46 46 tilingual by design and each aspect of the data can be 47 then redirects the search to an automatically generated 47 translated and rendered to the user in their preferred 48 placeholder page that displays the relevant informa- 48 language [4]. This makes it the tool of choice for a va- 49 tion, for example the name of the entity and its main 49 50 properties, in their language. The ArticlePlaceholder is 50 51 *Corresponding author. E-mail: [email protected]. currently used in 14 Wikipedias (see Section 3.1). 51
1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 Fig. 1. Representation of Wikidata triples and their inclusion in a Wikipedia infobox. Wikidata triples (left) are used to fill out the fields of the 23 24 infobox in the article about fromage infobox from the French Wikipedia 24 25 25 26 In this paper, we propose an iteration of the Ar- tences with limited usability in user-facing systems; 26 27 ticlePlaceholder to improve the representation of the one of the most common problems is their ability to 27 28 data on the placeholder page. The original version of handle rare words [12, 13], which are words that the 28 29 the tool pulled the raw data from Wikidata (available model does not meet frequently enough during train- 29 30 as triples with labels in different languages) and dis- ing, such as localisations of names in different lan- 30 31 played it in tabular form (see Figure 3 in Section 3). In guages. We introduce a mechanism called property 31 32 the current version, we use Natural Language Genera- placeholder [14] to tackle this problem, learning mul- 32 33 tion (NLG) techniques to automatically produce a text tiple verbalisations of the same entity in the text [6]. 33 34 snippet from the triples instead. Presenting structured In building the system we aimed to pursue the fol- 34 35 data as text rather than tables helps people uninitiated lowing research questions: 35 36 with the involved technologies to make sense of it [6]. 36 37 This is particularly useful in contexts where one can- RQ1 Can we train a neural network to generate 37 38 not make any assumptions about the levels of data lit- text from triples in a low-resource setting? 38 39 eracy of the audience, as it is the case for a large share To answer this question we first evaluated the 39 40 of the Wikipedia community. system using a series of predefined metrics and 40 41 Our NLG solution builds upon the general encoder- baselines. In addition, we undertook a quanti- 41 42 decoder framework for neural networks, which is cred- tative study with participants from two under- 42 43 ited with promising results in similar text-centric tasks, served Wikipedia language communities (Arabic 43 44 such as machine translation [7, 8] and question gen- and Esperanto), who were asked to assess, from a 44 45 eration [9–11]. We extend this framework to meet the reader’s perspective, whether the text is fluent and 45 46 needs of different Wikipedia language communities in appropriate for Wikipedia. 46 47 terms of text fluency, appropriateness to Wikipedia, RQ2 How do editors perceive the generated text 47 48 and reuse during article editing. Given an entity that on the ArticlePlaceholder page? To add depth 48 49 was matched by the ArticlePlaceholder, our system to the quantitative findings of the first study, we 49 50 uses its triples to generate a short Wikipedia-style sum- undertook a second, mixed-methods study within 50 51 mary. Many existing NLG techniques produce sen- six Wikipedia language communities (Arabic, 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 3
1 Swedish, Hebrew, Persian, Indonesian, and Ukrainian). around the
1 Language coverage tells a similar story. [19] noted 1 2 that only 67% of the world’s population has access to 2 3 encyclopedic knowledge in their first or second lan- 3 4 guage. Another significant problem is caused by the 4 5 extreme differences in content coverage between lan- 5 6 guage versions. As early as 2005, Voss[20] found huge 6 7 gaps in the development and growth of Wikipedias, 7 Fig. 2. Example of labeling in Wikidata. Each entity can be labeled 8 which make it more difficult for smaller communities 8 in multiple languages using the labeling property rdfs:label 9 9 to catch up with the larger ones. Alignment cannot be (not displayed in the figure). 10 simply achieved through translation - putting aside the 10 11 fact that each Wikipedia needs to reflect the interests anced than on the web at large. This was the starting 11 12 and points of view of their local community rather than point for our work on the ArticlePlaceholder (see Sec- 12 13 iterate over content transferred from elsewhere, stud- tion 3), which leverages the multilingual support of 13 14 ies have shown that the existing content does not over- Wikidata to bootstrap empty articles in under-served 14 15 lap, discarding the so-called English-as-Superset con- Wikipedias. 15 16 jecture for as many as 25 language versions [1, 21]. 16 17 To help tackle these imbalances, Bao et al. [22] built 17 2.2. Text generation 18 a system to give users easier access to the language 18 19 diversity of Wikipedia. 19 Our work is directly related to approaches that pro- 20 Our work is motivated and complements previous 20 duce text from triples expressed as RDF (Resource De- 21 studies and frameworks that argue that the language of 21 scription Framework) or similar. Many of them rely 22 global projects such as Wikipedia Hecht [21] should 22 on templates, which are either based on linguistic fea- 23 express cultural reality Kramsch and Widdowson [23]. 23 tures e.g., grammatical rules [25] or are hand-crafted 24 Instead of using content from one Wikipedia version 24 [26]. An example is Reasonator, a tool for lexical- 25 to bootstrap another, we take structured data labelled 25 ising Wikidata triples with templates translated by 26 in the relevant language and create a more accessible 26 users.1 Such approaches face many challenges - they 27 representation of it as text, keeping those cultural ex- 27 cannot be easily transferred to different languages or 28 pressions unimpaired. 28 scale to broader domains, as templates need to be ad- 29 To do so, we leverage Wikidata [2]. Wikidata was 29 justed to any new language or domain they are ported 30 originally created to support Wikipedia’s language 30 to. This makes them unsuitable for under-resources 31 connections, for instance links to articles in other lan- 31 Wikipedias, which rely on small numbers of contribu- 32 guages or infoboxes, see example from Section 1); 32 tors. 33 however, it soon evolved into becoming a critical 33 To tackle this limitation, [27, 28] introduced a 34 source of data for many other applications. 34 distant-supervised method to verbalise triples, which 35 Wikidata contains statements on general knowledge, 35 learns templates from existing Wikipedia articles. 36 e.g. about people, places, events and other entities of 36 While this makes the approach more suitable for 37 interest. The knowledge base is created and maintained 37 language-independent tasks, templates always assume 38 collaboratively by a community of editors, assisted 38 that entities will always have the relevant triples to 39 by automated tools called bots [3, 24]. Bots take on 39 fill in the slots. This assumption is not always true. In 40 repetitive tasks such as ingesting data from a different 40 our work, we propose a template-learning baseline and 41 source, or simple quality checks. 41 show that adapting to the varying triples available can 42 The basic building blocks of Wikidata are items and 42 achieve better performance. 43 properties. Both have identifiers, which are language- 43 Sauper and Barzilay [29] and Pochampally et al. [30] 44 independent, and labels in one or more languages. 44 generate Wikipedia summaries by harvesting sen- 45 They form triples, which link items to their attributes, 45 46 or to other items. Figure 2 shows a triple linking a tences from the Internet. Wikipedia articles are used 46 47 place to its country, where both items and the property to automatically derive templates for the topic struc- 47 48 between them have labels in three languages. ture of the summaries and the templates are afterward 48 49 We analysed the distribution of label languages filled using web content. Both systems work best on 49 50 in Wikidata in [4] and noted that while there is a 50 51 bias to English, the language distribution is more bal- 1https://tools.wmflabs.org/reasonator/ 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 5
1 specific domains and for languages like English, for curacy of the generated literacy and numeracy assess- 1 2 which suitable web content is readily available [31]. ments by a sample of 230 participants, which costed 2 3 A different line of research aims to explore knowl- as much as 25 thousand UK pounds. Reiter et al. [42] 3 4 edge bases as a resource for NLG [6, 27, 32, 33]. In all described a clinical trial with over 2000 smokers cost- 4 5 these examples, linguistic information from the knowl- ing 75 thousand pounds, assumed to be the most costly 5 6 edge base is used to build a parallel corpus containing NLG evaluation at this point. All of the smokers com- 6 7 triples and equivalent text sentences from Wikipedia, pleted a smoking questionnaire in the first stage of 7 8 which is then used to train the NLG algorithm. Directly the experiment, in order to find what portion of those 8 9 relevant to the model we propose are the proposals by who received the automatically generated letters from 9 10 Lebret et al. [34], Chisholm et al. [32], and Vougiouk- STOP had managed to quit. 10 11 lis et al. [6], which extend the general encoder-decoder Given these challenges, most research systems tend 11 12 neural network framework from [7, 8]. They use struc- to use judgement-based rather than task-based evalua- 12 13 tured data from Wikidata and DBpedia as input and tions [27, 28, 32, 35, 39, 40, 45, 46]. Besides their lim- 13 14 generate one-sentence summaries that match the style ited scope, most studies in this category do not recruit 14 15 of the English Wikipedia in a single domain. While from the relevant user population, relying on more ac- 15 16 this is a rather narrow task compared to other genera- cessible options such as online crowdsourcing. Sauper 16 17 tive tasks such as translation, Chisholm et al. [32] dis- and Barzilay [29] is a rare exception. In their paper, the 17 18 cuss its challenges in detail and show that it is far from authors describe the generation of Wikipedia articles 18 19 being solved. Compared to these previous works, the using a content-selection algorithm that extracts infor- 19 20 NLG algorithm presented in this paper is for open- mation from online sources. They test the results by 20 21 domain tasks and multiple languages. publishing 15 articles about diseases on Wikipedia and 21 22 measuring how the articles change (including links, 22 23 2.3. Evaluating text generation systems formatting and grammar). Their evaluation approach 23 24 is not easy to follow, as the Wikipedia community 24 25 Related literature suggests three ways of deter- tends to disagree with conducting research on their 25 26 mining how well an NLG system achieves its goals. platform.2 26 27 The first, which is commonly referred to as metric- The methodology we follow draws from all these 27 28 based corpus evaluation [35], use text-similarity met- three areas: we start with an automatic, metric-based 28 29 rics such as BLEU [36], ROUGE [37] and METEOR corpus evaluation to get a basic understanding of the 29 30 [38]. These metrics essentially compare how similar quality of the text the system produces and compare it 30 31 the generated texts are to texts from the corpus. The with relevant baselines. We then use quantitative anal- 31 32 other two involve people and are either task-based or ysis based on human judgements to assess how use- 32 33 judgement/rating-based [35]. Task-based evaluations ful the text snippets are for core tasks such as read- 33 34 aim to assess how an NLG solution assists participants ing and editing. To add context to these figures, we run 34 35 in undertaking a particular task, for instance learning an mixed-methods study with an interview and a task- 35 36 about a topic or writing a text. judgement-based eval- based component, where we learn more about the user 36 37 uations rely on a set of criteria against which partici- experience with NLG and measure the extent to which 37 38 pants are asked to rate the quality of the automatically NLG text is reused by editors, using a metric inspired 38 39 generated text [35]. from journalism [47] and plagiarism detection [48]. 39 40 Metric-based corpus evaluations are widely used as 40 41 they offer an affordable, reproducible way to auto- 41 42 matically assess the linguistic quality of the gener- 3. Bootstrapping empty Wikipedia articles 42 43 ated texts [14, 32, 34, 35, 39, 40]. However, they do 43 44 not always correlate with manually curated quality rat- The overall aim of our system is to give edi- 44 45 ings [35]. tors access to information that is not yet covered in 45 46 Task-based studies are considered most useful, as Wikipedia, but is available, in the relevant language, 46 47 they allow system designers to explore the impact of in Wikidata. The system is built on the ArticlePlace- 47 48 the NLG solution to end-users [35, 41]. However, they holder that displays Wikidata triples dynamically on 48 49 can be resource-intensive - previous studies [42, 43] 49 50 cite five figure sums, including data analysis and plan- 2https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_ 50 51 ning [44]. The system in [43] was evaluated for the ac- not#Wikipedia_is_not_a_laboratory 51 6 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 under-served Wikipedias. In this paper, we extend Figure 4 shows how the model generates a summary 1 2 the ArticlePlacehoder with an NLG component that from those triples, f1, f2 to fn. A vector representation 2 3 3 generates an introductory sentence on each Article- h f1 , h f2 to h fn for each of the input triples is computed 4 Placeholder page in the target language from Wikidata by processing their subject, predicate and object. 4 5 triples. Such triple-level vector representations help com- 5 6 6 pute a vector representation for the whole input set hFE . 7 7 3.1. ArticlePlaceholder hFE , along with a special start-of-summary
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 Fig. 3. Example page of the ArticlePlaceholder as deployed now on 14 Wikipedias. This example contains information from Wikidata on 23 Triceratops in Haitian-Creole. 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 Fig. 4. Representation of the neural network architecture. The triple encoder computes a vector representation for each one of the three input 34 34 triples from the ArticlePlaceholder, h f1 , h f2 and h f3 . Subsequently, the decoder is initialised using the concatenation of the three vectors, 35 35 [h f1 ; h f2 ; h f3 ]. The purple boxes represent the tokens of the generated text. Each snippet starts and ends with special tokens: start-of-summary 36
1 same property (i.e. Q490900 (Floridia) P17 (stato)ˆ and quantitative components. We instructed editors to 1 2 Q38 (Italio)). complete a reading and an editing task and to describe 2 3 their experiences along a series of questions. We used 3 3.2.2. Training dataset 4 thematic analysis to identify common themes in the 4 In order to train and evaluate our system, we cre- 5 answers. For the editing task, we also used a quantita- 5 ated a dataset for text generation from knowledge base 6 tive element in the form of a text reuse metric, which 6 triples in two languages. We used two language ver- 7 is described below. 7 sions of Wikipedia (we provide further details about 8 8 how we prepared the summaries for the rest of the lan- 9 4.1. RQ1 - Metric-based corpus evaluation 9 guages in Section 4.3) which differ in terms of size (see 10 10 Table 2) and language support in Wikidata [4]. The 11 We evaluated the generated summaries against 11 dataset aligns Wikidata triples about an item with the 12 two baselines on their original counterparts from 12 first sentence of the Wikipedia article about that entity. 13 Wikipedia. We used a set of evaluation metrics for text 13 For each Wikipedia article, we extracted and tok- 14 generation BLEU 2, BLEU 3, BLEU 4, METEOR and 14 enized the first sentence using a multilingual Regex to- 15 ROUGE . BLEU calculates n-gram precision multi- 15 kenizer from the NLTK toolkit [50]. Afterwards, we L 16 plied by a brevity penalty, which penalizes short sen- 16 retrieved the corresponding Wikidata item to the arti- 17 tences to account for word recall. METEOR is based 17 cle and queried all triples where the item appeared as 18 on the combination of uni-gram precision and recall, 18 a subject or an object in the Wikidata truthy dump.3 19 with recall weighted over precision. It extends BLEU 19 In doing so, we relied on keyword matching against 20 by including stemming, synonyms and paraphrasing. 20 labels from Wikidata from the corresponding lan- 21 ROUGE is a recall-based metric which calculates the 21 guage, due to the lack of reliable entity linking tools L 22 length of the most common subsequence between the 22 for under-served languages. 23 generated summary and the reference. 23 24 We use property placeholders (as describe in the 24 4.1.1. Data 25 previous section) to avoid the lack of vocabulary typ- 25 Both the Arabic and Esperanto corpus are split into 26 ical for under-resourced languages. An example of 26 training, validation and test, with respective portions of 27 a summary which is used for the training of the 27 85%, 10%, and 5%. We used a fixed target vocabulary 28 neural architecture is: “Floridia estas komunumo de 28 [[P17]] [[P17]] that consisted of the 15000 and 25000 of the most fre- 29 .”, where is the property place- 29 quent tokens (i.e. either words or surface form tuples 30 holder for Italio (see Table 1). In case a rare entity in 30 of entities) of the summaries in Arabic and Esperanto 31 the text is not matched to any of the input triples, its 31
1 Table 2 1 2 Recent page statistics and number of unique words (vocabulary size) of Esperanto, Arabic and English Wikipedias in comparison with Wikidata. 2 Retrieved 27 September 2017. Active users are registered users that have performed an action in the last 30 days. 3 3 4 Page stats Esperanto Arabic English Wikidata 4 5 5 Articles 241,901 541,166 5,483,928 37,703,807 6 6 7 Average number of edits/page 11.48 8.94 21.11 14.66 7 8 Active users 2,849 7,818 129,237 17,583 8 9 Vocabulary size 1.5M 2.2M 2M – 9 10 10 11 Table 3 11 12 Evaluation methodology 12 13 13 Data Method Participants 14 14 15 Metric-based evaluation and 15 Metrics and survey 16 RQ1 judgement-based evaluation Readers of two Wikipedias 16 answers 17 quantitative 17 18 Task-based evaluation, qualitative 18 RQ2 Interview answers Editors of six Wikipedias 19 (thematic analysis) 19 20 20 Interview answers and text Task-based evaluation, qualitative RQ3 Editors of six Wikipedias 21 reuse metrics (thematic analysis) and quantitative 21 22 22 23 23 24 by comparing it against two baselines of different na- summaries from the training dataset. The second one 24 25 ture. (TPext) retrieves summaries with the special tokens for 25 vocabulary extension. A summary can act as a tem- 26 Machine Translation (MT) For the MT baseline, 26 plate after replacing its entities with their correspond- 27 we used Google Translate on English Wikipedia sum- 27 ing Property Placeholders (see Table 1). 28 maries. Those translations are compared to the actual 28 29 target language’s Wikipedia entry. This limits us to KN The KN baseline is a 5-gram Kneser-Ney (KN) [55] 29 30 articles that exist in both English and the target lan- language model. KN has been used before as a base- 30 31 guage. In our dataset, the concepts in Esperanto and line for text generation from structured data [34] and 31 32 Arabic that are not covered by English Wikipedia ac- provided competitive results on a single domain in En- 32 33 count for 4.3% and 30.5% respectively. This indicates glish. We also introduce a second KN model (KNext), 33 34 the content coverage gap between different Wikipedia which is trained on summaries with the special to- 34 35 languages [1]. kens for property placeholder. During test time, we use 35 36 beam search of size 10 to sample from the learned lan- 36 37 Template Retrieval (TP) Similar to template-based guage model. 37 38 approaches for text generation [28, 52], we build a 38 39 template-based baseline that retrieves an output sum- 4.2. RQ1 - judgement-based evaluation 39 40 mary from the training data based on the input triples. 40 41 First, the baseline encodes the list of input triples that We defined quality in terms of text fluency and 41 42 corresponds to each summary in the training/test sets appropriateness, where fluency refers to how under- 42 43 into a sparse vector of TF-IDF weights [53]. After- standable and grammatically correct a text is, and ap- 43 44 wards, it performs LSA [54] to reduce the dimension- propriateness captures how well the text ’feels like’ 44 45 ality of that vector. Finally, for each item in the test Wikipedia content. We asked two sets of participants 45 46 set, we employ the K-nearest neighbors algorithm to from two different language Wikipedias to assess the 46 47 retrieve the vector from the training set that is the same text snippets on a scale according to these two 47 48 closest to this item. The summary that corresponds metrics. 48 49 to the retrieved vector is used as the output summary Participants were asked to fill out a survey combin- 49 50 for this item in the test set. We provide two versions ing fluency and appropriateness questions. An exam- 50 51 of this baseline. The first one (TP) retrieves the raw ple for a question can be found in Figure 5. 51 10 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 Fig. 5. Example of a question to the editors about quality and appropriateness for Wikipedia of the generated summaries in Arabic. They see this 26 27 page after the instructions are displayed. First, the user is asked to evaluate the quality from 0 to 6 (Question 69), then they are asked whether 27 28 the sentence could be part of Wikipedia (Question 70). 28 29 29 30 4.2.1. Recruitment 4.2.2. Participants 30 31 Our study targets any speaker of Arabic and Es- We recruited a total of 54 participants (see Table 4). 31 32 32 peranto who reads that particular Wikipedia, indepen- Coincidentally, 27 of them were from each language 33 dent of their contributions to Wikipedia. We wanted 33 34 community. 34 to reach fluent speakers of each language who use 35 35 Wikipedia and are familiar with it even if they do not 36 4.2.3. Ethics 36 37 edit it frequently. For Arabic, we reached out to Arabic 37 38 speaking researchers from research groups working on The research was approved by the Ethics Committee 38 39 Wikipedia-related topics. For Esperanto, as there are of the University of Southampton under ERGO Num- 39 40 40 fewer speakers and they are harder to reach, we pro- ber 30452. 41 41 moted the study on social media such as Twitter and 42 42 4 43 Reddit using the researchers’ accounts. The survey in- 4.2.4. Data 43 44 structions and announcements were translated to Ara- For both languages, we created a corpus consist- 44 45 bic and Esperanto.5 The survey was open for 15 days. 45 46 ing of 60 summaries of which 30 are generated 46 47 through our approach, 15 are from news, and 15 from 47 48 Wikipedia sentences used the train the neural network 48 49 4 49 https://www.reddit.com/r/Esperanto/comments/75rytb/help_in_ model. For news in Esperanto, we chose introductory 50 a_study_using_ai_to_create_esperanto/ 50 51 5https://github.com/luciekaffee/Announcements sentences of articles in the Esperanto version of Le 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 11
1 Table 4 1 50% 2 Judgement-based evaluation: total number of participants (P), total number of sentences (S), number of participants who evaluated at least 2 of the sentences, average number of sentences evaluated per participant, and number of total annotations 3 3 4 #P #S #P: S>50% Avg #S/P All Ann. 4 5 5 Fluency 27 60 5 15.03 406 6 6
Arab. Appropriateness 27 60 5 14.78 399 7 7 8 Fluency 27 60 3 8.7 235 8 Appropriateness 27 60 3 8.63 233 9 Esper. 9 10 10 11 Monde Diplomatique6. For news in Arabic, we did the Table 5 11 12 same and used the RSS feed of BBC Arabic7. Number of Wikipedia articles, active editors on Wikipedia (editors 12 30 13 that performed at least one edit in the last days), and number of 13 native speakers in million. 14 4.2.5. Metrics 14 15 Let S = {s1, s2,..., sn} be a set of sentences to be Language # Articles Active Editors Native Speakers 15 16 assessed and N = {N1, N2,... Nn} be the set of the 16 Arabic 541,166 5,398 280 17 total number of judgements for sentence s1...n. 17 Swedish 3,763,584 2,669 8.7 18 Each participant was asked to assess the fluency of 18 Hebrew 231,815 2,752 8 19 60 sentences on a scale from 0 to 6 as follows: 19 Persian 643,635 4,409 45 20 20 (6) No grammatical flaws and the content can be un- Indonesian 440,948 2,462 42.8 21 21 derstood with ease Ukrainian 830,941 2,755 30 22 22 (5) Comprehensible and grammatically correct sum- 23 23 mary that reads a bit artificial 24 be part of a Wikipedia article. We tested whether a 24 (4) Comprehensible summary with minor grammati- 25 reader can tell the difference from just one sentence 25 cal errors 26 whether a text is appropriate for Wikipedia, using the 26 (3) Understandable, but has grammatical issues 27 news sentences as a baseline. This gave us an insight 27 (2) Barely understandable summary with significant 28 into whether the text produced by the neural network 28 grammatical errors 29 “feels” like Wikipedia text. Participants were asked not 29 (1) Incomprehensible summary, but a general theme 30 to use any external tools for this task and had to give a 30 can be understood 31 binary answer. 31 (0) Incomprehensible summary 32 Given the same sets of sentences, S , and total num- 32 33 ber of annotations that each sentence in S has re- 33 For each sentence, we calculated the mean flu- s s s s 34 ceived, we defined Asi = {a i , a i ,..., a i } s.t. a i ∈ 34 ency given by all participants and then averaging over 1 2 Ni j 35 {0, 1}∀ j ∈ [1, N ] as the set of all the N appropriate- 35 all summaries of each category. We defined F si = i i 36 si si si si ness ratings that the sentence si has received. The av- 36 { f , f ,..., f } s.t. f ∈ [0, 6]∀ j ∈ [1, Ni] as the set 37 1 2 Ni j erage appropriateness score for the summaries of each 37 of all the Ni fluency ratings that the sentence si has 38 38 received. We computed the overall fluency as follows: method was calculated as follows: 39 39 40 40 41 n Ni 41 n Ni 1 1 1 1 X X si 42 X X si a j . (2) 42 f j . (1) n Ni 43 n Ni i=1 j=1 43 i=1 j=1 44 44 45 To assess the appropriateness, participants were 45 46 asked to assess whether the displayed sentence could 4.3. RQ2 and RQ3 - Task-based evaluation 46 47 47 48 We ran a series of semi-structured interviews with 48 6http://eo.mondediplo.com/, accessed on the 28th of September, 49 49 2017 editors of six Wikipedias to get an in-depth under- 50 7http://feeds.bbci.co.uk/arabic/middleeast/rss.xml, accessed on standing of their experience with reading and using the 50 51 the 28th of September, 2017 automatically generated text. Each interview started 51 12 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 Fig. 6. Screenshot of the reading task. The page is stored as a subpage of the author’s userpage on Wikidata, therefore the layout copies the 23 24 original layout of any Wikipedia. The layout of the information displayed mirrors the ArticlePlaceholder setup. The participants sees the sentence 24 to evaluate alongside information included from the Wikidata triples (such as the image and statements) in their native language (Arabic in this 25 25 example). 26 26 27 with general questions about the experience of the par- We were in contact with 18 editors from different 27 28 28 ticipant with Wikipedia and Wikidata, and their under- Wikipedia communities. We allowed all editors to par- 29 29 standing of different aspects of these projects. The par- ticipate but had to exclude editors who edit only on 30 ticipants were then asked to open and read an Article- 30 31 English Wikipedia (as it is outside our use-case) and 31 Placeholder page including text generated through the 32 editors who did not speak a sufficient level of English, 32 NLG algorithm as shown in Figure 6. Finally, partici- 33 which made conducting the interview impossible. 33 pants were asked to edit the content of a page, which 34 34 contained the same information as the one they had 35 4.3.2. Participants 35 to read earlier, but was displayed as plain text in the 36 n 36 Wikipedia edit field. The editing field can be seen in Our sample consists of = 10 Wikipedia editors 37 37 Figure 7. of different under-resourced languages (measured in 38 38 their number of articles). We originally conducted in- 39 4.3.1. Recruitment 39 terviews with 11 editors from seven different language 40 The goal was to recruit a set of editors from dif- 40 communities, but had to remove one interview with an 41 ferent language backgrounds to have a comprehensive 41 42 understanding of different language communities. We editor of the Breton Wikipedia, as we were not able 42 43 reached out to editors of different Wikipedia editor to generate the text for the reading and editing tasks 43 44 mailinglists8 and tweeted a call for contribution using because of a lack of training data. 44 45 the lead author’s account9. Among the participants, 4 were from the Arabic 45 46 46 Wikipedia and participated in the judgement-based 47 47 8 evaluation from RQ . The remaining were from 48 Mailinglists contacted: [email protected] 1 6 48 (Arabic), [email protected] (Esperanto), wikifa- 49 under-resourced language communities: Persian, In- 49 [email protected] (Persian), Wikidata mailinglist, Wikimedia 50 Research Mailinglist donesian, Hebrew, Ukrainian (one per language) and 50 51 9https://twitter.com/frimelle/status/1031953683263750144 Swedish (two participants). 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 13
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 Fig. 7. Screenshot of the editing task. The page is stored on a subpage of the author’s userpage on Wikidata, therefore the layout is equivalent 24 25 as the current MediaWiki installations on Wikipedia. The participants sees the sentence, that they saw before in the reading task and the triples 25 26 from Wikidata in their native language (Arabic in this example). 26 27 27 28 While Swedish is officially the third largest Wikipedia with the project. n = 4 participants have been active 28 10 29 in terms of number of articles, most of the articles are on Wikidata for over 3 years (with n = 2 editors being 29 30 not manually edited. In 2013, the Swedish Wikipedia involved since the start of the project in 2013), n = 5 30 31 passed one million articles, thanks to a a bot called editors had some experience with editing Wikidata and 31 32 lsjbot, which at that point in time had created almost one editor had never edited Wikidata, but knew the 32 11 33 half of the articles on Swedish Wikipedia . Such bot- project. 33 34 generated articles are commonly limited, both in terms Table 6 displays the sentence used for the inter- 34 35 of information content and length. The high activity views in different languages. The Arabic sentence is 35 36 of a single bot is also reflected in the small number of generated by the network based on Wikidata triples, 36 37 active editors compared to the large number of articles while the other sentences are synthetically created as 37 38 (see Table 5). described below. 38 The participants were experienced Wikipedia edi- 39 4.3.3. Ethics 39 40 tors, with average tenures of 9.3 years in Wikipedia 40 The research was approved by the Ethics Committee 41 (between 3 and 14 years, median 10). All of them have 41 of the University of Southampton under ERGO Num- 42 contributed to at least one language besides their main 42 ber 44971 and written consent was obtained from each 43 language, and to the English Wikipedia. Further, n = 4 43 participant ahead of the interviews. 44 editors worked in at least two other languages beside 44 45 their main language, while n = 2 editors were active 4.3.4. Data 45 46 in as many as 4 other languages. All participants knew For Arabic, we reused a text snippet from the RQ1 46 47 about Wikidata, but had varying levels of experience evaluation. For the other language communities, we 47 48 emulated the sentences the network would produce. 48 49 49 10https://meta.wikimedia.org/wiki/List_of_Wikipedias First, we picked an entity to test with the editors. Then, 50 11https://blog.wikimedia.org/2013/06/17/ we looked at the output produced by the network for 50 51 swedish-wikipedia-1-million-articles/ the same concept in Arabic and Esperanto to under- 51 14 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 Table 6 1 2 Overview of sentences and number of participants in each language 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 stand possible mistakes and tokens produced by the 4.3.5. Task 23 23 The interview started with an opening, explaining 24 network. We chose the city of Marrakesh as the con- 24 that the researcher will observe the reading and edit- 25 cept editors would work on. Marrakesh is a good start- 25 26 ing point, as it is a topic possibly highly relevant to ing of the participant in their language Wikipedia. Un- 26 27 readers and is widely covered, but falls into the cate- til both reading and editing were finished, the partic- 27 ipant did not know about the provenance of the text. 28 gory of topics that are potentially under-represented in 28 To start the interview, we asked demographic ques- 29 Wikipedia due to its geographic location [18]. 29 tions about the participants’ contributions to Wikipedia 30 We collected the introductory sentences for Mar- 30 31 and Wikidata, and to test their knowledge on the exist- 31 rakesh in the editors’ languages from Wikipedia. ing ArticlePlaceholder. Before reading, they were in- 32 Those are the sentences the network would be trained 32 33 troduced to the idea of displaying content from Wiki- 33 on and tries to reproduce. We ran the keyword matcher 34 data on Wikipedia. Then, they saw the mocked page 34 that was used for the preparation of the dataset on the 35 of the ArticlePlaceholder as can be seen in Figure 6 35 36 original Wikipedia sentences. It marked the words the in Arabic. Each participant saw the page in their re- 36 37 network would pick up or would be replaced by prop- spective language. As the interviews were remotely, 37 38 erty placeholders. Therefore, these words could not be the interviewer asked them to share the screen so they 38 39 removed. could point out details with the mouse cursor. Ques- 39 40 As we were particularly interested in how editors tions were asked to let them describe their impression 40 41 would interact with the missing word tokens, the net- of the page while they were looking at the page. Then, 41 42 work can produce, we removed up to two words in they were asked to open a new page, which can be 42 43 seen in Figure 7. Again, this page would contain in- 43 each sentence: the word for the concept in its native 44 formation in their language. They were asked to edit 44 language (e.g. Morocco in Arabic for non-Arabic sen- 45 the page and describe what they are doing at the same 45 tences), as we saw that the network does the same, and 46 time freely. We asked them to not edit a whole page 46 47 the word for a concept that is not connected to the main but only write two to three sentences as the introduc- 47 48 entity of the sentence on Wikidata (e.g. Atlas Moun- tion to a Wikipedia article on the topic with as much of 48 49 tains, which is not linked to Marrakesh). An overview the information given as needed. After the editing was 49 50 of all sentences used in the interviews can be found in finished, they were asked questions about theur expe- 50 51 Table 6. rience. Only then, the provenance of the sentences was 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 15
1 revealed. Given this new information, we asked them means that copied sequences of single or double words 1 2 about the predicted impact on their editing experience. will not count in the calculation of reuse. We calcu- 2 3 Finally, we left them time discuss open questions of lated a reuse score gstscore by counting the lengths of 3 4 the participants. The interviews were scheduled to last the detected tiles, and normalized by the length of the 4 5 between 20 minutes to one hour. generated summary. 5 6 6 4.3.6. Analysing the interviews P 7 |ti| 7 We interviewed a total of 11 editors of seven differ- gstscore S, D ti∈T (3) 8 ( ) = 8 ent language Wikipedias. The interviews took place in |S | 9 9 September 2018. We used thematic analysis to evalu- 10 We classified each of the edits into three groups ac- 10 ate the results of the interviews. The interviews were 11 cording to the gstscore as proposed by [47]: 1) Wholly 11 coded by two researchers independently, in the form 12 Derived (WD): the text snippet has been fully reused 12 of inductive coding based on the research questions. 13 in the composition of the editor’s text (gstscore 13 After comparing and merging all themes, both re- > 14 0.66); 2) Partially Derived (PD): the text snippet has 14 searchers independently applied these common themes 15 been partially used (0.66 > gstscore 0.33); 3) Non 15 on the text again. > 16 Derived (ND): the text has been changed completely 16 17 4.3.7. Editors’ reuse metric (0.33 > gstscore). 17 18 Editors were asked to complete a writing task. We 18 19 assessed how they used the automatically generated 19 20 text snippets in their work by measuring the amount 5. Results 20 21 of text reuse. We based the assessment on the editors’ 21 22 resultant summaries after the interviews were finished. 5.1. RQ1 - Metric-based corpus evaluation 22 23 To quantify the amount of reuse in text we use the 23 24 Greedy String-Tiling (GST) algorithm [56]. GST is a As noted earlier, the evaluation used standard met- 24 25 substring matching algorithm that computes the degree rics in this space and a data from the Arabic and Es- 25 26 of reuse or copy from a source text and a dependent peranto Wikipedias. We compared against five base- 26 27 one. GST is able to deal with cases when a whole block lines, one in machine translation (MT), two based on 27 28 is transposed, unlike other algorithms such as the Lev- templates (TP and TPext), the other main class of tech- 28 29 enshtein distance, which calculates it as a sequence of niques in NGL, and two language models (KN and 29 30 single insertions or deletions rather than a single block KNext). As displayed in Table 7, our model showed 30 31 move. Adler and de Alfaro [57] introduce the concept a significant enhancement compared to all baselines 31 32 of Edit Distance in the context of vandalism detec- across the majority of the evaluation metrics in both 32 33 tion on Wikipedia. They measure the trustworthiness languages. We achieved a 3.01 and 5.11 enhancement 33 34 of a piece of text by measuring how much it has been in BLEU 4 score in Arabic and Esperanto respectively 34 35 changed over time. However, their algorithm punishes over TPext, the strongest baseline. 35 36 the copy of the text, as they measure every edit to the The tests also hinted at the limitations of using ma- 36 37 original text. In comparison, we want to measure how chine translation for this task. We attributed this re- 37 38 much of the text is reused, therefore GST is appropri- sult to the different writing styles across language ver- 38 39 ate for the task at hand. sions of Wikipedia. The data confirms that generating 39 40 Given a generated summary S = s1, s2, .. and an language from labels of conceptual structures such as 40 41 edited one D = d1, d2, .., each consisting of a sequence Wikidata is a much more suitable approach. 41 42 of tokens, GST identifies a set of disjoint longest se- Around 6 1% of the generated text snippets in 42 43 quences of tokens in the edited text that exist in the Arabic and 2% of the Esperanto ones did not end 43 44 source text (called tiles) T = {t1, t2, ..}. It is expected in
1 Table 7 1 2 − 4 2 Automatic evaluation of our model against all other baselines using BLEU , ROUGE and METEOR for both Arabic and Esperanto 2 Validation and Test set. 3 3 4 BLEU 2 BLEU 3 BLEU 4 ROUGE METEOR 4 Model L 5 Valid. Test Valid. Test Valid. Test Valid Test Valid. Test 5 6 6 KN 2.28 2.4 0.95 1.04 0.54 0.61 17.08 17.09 29.04 29.02 7 7 8 KNext 21.21 21.16 16.78 16.76 13.42 13.42 28.57 28.52 30.47 30.43 8 9 MT 19.31 21.12 12.69 13.89 8.49 9.11 31.05 30.1 29.96 30.51 9 10 TP 34.18 34.58 29.36 29.72 25.68 25.98 43.26 43.58 32.99 33.33 10 11 Arabic 11 TPext 42.44 41.5 37.29 36.41 33.27 32.51 51.66 50.57 34.39 34.25 12 12 13 Ours 47.38 48.05 42.65 43.32 38.52 39.20 64.27 64.64 45.89 45.99 13 14 w/ PrP 47.96 48.27 43.27 43.60 39.17 39.51 64.60 64.69 46.09 46.17 14 15 15 KN 6.91 6.64 4.18 4.0 2.9 2.79 37.48 36.9 31.05 30.74 16 16 17 KNext 16.44 16.3 11.99 11.92 8.77 8.79 44.93 44.77 33.77 33.71 17 18 MT 1.62 1.62 0.59 0.56 0.26 0.23 0.66 0.68 4.67 4.79 18 19 TP 33.67 33.46 28.16 28.07 24.35 24.3 46.75 45.92 20.71 20.46 19
20 Esperanto 20 TPext 43.57 42.53 37.53 36.54 33.35 32.41 58.15 57.62 31.21 31.04 21 21 Ours 42.83 42.95 38.28 38.45 34.66 34.85 66.43 67.02 40.62 41.13 22 22 23 w/ PrP 43.57 43.19 38.93 38.62 35.27 34.95 66.73 66.61 40.80 40.74 23 24 24 25 Arabic Esperanto We show that i) the high performance of our system 25 80 26 26 70 is not skewed towards some domains at the expense of 27 60 others, and that ii) our model has a good generalisation 27 28 50 across domains – better than any other baseline. 28 40
29 BLEU 4 For instance, the over performance of TP is lim- 29 30 ext 30 30 20 ited to a few number of domains – plotted as the few 31 31 10 outliers in Figure 8 for TPext – , despite its performance 32 0 being much lower on average for all the domains. 32 Our TPext KNext Our TPext KNext 33 w/ PrP w/ PrP 33 Despite the fact that TPext achieves the highest 34 34 recorded performance in a few domains (i.e. TPext out- 35 Fig. 8. A box plot showing the distribution of BLEU 4 scores of all 35 systems for each category of generated summaries. liers in Figure 8), its performance is much lower on 36 36 average for all the domains. The valuable generalisa- 37 37 The introduction of the property placeholders to tion of our model across domains is mainly due to the 38 38 our encoder-decoder architecture enhances our perfor- language model in the decoder layer of our model, 39 39 mance further by 0.61 − 1.10 BLEU (using BLEU 4). which is more flexible than rigid templates and can 40 40 In general, our property placeholder mechanism 41 adapt easier to multiple domains. Despite the fact that 41 benefits the performance of all the competitive sys- 42 the Kneser-Ney template-based baseline (KNext) has 42 tems. 43 exhibited competitive performance in a single-domain 43 44 Generalisation Across Domains To investigate how context [34], it is failing to generalise in our multi- 44 45 well different models can generalise across multiple domain text generation scenario. Unlike our approach, 45 46 domains, we categorise each generated summary into KNext does not incorporate the input triples directly for 46 47 one of 50 categories according to its main entity in- generating the output summary, but rather only uses 47 48 stance type (e.g. village, company, football player). We them to replace the special tokens after a summary 48 49 examine the distribution of BLEU-4 scores per cate- has been generated. This might yield acceptable per- 49 50 gory to measure how well the model generalises across formance in a single domain, where most of the sum- 50 51 domains (Figure 8). maries share a very similar pattern. However, it strug- 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 17
Table 8 1 text and asked them to comment on a series of issues. 1 Results for fluency and appropriateness. 2 We grouped their answers into several general themes 2 3 Fluency Appropriateness around: their use of the snippets, their opinions on text 3 4 provenance, the ideal length of the text, the importance 4 Mean SD Part of Wikipedia 5 of the text for the ArticlePlaceholder, and limitations 5 6 Ours 4.7 1.2 77% of the algorithm. 6 7 Wikipedia 4.6 0.9 74% 7
Arabic Use The participants appreciated the text snippets: 8 News 5.3 0.4 35% 8 9 9 Ours 4.5 1.5 69% "I think it would be a great opportunity for gen- 10 10 Wikipedia 4.9 1.2 84% eral Wikipedia readers to help improve their expe-
11 Esper. 11 News 4.2 1.2 52% rience, while reading Wikipedia" (P7). 12 12 13 Some of them noted that the text gave them a use- 13 14 gles to generate a different pattern for each input set of ful overview and quick introduction to the topic of the 14 15 triples in multiple domain summary generation. page, particularly for people trained in one language or 15 16 non-English speakers: 16 5.2. RQ1 - judgement-based evaluation 17 "I think that if I saw such an article in Ukrainian, 17 18 I would probably then go to English anyway, be- 18 5.2.1. Fluency 19 cause I know English, but I think it would be a huge 19 As shown in Table 8, overall, the quality of the gen- 20 help for those who don’t" (P10). 20 21 erated text is high (4.7 points out of 6 in average in 21 22 Arabic and 4.5 in Esperanto). In Arabic, 63.3% of the Provenance As part of the reading task, we asked the 22 23 summaries scored as much as 5 in fluency on aver- editors what they believed was the provenance of the 23 24 age. In Esperanto, which had the smaller training cor- information displayed on the page. This gave us more 24 25 pus, the participants nevertheless gave as many as half context to the promising fluency and appropriateness 25 26 of the snippets an average of 5, with 33% of them scores achieved in the quantitative study. The editors 26 27 reaching a 6 by all participants. We concluded that made general comments about the page and tended to 27 28 in most cases, the text we generated was very under- assume that the text was from other Wikipedia lan- 28 29 standable and grammatically correct. In addition, the guage versions: 29 30 results were perceived to match the quality of writing 30 in Wikipedia and news reporting. [The text was] "taken from Wikipedia, from Wikipedia 31 projects in different languages." (P1) 31 32 5.2.2. Appropriateness 32 33 The results for appropriateness are summarised in Editors more familiar with Wikidata suggested the 33 34 Table 8. A majority of the snippets were considered to information might be derived from Wikidata’s descrip- 34 35 be part of Wikipedia (77% for Arabic and 69% in Es- tions: 35 36 peranto). The participants confirmed that they seemed "it should be possible to be imported from Wiki- 36 37 to identify a certain style and manner of writing with data" (P9). 37 38 Wikipedia - by comparison, only 35% of the Arabic 38 39 news snippets and 52% of the Esperanto ones could Only one editor could spot a difference in the text 39 40 have passed as Wikipedia content in the study. Gen- and regular Wikidata triples: 40 41 uine Wikipedia text was recognised as such (in 77% "I think it’s taken from the outside sources, the text, 41 42 42 and 84% of the cases, respectively). the first text here, anything else I don’t think it has 43 43 Our model was able to generate text that is not only been taken from anywhere else, as far as I can tell" 44 44 accurate from a writing point of view, but in a high (P2). 45 number of cases, felt like Wikipedia and could blend 45 46 in with other Wikipedia content. Overall, the answers supported our assumption that 46 47 NLG, trained on Wikidata labelled triples, could be 47 48 5.3. RQ2 Task-based evaluation naturally added to Wikipedia pages without chang- 48 49 ing the reading experience. In the same time, the 49 50 As part of our interview study, we asked editors to task revealed questions around algorithmic complexity 50 51 read an ArticlePlaceholder page with included NLG and capturing provenance. Both are relevant to ensure 51 18 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 transparency and accountability and help flag quality Importance When reading the ArticlePlaceholder 1 2 issues. page, people looked first at our text: 2 3 3 4 Length We were interested in understanding how we "The introduction of this line, that’s the first thing 4 5 could iterate over the NLG capabilities of our system I see" (P4). 5 to produce text of appropriate length. While the model 6 This is their way to confirm that they landed on the 6 generated just one sentence, the editors thought it to be 7 right page and if the topic matches what they were 7 a helpful starting point: 8 looking for: 8 9 "Actually I would feel pretty good learning the ba- 9 "Yeah, it does help because that’s how you know 10 sics. What I saw is the basic information of the city 10 if you’re on the right article and not a synonym or 11 so it will be fine, almost like a stub" (P4). 11 12 some other article" (P1). 12 13 While generating larger pieces of text could ar- 13 This makes the text critical for the engagement with 14 guably be more useful, reducing the need for manual 14 the ArticlePlaceholder page, where most of the infor- 15 editing even further, the fact that the placeholder page 15 mation is expressed as Wikidata triples. Natural lan- 16 contained just one sentence made it clear to the editors 16 guage can add context to structured data representa- 17 that the page still requires work. In this context, an- 17 tions: 18 other editor referred to a ’magic threshold’ for an au- 18 19 tomatically generated text to be useful (see also Sec- "Well that first line was, it’s really important be- 19 20 tion 5.4). Their expectations for an NLG output were cause I would say that it [the ArticlePlaceholder 20 21 clear: page] doesn’t really make a lot of sense [without 21 22 it] ... it’s just like a soup of words, like there should 22 23 "So the definition has to be concise, a little bit not 23 very long, very complex, to understand the topic, be one string of words next to each other so this all 24 ties in the thing together. This is the most important 24 25 is it the right topic you’re looking for or". (P1) 25 thing I would say." (P8) 26 We noted that whatever the length of the snippet, it 26 27 needs to match reading practice. Editors tend to skim
1 "the language is specific, it says that this is a lan- Editing experience While GST scores gave us an 1 2 guage that is spoken mainly in Morocco and Alge- idea about the extent to which the automatically gen- 2 3 ria, the language, I don’t even know the symbols erated text is reused by the participants, the interviews 3 4 for the alphabets [...] I don’t know if this is correct, helped us understand how they experienced the text. 4 5 I don’t know the language itself so for me, it will Overall, the text snippets were widely praised as help- 5 6 go unnoticed. But if somebody from that area who ful, especially for newcomers: 6 7 knows anything about this language, I think they 7 "Especially for new editors, they can be a good 8 might think twice" (P8). 8 starting help: "I think it would be good at least to 9 9 make it easier for me as a reader to build on the 10 10 5.4. RQ3 - Task-based evaluation page if I’m a new volunteer or a first time edit, 11 11 it would make adding to the entry more appealing 12 12 Our third research questions focused on how people (...) I think adding a new article is a barrier. For me 13 13 work with automatically generated text. The overall I started only very few articles, I always built on 14 14 aim of adding NLG to the ArticlePlaceholder is to help other contribution. So I think adding a new page 15 15 editors from under-served Wikipedia language com- is a barrier that Wikidata can remove. I think that 16 16 munities bootstrap missing articles without disrupting would be the main difference." (P4) 17 their editing practices. 17 18 As noted earlier, we carried out a task-based evalu- All participants were experienced editors. Just like 18 19 ation, in which the participants were presented with an in the reading task, they thought having a shorter text 19 20 ArticlePlaceholder page that included the text snippet to start editing had advantages: 20 21 21 and triples from Wikidata relevant to the topic. We car- "It wasn’t too distracting because this was so short. 22 22 ried out a quantitative analysis of the editing activities If it was longer, it would be (...) There is this magi- 23 23 via the GST score, as well as a qualitative, thematic cal threshold up to which you think that it would be 24 24 analysis of the interviews, in which the participants ex- easier to write from scratch, it wasn’t here there." 25 25 plained how and they changed the text. In the follow- (P10) 26 ing we will first present the GST scores, and then dis- 26 27 cuss the themes that emerged from the interviews. The length of the text is also related to the ability 27 28 of the editors to work around and fix errors, such as 28 29 Reusing the text As shown in Table 9, the snippets the
1 Table 9 1 2 Number of snippets in each category of reuse. A generated snippet (top) and its edited version (bottom). Solid lines represent reused tiles, while 2 dashed lines represent overlapping sub-sequences not contributing to the gstscore. The first two examples are created in the studies, the last one 3 3 (ND) is from a previous experiment. 4 4 5 Category Examples # 5 6 6 7 7 8 WD 8 8 9 9 10 10 11 11 PD 2 12 12 13 13 14 ND 0 14 15 15 16 16 However, the text they added turned out to be very not been summarised in the text, or by trying to find 17 17 close to what the algorithm generated. that information elsewhere on Wikipedia. 18 18 19 "I know it, I have it here, I have the name in Arabic, 19 20 so I can just copy and paste it here." (P10) 20 6. Discussion 21 "[deleted the whole sentence] mainly because of 21 22 the missing tokens, otherwise it would have been 22 Our first research question focuses on how well 23 fine" (P5) 23 24 an NLG algorithm can generate summaries from the 24 25 One participant commented at length the presence Wikipedia reader’s perspective. In most of the cases, 25 26 of the tokens: the text is considered to be from the Wikimedia envi- 26 27 ronment. Readers do not clearly differentiate between 27 "I didn’t know what rare [is], I thought it was some 28 the generated summary and an original Wikipedia sen- 28 kind of tag used in machine learning because I’ve 29 tence. While this indicates the high quality of the gen- 29 seen other tags before but it didn’t make any sense 30 erated textual content, it is problematic with respect 30 because I didn’t know what use I had, how can I 31 to trust in Wikipedia. Trust in Wikipedia and how hu- 31 32 use it, what’s the use of it and I would say it would mans evaluate trustworthiness of a certain article has 32 33 be distracting, if it’s like in other parts of the page been investigated using both quantitative and qualita- 33 34 here. So that would require you to understand first tive methods. Adler et al. [58] and Adler and de Al- 34 35 what rare does there, what is it for and that would faro [57] develop a quantitative framework based on 35 36 take away the interest I guess, or the attention span Wikipedia’s history. Lucassen and Schraagen [59] use 36 37 so it would be better just to, I don’t know if it’s a qualitative methodology to code readers’ opinions on 37 38 for instance, if the word is not, it’s rare, this part the aspects that indicate the trustworthiness of an ar- 38 39 right here which is, it shouldn’t be there, it should ticle. However, none of these approaches take the au- 39 40 be more, it would be better if it’s like the input box tomatic creation of text by non-human algorithms into 40 41 or something". (P1) account. Therefore, a high quality, not distinguishable 41 42 from a human-generated Wikipedia summary can be a 42 43 Overall, the editing task and the follow-up inter- double-edged sword. While conducting the interviews, 43 44 views showed that the text snippets were a useful start- we realized that the Arabic generated summary has a 44 45 ing point for editing the page. Missing information, factual mistake. We could show in previous work [6] 45 46 presented in the form of
1 take, even while translating the sentence to the inter- approach, however using a tool with gamification fea- 1 2 viewee: tures [66]. The ArticlePlaceholder, and in particular 2 3 the provided summaries in natural language, can have 3 "Yes, so I think, so we have here country, Moroc- 4 an impact on how people start editing. 4 can city in the north, I would say established and 5 In comparison to Wikipedia Adventure, the readers 5 in year (. . . )" (P1) 6 are exposed to the ArticlePlaceholder pages and, thus, 6 7 As we did not introduce the sentence as automatically it could lower their reservation to edit by offering a 7 8 generated, we assume it is due to the trust Wikipedians more natural start of editing. 8 9 have in their platform and the quality of the sentence: Lastly, we asked the research question how editors 9 10 use the textual summaries in their workflow. Gener- 10 "This sentence was so well written that I didn’t 11 ally, we can show that the text is highly reused. One 11 even bother to verify if it’s actually a desert city" 12 of the editors mentions a magic threshold, that makes 12 (P3) 13 the summary acceptable as a starting point for editing. 13 14 To not misinform and eventually disrupt this trust, fu- This seems similar to post-editing in machine transla- 14 15 ture work will have to investigate ways of dealing with tion (or rather monolingual machine translation [67], 15 16 those wrongly generated statements. On a technical where a user only speaks the target or source lan- 16 17 level that will mean exploring ways to excluding sen- guage). Translators have been found to oppose ma- 17 18 tences with factual mistakes. On an HCI level, investi- chine translation and post-editing as they perceive it 18 19 gating ways of communicating the problems of neural as more time consuming and restricting with respect 19 20 text generation to the reader will be needed. One pos- to their freedom in the translation (e.g. sentence struc- 20 21 sible solution to this problem can be visualizations, as ture) [68]. Nonetheless, it has been shown that a differ- 21 22 these can influence the trust given to a particular arti- ent interface can not only lead to reduced time and en- 22 23 cle [60], such as WikiDashboard [61]. A similar tool hanced quality, but also convinces a user to believe in 23 24 could be used for the provenance of text. the improved quality of the machine translation [69]. 24 25 Supporting the previous results from the readers, ed- This underlines the importance of the right integration 25 26 itors have also a positive perception of the summaries. with machine-generated sentences, as we aim for in the 26 27 It is the first thing they read when they arrive at a page ArticlePlaceholder. 27 28 and it helps them to quickly verify that the page is The core of this work is to understand the percep- 28 29 about the topic they are looking for. tion of a community, such as Wikipedians, of the in- 29 30 When creating the summaries, we assumed their rel- tegration of a state-of-the-art machine learning tech- 30 31 atively short length might be a point for improvement nique for NLG in their platform. We can show that 31 32 from an editors’ perspective. In particular, as research an integration can work and be supported. This find- 32 33 suggests that the length of an article indicates its qual- ing aligns with other projects already deployed on 33 34 ity – basically the longer, the better [62]. However, Wikipedia. For instance, bots (short for robots) that 34 35 from the interviews with editors, we found that they monitor Wikipedia, have become a trusted tool for van- 35 36 mostly skim articles when reading them. This seems to dalism fighting [70]; so much that they can even re- 36 37 be the more natural way of browsing the information vert edits made by humans if they believe them to be 37 38 on an article and is supported by the short summary, malicious. The cooperative work between humans and 38 39 giving an overview on the topic. machines on Wikipedia has been also theorized in ma- 39 40 All editors we worked with are part of the mul- chine translation. Alegria et al. [71] argue for the inte- 40 41 tilingual Wikipedia community, editing in at least gration of machine translation in Wikipedia, that learns 41 42 two Wikipedias. Hale [63, 64] highlight that users of from the post-editing of the editors. Such a human-in- 42 43 this community are particularly active compared to the-loop approach is also applicable to our NLG work, 43 44 their monolingual counterparts and confident in edit- where a algorithm could learn from the humans’ con- 44 45 ing across different languages. However, taking poten- tributions. 45 46 tial newcomers into account, they suggest that the Ar- There is a need of investigating this direction fur- 46 47 ticlePlaceholder might be helpful to lower the barrier ther, as NLG algorithms will not achieve the same 47 48 of starting to edit. Recruiting more editors has been quality as humans. Especially in a low resource setting, 48 49 a long-standing objective, with initiatives such as the as the one observed in this work, human support is 49 50 Tea House [65] aiming at welcoming and comforting needed. However, automated tools can be a great way 50 51 new editors; Wikipedia Adventure employs a similar of allocating the limited human resources to the tasks 51 22 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 that are mostly needed. Post-editing the summaries can gether, those two groups form the Wikipedia commu- 1 2 serve a purely data-driven approach such as ours with nity as new editors are recruited from the existing pool 2 3 additional data that can be used to further improve of readers and readers contribute in essential ways to 3 4 the quality of the automatically generated content. To Wikipedia as shown by Antin and Cheshire [75]. 4 5 make such an approach feasible for the ArticlePlace- Beside the Arabic sentence, the editors worked on 5 6 holder, we need an interface that encourages the edit- are synthesized, i.e. not generated but created by the 6 7 ing of the summaries. The less effort this editing re- authors of this study. While those sentences were not 7 8 quires, the more we can ensure an easy collaboration created by natural language generation, they were cre- 8 9 of human and machine. ated and discussed with other researchers in the field. 9 10 We therefore focused on the most common problem in 10 11 text generative tasks similar to ours: the
1 gain an understanding how users interact with faulty [9] H. ElSahar, C. Gravier and F. Laforest, Zero-Shot Question 1 2 produced text, did not hinder the reading experience Generation from Knowledge Graphs for Unseen Predicates 2 3 nor the editing experience. Editors are likely to reuse and Entity Types, in: Proceedings of the 2018 Conference of 3 the North American Chapter of the Association for Compu- 4 a large portion of the generated summaries. Addition- 4 tational Linguistics: Human Language Technologies, NAACL- 5 ally, participants mentioned that the summary can be a HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Vol- 5 6 good starting point for new editors. ume 1 (Long Papers), 2018, pp. 218–228. https://aclanthology. 6 7 info/papers/N18-1020/n18-1020. 7 8 [10] I.V.Serban, A. García-Durán, Ç. Gülçehre, S. Ahn, S. Chandar, 8 A.C. Courville and Y. Bengio, Generating Factoid Questions 9 9. Acknowledgment 9 With Recurrent Neural Networks: The 30M Factoid Question- 10 10 Answer Corpus, in: Proceedings of the 54th Annual Meeting 11 This research was supported by the EPSRC-funded of the Association for Computational Linguistics, ACL 2016, 11 12 project Data Stories under grant agreement num- August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 12 13 ber EP/P025676/1; and DSTL under grant number 2016. 13 14 DSTLX-1000094186; and EU H2020 Program for the [11] X. Du, J. Shao and C. Cardie, Learning to Ask: Neural Ques- 14 tion Generation for Reading Comprehension, in: Proceedings 15 Project No. 727658 (IASIS). We would like to express 15 of the 55th Annual Meeting of the Association for Compu- 16 our gratitude to the Wikipedia communities involved tational Linguistics, ACL 2017, Vancouver, Canada, July 30 16 17 in this study for embracing our research and particu- - August 4, Volume 1: Long Papers, 2017, pp. 1342–1352. 17 18 larly to the participants of the interviews. doi:10.18653/v1/P17-1123. 18 19 [12] T. Luong, I. Sutskever, Q.V. Le, O. Vinyals and W. Zaremba, 19 20 Addressing the Rare Word Problem in Neural Machine Trans- 20 lation, in: Proceedings of the 53rd Annual Meeting of the Asso- 21 References 21 ciation for Computational Linguistics and the 7th International 22 Joint Conference on Natural Language Processing of the Asian 22 [1] B. Hecht and D. Gergle, The Tower of Babel Meets Web 2.0: 23 Federation of Natural Language Processing, ACL 2015, July 23 User-generated Content and Its Applications in a Multilingual 24 26-31, 2015, Beijing, China, Volume 1: Long Papers, 2015, 24 Context, in: Proceedings of the SIGCHI conference on human pp. 11–19. 25 factors in computing systems, ACM, 2010, pp. 291–300. 25 [13] L. Dong and M. Lapata, Language to Logical Form with Neu- 26 [2] D. Vrandeciˇ c´ and M. Krötzsch, Wikidata: A Free Collaborative 26 ral Attention, in: Proceedings of the 54th Annual Meeting of the Knowledgebase, Communications of the ACM 57(10) (2014), 27 Association for Computational Linguistics, ACL 2016, August 27 78–85. 28 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. 28 [3] L.-A. Kaffee and E. Simperl, Analysis of Editors’ Lan- 29 http://aclweb.org/anthology/P/P16/P16-1004.pdf. 29 guages in Wikidata, in: Proceedings of the 14th Interna- [14] L. Kaffee, H. ElSahar, P. Vougiouklis, C. Gravier, F. Lafor- 30 tional Symposium on Open Collaboration, OpenSym 2018, 30 est, J.S. Hare and E. Simperl, Learning to Generate Wikipedia 31 Paris, France, August 22-24, 2018, 2018, pp. 21:1–21:5. 31 Summaries for Underserved Languages from Wikidata, in: 32 doi:10.1145/3233391.3233965. 32 Proceedings of the 2018 Conference of the North Ameri- 33 [4] L.-A. Kaffee, A. Piscopo, P. Vougiouklis, E. Simperl, L. Carr 33 and L. Pintscher, A Glimpse into Babel: An Analysis of Mul- can Chapter of the Association for Computational Linguis- 34 34 tilinguality in Wikidata, in: Proceedings of the 13th Interna- tics: Human Language Technologies, NAACL-HLT, New Or- 35 tional Symposium on Open Collaboration, ACM, 2017, p. 14. leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short 35 36 [5] L.A. Kaffee, Generating ArticlePlaceholders from Wikidata Papers), 2018, pp. 640–645. https://aclanthology.info/papers/ 36 N18-2101/n18-2101. 37 for Wikipedia: Increasing Access to Free and Open Knowl- 37 [15] P. Isola, J. Zhu, T. Zhou and A.A. Efros, Image-to-Image 38 edge, 2016. 38 [6] P. Vougiouklis, H. Elsahar, L.-A. Kaffee, C. Gravier, Translation with Conditional Adversarial Networks, in: 2017 39 39 F. Laforest, J. Hare and E. Simperl, Neural Wikipedian: IEEE Conference on Computer Vision and Pattern Recogni- 40 Generating Textual Summaries from Knowledge tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, 40 41 Base Triples, Journal of Web Semantics (2018). pp. 5967–5976. doi:10.1109/CVPR.2017.632. 41 42 doi:https://doi.org/10.1016/j.websem.2018.07.002. [16] L. Kaffee, H. ElSahar, P. Vougiouklis, C. Gravier, F. Laforest, 42 J.S. Hare and E. Simperl, Mind the (Language) Gap: Genera- 43 http://www.sciencedirect.com/science/article/pii/ 43 S1570826818300313. tion of Multilingual Wikipedia Summaries from Wikidata for 44 44 [7] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, ArticlePlaceholders (2018), 319–334. doi:10.1007/978-3-319- 45 H. Schwenk and Y. Bengio, Learning Phrase Representations 93417-4_21. 45 46 using RNN Encoder-Decoder for Statistical Machine Transla- [17] B. Collier and J. Bear, Conflict, criticism, or confidence: an 46 47 tion, CoRR abs/1406.1078 (2014). empirical examination of the gender gap in wikipedia contribu- 47 tions, in: CSCW ’12 Computer Supported Cooperative Work, 48 [8] I. Sutskever, O. Vinyals and Q.V. Le, Sequence to Sequence 48 Learning with Neural Networks, in: Advances in Neural In- Seattle, WA, USA, February 11-15, 2012, 2012, pp. 383–392. 49 49 formation Processing Systems 27, Z. Ghahramani, M. Welling, doi:10.1145/2145204.2145265. 50 C. Cortes, N.D. Lawrence and K.Q. Weinberger, eds, Curran [18] M. Graham, B. Hogan, R.K. Straumann and A. Medhat, Un- 50 51 Associates, Inc., 2014, pp. 3104–3112. even Geographies of User-generated Information: Patterns of 51 24 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 Increasing Informational Poverty, Annals of the Association of [33] Y. Mrabet, P. Vougiouklis, H. Kilicoglu, C. Gardent, 1 2 American Geographers 104(4) (2014), 746–764. D. Demner-Fushman, J. Hare and E. Simperl, Aligning Texts 2 3 [19] M.J. Pat Wu, State of connectivity 2015: A report on global and Knowledge Bases with Semantic Sentence Simplification, 3 internet access, 2016. http://newsroom.fb.com/news/2016/02/ Association for Computational Linguistics, 2016, pp. 29–36. 4 4 state-of-connectivity-2015-a-report-on-global-internet-access/. [34] R. Lebret, D. Grangier and M. Auli, Neural Text Genera- 5 [20] J. Voss, Measuring Wikipedia (2005). tion from Structured Data with Application to the Biography 5 6 [21] B.J. Hecht, The Mining and Application of Diverse Cultural Domain, in: Proceedings of the 2016 Conference on Empiri- 6 7 Perspectives in User-generated Content, PhD thesis, North- cal Methods in Natural Language Processing, EMNLP 2016, 7 8 western University, 2013. Austin, Texas, USA, November 1-4, 2016, 2016, pp. 1203– 8 9 [22] P. Bao, B. Hecht, S. Carton, M. Quaderi, M. Horn and D. Ger- 1213. 9 gle, Omnipedia: Bridging the Wikipedia Language Gap, in: [35] E. Reiter and A. Belz, An Investigation into the Validity of 10 10 Proceedings of the SIGCHI Conference on Human Factors in Some Metrics for Automatically Evaluating Natural Language 11 Computing Systems, ACM, 2012, pp. 1075–1084. Generation Systems, Comput. Linguist. 35(4) (2009), 529–558. 11 12 [23] C. Kramsch and H. Widdowson, Language and Culture, Ox- doi:10.1162/coli.2009.35.4.35405. 12 13 ford University Press, 1998. [36] K. Papineni, S. Roukos, T. Ward and W. Zhu, Bleu: a Method 13 14 [24] T. Steiner, Bots vs. Wikipedians, Anons vs. Logged-Ins (Re- for Automatic Evaluation of Machine Translation, in: Proceed- 14 dux): A Global Study of Edit Activity on Wikipedia and ings of the 40th Annual Meeting of the Association for Compu- 15 15 Wikidata, in: Proceedings of The International Symposium tational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., 16 16 on Open Collaboration, OpenSym ’14, ACM, New York, 2002, pp. 311–318. 17 NY, USA, 2014, pp. 25:1–25:7. ISBN 978-1-4503-3016-9. [37] C.-Y. Lin, Rouge: A package for automatic evaluation of sum- 17 18 doi:10.1145/2641580.2641613. maries, in: Text summarization branches out: Proceedings of 18 19 [25] L. Wanner, B. Bohnet, N. Bouayad-Agha, F. Lareau and the ACL-04 workshop, Vol. 8, Barcelona, Spain, 2004. 19 20 D. Nicklaß, Marquis: Generation of User-Tailored Multilingual [38] A. Lavie and A. Agarwal, METEOR: An Automatic Metric for 20 Air Quality Bulletins, Applied Artificial Intelligence 24(10) MT Evaluation with High Levels of Correlation with Human 21 21 (2010), 914–952. doi:10.1080/08839514.2010.529258. Judgments, in: Proceedings of the Second Workshop on Statis- 22 [26] D. Galanis and I. Androutsopoulos, Generating Multilingual tical Machine Translation, StatMT ’07, Association for Com- 22 23 Descriptions from Linguistically Annotated OWL Ontologies: putational Linguistics, Stroudsburg, PA, USA, 2007, pp. 228– 23 24 The NaturalOWL system, in: Proceedings of the Eleventh Eu- 231. 24 25 ropean Workshop on Natural Language Generation, Associa- [39] G. Angeli, P. Liang and D. Klein, A Simple Domain- 25 tion for Computational Linguistics, 2007, pp. 143–146. independent Probabilistic Approach to Generation, in: Pro- 26 26 [27] D. Duma and E. Klein, Generating Natural Language from ceedings of the 2010 Conference on Empirical Methods in Nat- 27 27 Linked Data: Unsupervised template extraction, in: Proceed- ural Language Processing, EMNLP ’10, Association for Com- 28 ings of the 10th International Conference on Computational putational Linguistics, Stroudsburg, PA, USA, 2010, pp. 502– 28 29 Semantics (IWCS 2013) – Long Papers, Association for Com- 512. http://dl.acm.org/citation.cfm?id=1870658.1870707. 29 30 putational Linguistics, Potsdam, Germany, 2013, pp. 83–94. [40] I. Konstas and M. Lapata, A Global Model for Concept-to-text 30 31 http://www.aclweb.org/anthology/W13-0108. Generation, J. Artif. Int. Res. 48(1) (2013), 305–346. http://dl. 31 [28] B. Ell and A. Harth, A language-independent method for the acm.org/citation.cfm?id=2591248.2591256. 32 32 extraction of RDF verbalization templates, in: INLG 2014 - [41] C. Mellish and R. Dale, Evaluation in the context of natu- 33 Proceedings of the Eighth International Natural Language ral language generation, Computer Speech & Language 12(4) 33 34 Generation Conference, Including Proceedings of the INLG (1998), 349–373. doi:10.1006/csla.1998.0106. 34 35 and SIGDIAL 2014 Joint Session, 19-21 June 2014, Philadel- [42] E. Reiter, R. Robertson and S. Sripada, Acquiring Correct 35 36 phia, PA, USA, 2014, pp. 26–34. Knowledge for Natural Language Generation, J. Artif. Int. 36 [29] C. Sauper and R. Barzilay, Automatically Generating Res. 18 (2003), 491–516. doi:https://doi.org/10.1613/jair.1176. 37 37 Wikipedia Articles: A Structure-aware Approach, in: Proceed- https://jair.org/index.php/jair/article/view/10332. 38 38 ings of the Joint Conference of the 47th Annual Meeting of [43] S. Williams and E. Reiter, Generating basic skills reports 39 the ACL and the 4th International Joint Conference on Natural for low-skilled readers, Natural Language Engineering 14(4) 39 40 Language Processing of the AFNLP: Volume 1-Volume 1, As- (2008), 495–525–. doi:10.1017/S1351324908004725. 40 41 sociation for Computational Linguistics, 2009, pp. 208–216. [44] E. Reiter, Natural Language Generation, in: The Handbook 41 42 [30] Y. Pochampally, K. Karlapalem and N. Yarrabelly, Semi- of Computational Linguistics and Natural Language Process- 42 Supervised Automatic Generation of Wikipedia Articles for ing, Wiley-Blackwell, 2010, pp. 574–598, Chapter 20. ISBN 43 43 Named Entities, in: Wiki@ ICWSM, 2016. 9781444324044. doi:10.1002/9781444324044.ch20. 44 [31] W.D. Lewis and P. Yang, Building MT for a Severely Under- [45] X. Sun and C. Mellish, An experiment on free generation from 44 45 Resourced Language: White Hmong, Association for Machine single RDF triples, in: Proceedings of the Eleventh European 45 46 Translation in the Americas, October (2012). Workshop on Natural Language Generation, Association for 46 47 [32] A. Chisholm, W. Radford and B. Hachey, Learning to gener- Computational Linguistics, 2007, pp. 105–108. 47 48 ate one-sentence biographies from Wikidata, in: Proceedings [46] A.-C. Ngonga Ngomo, L. Bühmann, C. Unger, J. Lehmann 48 of the 15th Conference of the European Chapter of the Associ- and D. Gerber, Sorry, I don’t speak SPARQL – Translating 49 49 ation for Computational Linguistics: Volume 1, Long Papers, SPARQL Queries into Natural Language, in: Proceedings of 50 Association for Computational Linguistics, Valencia, Spain, the 22nd international conference on World Wide Web, ACM, 50 51 2017, pp. 633–642. 2013, pp. 977–988. 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 25
1 [47] P.D. Clough, R.J. Gaizauskas, S.S.L. Piao and Y. Wilks, ME- [59] T. Lucassen and J.M. Schraagen, Trust in Wikipedia: 1 2 TER: MEasuring TExt Reuse, in: Proceedings of the 40th An- How Users Trust Information from an Unknown Source, 2 3 nual Meeting of the Association for Computational Linguistics, in: Proceedings of the 4th ACM Workshop on Infor- 3 July 6-12, 2002, Philadelphia, PA, USA., 2002, pp. 152–159. mation Credibility on the Web, WICOW 2010, Raleigh, 4 4 [48] M. Potthast, T. Gollub, M. Hagen, J. Kiesel, M. Michel, North Carolina, USA, April 27, 2010, 2010, pp. 19–26. 5 5 A. Oberländer, M. Tippmann, A. Barrón-Cedeño, P. Gupta, doi:10.1145/1772938.1772944. 6 P. Rosso and B. Stein, Overview of the 4th International [60] A. Kittur, B. Suh and E.H. Chi, Can You Ever Trust a 6 7 Competition on Plagiarism Detection, in: CLEF 2012 Evalua- Wiki?: Impacting Perceived Trustworthiness in Wikipedia, 7 8 tion Labs and Workshop, Online Working Notes, Rome, Italy, in: Proceedings of the 2008 ACM Conference on Com- 8 9 September 17-20, 2012, 2012. puter Supported Cooperative Work, CSCW 2008, San Diego, 9 CA, USA, November 8-12, 2008, 2008, pp. 477–480. 10 [49] M. Mintz, S. Bills, R. Snow and D. Jurafsky, Distant supervi- 10 sion for relation extraction without labeled data, in: ACL 2009, doi:10.1145/1460563.1460639. 11 11 Proceedings of the 47th Annual Meeting of the Association [61] P. Pirolli, E. Wollny and B. Suh, So You Know You’re Getting 12 for Computational Linguistics and the 4th International Joint the Best Possible Information: a Tool that Increases Wikipedia 12 13 Conference on Natural Language Processing of the AFNLP, 2- Credibility, in: Proceedings of the 27th International Confer- 13 14 7 August 2009, Singapore, 2009, pp. 1003–1011. ence on Human Factors in Computing Systems, CHI 2009, 14 Boston, MA, USA, April 4-9, 2009, 2009, pp. 1505–1508. 15 [50] S. Bird, NLTK: The Natural Language Toolkit, in: ACL 2006, 15 doi:10.1145/1518701.1518929. 16 21st International Conference on Computational Linguistics 16 and 44th Annual Meeting of the Association for Computational [62] J.E. Blumenstock, Size Matters: Word Count as a Measure 17 17 Linguistics, Proceedings of the Conference, Sydney, Australia, of Quality on Wikipedia, in: Proceedings of the 17th In- 18 17-21 July 2006, 2006. ternational Conference on World Wide Web, WWW 2008, 18 19 [51] T. Luong, H. Pham and C.D. Manning, Effective Approaches Beijing, China, April 21-25, 2008, 2008, pp. 1095–1096. 19 doi:10.1145/1367497.1367673. 20 to Attention-based Neural Machine Translation, in: Proceed- 20 [63] S.A. Hale, Multilinguals and Wikipedia editing, in: 21 ings of the 2015 Conference on Empirical Methods in Natural 21 ACM Web Science Conference, WebSci ’14, Blooming- Language Processing, Association for Computational Linguis- 22 ton, IN, USA, June 23-26, 2014, 2014, pp. 99–108. 22 tics, Lisbon, Portugal, 2015, pp. 1412–1421. http://aclweb.org/ 23 doi:10.1145/2615569.2615684. 23 anthology/D15-1166. [64] S.A. Hale, Cross-language Wikipedia Editing of Okinawa, 24 [52] A.M. Rush, S. Chopra and J. Weston, A Neural Attention 24 Japan, in: Proceedings of the 33rd Annual ACM Conference 25 Model for Abstractive Sentence Summarization, in: Proceed- 25 on Human Factors in Computing Systems, CHI 2015, Seoul, 26 ings of the 2015 Conference on Empirical Methods in Nat- 26 Republic of Korea, April 18-23, 2015, 2015, pp. 183–192. ural Language Processing, EMNLP 2015, Lisbon, Portugal, 27 doi:10.1145/2702123.2702346. 27 September 17-21, 2015, 2015, pp. 379–389. 28 [65] J.T. Morgan, S. Bouterse, H. Walls and S. Stierch, Tea 28 [53] T. Joachims, A Probabilistic Analysis of the Rocchio Algo- 29 and Sympathy: Crafting Positive New User Experiences on 29 rithm with TFIDF for Text Categorization, in: Proceedings of 30 Wikipedia, in: Computer Supported Cooperative Work, CSCW 30 the Fourteenth International Conference on Machine Learn- 2013, San Antonio, TX, USA, February 23-27, 2013, 2013, 31 ing (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, 31 pp. 839–848. doi:10.1145/2441776.2441871. 32 1997, pp. 143–151. 32 [66] S. Narayan, J. Orlowitz, J.T. Morgan, B.M. Hill and 33 [54] N. Halko, P. Martinsson and J.A. Tropp, Finding Structure with A.D. Shaw, The Wikipedia Adventure: Field Evaluation of 33 34 Randomness: Probabilistic Algorithms for Constructing Ap- an Interactive Tutorial for New Users, in: Proceedings of the 34 SIAM Review 35 proximate Matrix Decompositions, 53(2) (2011), 2017 ACM Conference on Computer Supported Cooperative 35 217–288. doi:10.1137/090771806. 36 Work and Social Computing, CSCW 2017, Portland, OR, USA, 36 [55] K. Heafield, I. Pouzyrevsky, J.H. Clark and P. Koehn, Scalable February 25 - March 1, 2017, 2017, pp. 1785–1799. http: 37 37 Modified Kneser-Ney Language Model Estimation, in: Pro- //dl.acm.org/citation.cfm?id=2998307. 38 38 ceedings of the 51st Annual Meeting of the Association for [67] C. Hu, B.B. Bederson, P. Resnik and Y. Kronrod, MonoTrans2: 39 Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, A New Human Computation System to Support Monolingual 39 40 Bulgaria, Volume 2: Short Papers, 2013, pp. 690–696. Translation, in: Proceedings of the International Conference 40 41 [56] M.J. Wise, YAP3: Improved detection of similarities in com- on Human Factors in Computing Systems, CHI 2011, Van- 41 puter program and other texts, ACM SIGCSE Bulletin 28(1) 42 couver, BC, Canada, May 7-12, 2011, 2011, pp. 1133–1136. 42 (1996), 130–134. doi:10.1145/1978942.1979111. 43 43 [57] B.T. Adler and L. de Alfaro, A Content-driven Reputation [68] E. Lagoudaki, Translation Editing Environments, in: MT Sum- 44 System for the Wikipedia, in: Proceedings of the 16th In- mit XII: Workshop on Beyond Translation Memories, 2009. 44 45 ternational Conference on World Wide Web, WWW 2007, [69] S. Green, J. Heer and C.D. Manning, The Efficacy of Human 45 46 Banff, Alberta, Canada, May 8-12, 2007, 2007, pp. 261–270. Post-editing for Language Translation, in: 2013 ACM SIGCHI 46 47 doi:10.1145/1242572.1242608. Conference on Human Factors in Computing Systems, CHI 47 [58] B.T. Adler, K. Chatterjee, L. de Alfaro, M. Faella, I. Pye ’13, Paris, France, April 27 - May 2, 2013 48 , 2013, pp. 439–448. 48 and V. Raman, Assigning Trust to Wikipedia Content, doi:10.1145/2470654.2470718. 49 49 in: Proceedings of the 2008 International Symposium on [70] R.S. Geiger and D. Ribes, The Work of Sustaining Order in 50 Wikis, 2008, Porto, Portugal, September 8-10, 2008, 2008. Wikipedia: The Banning of a Vandal, in: Proceedings of the 50 51 doi:10.1145/1822258.1822293. 2010 ACM Conference on Computer Supported Cooperative 51 26 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles
1 Work, CSCW 2010, Savannah, Georgia, USA, February 6-10, Proceedings of the Twelfth ACM International Conference 1 2 2010, 2010, pp. 117–126. doi:10.1145/1718918.1718941. on Web Search and Data Mining, WSDM 2019, Melbourne, 2 [71] I. Alegria, U. Cabezón, U.F. de Betoño, G. Labaka, A. Mayor, 3 VIC, Australia, February 11-15, 2019, 2019, pp. 618–626. 3 K. Sarasola and A. Zubiaga, Reciprocal Enrichment Between 4 4 Basque Wikipedia and Machine Translation, in: The People’s doi:10.1145/3289600.3291021. 5 5 Web Meets NLP, Collaboratively Constructed Language Re- [75] J. Antin and C. Cheshire, Readers are not free-riders: reading 6 sources, 2013, pp. 101–118. doi:10.1007/978-3-642-35085- 6 as a form of participation on wikipedia, in: Proceedings of the 7 6_4. 7 2010 ACM Conference on Computer Supported Cooperative 8 [72] S. Kuznetsov, Motivations of Contributors to Wikipedia, 8 SIGCAS Computers and Society 36(2) (2006), 1. Work, CSCW 2010, Savannah, Georgia, USA, February 6-10, 9 9 doi:10.1145/1215942.1215943. 2010, 2010, pp. 127–130. doi:10.1145/1718918.1718942. 10 [73] K.A. Panciera, A. Halfaker and L.G. Terveen, Wikipedians 10 11 are born, not made: a study of power editors on Wikipedia, [76] A. Rohrbach, L.A. Hendricks, K. Burns, T. Darrell and 11 12 in: Proceedings of the 2009 International ACM SIGGROUP K. Saenko, Object Hallucination in Image Captioning, in: Pro- 12 13 Conference on Supporting Group Work, GROUP 2009, Sani- ceedings of the 2018 Conference on Empirical Methods in Nat- 13 bel Island, Florida, USA, May 10-13, 2009, 2009, pp. 51–60. 14 ural Language Processing, Brussels, Belgium, October 31 - 14 doi:10.1145/1531674.1531682. 15 [74] F. Lemmerich, D. Sáez-Trumper, R. West and L. Zia, Why November 4, 2018, 2018, pp. 4035–4045. https://aclanthology. 15 16 the World Reads Wikipedia: Beyond English Speakers, in: info/papers/D18-1437/d18-1437. 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51