Using Natural Language Generation to Bootstrap Empty Wikipedia Articles

Total Page:16

File Type:pdf, Size:1020Kb

Using Natural Language Generation to Bootstrap Empty Wikipedia Articles Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 Using Natural Language Generation to 4 5 5 6 Bootstrap Empty Wikipedia Articles: 6 7 7 8 8 9 A Human-centric Perspective 9 10 10 * 11 Lucie-Aimée Kaffee , Pavlos Vougiouklis and Elena Simperl 11 12 School of Electronics and Computer Science, University of Southampton, UK 12 13 E-mails: [email protected], [email protected], [email protected] 13 14 14 15 15 16 16 17 17 Abstract. Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media 18 18 management. Recent advances in machine learning have made it possible to train NLG systems to achieve human-level perfor- 19 19 mance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it 20 with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-served Wikipedias, 20 21 which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to 21 22 generate text from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The 22 23 evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the text snippets score 23 24 well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles. It also hints 24 25 at several potential implications of using NLG solutions in Wikipedia at large, including content quality, trust in technology, and 25 26 algorithmic transparency. 26 27 27 Keywords: Wikipedia, Wikidata, ArticlePlaceholder, Under-resourced languages, Natural Language Generation, Neural 28 28 networks 29 29 30 30 31 31 32 1. Introduction riety of content integration affordances in Wikipedia, 32 33 including links to articles in other languages and in- 33 34 34 Wikipedia is available in 301 languages, but its con- foboxes. An example can be seen in Figure 1: in the 35 35 tent is unevenly distributed [1]. Under-resourced lan- French Wikipedia, the infobox shown in the article 36 guage versions face multiple challenges: fewer editors 36 37 about cheese (right) automatically draws in data from 37 means fewer articles and less quality control, making Wikidata (left) and displays it in French. 38 that particular Wikipedia less attractive for readers in 38 In previous work of ours, we proposed the Article- 39 that language, which in turn makes it more difficult to 39 40 Placeholder, a tool that takes advantage of Wikidata’s 40 recruit new editors from among the readers. 41 41 Wikidata, the structured-data backbone of Wikipedia multilingual capabilities to increase the coverage of 42 42 [2], offers some help. It contains information about under-resourced Wikipedias [5]. When someone looks 43 43 more than 55 million entities, for example, people, for a topic that is not yet covered by Wikipedia in 44 44 places or events, edited by an active international com- their language, the ArticlePlaceholder tries to match 45 45 munity of volunteers [3]. More importantly, it is mul- the topic with an entity in Wikidata. If successful, it 46 46 tilingual by design and each aspect of the data can be 47 then redirects the search to an automatically generated 47 translated and rendered to the user in their preferred 48 placeholder page that displays the relevant informa- 48 language [4]. This makes it the tool of choice for a va- 49 tion, for example the name of the entity and its main 49 50 properties, in their language. The ArticlePlaceholder is 50 51 *Corresponding author. E-mail: [email protected]. currently used in 14 Wikipedias (see Section 3.1). 51 1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 Fig. 1. Representation of Wikidata triples and their inclusion in a Wikipedia infobox. Wikidata triples (left) are used to fill out the fields of the 23 24 infobox in the article about fromage infobox from the French Wikipedia 24 25 25 26 In this paper, we propose an iteration of the Ar- tences with limited usability in user-facing systems; 26 27 ticlePlaceholder to improve the representation of the one of the most common problems is their ability to 27 28 data on the placeholder page. The original version of handle rare words [12, 13], which are words that the 28 29 the tool pulled the raw data from Wikidata (available model does not meet frequently enough during train- 29 30 as triples with labels in different languages) and dis- ing, such as localisations of names in different lan- 30 31 played it in tabular form (see Figure 3 in Section 3). In guages. We introduce a mechanism called property 31 32 the current version, we use Natural Language Genera- placeholder [14] to tackle this problem, learning mul- 32 33 tion (NLG) techniques to automatically produce a text tiple verbalisations of the same entity in the text [6]. 33 34 snippet from the triples instead. Presenting structured In building the system we aimed to pursue the fol- 34 35 data as text rather than tables helps people uninitiated lowing research questions: 35 36 with the involved technologies to make sense of it [6]. 36 37 This is particularly useful in contexts where one can- RQ1 Can we train a neural network to generate 37 38 not make any assumptions about the levels of data lit- text from triples in a low-resource setting? 38 39 eracy of the audience, as it is the case for a large share To answer this question we first evaluated the 39 40 of the Wikipedia community. system using a series of predefined metrics and 40 41 Our NLG solution builds upon the general encoder- baselines. In addition, we undertook a quanti- 41 42 decoder framework for neural networks, which is cred- tative study with participants from two under- 42 43 ited with promising results in similar text-centric tasks, served Wikipedia language communities (Arabic 43 44 such as machine translation [7, 8] and question gen- and Esperanto), who were asked to assess, from a 44 45 eration [9–11]. We extend this framework to meet the reader’s perspective, whether the text is fluent and 45 46 needs of different Wikipedia language communities in appropriate for Wikipedia. 46 47 terms of text fluency, appropriateness to Wikipedia, RQ2 How do editors perceive the generated text 47 48 and reuse during article editing. Given an entity that on the ArticlePlaceholder page? To add depth 48 49 was matched by the ArticlePlaceholder, our system to the quantitative findings of the first study, we 49 50 uses its triples to generate a short Wikipedia-style sum- undertook a second, mixed-methods study within 50 51 mary. Many existing NLG techniques produce sen- six Wikipedia language communities (Arabic, 51 Kaffee et al. / Using Natural Language Generation to Bootstrap Empty Wikipedia Articles 3 1 Swedish, Hebrew, Persian, Indonesian, and Ukrainian). around the <rare> tokens both during reading and 1 2 We carried out semi-structured interviews, in writing, they did not check the text for correctness, 2 3 which we asked editors to comment on their ex- nor queried where the text came from and what the to- 3 4 perience with reading the summaries generated kens meant. This suggests that we need more research 4 5 through our approach and we identified common into how to communicate the provenance of content in 5 6 themes in their answers. Among others, we were Wikipedia, especially in the context of automatic con- 6 7 interested to understand how editors perceive text tent generation and deep fakes [15], as well as algo- 7 8 that is the result of the artificial intelligence (AI) rithmic transparency. 8 9 algorithm rather than being manually crafted, and 9 Structure of the paper The remainder of the paper is 10 how they deal with so-called <rare> tokens in 10 organised as follows. We start with some background 11 the sentences. Those tokens represent realisations 11 for our work and related papers in Section 2. Next, 12 of infrequent entities in the text, that data-driven 12 we introduce our approach to bootstrapping empty 13 approaches generally struggle to verbalise [12]. 13 Wikipedia articles, which includes the ArticlePlace- 14 RQ3 How do editors use the text snippets in their 14 holder tool and its NLG extension (Section 3). In Sec- 15 work? As part of the second study, we also asked 15 tion 4 we provide details on the evaluation methodol- 16 participants to edit the placeholder page, start- 16 ogy, whose findings we present in Section 5. We then 17 ing from the automatically generated text or re- 17 discuss the main themes emerging from the findings 18 moving it completely. We assessed text reuse 18 and their implications and the limitations of our work 19 both quantitatively, using a string-matching met- 19 in Section 6, before concluding with a summary of 20 20 ric, and qualitatively through the interviews. Just contributions and planned future work in Section 8. 21 like in RQ2, we were also interested to under- 21 22 stand whether summaries with <rare> tokens, Previous submissions A preliminary version of this 22 23 which point to limitations in the algorithm, would work was published in [14, 16].
Recommended publications
  • A Systematic Review of Scholarly Research on Wikipedia
    Downloaded from orbit.dtu.dk on: Oct 06, 2021 The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia Okoli, Chitu; Mehdi, Mohamad; Mesgari, Mostafa; Nielsen, Finn Årup; Lanamäki, Arto Link to article, DOI: 10.2139/ssrn.2021326 Publication date: 2012 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Okoli, C., Mehdi, M., Mesgari, M., Nielsen, F. Å., & Lanamäki, A. (2012). The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia. https://doi.org/10.2139/ssrn.2021326 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. The people’s encyclopedia under the gaze of the sages: A systematic review of scholarly research on Wikipedia Chitu Okoli John Molson School of Business,
    [Show full text]
  • The People's Encyclopedia Under the Gaze of the Sages: a Systematic Review of Scholarly Research on Wikipedia
    View metadata,Downloaded citation and from similar orbit.dtu.dk papers on:at core.ac.uk Dec 20, 2017 brought to you by CORE provided by Online Research Database In Technology The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia Okoli, Chitu; Mehdi, Mohamad; Mesgari, Mostafa; Nielsen, Finn Årup; Lanamäki, Arto Link to article, DOI: 10.2139/ssrn.2021326 Publication date: 2012 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Okoli, C., Mehdi, M., Mesgari, M., Nielsen, F. Å., & Lanamäki, A. (2012). The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia. DOI: 10.2139/ssrn.2021326 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. The people’s encyclopedia under
    [Show full text]
  • On the Spectral Evolution of Large Networks (Online Version)
    On the Spectral Evolution of Large Networks J´er^omeKunegis Institute for Web Science and Technologies University of Koblenz-Landau [email protected] November 2011 Vom Promotionsausschuss des Fachbereichs 4: Informatik der Universit¨at Koblenz-Landau zur Verleihung des akademischen Grades Doktor der Naturwissenschaften (Dr. rer. nat.) genehmigte Dissertation. PhD thesis at the University of Koblenz-Landau. Datum der wissenschaftlichen Aussprache: 9. November 2011 Vorsitz des Promotionsausschusses: Prof. Dr. Karin Harbusch Berichterstatter: Prof. Dr. Steffen Staab Berichterstatter: Prof. Dr. Christian Bauckhage Berichterstatter: Prof. Dr. Klaus Obermayer Abstract In this thesis, I study the spectral characteristics of large dynamic networks and formulate the spectral evolution model. The spectral evolution model applies to networks that evolve over time, and describes their spectral decompositions such as the eigenvalue and singular value decomposition. The spectral evolution model states that over time, the eigenvalues of a network change while its eigenvectors stay approximately constant. I validate the spectral evolution model empirically on over a hundred network datasets, and theoretically by showing that it generalizes a certain number of known link prediction functions, including graph kernels, path counting methods, rank reduction and triangle closing. The collection of datasets I use contains 118 distinct network datasets. One dataset, the signed social network of the Slashdot Zoo, was specifically extracted during work on this thesis. I also show that the spectral evolution model can be understood as a generalization of the preferential attachment model, if we consider growth in latent dimensions of a network individually. As applications of the spectral evolution model, I introduce two new link prediction algo- rithms that can be used for recommender systems, search engines, collaborative filtering, rating prediction, link sign prediction and more.
    [Show full text]
  • Folklore and the Internet: Vernacular Expression in a Digital World
    Utah State University DigitalCommons@USU All USU Press Publications USU Press 2009 Folklore and the Internet: Vernacular Expression in a Digital World Trevor J. Blank [email protected] Follow this and additional works at: https://digitalcommons.usu.edu/usupress_pubs Part of the Computer Sciences Commons, and the Folklore Commons Recommended Citation Blank, T. J. (2009). Folklore and the Internet: Vernacular expression in a digital world. Logan, Utah: Utah State University Press. This Book is brought to you for free and open access by the USU Press at DigitalCommons@USU. It has been accepted for inclusion in All USU Press Publications by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. Folklore and the Internet Vernacular Expression in a Digital World Folklore and the Internet Vernacular Expression in a Digital World Edited by Trevor J. Blank Utah State University Press Logan, Utah Copyright © 2009 Utah State University Press All rights reserved Utah State University Press Logan, Utah 84322-7800 USUPress.org ISBN: 978-0-87421-750-6 (paper) ISBN: 978-0-87421-751-3 (e-book) Manufactured in the United States of America Printed on acid-free, recycled paper Library of Congress Cataloging-in-Publication Data Folklore and the internet : vernacular expression in a digital world / edited by Trevor J. Blank. p. cm. Includes bibliographical references and index. ISBN 978-0-87421-750-6 (pbk. : alk. paper) -- ISBN 978-0-87421-751-3 (e-book) 1. Folklore and the Internet. 2. Folklore--Computer network resources. 3. Digital communications. I. Blank, Trevor J. GR44.E43F65 2009 398.02854678--dc22 2009026813 To Charley Camp, friend and mentor Contents Acknowledgments ix Introduction Toward a Conceptual Framework for the Study of Folklore and the Internet 1 Trevor J.
    [Show full text]