Classifying Articles in English and German Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Classifying Articles in English and German Wikipedia Classifying articles in English and German Wikipedia Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran School of Information Technologies University of Sydney NSW 2006, Australia {nicky,joel,tm,james}@it.usyd.edu.au Abstract German NER is especially challenging since various features used successfully in English Named Entity (NE) information is criti- NER, including proper noun capitalisation, do cal for Information Extraction (IE) tasks. However, the cost of manually annotating not apply to German language data, making NEs sufficient data for training purposes, espe- harder to detect and classify (Nothman et al., cially for multiple languages, is prohibitive, 2008). Furthermore, German has partially free meaning automated methods for develop- word order which affects the reliability of con- ing resources are crucial. We investigate textual evidence, such as previous and next word the automatic generation of NE annotated features, for NE detection. data in German from Wikipedia. By incor- porating structural features of Wikipedia, Nothman et al. (2008) devised a novel method we can develop a German corpus which of automatically generating English NE train- accurately classifies Wikipedia articles into ing data by utilising Wikipedia’s internal struc- NE categories to within 1% F -score of the ture. The approach involves classifying all arti- state-of-the-art process in English. cles in Wikipedia into classes using a features- based bootstrapping algorithm, and then creating 1 Introduction a corpus of sentences containing links to articles identified and classified based on the link’s target. Machine Learning methods in Natural Language Processing (NLP) often require large annotated We extend the features used in Nothman et al. training corpora. Wikipedia can be used to au- (2008) for use with German Wikipedia by cre- tomatically generate robust annotated corpora for ating new heuristics for classification. We en- tasks like Named Entity Recognition (NER), com- deavour to make these as language-independent petitive with manual annotation (Nothman et al., as possible, and evaluate on English and German. 2009). The CoNLL-2002 shared task defined Our experiments show that we can accurately NER as the task of identifying and classifying classify German Wikipedia articles at an F - the names of people (PER), organisations (ORG), score of 88%, and 91% for entity classes only, places (LOC) and other entities (MISC) within achieving results very close to the state-of-the-art text (Tjong Kim Sang, 2002). There has been method for English data by Nothman et al. (2008) extensive research into recognising NEs in news- who reported 89% on all and 92% on entities only. paper text and domain-specific corpora, however Nothman et al.’s (2009) NER training corpus cre- most of this has been in English. The cost of pro- ated from these entity classifications outperforms ducing sufficient NER annotated data required for the best cross-corpus results with gold standard training makes manual annotation unfeasible, and training data by up to 12% F -score using CoNLL- the generation of this data is even more impor- 2003-style evaluation. Thus, we show that it is tant for languages other than English, where gold- possible to create free, high-coverage NE anno- standard corpora are harder to obtain. tated German-language corpora from Wikipedia. 2 Background of word forms which must be considered as po- tential NEs is much larger than for languages such The area of NER has developed considerably as English. German’s partially free word order from the Message Understanding Conferences also means that surface cues, such as PER enti- (MUC) of the 1990s where the task first emerged. ties often preceding verbs of communication, are MET, the Multilingual Entity Task associated much weaker. with MUC introduced NER in languages other A final consideration is gender. The name Mark than English (Merchant et al., 1996) which had is likely to be on any list of German person names, previously made up the majority of research in but also makes up part of Germany’s old currency, the area. The CoNLL evaluations of 2002 and the Deutsche Mark, also known as D-Mark or 2003 shifted the focus to Machine Language, and just Mark (gender: female; die), and also has the further multilingual NER research incorporated meaning ‘marrow’ (gender: neuter; das). Whilst language-independent NER. CoNLL-2002 evalu- gender can sometimes disambiguate word senses, ating on Spanish and Dutch (Tjong Kim Sang, in more complicated sentence construction, gen- 2002) and CoNLL-2003 on English and German der distinctions reflected on articles and adjec- (Tjong Kim Sang and De Meulder, 2003). tives can change or be lost when a noun is used The results of the CoNLL-2002 shared task in different cases. showed that whilst choosing an appropriate ma- chine learning technique affected performance, feature choice was also vital. All of the top 8 2.2 Cross-language Wikipedia systems at CoNLL-2003 used lexical features, POS The cross-lingual link structure of Wikipedia rep- tags, affix information, previously predicted NE resents a valuable resource which can be ex- tags and orthographic features. ploited for inter-language NLP applications. Sorg The best-performing CoNLL-2003 system and Cimiano (2008) developed a method to au- achieved an F -score of 88.8% on English and tomatically induce new inter-language links by 72.4% on German (Florian et al., 2003). It com- classifying pairs of articles of two different lan- bined Maximum Entropy Models and robust risk gauges as connected by a inter-language link. minimisation with the use of external knowledge They use a classifier utilising various text and in the form of a small gazetteer of names. This graph-based features including edit distance be- was collected by manually browsing web pages tween the title of articles and link patterns. They for about two hours and was composed of 4500 find that since the fraction of bidirectional links first and last names, 4800 locations in Germany (cases where the English article e.g. Dog is linked and 190 countries. Gazetteers are very costly to to the German article Hund which is linked to the create and maintain, and so considerable research original English article) is around 95% for Ger- has gone into their automatic generation from man and English, they can be used in a bootstrap- online sources including Wikipedia (Toral and ping manner to find new inter-language links. The Munoz,˜ 2006). consistency and accuracy of the links was also The CoNLL-2003 results for German were con- found to vary, with roughly 50% of German lan- siderably lower than for English, up to 25% guage articles being linked to their English equiv- difference in F -score (Tjong Kim Sang and alents, and only 14% from English to German. De Meulder, 2003). The top performing systems Richman and Schone (2008) proposed a sys- all achieved F -scores on English more than 15 tem in which English Wikipedia article classi- higher than on German. fications are used to produce NE-annotated cor- pora in other languages, achieving an F -score of 2.1 German NER up to 84.7% on French language data, evaluated German is a very challenging language for NER, against human-annotated corpora with the MUC because various features used in English do not evaluation metric. So far there has been very little apply. There is no distinction in the capitalisa- research into directly classifying articles in non- tion of common and proper nouns so the number English Wikipedias. 2.3 Learning NER Rank Article Pageviews 1 2008 Summer Olympics 4 437 251 Machine learning approaches to NER are flexible 2 Wiki 4 030 068 due to their statistical data-driven approach, but 3 Sarah Palin 4 004 853 training data is key to their performance (Noth- 4 Michael Phelps 3 476 803 man et al., 2009). The size and topical coverage 5 YouTube 2 685 316 of Wikipedia makes its text appropriate for train- 6 Bernie Mac 2 013 775 ing general NLP systems. 7 Olympic Games 2 003 678 The method of Nothman et al. (2008) for trans- 8 Joe Biden 1 966 877 9 Georgia (country) 1 757 967 forming Wikipedia into an NE annotated corpus 10 The Dark Knight (film) 1 427 277 relies on the fact that links between Wikipedia ar- ticles often correspond to NEs. By using structural Table 1: Most frequently viewed Wikipedia articles features to classify an article, links to it can be la- from August 2008, retrieved from http://stats.grok.se belled with an NE class. The process of deriving a corpus of NE anno- Rank Title Inlinks tated sentences from Wikipedia consists of two 1 United States 543 995 2 Australia 344 969 main sub-tasks: (1) selecting sentences to include 3 Wikipedia 272 073 in the corpus; and (2) classifying articles linked 4 Association Football 241 514 in those sentences into NE classes. By relying on 5 France 227 464 redundancy, articles that are difficult to classify with confidence may simply be discarded. Table 2: Most linked-to articles of English Wikipedia. This method of processing Wikipedia enables the creation of free, much larger NE-annotated tions. Both of these consist of randomly se- corpora than have previously been available, with lected articles, Dakka and Cucerzan’s consisting wider domain applicability and up-to-date, copy- of a random set of 800 pages, expanded by list right free text. We focus on the first phase of this co-occurrence. Nothman et al.’s data set ini- process: accurate classification of articles. tially consisted of 1100 randomly sampled ar- NLP tasks in languages other than English are ticles from among all Wikipedia articles. This disadvantaged by the lack of available data for biased the sample towards entity types that are training and testing.
Recommended publications
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Omnipedia: Bridging the Wikipedia Language
    Omnipedia: Bridging the Wikipedia Language Gap Patti Bao*†, Brent Hecht†, Samuel Carton†, Mahmood Quaderi†, Michael Horn†§, Darren Gergle*† *Communication Studies, †Electrical Engineering & Computer Science, §Learning Sciences Northwestern University {patti,brent,sam.carton,quaderi}@u.northwestern.edu, {michael-horn,dgergle}@northwestern.edu ABSTRACT language edition contains its own cultural viewpoints on a We present Omnipedia, a system that allows Wikipedia large number of topics [7, 14, 15, 27]. On the other hand, readers to gain insight from up to 25 language editions of the language barrier serves to silo knowledge [2, 4, 33], Wikipedia simultaneously. Omnipedia highlights the slowing the transfer of less culturally imbued information similarities and differences that exist among Wikipedia between language editions and preventing Wikipedia’s 422 language editions, and makes salient information that is million monthly visitors [12] from accessing most of the unique to each language as well as that which is shared information on the site. more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with In this paper, we present Omnipedia, a system that attempts a multilingual Wikipedia experience. These include to remedy this situation at a large scale. It reduces the silo visualizing content in a language-neutral way and aligning effect by providing users with structured access in their data in the face of diverse information organization native language to over 7.5 million concepts from up to 25 strategies. We present a study of Omnipedia that language editions of Wikipedia. At the same time, it characterizes how people interact with information using a highlights similarities and differences between each of the multilingual lens.
    [Show full text]
  • Texts from De.Wikipedia.Org, De.Wikisource.Org
    Texts from http://www.gutenberg.org/files/13635/13635-h/13635-h.htm, de.wikipedia.org, de.wikisource.org BRITISH AND GERMAN AUTHORS AND Contents: INTELLECTUALS CONFRONT EACH OTHER IN 1914 1. British authors defend England’s war (17 September) The material here collected consists of altercations between British and German men of letters and professors in the first 2. Appeal of the 93 German professors “to the world of months of the First World War. Remarkably they present culture” (4 October 1914) themselves as collective bodies and as an authoritative voice on — 2a. German version behalf of their nation. — 2b. English version The English-language materials given here were published in, 3. Declaration of the German university professors (16 and have been quoted from, The New York Times Current October 1914) History: A Monthly Magazine ("The European War", vol. 1: — 3a. German version "From the beginning to March 1915"; New York: The New — 3b. English version York Times Company 1915), online on Project Gutenberg (www.gutenberg.org). That same source also contains many 4. Reply to the German professors, by British scholars interventions written à titre personnel by individuals, including (21 October 1914) G.B. Shaw, H.G. Wells, Arnold Bennett, John Galsworthy, Jerome K. Jerome, Rudyard Kipling, G.K. Chesterton, H. Rider Haggard, Robert Bridges, Arthur Conan Doyle, Maurice Maeterlinck, Henri Bergson, Romain Rolland, Gerhart Hauptmann and Adolf von Harnack (whose open letter "to Americans in Germany" provoked a response signed by 11 British theologians). The German texts given here here can be found, with backgrounds, further references and more precise datings, in the German wikipedia article "Manifest der 93" and the German wikisource article “Erklärung der Hochschullehrer des Deutschen Reiches” (in a version dated 23 October 1914, with French parallel translation, along with the names of all 3000 signatories).
    [Show full text]
  • Does Wikipedia Matter? the Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Discus­­ Si­­ On­­ Paper No
    Dis cus si on Paper No. 15-089 Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Dis cus si on Paper No. 15-089 Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko First version: December 2015 This version: September 2017 Download this ZEW Discussion Paper from our ftp server: http://ftp.zew.de/pub/zew-docs/dp/dp15089.pdf Die Dis cus si on Pape rs die nen einer mög lichst schnel len Ver brei tung von neue ren For schungs arbei ten des ZEW. Die Bei trä ge lie gen in allei ni ger Ver ant wor tung der Auto ren und stel len nicht not wen di ger wei se die Mei nung des ZEW dar. Dis cus si on Papers are inten ded to make results of ZEW research prompt ly avai la ble to other eco no mists in order to encou ra ge dis cus si on and sug gesti ons for revi si ons. The aut hors are sole ly respon si ble for the con tents which do not neces sa ri ly repre sent the opi ni on of the ZEW. Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices ∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ First version: December 2015 This version: September 2017 September 25, 2017 Abstract We document a causal influence of online user-generated information on real- world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional information on Wikipedia about cities affects tourists’ choices of overnight visits.
    [Show full text]
  • Peer Effects in Collaborative Content Generation: the Evidence from German Wikipedia Olga Slivko Discus­­ Si­­ On­­ Paper No
    Dis cus si on Paper No. 14-128 Peer Effects in Collaborative Content Generation: The Evidence from German Wikipedia Olga Slivko Dis cus si on Paper No. 14-128 Peer Effects in Collaborative Content Generation: The Evidence from German Wikipedia Olga Slivko Previous version: December 22, 2014 This version: March 3, 2015 Download this ZEW Discussion Paper from our ftp server: http://ftp.zew.de/pub/zew-docs/dp/dp14128.pdf Die Dis cus si on Pape rs die nen einer mög lichst schnel len Ver brei tung von neue ren For schungs arbei ten des ZEW. Die Bei trä ge lie gen in allei ni ger Ver ant wor tung der Auto ren und stel len nicht not wen di ger wei se die Mei nung des ZEW dar. Dis cus si on Papers are inten ded to make results of ZEW research prompt ly avai la ble to other eco no mists in order to encou ra ge dis cus si on and sug gesti ons for revi si ons. The aut hors are sole ly respon si ble for the con tents which do not neces sa ri ly repre sent the opi ni on of the ZEW. Peer E↵ects in Collaborative Content Generation: The Evidence from German Wikipedia Olga Slivko ⇤ ICT Research Department, Centre for European Economic Research (ZEW) Previous version: December 22, 2014 This version: March 3, 2015 Abstract On Wikipedia, the largest online encyclopedia, editors who contribute to the same articles and exchange comments on articles’ talk pages work in collaborative manner sometimes discussing their work.
    [Show full text]
  • Visual Narratives and Collective Memory Across Peer-Produced Accounts of Contested Sociopolitical Events
    XX Visual Narratives and Collective Memory across Peer-Produced Accounts of Contested Sociopolitical Events EMILY PORTER, University of Washington, USA P.M. KRAFFT∗, University of Oxford, UK BRIAN KEEGAN, University of Colorado, USA Studying cultural variation in recollections of sociopolitical events is crucial for achieving diverse understand- ings of such events. To date, most studies in this area have focused on analyzing variation in texts describing events. Here we analyze variation in image usage across Wikipedia language editions to understand if like text, visual narratives reflect distinct perspectives in articles about culturally-tethered events. We focus on articles about coup d’états as an example of highly contextual sociopolitical events likely to display such variation. The key challenge to examining variation in images is that there is no existing framework to use as a basis for comparison. To address this challenge, we use an iterative inductive coding process to arrive at a 46-item typology for categorizing the content of images relating to contested sociopolitical events, and a typology of network motifs that characterizes structural patterns of image use. We apply these typologies in a large-scale quantitative analysis that establishes clusters of image themes, two detailed qualitative case studies comparing Wikipedia articles on coup d’états in Soviet Russia and Egypt, and four quantitative analyses clustering image themes by language usage at the article level. These analyses document variation in imagery around particular events and variation in tendencies across cultures. We find substantial cultural variation in both content and network structure. This study presents a novel methodological framework for uncovering culturally divergent perspective of political crises through imagery on Wikipedia.
    [Show full text]
  • Complete Issue
    Culture Unbound: Journal of Current Cultural Research Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Edited by Jutta Haider & Olof Sundin Extraction from Volume 6, 2014 Linköping University Electronic Press Culture Unbound: ISSN 2000-1525 (online) URL: http://www.cultureunbound.ep.liu.se/ © 2014 The Authors. Culture Unbound, Extraction from Volume 6, 2014 Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Jutta Haider & Olof Sundin Introduction: Changing Orders of Knowledge? Encyclopaedias in Transition ................................ 475 Katharine Schopflin What do we Think an Encyclopaedia is? ........................................................................................... 483 Seth Rudy Knowledge and the Systematic Reader: The Past and Present of Encyclopedic Reading .............................................................................................................................................. 505 Siv Frøydis Berg & Tore Rem Knowledge for Sale: Norwegian Encyclopaedias in the Marketplace .............................................. 527 Vanessa Aliniaina Rasoamampianina Reviewing Encyclopaedia Authority .................................................................................................. 547 Ulrike Spree How readers Shape the Content of an Encyclopedia: A Case Study Comparing the German Meyers Konversationslexikon (1885-1890) with Wikipedia (2002-2013) ........................... 569 Kim Osman The Free Encyclopaedia that Anyone can Edit: The
    [Show full text]
  • Genre Analysis of Online Encyclopedias. the Case of Wikipedia
    Genre analysis online encycloped The case of Wikipedia AnnaTereszkiewicz Genre analysis of online encyclopedias The case of Wikipedia Wydawnictwo Uniwersytetu Jagiellońskiego Publikacja dofi nansowana przez Wydział Filologiczny Uniwersytetu Jagiellońskiego ze środków wydziałowej rezerwy badań własnych oraz Instytutu Filologii Angielskiej PROJEKT OKŁADKI Bartłomiej Drosdziok Zdjęcie na okładce: Łukasz Stawarski © Copyright by Anna Tereszkiewicz & Wydawnictwo Uniwersytetu Jagiellońskiego Wydanie I, Kraków 2010 All rights reserved Książka, ani żaden jej fragment nie może być przedrukowywana bez pisemnej zgody Wydawcy. W sprawie zezwoleń na przedruk należy zwracać się do Wydawnictwa Uniwersytetu Jagiellońskiego. ISBN 978-83-233-2813-1 www.wuj.pl Wydawnictwo Uniwersytetu Jagiellońskiego Redakcja: ul. Michałowskiego 9/2, 31-126 Kraków tel. 12-631-18-81, 12-631-18-82, fax 12-631-18-83 Dystrybucja: tel. 12-631-01-97, tel./fax 12-631-01-98 tel. kom. 0506-006-674, e-mail: [email protected] Konto: PEKAO SA, nr 80 1240 4722 1111 0000 4856 3325 Table of Contents Acknowledgements ........................................................................................................................ 9 Introduction .................................................................................................................................... 11 Materials and Methods .................................................................................................................. 14 1. Genology as a study ..................................................................................................................
    [Show full text]
  • Critical Point of View: a Wikipedia Reader
    w ikipedia pedai p edia p Wiki CRITICAL POINT OF VIEW A Wikipedia Reader 2 CRITICAL POINT OF VIEW A Wikipedia Reader CRITICAL POINT OF VIEW 3 Critical Point of View: A Wikipedia Reader Editors: Geert Lovink and Nathaniel Tkacz Editorial Assistance: Ivy Roberts, Morgan Currie Copy-Editing: Cielo Lutino CRITICAL Design: Katja van Stiphout Cover Image: Ayumi Higuchi POINT OF VIEW Printer: Ten Klei Groep, Amsterdam Publisher: Institute of Network Cultures, Amsterdam 2011 A Wikipedia ISBN: 978-90-78146-13-1 Reader EDITED BY Contact GEERT LOVINK AND Institute of Network Cultures NATHANIEL TKACZ phone: +3120 5951866 INC READER #7 fax: +3120 5951840 email: [email protected] web: http://www.networkcultures.org Order a copy of this book by sending an email to: [email protected] A pdf of this publication can be downloaded freely at: http://www.networkcultures.org/publications Join the Critical Point of View mailing list at: http://www.listcultures.org Supported by: The School for Communication and Design at the Amsterdam University of Applied Sciences (Hogeschool van Amsterdam DMCI), the Centre for Internet and Society (CIS) in Bangalore and the Kusuma Trust. Thanks to Johanna Niesyto (University of Siegen), Nishant Shah and Sunil Abraham (CIS Bangalore) Sabine Niederer and Margreet Riphagen (INC Amsterdam) for their valuable input and editorial support. Thanks to Foundation Democracy and Media, Mondriaan Foundation and the Public Library Amsterdam (Openbare Bibliotheek Amsterdam) for supporting the CPOV events in Bangalore, Amsterdam and Leipzig. (http://networkcultures.org/wpmu/cpov/) Special thanks to all the authors for their contributions and to Cielo Lutino, Morgan Currie and Ivy Roberts for their careful copy-editing.
    [Show full text]
  • Assessing the Accuracy and Quality of Wikipedia Entries Compared to Popular Online Encyclopaedias
    Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias A preliminary comparative study across disciplines in English, Spanish and Arabic ˃ Imogen Casebourne ˃ Dr. Chris Davies ˃ Dr. Michelle Fernandes ˃ Dr. Naomi Norman Casebourne, I., Davies, C., Fernandes, M., Norman, N. (2012) Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias: A comparative preliminary study across disciplines in English, Spanish and Arabic. Epic, Brighton, UK. Retrieved from: http://commons.wikimedia.org/wiki/File:EPIC_Oxford_report.pdf This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported Licence http://creativecommons.org/licences/by-sa/3.0/ Supplementary information Additional information, datasets and supplementary materials from this study can be found at: http://meta.wikimedia.org/wiki/Research:Accuracy_and_quality_of_Wikipedia_entries/Data 2 Table of Contents Table of Contents .......................................................................................................................................... 2 Executive Summary ....................................................................................................................................... 5 1. Introduction .......................................................................................................................................... 5 2. Aims and Objectives .............................................................................................................................
    [Show full text]
  • Wikipedia Matters∗
    Wikipedia Matters∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ July 14, 2019 Abstract We document a causal impact of online user-generated information on real-world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional content on Wikipedia pages about cities affects tourists’ choices of overnight visits. Our treatment of adding information to Wikipedia in- creases overnight stays in treated cities compared to non-treated cities. The impact is largely driven by improvements to shorter and relatively incomplete pages on Wikipedia. Our findings highlight the value of content in digital public goods for informing individual choices. JEL: C93, H41, L17, L82, L83, L86 Keywords: field experiment, user-generated content, Wikipedia, tourism industry ∗We are grateful to Irene Bertschek, Avi Goldfarb, Shane Greenstein, Tobias Kretschmer, Michael Luca, Thomas Niebel, Marianne Saam, Greg Veramendi, Joel Waldfogel, and Michael Zhang as well as seminar audiences at the Economics of Network Industries conference (Paris), ZEW Conference on the Economics of ICT (Mannheim), Advances with Field Experiments 2017 Conference (University of Chicago), 15th Annual Media Economics Workshop (Barcelona), Conference on Digital Experimentation (MIT), and Digital Economics Conference (Toulouse) for valuable comments. Ruetger Egolf, David Neseer, and Andrii Pogorielov provided outstanding research assistance. Financial support from SEEK 2014 is gratefully acknowledged. †Collegio Carlo Alberto and CEPR, [email protected] ‡Collegio Carlo Alberto, [email protected] §Georgia Institute of Technology & University of East Anglia, [email protected] ¶ZEW, [email protected] 1 1 Introduction Asymmetric information can hinder efficient economic activity (Akerlof, 1970). In recent decades, the Internet and new media have enabled greater access to information than ever before.
    [Show full text]