Language-Agnostic Topic Classification for Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Language-Agnostic Topic Classification for Wikipedia Language-agnostic Topic Classification for Wikipedia Isaac Johnson Martin Gerlach Diego Sáez-Trumper [email protected] [email protected] [email protected] Wikimedia Foundation Wikimedia Foundation Wikimedia Foundation United States United States United States ABSTRACT are attracting the interest of editors or readers?—especially across A major challenge for many analyses of Wikipedia dynamics—e.g., the different language editions. imbalances in content quality, geographic differences in what con- Wikipedia itself has a number of editor-curated annotation sys- tent is popular, what types of articles attract more editor discussion— tems that bring some order to all of this content. Most directly per- is grouping the very diverse range of Wikipedia articles into coher- haps is the category network, but content can also be categorized ent, consistent topics. This problem has been addressed using vari- based on properties stored in Wikidata, tagging by WikiProjects ous approaches based on Wikipedia’s category network, WikiPro- (groups of editors who focus on a specific topic), or inclusion of jects, and external taxonomies. However, these approaches have templates such as infoboxes. While these annotation systems are always been limited in their coverage: typically, only a small sub- quite powerful and extensive, they ultimately are human-generated set of articles can be classified, or the method cannot be applied and semi-structured and thus have many edge cases and under- across (the more than 300) languages on Wikipedia. In this paper, coverage in languages or communities that do not have editors who we propose a language-agnostic approach based on the links in an can maintain these annotations (see [10]). Researchers have devel- article for classifying articles into a taxonomy of topics that can be oped many approaches to improve these annotation systems by easily applied to (almost) any language and article on Wikipedia. using Wikipedia’s category network [17, 22], WikiProjects [2, 30], We show that it matches the performance of a language-dependent depending on DBPedia’s manually-curated taxonomy that then approach while being simpler and having much greater coverage. assigns topics based on infobox templates [17], or throwing out the editor-based annotation systems completely and learning a set CCS CONCEPTS number of topics through unsupervised techniques such as topic modeling [18, 20, 23, 27]. • Human-centered computing ! Empirical studies in collab- However, these approaches generally suffer from two limitations orative and social computing. related to coverage in terms of language and article. First, not all lan- guage communities have the editor base or need to maintain these KEYWORDS annotations to the same degree. For example, Arabic Wikipedia has Wikipedia, language-agnostic, topic classification the most categories per article at 25.5, and English Wikipedia with 4 ACM Reference Format: its approximately 40,000 monthly active editors, has 1.5M cate- Isaac Johnson, Martin Gerlach, and Diego Sáez-Trumper. 2021. Language- gories that are collectively applied 66M times across its 6.2 million 5 agnostic Topic Classification for Wikipedia. In Companion Proceedings of the articles. Wu Chinese Wikipedia, however, with approximately 20 Web Conference 2021 (WWW ’21 Companion), April 19–23, 2021, Ljubljana, active editors and 41,231 articles only has 6,990 categories that are Slovenia. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3442442. applied 26,396 times (leaving many articles with no categories at 3452347 all). Similar variation is seen with other systems as well: linkage to Wikidata is higher but still as low as 87.5% in Cebuano Wikipedia 1 INTRODUCTION with its 5.5M articles6, only 92 Wikipedia languages have a page de- 7 8 As of January 2021, Wikipedia has over 300 language editions with scribing WikiProjects, and many articles lack infoboxes. Second, 55.7 million articles1 about 20.4 million distinct entities2 and an approaches that seek to expand article coverage by also predicting additional 250 thousand articles created every month.3 These ar- topics for non-annotated articles depend on hand-labeling of topics ticles cover a very wide range of content and it can be difficult to (which requires language expertise) [20, 23, 27] or language model- track and understand these dynamics—e.g., what types of content ing that does not easily scale to all languages on Wikipedia [2]. In this paper, we make the following contributions: 1https://wikistats.wmcloud.org/display.php?t=wp • We present an approach to automatically labeling (almost) 2 Personal calculation based on 4 January 2021 Wikidata JSON dump: https://dumps. all Wikipedia articles across every language of Wikipedia wikimedia.org/wikidatawiki/entities/20210104/ 3https://stats.wikimedia.org/#/all-wikipedia-projects/contributing/new- with a consistent set of topics. Specifically, we build on work pages/normal|bar|2-year|~total|monthly 4https://stats.wikimedia.org/#/en.wikipedia.org/contributing/active-editors/normal| This paper is published under the Creative Commons Attribution 4.0 International line|2-year|(page_type)~content*non-content|monthly (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their 5Personal calculations using the December 2020 dumps and categorylinks table personal and corporate Web sites with the appropriate attribution. (https://www.mediawiki.org/wiki/Manual:Categorylinks_table) filtered by page table WWW ’21 Companion, April 19–23, 2021, Ljubljana, Slovenia to articles in namespace 0 (https://www.mediawiki.org/wiki/Manual:Page_table) © 2021 IW3C2 (International World Wide Web Conference Committee), published 6https://wikidata-analytics.wmcloud.org/app/WD_percentUsageDashboard under Creative Commons CC-BY 4.0 License. 7https://www.wikidata.org/wiki/Q4234303 ACM ISBN 978-1-4503-8313-4/21/04. 8Only about one-third of English Wikipedia articles have infoboxes per DBPedia’s https://doi.org/10.1145/3442442.3452347 statistics: https://wiki.dbpedia.org/services-resources/ontology WWW ’21 Companion, April 19–23, 2021, Ljubljana, Slovenia Johnson et al. by Asthana & Halfaker [2] that uses 64 topics derived from other ontologies, coverage is limited by how many infobox tem- WikiProject tags and extend their language-dependent ap- plates are present and mapped to DBPedia’s ontology. proach to all Wikipedia languages. The main innovation of Wikidata offers another ontology that is much more closely- our approach is to represent articles in a language-agnostic linked to Wikipedia. Wikidata items often either have an instance-of way using article links that have been mapped to Wikidata property (P31) or subclass-of property (P279), the network of which items (similar to Piccardi and West [23]). can be used to categorize Wikidata items (and their corresponding • We demonstrate through quantitative and qualitative evalua- Wikipedia articles) into a set of high-level topics [24]. Wikidata’s tions that our language-agnostic approach performs equally ontology contains loops, dead-ends, and other inconsistencies that well or better than alternative approaches. limit its usage, however, in applying coherent topics to articles [4, • We release the code and trained model, a dataset of every 24]. It also is a step removed from Wikipedia articles, which removes Wikipedia article and its predicted topics, and APIs for in- the direct connection and feedback loop between the topics applied teracting with the models. to an article and what content is included in the article. 2 RELATED WORK 2.3 Unsupervised Approaches Three general approaches have been taken to classifying Wikipedia Some researchers have also avoided these existing ontologies in articles into a consistent and coherent set of topics: 1) directly ap- favor of unsupervised learning of topics and post-hoc labeling. ply existing editor-generated annotations on Wikipedia, 2) linking These generally are learned via topic models, most notably Latent Wikipedia articles to an external taxonomy, and, 3) learning un- Dirichlet Allocation (LDA), with article text as input [18, 20, 27]. supervised topics and manually labeling them. This work pulls These unsupervised approaches have the benefit of generating most directly from Section 2.1 with additional modeling similar to continuous topic vectors that can be valuable for modeling and Section 2.3. having high coverage because they do not rely on annotations. However, there are limitations of these approaches for topic labeling 2.1 Annotations on Wikipedia of articles systematically. First, the identified latent topics cannot The most common and simplest strategy for classifying Wikipedia always easily be interpreted in terms of its content. Second, text- articles by topic is using existing annotations that editors have based approaches usually require custom adaptations when being added to articles. For instance, Wikipedia has a category network applied across all languages due to issues arising from parsing and that roughly forms a tree with approximately 40 root topics.9 Not pre-processing different scripts. all language communities of Wikipedia, however, use the same set The approach by Piccardi and West [23], which learns a topic of high-level categories or label articles with categories to the same model over articles as represented by their links mapped to the extent [17]. The category network also is messy, requiring careful language-agnostic Wikidata vocabulary, is very similar to the ap- rules and pruning
Recommended publications
  • Cultural Anthropology Through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-Based Sentiment
    Cultural Anthropology through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-based Sentiment Peter A. Gloor, Joao Marcos, Patrick M. de Boer, Hauke Fuehres, Wei Lo, Keiichi Nemoto [email protected] MIT Center for Collective Intelligence Abstract In this paper we study the differences in historical World View between Western and Eastern cultures, represented through the English, the Chinese, Japanese, and German Wikipedia. In particular, we analyze the historical networks of the World’s leaders since the beginning of written history, comparing them in the different Wikipedias and assessing cultural chauvinism. We also identify the most influential female leaders of all times in the English, German, Spanish, and Portuguese Wikipedia. As an additional lens into the soul of a culture we compare top terms, sentiment, emotionality, and complexity of the English, Portuguese, Spanish, and German Wikinews. 1 Introduction Over the last ten years the Web has become a mirror of the real world (Gloor et al. 2009). More recently, the Web has also begun to influence the real world: Societal events such as the Arab spring and the Chilean student unrest have drawn a large part of their impetus from the Internet and online social networks. In the meantime, Wikipedia has become one of the top ten Web sites1, occasionally beating daily newspapers in the actuality of most recent news. Be it the resignation of German national soccer team captain Philipp Lahm, or the downing of Malaysian Airlines flight 17 in the Ukraine by a guided missile, the corresponding Wikipedia page is updated as soon as the actual event happened (Becker 2012.
    [Show full text]
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Omnipedia: Bridging the Wikipedia Language
    Omnipedia: Bridging the Wikipedia Language Gap Patti Bao*†, Brent Hecht†, Samuel Carton†, Mahmood Quaderi†, Michael Horn†§, Darren Gergle*† *Communication Studies, †Electrical Engineering & Computer Science, §Learning Sciences Northwestern University {patti,brent,sam.carton,quaderi}@u.northwestern.edu, {michael-horn,dgergle}@northwestern.edu ABSTRACT language edition contains its own cultural viewpoints on a We present Omnipedia, a system that allows Wikipedia large number of topics [7, 14, 15, 27]. On the other hand, readers to gain insight from up to 25 language editions of the language barrier serves to silo knowledge [2, 4, 33], Wikipedia simultaneously. Omnipedia highlights the slowing the transfer of less culturally imbued information similarities and differences that exist among Wikipedia between language editions and preventing Wikipedia’s 422 language editions, and makes salient information that is million monthly visitors [12] from accessing most of the unique to each language as well as that which is shared information on the site. more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with In this paper, we present Omnipedia, a system that attempts a multilingual Wikipedia experience. These include to remedy this situation at a large scale. It reduces the silo visualizing content in a language-neutral way and aligning effect by providing users with structured access in their data in the face of diverse information organization native language to over 7.5 million concepts from up to 25 strategies. We present a study of Omnipedia that language editions of Wikipedia. At the same time, it characterizes how people interact with information using a highlights similarities and differences between each of the multilingual lens.
    [Show full text]
  • International Journal of Computational Linguistics
    International Journal of Computational Linguistics & Chinese Language Processing Aims and Scope International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) is an international journal published by the Association for Computational Linguistics and Chinese Language Processing (ACLCLP). This journal was founded in August 1996 and is published four issues per year since 2005. This journal covers all aspects related to computational linguistics and speech/text processing of all natural languages. Possible topics for manuscript submitted to the journal include, but are not limited to: • Computational Linguistics • Natural Language Processing • Machine Translation • Language Generation • Language Learning • Speech Analysis/Synthesis • Speech Recognition/Understanding • Spoken Dialog Systems • Information Retrieval and Extraction • Web Information Extraction/Mining • Corpus Linguistics • Multilingual/Cross-lingual Language Processing Membership & Subscriptions If you are interested in joining ACLCLP, please see appendix for further information. Copyright © The Association for Computational Linguistics and Chinese Language Processing International Journal of Computational Linguistics and Chinese Language Processing is published four issues per volume by the Association for Computational Linguistics and Chinese Language Processing. Responsibility for the contents rests upon the authors and not upon ACLCLP, or its members. Copyright by the Association for Computational Linguistics and Chinese Language Processing. All rights reserved. No part of this journal may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical photocopying, recording or otherwise, without prior permission in writing form from the Editor-in Chief. Cover Calligraphy by Professor Ching-Chun Hsieh, founding president of ACLCLP Text excerpted and compiled from ancient Chinese classics, dating back to 700 B.C.
    [Show full text]
  • Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages a SALTMIL Workshop May 27, 2008 Workshop Programme
    Collaboration: interoperability between people in the creation of language resources for less-resourced languages A SALTMIL workshop May 27, 2008 Workshop Programme 14:30-14:35 Welcome 14:35-15:05 Icelandic Language Technology Ten Years Later Eiríkur Rögnvaldsson 15:05-15:35 Human Language Technology Resources for Less Commonly Taught Languages: Lessons Learned Toward Creation of Basic Language Resources Heather Simpson, Christopher Cieri, Kazuaki Maeda, Kathryn Baker, and Boyan Onyshkevych 15:35-16:05 Building resources for African languages Karel Pala, Sonja Bosch, and Christiane Fellbaum 16:05-16:45 COFFEE BREAK 16:45-17:15 Extracting bilingual word pairs from Wikipedia Francis M. Tyers and Jacques A. Pienaar 17:15-17:45 Building a Basque/Spanish bilingual database for speaker verification Iker Luengo, Eva Navas, Iñaki Sainz, Ibon Saratxaga, Jon Sanchez, Igor Odriozola, Juan J. Igarza and Inma Hernaez 17:45-18:15 Language resources for Uralic minority languages Attila Novák 18:15-18:45 Eslema. Towards a Corpus for Asturian Xulio Viejo, Roser Saurí and Angel Neira 18:45-19:00 Annual General Meeting of the SALTMIL SIG i Workshop Organisers Briony Williams Language Technologies Unit Bangor University, Wales, UK Mikel L. Forcada Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant, Spain Kepa Sarasola Lengoaia eta Sistema Informatikoak Saila Euskal Herriko Unibertsitatea / University of the Basque Country SALTMIL Speech and Language Technologies for Minority Languages A SIG of the International Speech Communication Association
    [Show full text]
  • An Analysis of Contributions to Wikipedia from Tor
    Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor Chau Tran Kaylea Champion Andrea Forte Department of Computer Science & Engineering Department of Communication College of Computing & Informatics New York University University of Washington Drexel University New York, USA Seatle, USA Philadelphia, USA [email protected] [email protected] [email protected] Benjamin Mako Hill Rachel Greenstadt Department of Communication Department of Computer Science & Engineering University of Washington New York University Seatle, USA New York, USA [email protected] [email protected] Abstract—User-generated content sites routinely block contri- butions from users of privacy-enhancing proxies like Tor because of a perception that proxies are a source of vandalism, spam, and abuse. Although these blocks might be effective, collateral damage in the form of unrealized valuable contributions from anonymity seekers is invisible. One of the largest and most important user-generated content sites, Wikipedia, has attempted to block contributions from Tor users since as early as 2005. We demonstrate that these blocks have been imperfect and that thousands of attempts to edit on Wikipedia through Tor have been successful. We draw upon several data sources and analytical techniques to measure and describe the history of Tor editing on Wikipedia over time and to compare contributions from Tor users to those from other groups of Wikipedia users. Fig. 1. Screenshot of the page a user is shown when they attempt to edit the Our analysis suggests that although Tor users who slip through Wikipedia article on “Privacy” while using Tor. Wikipedia’s ban contribute content that is more likely to be reverted and to revert others, their contributions are otherwise similar in quality to those from other unregistered participants and to the initial contributions of registered users.
    [Show full text]
  • China Date: 8 January 2007
    Refugee Review Tribunal AUSTRALIA RRT RESEARCH RESPONSE Research Response Number: CHN31098 Country: China Date: 8 January 2007 Keywords: China – Taiwan Strait – 2006 Military exercises – Typhoons This response was prepared by the Country Research Section of the Refugee Review Tribunal (RRT) after researching publicly accessible information currently available to the RRT within time constraints. This response is not, and does not purport to be, conclusive as to the merit of any particular claim to refugee status or asylum. Questions 1. Is there corroborating information about military manoeuvres and exercises in Pingtan? 2. Is there any information specifically about the military exercise there in July 2006? 3. Is there any information about “Army day” on 1 August 2006? 4. What are the aquatic farming/fishing activities carried out in that area? 5. Has there been pollution following military exercises along the Taiwan Strait? 6. The delegate makes reference to independent information that indicates that from May until August 2006 China particularly the eastern coast was hit by a succession of storms and typhoons. The last one being the hardest to hit China in 50 years. Could I have information about this please? The delegate refers to typhoon Prapiroon. What information is available about that typhoon? 7. The delegate was of the view that military exercises would not be organised in typhoon season, particularly such a bad one. Is there any information to assist? RESPONSE 1. Is there corroborating information about military manoeuvres and exercises in Pingtan? 2. Is there any information specifically about the military exercise there in July 2006? There is a minor naval base in Pingtan and military manoeuvres are regularly held in the Taiwan Strait where Pingtan in located, especially in the June to August period.
    [Show full text]
  • THE CASE of WIKIPEDIA Shane Greenstein Feng Zhu
    NBER WORKING PAPER SERIES COLLECTIVE INTELLIGENCE AND NEUTRAL POINT OF VIEW: THE CASE OF WIKIPEDIA Shane Greenstein Feng Zhu Working Paper 18167 http://www.nber.org/papers/w18167 NATIONAL BUREAU OF ECONOMIC RESEARCH 1050 Massachusetts Avenue Cambridge, MA 02138 June 2012 The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research. NBER working papers are circulated for discussion and comment purposes. They have not been peer- reviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications. © 2012 by Shane Greenstein and Feng Zhu. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source. Collective Intelligence and Neutral Point of View: The Case of Wikipedia Shane Greenstein and Feng Zhu NBER Working Paper No. 18167 June 2012 JEL No. L17,L3,L86 ABSTRACT We examine whether collective intelligence helps achieve a neutral point of view using data from a decade of Wikipedia’s articles on US politics. Our null hypothesis builds on Linus’ Law, often expressed as “Given enough eyeballs, all bugs are shallow.” Our findings are consistent with a narrow interpretation of Linus’ Law, namely, a greater number of contributors to an article makes an article more neutral. No evidence supports a broad interpretation of Linus’ Law. Moreover, several empirical facts suggest the law does not shape many articles. The majority of articles receive little attention, and most articles change only mildly from their initial slant.
    [Show full text]
  • A Natural Experiment at Chinese Wikipedia
    American Economic Review 101 (June 2011): 1601–1615 http://www.aeaweb.org/articles.php?doi 10.1257/aer.101.4.1601 = Group Size and Incentives to Contribute: A Natural Experiment at Chinese Wikipedia By Xiaoquan Michael Zhang and Feng Zhu* ( ) Many public goods on the Internet today rely entirely on free user contributions. Popular examples include open source software development communities e.g., ( Linux, Apache , open content production e.g., Wikipedia, OpenCourseWare , and ) ( ) content sharing networks e.g., Flickr, YouTube . Several studies have examined the ( ) incentives that motivate these free contributors e.g., Josh Lerner and Jean Tirole ( 2002; Karim R. Lakhani and Eric von Hippel 2003 . ) In this paper, we examine the causal relationship between group size and incen- tives to contribute in the setting of Chinese Wikipedia, the Chinese language ver- sion of an online encyclopedia that relies entirely on voluntary contributions. The group at Chinese Wikipedia is composed of Chinese-speaking people in mainland China, Taiwan, Hong Kong, Singapore, and other regions in the world, who are aware of Chinese Wikipedia and have access to it. Our identification hinges on the exogenous reduction in group size at Chinese Wikipedia as a result of the block of Chinese Wikipedia in mainland China in October 2005. During the block, mainland Chinese could not use or contribute to Chinese Wikipedia, although contributors outside mainland China could continue to do so. We find that contribution levels of these nonblocked contributors decrease by 42.8 percent on average as a result of the block. We attribute the cause to social effects: contributors receive social benefits from their contributions, and the shrinking group size reduces these social benefits.
    [Show full text]
  • Texts from De.Wikipedia.Org, De.Wikisource.Org
    Texts from http://www.gutenberg.org/files/13635/13635-h/13635-h.htm, de.wikipedia.org, de.wikisource.org BRITISH AND GERMAN AUTHORS AND Contents: INTELLECTUALS CONFRONT EACH OTHER IN 1914 1. British authors defend England’s war (17 September) The material here collected consists of altercations between British and German men of letters and professors in the first 2. Appeal of the 93 German professors “to the world of months of the First World War. Remarkably they present culture” (4 October 1914) themselves as collective bodies and as an authoritative voice on — 2a. German version behalf of their nation. — 2b. English version The English-language materials given here were published in, 3. Declaration of the German university professors (16 and have been quoted from, The New York Times Current October 1914) History: A Monthly Magazine ("The European War", vol. 1: — 3a. German version "From the beginning to March 1915"; New York: The New — 3b. English version York Times Company 1915), online on Project Gutenberg (www.gutenberg.org). That same source also contains many 4. Reply to the German professors, by British scholars interventions written à titre personnel by individuals, including (21 October 1914) G.B. Shaw, H.G. Wells, Arnold Bennett, John Galsworthy, Jerome K. Jerome, Rudyard Kipling, G.K. Chesterton, H. Rider Haggard, Robert Bridges, Arthur Conan Doyle, Maurice Maeterlinck, Henri Bergson, Romain Rolland, Gerhart Hauptmann and Adolf von Harnack (whose open letter "to Americans in Germany" provoked a response signed by 11 British theologians). The German texts given here here can be found, with backgrounds, further references and more precise datings, in the German wikipedia article "Manifest der 93" and the German wikisource article “Erklärung der Hochschullehrer des Deutschen Reiches” (in a version dated 23 October 1914, with French parallel translation, along with the names of all 3000 signatories).
    [Show full text]
  • Detecting Undisclosed Paid Editing in Wikipedia
    Detecting Undisclosed Paid Editing in Wikipedia Nikesh Joshi, Francesca Spezzano, Mayson Green, and Elijah Hill Computer Science Department, Boise State University Boise, Idaho, USA [email protected] {nikeshjoshi,maysongreen,elijahhill}@u.boisestate.edu ABSTRACT 1 INTRODUCTION Wikipedia, the free and open-collaboration based online ency- Wikipedia is the free online encyclopedia based on the principle clopedia, has millions of pages that are maintained by thousands of of open collaboration; for the people by the people. Anyone can volunteer editors. As per Wikipedia’s fundamental principles, pages add and edit almost any article or page. However, volunteers should on Wikipedia are written with a neutral point of view and main- follow a set of guidelines when editing Wikipedia. The purpose tained by volunteer editors for free with well-defned guidelines in of Wikipedia is “to provide the public with articles that summa- order to avoid or disclose any confict of interest. However, there rize accepted knowledge, written neutrally and sourced reliably” 1 have been several known incidents where editors intentionally vio- and the encyclopedia should not be considered as a platform for late such guidelines in order to get paid (or even extort money) for advertising and self-promotion. Wikipedia’s guidelines strongly maintaining promotional spam articles without disclosing such. discourage any form of confict-of-interest (COI) editing and require In this paper, we address for the frst time the problem of identi- editors to disclose any COI contribution. Paid editing is a form of fying undisclosed paid articles in Wikipedia. We propose a machine COI editing and refers to editing Wikipedia (in the majority of the learning-based framework using a set of features based on both cases for promotional purposes) in exchange for compensation.
    [Show full text]