Wikimatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Wikimatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia Holger Schwenk Vishrav Chaudhary Shuo Sun Hongyu Gong Francisco Guzman´ Facebook AI Facebook AI Johns Hopkins Univ. of Illinois Facebook AI University Urbana-Champaign [email protected] [email protected] [email protected] hgong6 [email protected] @illinois.edu Abstract Wikipedia is probably the largest free multi- lingual resource on the Internet. The content of We present an approach based on multilin- Wikipedia is very diverse and covers many topics. gual sentence embeddings to automatically ex- Articles exist in more than 300 languages. Some tract parallel sentences from the content of content on Wikipedia was human translated from Wikipedia articles in 96 languages, including an existing article into another language, not neces- several dialects or low-resource languages. We systematically consider all possible language sarily from or into English. Eventually, the trans- pairs. In total, we are able to extract 135M lated articles have been later independently edited parallel sentences for 1620 different language and are not parallel anymore. Wikipedia strongly pairs, out of which only 34M are aligned with discourages the use of unedited machine transla- English. This corpus is freely available.1 tion,2 but the existence of such articles cannot be To get an indication on the quality of the ex- totally excluded. Many articles have been written tracted bitexts, we train neural MT baseline independently, but may nevertheless contain sen- systems on the mined data only for 1886 lan- tences which are mutual translations. This makes guages pairs, and evaluate them on the TED Wikipedia a very appropriate resource to mine for corpus, achieving strong BLEU scores for parallel texts for a large number of language pairs. many language pairs. The WikiMatrix bitexts To the best of our knowledge, this is the first work seem to be particularly interesting to train MT systems between distant languages without the to process the entire Wikipedia and systematically need to pivot through English. mine for parallel sentences in all language pairs. In this work, we build on a recent approach to 1 Introduction mine parallel texts based on a distance measure in a joint multilingual sentence embedding space Most of the current approaches in Natural Lan- (Schwenk, 2018; Artetxe and Schwenk, 2018a), guage Processing are data-driven. The size of the and a freely available encoder for 93 languages. resources used for training is often the primary con- We approach the computational challenge to mine cern, but the quality and a large variety of topics in almost six hundred million sentences by using may be equally important. Monolingual texts are fast indexing and similarity search algorithms. usually available in huge amounts for many topics The paper is organized as follows. In the next and languages. However, multilingual resources, section, we first discuss related work. We then sum- i.e. sentences which are mutual translations, are marize the underlying mining approach. Section4 more limited, in particular when the two languages describes in detail how we applied this approach to do not involve English. An important source of extract parallel sentences from Wikipedia in 1620 parallel texts is from international organizations language pairs. In section5, we assess the quality like the European Parliament (Koehn, 2005) or the of the extracted bitexts by training NMT systems United Nations (Ziemski et al., 2016). Several for a subset of language pairs and evaluate them on projects rely on volunteers to provide translations the TED corpus (Qi et al., 2018) for 45 languages. for public texts, e.g. news commentary (Tiede- The paper concludes with a discussion of future mann, 2012), OpensubTitles (Lison and Tiede- research directions. mann, 2016) or the TED corpus (Qi et al., 2018) 1https://github.com/facebookresearch/ 2https://en.wikipedia.org/wiki/ LASER/tree/master/tasks/WikiMatrix Wikipedia:Translation 1351 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1351–1361 April 19 - 23, 2021. ©2021 Association for Computational Linguistics 2 Related work to all combinations of many different languages, written in more than twenty different scripts. In fol- There is a large body of research on mining parallel low up work, the same underlying mining approach sentences in monolingual texts collections, usually was applied to a huge collection of Common Crawl named “comparable coprora”. Initial approaches texts (Schwenk et al., 2019). Hierarchical mining in to bitext mining have relied on heavily engineered Common Crawl texts was performed by El-Kishky systems often based on metadata information, e.g. et al.(2020). 4 (Resnik, 1999; Resnik and Smith, 2003). More re- cent methods explore the textual content of the com- Wikipedia is arguably the largest comparable parable documents. For instance, it was proposed corpus. One of the first attempts to exploit this to rely on cross-lingual document retrieval, e.g. resource was performed by Adafre and de Rijke (Utiyama and Isahara, 2003; Munteanu and Marcu, (2006). An MT system was used to translate 2005) or machine translation, e.g. (Abdul-Rauf Dutch sentences into English and to compare them and Schwenk, 2009; Bouamor and Sajjad, 2018), with the English texts, yielding several hundreds typically to obtain an initial alignment that is then of Dutch/English bitexts. Later, a similar tech- further filtered. In the shared task for bilingual doc- nique was applied to Persian/English (Mohammadi ument alignment (Buck and Koehn, 2016), many and GhasemAghaee, 2010). Structural informa- participants used techniques based on n-gram or tion in Wikipedia such as the topic categories of neural language models, neural translation mod- documents was used in the alignment of multi- els and bag-of-words lexical translation probabil- lingual corpora (Otero and Lopez´ , 2010). In an- ities for scoring candidate document pairs. The other work, the mining approach of Munteanu and STACC method uses seed lexical translations in- Marcu(2005) was applied to extract large corpora duced from IBM alignments, which are combined from Wikipedia in sixteen languages (Smith et al., with set expansion operations to score translation 2010). Otero et al.(2011) measured the compa- candidates through the Jaccard similarity coeffi- rability of Wikipedia corpora by the translation cient (Etchegoyhen and Azpeitia, 2016; Azpeitia equivalents on three languages Portuguese, Span- et al., 2017, 2018). Using multilingual noisy web- ish, and English. Patry and Langlais(2011) came crawls such as ParaCrawl3 for filtering good quality up with a set of features such as Wikipedia en- sentence pairs has been explored in the shared tasks tities to recognize parallel documents, and their for high resource (Koehn et al., 2018) and low re- approach was limited to a bilingual setting. Tu- source (Koehn et al., 2019) languages. fis et al.(2013) proposed an approach to mine bi- In this work, we rely on massively multilin- texts from Wikipedia textual content, but they only gual sentence embeddings and margin-based min- considered high-resource languages, namely Ger- ing in the joint embedding space, as described in man, Spanish and Romanian paired with English. (Schwenk, 2018; Artetxe and Schwenk, 2018a,b). Tsai and Roth(2016) grounded multilingual men- This approach has also proven to perform best in tions to English Wikipedia by training cross-lingual a low resource scenario (Chaudhary et al., 2019; embeddings on twelve languages. Gottschalk and Koehn et al., 2019). Closest to this approach is the Demidova(2017) searched for parallel text pas- research described in Espana-Bonet˜ et al.(2017); sages in Wikipedia by comparing their named enti- Hassan et al.(2018); Guo et al.(2018); Yang et al. ties and time expressions. Finally, Aghaebrahimian (2019). However, in all these works, only bilingual (2018) propose an approach based on bilingual BiL- sentence representations have been trained. Such STM sentence encoders to mine German, French an approach does not scale to many languages, in and Persian parallel texts with English. Parallel particular when considering all possible language data consisting of aligned Wikipedia titles have pairs in Wikipedia. Finally, related ideas have been been extracted for twenty-three languages.5 We also proposed in Bouamor and Sajjad(2018) or are not aware of other attempts to systematically Gregoire´ and Langlais(2017). However, in those mine for parallel sentences in the textual content of works, mining is not solely based on multilingual Wikipedia for a large number of languages. sentence embeddings, but they are part of a larger system. To the best of our knowledge, this work is the first one that applies the same mining approach 4http://www.statmt.org/cc-aligned/ 5https://linguatools.org/tools/ 3http://www.paracrawl.eu/ corpora/wikipedia-parallel-titles-corpora/ 1352 3 Distance-based mining approach 3.2 Multilingual sentence embeddings The underlying idea of the mining approach used Distance-based bitext mining requires a joint sen- in this work is to first learn a multilingual sentence tence embedding for all the considered languages. embedding. The distance in that space can be used One may be tempted to train a bi-lingual em- as an indicator of whether two sentences are mutual bedding for each language pair, e.g. (Espana-˜ translations or not. Using a simple absolute thresh- Bonet et al., 2017; Hassan et al., 2018; Guo et al., old on the cosine distance was shown to achieve 2018; Yang et al., 2019), but this is difficult to competitive results (Schwenk, 2018). However, it scale to thousands of language pairs present in has been observed that an absolute threshold on Wikipedia. Instead, we chose to use one single the cosine distance is globally not consistent, e.g. massively multilingual sentence embedding for (Guo et al., 2018). This is particularly true when all languages, namely the one proposed by the mining bitexts for many different language pairs.
Recommended publications
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science
    (forthcoming in the Journal of the Association for Information Science and Technology) Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science Misha Teplitskiy Grace Lu Eamon Duede Dept. of Sociology and KnowledgeLab Computation Institute and KnowledgeLab University of Chicago KnowledgeLab University of Chicago [email protected] University of Chicago [email protected] (773) 834-4787 [email protected] (773) 834-4787 5735 South Ellis Avenue (773) 834-4787 5735 South Ellis Avenue Chicago, Illinois 60637 5735 South Ellis Avenue Chicago, Illinois 60637 Chicago, Illinois 60637 Abstract With the rise of Wikipedia as a first-stop source for scientific knowledge, it is important to compare its representation of that knowledge to that of the academic literature. Here we identify the 250 most heavi- ly used journals in each of 26 research fields (4,721 journals, 19.4M articles in total) indexed by the Scopus database, and test whether topic, academic status, and accessibility make articles from these journals more or less likely to be referenced on Wikipedia. We find that a journal’s academic status (im- pact factor) and accessibility (open access policy) both strongly increase the probability of its being ref- erenced on Wikipedia. Controlling for field and impact factor, the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to paywall journals. One of the implica- tions of this study is that a major consequence of open access policies is to significantly amplify the dif- fusion of science, through an intermediary like Wikipedia, to a broad audience. Word count: 7894 Introduction Wikipedia, one of the most visited websites in the world1, has become a destination for information of all kinds, including information about science (Heilman & West, 2015; Laurent & Vickers, 2009; Okoli, Mehdi, Mesgari, Nielsen, & Lanamäki, 2014; Spoerri, 2007).
    [Show full text]
  • Language-Agnostic Relation Extraction from Abstracts in Wikis
    information Article Language-Agnostic Relation Extraction from Abstracts in Wikis Nicolas Heist, Sven Hertling and Heiko Paulheim * ID Data and Web Science Group, University of Mannheim, Mannheim 68131, Germany; [email protected] (N.H.); [email protected] (S.H.) * Correspondence: [email protected] Received: 5 February 2018; Accepted: 28 March 2018; Published: 29 March 2018 Abstract: Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph. Keywords: relation extraction; knowledge graphs; Wikipedia; DBpedia; DBkWik; Wiki farms 1. Introduction Large-scale knowledge graphs, such as DBpedia [1], Freebase [2], Wikidata [3], or YAGO [4], are usually built using heuristic extraction methods, by exploiting crowd-sourcing processes, or both [5].
    [Show full text]
  • Defending Democracy & Kristin Skare Orgeret Nordicom-Information
    Defending Defending Oslo 8–11 August 2013 Democracy The 2013 NordMedia conference in Oslo marked the 40 years that had passed since the very first Nordic media conference. To acknowledge this 40-year anniversary, it made sense to have a conference theme that dealt with a major and important topic: Defending Democracy. Nordic Defending and Global Diversities in Media and Journalism. Focusing on the rela- tionship between journalism, other media practices and democracy, the plenary sessions raised questions such as: Democracy & Edited by What roles do media and journalism play in democratization • Kristin Skare Orgeret processes and what roles should they play? Nordic and Global Diversities How does the increasingly complex and omnipresent media in Media and Journalism • Hornmoen Harald field affect conditions for freedom of speech? This special issue contains the keynote speeches of Natalie Fenton, Stephen Ward and Ib Bondebjerg. A number of the conference papers have been revised and edited to become articles. Together, the articles presented should give the reader an idea of the breadth and depth of Edited by current Nordic scholarship in the area. Harald Hornmoen & Kristin Skare Orgeret SPECIAL ISSUE Nordicom Review | Volume 35 | August 2014 Nordicom-Information | Volume 36 | Number 2 | August 2014 Nordicom-Information Nordicom Review University of Gothenburg 2014 issue Special Box 713, SE 405 30 Göteborg, Sweden Telephone +46 31 786 00 00 (op.) | Fax +46 31 786 46 55 www.nordicom.gu.se | E-mail: [email protected] SPECIAL ISSUE Nordicom Review | Volume 35 | August 2014 Nordicom-Information | Volume 36 | Number 2 | August 2014 Nordicom Review Journal from the Nordic Information Centre for Media and Communication Research Editor NORDICOM invites media researchers to contri- Ulla Carlsson bute scientific articles, reviews, and debates.
    [Show full text]
  • A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks
    WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks Cristian Consonni David Laniado Alberto Montresor DISI, University of Trento Eurecat, Centre Tecnologic` de Catalunya DISI, University of Trento [email protected] [email protected] [email protected] Abstract and result in a huge conceptual network. According to Wiki- pedia policies2 (Wikipedia contributors 2018e), when a con- Wikipedia articles contain multiple links connecting a sub- cept is relevant within an article, the article should include a ject to other pages of the encyclopedia. In Wikipedia par- link to the page corresponding to such concept (Borra et al. lance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wiki- 2015). Therefore, the network between articles may be seen pedia links for the 9 largest language editions. The dataset as a giant mind map, emerging from the links established by contains yearly snapshots of the network and spans 17 years, the community. Such graph is not static but is continuously from the creation of Wikipedia in 2001 to March 1st, 2018. growing and evolving, reflecting the endless collaborative While previous work has mostly focused on the complete hy- process behind it. perlink graph which includes also links automatically gener- The English Wikipedia includes over 163 million con- ated by templates, we parsed each revision of each article to nections between its articles. This huge graph has been track links appearing in the main text. In this way we obtained exploited for many purposes, from natural language pro- a cleaner network, discarding more than half of the links and cessing (Yeh et al.
    [Show full text]
  • Complete Issue
    Culture Unbound: Journal of Current Cultural Research Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Edited by Jutta Haider & Olof Sundin Extraction from Volume 6, 2014 Linköping University Electronic Press Culture Unbound: ISSN 2000-1525 (online) URL: http://www.cultureunbound.ep.liu.se/ © 2014 The Authors. Culture Unbound, Extraction from Volume 6, 2014 Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Jutta Haider & Olof Sundin Introduction: Changing Orders of Knowledge? Encyclopaedias in Transition ................................ 475 Katharine Schopflin What do we Think an Encyclopaedia is? ........................................................................................... 483 Seth Rudy Knowledge and the Systematic Reader: The Past and Present of Encyclopedic Reading .............................................................................................................................................. 505 Siv Frøydis Berg & Tore Rem Knowledge for Sale: Norwegian Encyclopaedias in the Marketplace .............................................. 527 Vanessa Aliniaina Rasoamampianina Reviewing Encyclopaedia Authority .................................................................................................. 547 Ulrike Spree How readers Shape the Content of an Encyclopedia: A Case Study Comparing the German Meyers Konversationslexikon (1885-1890) with Wikipedia (2002-2013) ........................... 569 Kim Osman The Free Encyclopaedia that Anyone can Edit: The
    [Show full text]
  • Wikimatrix: Mining 135M Parallel Sentences
    Under review as a conference paper at ICLR 2020 WIKIMATRIX:MINING 135M PARALLEL SENTENCES IN 1620 LANGUAGE PAIRS FROM WIKIPEDIA Anonymous authors Paper under double-blind review ABSTRACT We present an approach based on multilingual sentence embeddings to automat- ically extract parallel sentences from the content of Wikipedia articles in 85 lan- guages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but systematically consider all pos- sible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus of parallel sentences is freely available.1 To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English. 1 INTRODUCTION Most of the current approaches in Natural Language Processing (NLP) are data-driven. The size of the resources used for training is often the primary concern, but the quality and a large variety of topics may be equally important. Monolingual texts are usually available in huge amounts for many topics and languages. However, multilingual resources, typically sentences in two languages which are mutual translations, are more limited, in particular when the two languages do not involve English. An important source of parallel texts are international organizations like the European Parliament (Koehn, 2005) or the United Nations (Ziemski et al., 2016).
    [Show full text]
  • Using Natural Language Generation to Bootstrap Missing Wikipedia Articles
    Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 Using Natural Language Generation to 4 5 5 6 Bootstrap Missing Wikipedia Articles: 6 7 7 8 8 9 A Human-centric Perspective 9 10 10 * 11 Lucie-Aimée Kaffee , Pavlos Vougiouklis and Elena Simperl 11 12 School of Electronics and Computer Science, University of Southampton, UK 12 13 E-mails: [email protected], [email protected], [email protected] 13 14 14 15 15 16 16 17 17 Abstract. Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media 18 18 management. Recent advances in machine learning have made it possible to train NLG systems that seek to achieve human- 19 19 level performance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and 20 evaluate it with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-resourced 20 21 Wikipedia language versions, which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. 21 22 We train a neural network to generate a ’introductory sentence from the Wikidata triples shown by the ArticlePlaceholder, and 22 23 explore how Wikipedia users engage with it. The evaluation, which includes an automatic, a judgement-based, and a task-based 23 24 component, shows that the summary sentences score well in terms of perceived fluency and appropriateness for Wikipedia, and 24 25 can help editors bootstrap new articles.
    [Show full text]
  • Wikipedia and Wikis Haider, Jutta
    Wikipedia and Wikis Haider, Jutta; Sundin, Olof Published in: The Handbook of Peer Production DOI: 10.1002/9781119537151.ch13 2021 Document Version: Peer reviewed version (aka post-print) Link to publication Citation for published version (APA): Haider, J., & Sundin, O. (2021). Wikipedia and Wikis. In M. O'Neil, C. Pentzold, & S. Toupin (Eds.), The Handbook of Peer Production (pp. 169-184). (Wiley Handbooks in Communication and Media Series ). Wiley. https://doi.org/10.1002/9781119537151.ch13 Total number of authors: 2 Creative Commons License: CC BY-NC-ND General rights Unless other specific re-use rights are stated the following general rights apply: Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Read more about Creative commons licenses: https://creativecommons.org/licenses/ Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. LUND UNIVERSITY PO Box 117 221 00 Lund +46 46-222 00 00 This is the author’s version of a chapter accepted for publication in the Handbook of Peer Production.
    [Show full text]
  • Wikipedia @ 20
    Wikipedia @ 20 Wikipedia @ 20 Stories of an Incomplete Revolution Edited by Joseph Reagle and Jackie Koerner The MIT Press Cambridge, Massachusetts London, England © 2020 Massachusetts Institute of Technology This work is subject to a Creative Commons CC BY- NC 4.0 license. Subject to such license, all rights are reserved. The open access edition of this book was made possible by generous funding from Knowledge Unlatched, Northeastern University Communication Studies Department, and Wikimedia Foundation. This book was set in Stone Serif and Stone Sans by Westchester Publishing Ser vices. Library of Congress Cataloging-in-Publication Data Names: Reagle, Joseph, editor. | Koerner, Jackie, editor. Title: Wikipedia @ 20 : stories of an incomplete revolution / edited by Joseph M. Reagle and Jackie Koerner. Other titles: Wikipedia at 20 Description: Cambridge, Massachusetts : The MIT Press, [2020] | Includes bibliographical references and index. Identifiers: LCCN 2020000804 | ISBN 9780262538176 (paperback) Subjects: LCSH: Wikipedia--History. Classification: LCC AE100 .W54 2020 | DDC 030--dc23 LC record available at https://lccn.loc.gov/2020000804 Contents Preface ix Introduction: Connections 1 Joseph Reagle and Jackie Koerner I Hindsight 1 The Many (Reported) Deaths of Wikipedia 9 Joseph Reagle 2 From Anarchy to Wikiality, Glaring Bias to Good Cop: Press Coverage of Wikipedia’s First Two Decades 21 Omer Benjakob and Stephen Harrison 3 From Utopia to Practice and Back 43 Yochai Benkler 4 An Encyclopedia with Breaking News 55 Brian Keegan 5 Paid with Interest: COI Editing and Its Discontents 71 William Beutler II Connection 6 Wikipedia and Libraries 89 Phoebe Ayers 7 Three Links: Be Bold, Assume Good Faith, and There Are No Firm Rules 107 Rebecca Thorndike- Breeze, Cecelia A.
    [Show full text]
  • The Handbook of Peer Production Chapter 13: Wikipedia and Wikis
    This is the author’s version of a chapter accepted for publication in the Handbook of Peer Production. Changes resulting from the publishing process such as copy-editing, typesetting, and other quality control mechanisms may not be reflected in this document. This author manuscript version is available for personal, non-commercial and no derivative uses only. Citation: Haider, J. & Sundin, O. (2021). Wikipedia and wikis. In: M. O’Neil, C. Pentzold & S. Toupin (Eds.), The Handbook of Peer Production (pp. 169-184). Malden, MA: Wiley-Blackwell. ISBN 9781119537106 Available at: https://www.wiley.com/en-au/The+Handbook+of+Peer+Production-p-9781119537090 The Handbook of Peer Production Chapter 13: Wikipedia and Wikis Jutta Haider, University of Borås, Sweden & Olof Sundin, Lund University, Sweden Chapter 13 – Wikis and Wikipedia 2 1. Introduction WikiWikiWebs or wikis constitute the core platform of peer production. Wikis are software programs allowing for flexible collaboration without necessarily having a defined content owner or leader. A wiki is a user-editable website or content management system. Wikis might use different programming languages and licenses, but they apply the same model for cooperation, which means that collaborators can modify content, insert hyperlinks, and change the structure of a document directly in any web browser. Edits are usually archived and open to revision for all collaborators. The most popular and successful wiki-based project by far is Wikipedia. Established in 2001, today this encyclopedia is one of the most popular sites on the Web; what is more, with its data supporting other applications and commercial platforms like Google, Wikipedia has taken on an infrastructural role in the contemporary internet.
    [Show full text]
  • Bachelor Thesis
    DEPARTMENT INFORMATION Bachelor Thesis The Striving for Credibilty in Different Language Versions of Wikipedia Written by Wolfgang Reinhold Hesse Study Programme: Library and Information Management First examiner: Prof. Dr. Ulrike Spree Second examiner: Prof. Dr. Olof Sundin Hamburg, 22 October 2016 Abstract! ii! ! Abstract(! The!present!paper!examines!differences!in!credibility!creation!between!the!English,!German,! Icelandic! and! Swedish! Wikipedia!language!versions.!An!analysis!of!policies!on!verifiability!of! content!was!conducted!to!gain!an!insight!into!the!language!communities’!theoretical!approach! to!credibility.!The!application!of!these!norms!was!evaluated!by!comparing!differences!of!usage! of!sources!and!maintenance!templates!in!a!30Earticle!sample!per!language.!Content!and!disE course!analysis!of!the!policy!and!article!talk!pages!explored!the!editor’s!concern!with!verifiabilE ity!in!the!different!language!editions.!The!study!revealed!an!overall!similar!approach!to!crediE bility! leading! to! a! predominant! use! of! academic! sources! with! slight! variances! between! the! editions.!While!the!German!Wikipedia!displays!a!more!practical!approach!in!its!policies,!the! Swedish!approach!is!characterised!by!an!emphasis!on!a!source!hierarchy.!A!source!hierarchy!is! also!implied!in!the!English!policies!but!in!consideration!of!the!subject!context.!The!biggest!inE fluence!on!the!realisation!of!the!theoretical!approach!to!generate!credibility!is!the!size!of!a! community.!The!English!Wikipedia!is!both!the!biggest!and!the!most!active!community!in!conE
    [Show full text]