The Case for a Norwegian Transformer Model Per E Kummervold Javier De La Rosa [email protected] [email protected]

Total Page:16

File Type:pdf, Size:1020Kb

The Case for a Norwegian Transformer Model Per E Kummervold Javier De La Rosa Per.Kummervold@Nb.No Javier.Rosa@Nb.No Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model Per E Kummervold Javier de la Rosa [email protected] [email protected] Freddy Wetjen Svein Arne Brygfjeld [email protected] [email protected] The National Library of Norway Mo i Rana, Norway Abstract (Devlin et al., 2019). Later research has shown that the corpus size might have even been too In this work, we show the process of build- small, and when Facebook released its Robustly ing a large-scale training set from digi- Optimized BERT (RoBERTa), it showed a consid- tal and digitized collections at a national erable gain in performance by increasing the cor- library. The resulting Bidirectional En- pus to 160GB (Liu et al., 2019). coder Representations from Transformers (BERT)-based language model for Nor- Norwegian is spoken by just 5 million peo- wegian outperforms multilingual BERT ple worldwide. The reference publication Ethno- (mBERT) models in several token and se- logue lists the 200 most commonly spoken na- quence classification tasks for both Nor- tive languages, and it places Norwegian as num- wegian Bokmal˚ and Norwegian Nynorsk. ber 171. The Norwegian language has two differ- Our model also improves the mBERT per- ent varieties, both equally recognized as written formance for other languages present in languages: Bokmal˚ and Nynorsk. The number of the corpus such as English, Swedish, and Wikipedia pages written in a certain language is Danish. For languages not included in the often used to measure its prevalence, and in this corpus, the weights degrade moderately regard, Norwegian Bokmal˚ ranges as number 23 while keeping strong multilingual prop- and Nynorsk as number 55. However, there exist erties. Therefore, we show that build- more than 100 times as many English Wikipedia ing high-quality models within a mem- pages as there are Norwegian Wikipedia pages ory institution using somewhat noisy op- (2021b). When it comes to building large text cor- tical character recognition (OCR) content pora, Norwegian is considered a minor language, is feasible, and we hope to pave the way with scarce textual resources. So far, it has been for other memory institutions to follow. hard to train well-performing transformer-based models for such languages. 1 Introduction As a governmental entity, the National Library Modern natural language processing (NLP) mod- of Norway (NLN) established in 2006 a mass digi- els pose a challenge due to the massive size of tization program for its collections. The Language the training data they require to perform well. Bank, an organizational unit within the NLN, pro- For resource-rich languages such as Chinese, En- vides text collections and curated corpora to the glish, French, and Spanish, collections of texts scholarly community (Sprakbanken,˚ 2021). Due from open sources such as Wikipedia (2021a), to copyright restrictions, the publicly available variations of Common Crawl data (2021), and Norwegian corpus consists mainly of Wikipedia other open-source corpora such as the BooksCor- pages and online newspapers, and it is around 5GB pus (Zhu et al., 2015) are generally used. When (818M words) in size (see Table 1). However, in researchers at Google released their Bidirec- this study, by adding multiple sources only acces- tional Encoder Representations from Transform- sible from the NLN, we were able to increase that ers (BERT) model, they trained it on a huge corpus size up to 109GB (18,438M words) of raw, dedu- of 16GB of uncompressed text (3,300M words) plicated text. While such initiatives may produce textual data that can be used for the large-scale data for Norwegian (5GB), Danish (9.5GB), and pre-training of transformer-based models, relying Swedish (24.7GB). Unfortunately, we were unable on text derived from optical character recognition to make the Norwegian models work, as they seem (OCR)–based pipelines introduces new challenges to be no longer updated. Similarly, the KBLab related to the format, scale, and quality of the nec- at the National Library of Sweden trained and re- essary data. On these grounds, this work describes leased a BERT-based model and an A Lite BERT the effort to build a pre-training corpus and to use (ALBERT) model, both trained on approximately it to train a BERT-based language model for Nor- 20GB of raw text from a variety of sources such wegian. as books, news articles, government publications, Swedish Wikipedia, and internet forums (Malm- 1.1 Previous Work sten et al., 2020). They claimed significantly bet- ter performance than both the mBERT and the Before the advent of transformer-based models, Swedish model by BotXO for the tasks they eval- non-contextual word and document embeddings uated. were the most prominent technology used to ap- At the same of the release of our model, the proach general NLP tasks. In the Nordic region, Language Technology Group at the University of the Language Technology Group at the Univer- Oslo released a monolingual BERT-based model sity of Oslo, as part of the joint Nordic Lan- for Norwegian named NorBERT. It was trained guage Processing Laboratory, collected a series of on around 5GB of data from Wikipedia and the monolingual resources for many languages, with a Norsk aviskorpus (2019). We were unable to get special emphasis on Norwegian (Kutuzov et al., sensible results when finetuning version 1.0 of 2017). Based on these resources, they trained their model. However, they released a second and released collections of dense vectors using version shortly thereafter (1.1) fixing some errors word2vec and fastText (both with continuous skip- (Language Technology Group at the University of gram and continuous bag-of-words architectures) Oslo, 2021a). We have therefore included the eval- Mikolov et al. 2013; Bojanowski et al. 2017, and uation results of this second version of the model even using an Embeddings from Language Mod- in our benchmarking. They have also evaluated els (ELMo)–based model with contextual capabil- their and our model themselves (Kutuzov et al., ities (Peters et al., 2018). Shortly thereafter, De- 2021) with consistent results. vlin et al. (2019) introduced the foundational work on the monolingual English BERT model, which 2 Building a Colossal Norwegian Corpus would later be extended to support 104 different languages including Norwegian Bokmal˚ and Nor- As the main Norwegian memory institution, the wegian Nynorsk, Swedish, and Danish. The main NLN has the obligation to preserve and give ac- data source used was Wikipedia (2021a). In terms cess to all published information in Norway. A of Norwegian, this amounted to around 0.9GB of large amount of the traditional collection is now uncompressed text (140M words) for Bokmal˚ and available in digital format. As part of the cur- 0.2GB (32M words) for Nynorsk (2021b). While rent legal deposit, many born-digital documents it is generally agreed that language models ac- are also available as digital documents in the col- quire better language capabilities by pre-training lection. The texts in the NLN collection span hun- with multiple languages (Pires et al., 2019; Wu dreds of years and exhibit varied uses of texts in and Dredze, 2020), there is a strong indication society. All kinds of historical written materials that this amount of data might have been insuffi- can be found in the collections, although we found cient for the multilingual BERT (mBERT) model that the most relevant resources for building an ap- to learn high-quality representations of Norwegian propriate corpus for NLP were books, magazines, at a level comparable to, for instance, monolingual journals, and newspapers (see Table 1). As a con- English models (Pires et al., 2019). sequence, the resulting corpus reflects the varia- In the area of monolingual models, the Danish tion in the use of the Norwegian written language, company BotXO trained BERT-based models for a both historically and socially. few of the Nordic languages using corpora of var- Texts in the NLN have been subject to a large ious sizes. Their repository (BotXO Ltd., 2021) digitization operation in which digital copies were lists models trained mainly on Common Crawl created for long-term preservation. The NLN em- Digital SafeStore unpack METS/ALTO files create Text & Meta files clean Training Non OCR clean Clean text deduplicate Corpus sample Library sources Set External sources clean Figure 1: The general corpus-building process. ploys METS/ALTO1 as the preferred format for tags in the collection and counting the frequency storing digital copies. As the digitized part of of words of certain types (e.g., personal pronouns). the collection conforms to standard preservation Our estimate is that 83% of the text is in Norwe- library practices, the format in which the texts are gian Bokmal˚ and 12% is in Nynorsk. Close to 4% stored is not suitable for direct text processing; of the texts are written in English, and the 1% left thus, they needed to be pre-processed and manip- is a mixture of Sami, Danish, Swedish, and a few ulated for use as plain text. One major challenge traces from other languages. was the variation in the OCR quality, which varied The aforementioned process was carefully or- both over time and between the types of materials chestrated, with data moving from preservation digitized. This limited the number of usable re- storage, through error correction and quality as- sources and introduced some artifacts that affected sessment, and ending up as text in the corpus. As the correctness of the textual data. shown in Figure 1, after filtering, OCR-scanned The basic inclusion criterion for our corpus was documents were added to the other digital sources. that as long as it was possible for a human to infer After this step, the data went through the cleaning the meaning from the text, it should be included.
Recommended publications
  • Content Categorization for Contextual Advertising Using Wikipedia
    Content Categorization for Contextual Advertising Using Wikipedia Ingrid Grønlie Guren August 2, 2015 Content Categorization for Contextual Advertising Using Wikipedia Ingrid Grønlie Guren August 2, 2015 ii Abstract Automatic categorization of content is an important functionality in online ad- vertising and automated content recommendations, both for ensuring contextual relevancy of placements and for building up behavioral profiles for users that consume the content. Within the advertising domain, the taxonomy tree that content is classified into is defined with some commercial application in mind to somehow reflect the advertising platform’s ad inventory. The nature of the ad inventory and the language of the content might vary across brokers (i.e., the operator of the advertising platform), so it was of interest to develop a system that can easily bootstrap the development of a well-working classifier. We developed a dictionary-based classifier based on titles from Wikipedia articles where the titles represent entries in the dictionary. The idea of the dictionary-based classifier is so simple that it can be understood by users of the program, also those who lack technical experience. Further, it has the ad- vantage that its users easily can expand the dictionary with desirable words for specific advertisement purposes. The process of creating the classifier includes a processing of all Wikipedia article titles to a form more likely to occur in docu- ments, before each entry is graded to their most describing Wikipedia category path. The Wikipedia category paths are further mapped to categories based on the taxonomy of Interactive Advertising Bureau (IAB), which are categories relevant for advertising.
    [Show full text]
  • Complete Issue
    Culture Unbound: Journal of Current Cultural Research Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Edited by Jutta Haider & Olof Sundin Extraction from Volume 6, 2014 Linköping University Electronic Press Culture Unbound: ISSN 2000-1525 (online) URL: http://www.cultureunbound.ep.liu.se/ © 2014 The Authors. Culture Unbound, Extraction from Volume 6, 2014 Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Jutta Haider & Olof Sundin Introduction: Changing Orders of Knowledge? Encyclopaedias in Transition ................................ 475 Katharine Schopflin What do we Think an Encyclopaedia is? ........................................................................................... 483 Seth Rudy Knowledge and the Systematic Reader: The Past and Present of Encyclopedic Reading .............................................................................................................................................. 505 Siv Frøydis Berg & Tore Rem Knowledge for Sale: Norwegian Encyclopaedias in the Marketplace .............................................. 527 Vanessa Aliniaina Rasoamampianina Reviewing Encyclopaedia Authority .................................................................................................. 547 Ulrike Spree How readers Shape the Content of an Encyclopedia: A Case Study Comparing the German Meyers Konversationslexikon (1885-1890) with Wikipedia (2002-2013) ........................... 569 Kim Osman The Free Encyclopaedia that Anyone can Edit: The
    [Show full text]
  • Strengthening and Unifying the Visual Identity of Wikimedia Projects: a Step Towards Maturity
    Strengthening and unifying the visual identity of Wikimedia projects: a step towards maturity Guillaume Paumier∗ Elisabeth Bauer [[m:User:guillom]] [[m:User:Elian]] Abstract In January 2007, the Wikimedian community celebrated the sixth birthday of Wikipedia. Six years of constant evolution have now led to Wikipedia being one of the most visited websites in the world. Other projects developing free content and supported by the Wikimedia Foundation have been expanding rapidly too. The Foundation and its projects are now facing some communication issues due to the difference of scale between the human and financial resources of the Foundation and the success of its projects. In this paper, we identify critical issues in terms of visual identity and marketing. We evaluate the situation and propose several changes, including a redesign of the default website interface. Introduction The first Wikipedia project was created in January 2001. In these days, the technical infrastructure was provided by Bomis, a dot-com company. In June 2003, Jimmy Wales, founder of Wikipedia and owner of Bomis, created the Wikimedia Foundation [1] to provide a long-term administrative and technical structure dedicated to free content. Since these days, both the projects and the Foundation have been evolving. New projects have been created. All have grown at different rates. Some have got more fame than the others. New financial, technical and communication challenges have risen. In this paper, we will first identify some of these challenges and issues in terms of global visual identity. We will then analyse logos, website layouts, projects names, trademarks so as to provide some hindsight.
    [Show full text]
  • Wikipedia Data Analysis
    Wikipedia data analysis. Introduction and practical examples Felipe Ortega June 29, 2012 Abstract This document offers a gentle and accessible introduction to conduct quantitative analysis with Wikipedia data. The large size of many Wikipedia communities, the fact that (for many languages) the already account for more than a decade of activity and the precise level of detail of records accounting for this activity represent an unparalleled opportunity for researchers to conduct interesting studies for a wide variety of scientific disciplines. The focus of this introductory guide is on data retrieval and data preparation. Many re- search works have explained in detail numerous examples of Wikipedia data analysis, includ- ing numerical and graphical results. However, in many cases very little attention is paid to explain the data sources that were used in these studies, how these data were prepared before the analysis and additional information that is also available to extend the analysis in future studies. This is rather unfortunate since, in general, data retrieval and data preparation usu- ally consumes no less than 75% of the whole time devoted to quantitative analysis, specially in very large datasets. As a result, the main goal of this document is to fill in this gap, providing detailed de- scriptions of available data sources in Wikipedia, and practical methods and tools to prepare these data for the analysis. This is the focus of the first part, dealing with data retrieval and preparation. It also presents an overview of useful existing tools and frameworks to facilitate this process. The second part includes a description of a general methodology to undertake quantitative analysis with Wikipedia data, and open source tools that can serve as building blocks for researchers to implement their own analysis process.
    [Show full text]
  • 1 Wikipedia As an Arena and Source for the Public. a Scandinavian
    Wikipedia as an arena and source for the public. A Scandinavian Comparison of “Islam” Hallvard Moe Department of Information Science and Media Studies University of Bergen [email protected] Abstract This article compares Wikipedia as an arena and source for the public through analysis of articles on “Islam” across the three Scandinavian languages. Findings show that the Swedish article is continuously revised and adjusted by a fairly high number of contributors, with comparatively low concentration to a small group of top users. The Norwegian article is static, more basic, but still serves as a matter-of-factly presentation of Islam as religion to a stable amount of views. In contrast, the Danish article is at once more dynamic through more changes up until recently, it portrays Islam differently with a distinct focus on identity issues, and it is read less often. The analysis illustrates how studying Wikipedia can bring light to the receiving end of what goes on in the public sphere. The analysis also illustrates how our understanding of the online realm profits from “groundedness”, and how comparison of similar sites in different languages can yield insights into cultural as well as political differences, and their implications. Keywords Wikipedia, public sphere, freedom of information, comparative, digital methods Introduction The online encyclopedia Wikipedia is heralded as a non-commercial, user generated source of information. It is also a space for debate over controversial issues. Wikipedia, therefore, stands out from other online media more commonly analyzed in studies of public debate: on the one hand, mainstream media such as online newspapers are typically deemed interesting since they (are thought to) reach a wide audience with curated or edited content.
    [Show full text]
  • Operationalizing a National Digital Library: the Case for a Norwegian Transformer Model Per E Kummervold Javier De La Rosa [email protected] [email protected]
    Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model Per E Kummervold Javier de la Rosa [email protected] [email protected] Freddy Wetjen Svein Arne Brygfjeld [email protected] [email protected] The National Library of Norway Mo i Rana, Norway Abstract (Devlin et al., 2019). Later research has shown that the corpus size might have even been too In this work, we show the process of build- small, and when Facebook released its Robustly ing a large-scale training set from digi- Optimized BERT (RoBERTa), it showed a consid- tal and digitized collections at a national erable gain in performance by increasing the cor- library. The resulting Bidirectional En- pus to 160GB (Liu et al., 2019). coder Representations from Transformers (BERT)-based language model for Nor- Norwegian is spoken by just 5 million peo- wegian outperforms multilingual BERT ple worldwide. The reference publication Ethno- (mBERT) models in several token and se- logue lists the 200 most commonly spoken na- quence classification tasks for both Nor- tive languages, and it places Norwegian as num- wegian Bokmal˚ and Norwegian Nynorsk. ber 171. The Norwegian language has two differ- Our model also improves the mBERT per- ent varieties, both equally recognized as written formance for other languages present in languages: Bokmal˚ and Nynorsk. The number of the corpus such as English, Swedish, and Wikipedia pages written in a certain language is Danish. For languages not included in the often used to measure its prevalence, and in this corpus, the weights degrade moderately regard, Norwegian Bokmal˚ ranges as number 23 while keeping strong multilingual prop- and Nynorsk as number 55.
    [Show full text]
  • Bootstrap Quantification of Estimation Uncertainties in Network Degree Distributions Received: 15 February 2017 Yulia R
    www.nature.com/scientificreports OPEN Bootstrap quantification of estimation uncertainties in network degree distributions Received: 15 February 2017 Yulia R. Gel1, Vyacheslav Lyubchich2 & L. Leticia Ramirez Ramirez3 Accepted: 5 June 2017 We propose a new method of nonparametric bootstrap to quantify estimation uncertainties in functions Published: xx xx xxxx of network degree distribution in large ultra sparse networks. Both network degree distribution and network order are assumed to be unknown. The key idea is based on adaptation of the “blocking” argument, developed for bootstrapping of time series and re-tiling of spatial data, to random networks. We first sample a set of multiple ego networks of varying orders that form a patch, or a network block analogue, and then resample the data within patches. To select an optimal patch size, we develop a new computationally efficient and data-driven cross-validation algorithm. The proposed fast patchwork bootstrap (FPB) methodology further extends the ideas for a case of network mean degree, to inference on a degree distribution. In addition, the FPB is substantially less computationally expensive, requires less information on a graph, and is free from nuisance parameters. In our simulation study, we show that the new bootstrap method outperforms competing approaches by providing sharper and better-calibrated confidence intervals for functions of a network degree distribution than other available approaches, including the cases of networks in an ultra sparse regime. We illustrate the FPB in application to collaboration networks in statistics and computer science and to Wikipedia networks. Motivated by a plethora of modern large network applications and rapid advances in computing technologies, the area of network modeling is undergoing a vigorous developmental boom, spreading over numerous disciplines, from computer science to engineering to social and health sciences.
    [Show full text]
  • Global Gender Differences in Wikipedia Readership
    Global gender differences in Wikipedia readership Isaac Johnson1, Florian Lemmerich2, Diego Saez-Trumper´ 1, Robert West3, Markus Strohmaier2,4, and Leila Zia1 1Wikimedia Foundation; fi[email protected] 2RWTH Aachen University; fi[email protected] 3EPFL; fi[email protected] 4GESIS - Leibniz Institute for the Social Sciences; fi[email protected] ABSTRACT Wikipedia represents the largest and most popular source of encyclopedic knowledge in the world today, aiming to provide equal access to information worldwide. From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present novel evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we report that (1) women are underrepresented among readers of Wikipedia, (2) women view fewer pages per reading session than men do, (3) men and women visit Wikipedia for similar reasons, and (4) men and women exhibit specific topical preferences. Our findings lay the foundation for identifying pathways toward knowledge equity in the usage of online encyclopedic knowledge. Equal access to encyclopedic knowledge represents a critical prerequisite for promoting equality more broadly in society and for an open and informed public at large. With almost 54 million articles written by roughly 500,000 monthly editors across more than 160 actively edited languages, Wikipedia is the most important source of encyclopedic knowledge for a global audience, and one of the most important knowledge resources available on the internet. Every month, Wikipedia attracts users on more than 1.5 billion unique devices from across the globe, for a total of more than 15 billion monthly pageviews.1 Data about who is represented in this readership provides unique insight into the accessibility of encyclopedic knowledge worldwide.
    [Show full text]
  • Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia*
    Nordicom Review 30 (2009) 1, pp. 183-199 Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia* MARIA MATTUS Abstract Wikipedia is a multilingual, Internet-based, free, wiki-encyclopaedia that is created by its own users. The aim of the present article is to let users’ descriptions of their impressions and ex- periences of Wikipedia increase our understanding of the function and dynamics of this collaboratively shaped wiki-encyclopaedia. Qualitative, structured interviews concerning users, and the creation and use of Wikipe- dia were carried out via e-mail with six male respondents – administrators at the Swedish Wikipedia – during September and October, 2006. The results focus on the following themes: I. Passive and active users; II. Formal and informal tasks; III. Common and personal visions; IV. Working together; V. The origins and creation of articles; VI. Contents and quality; VII. Decisions and interventions; VIII. Encyclopaedic ambitions. The discussion deals with the approach of this encyclopaedic phenomenon, focusing on its “unfinishedness”, its development in different directions, and the social regulation that is implied and involved. Wikipedia is a product of our time, having a powerful vision and engagement, and it should therefore be interpreted and considered on its own terms. Keywords: Wikipedia, encyclopaedia, wiki, cooperation, collaboration, social regulation Introduction Wikipedia is an international, multilingual1, Internet-based, free-content encyclopaedia. Consisting of about 8 million2 articles (27 million pages), it has become the world’s largest encyclopaedic wiki. Wikipedia uses the wiki technique and hypertext environ- ment, and besides its encyclopaedic content, it contains meta-level information and communication channels to enable further interaction.
    [Show full text]
  • Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia*
    10.1515/nor-2017-0146 Nordicom Review 30 (2009) 1, pp. 183-199 Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia* MARIA MATTUS Abstract Wikipedia is a multilingual, Internet-based, free, wiki-encyclopaedia that is created by its own users. The aim of the present article is to let users’ descriptions of their impressions and ex- periences of Wikipedia increase our understanding of the function and dynamics of this collaboratively shaped wiki-encyclopaedia. Qualitative, structured interviews concerning users, and the creation and use of Wikipe- dia were carried out via e-mail with six male respondents – administrators at the Swedish Wikipedia – during September and October, 2006. The results focus on the following themes: I. Passive and active users; II. Formal and informal tasks; III. Common and personal visions; IV. Working together; V. The origins and creation of articles; VI. Contents and quality; VII. Decisions and interventions; VIII. Encyclopaedic ambitions. The discussion deals with the approach of this encyclopaedic phenomenon, focusing on its “unfinishedness”, its development in different directions, and the social regulation that is implied and involved. Wikipedia is a product of our time, having a powerful vision and engagement, and it should therefore be interpreted and considered on its own terms. Keywords: Wikipedia, encyclopaedia, wiki, cooperation, collaboration, social regulation Introduction Wikipedia is an international, multilingual1, Internet-based, free-content encyclopaedia. Consisting of about 8 million2 articles (27 million pages), it has become the world’s largest encyclopaedic wiki. Wikipedia uses the wiki technique and hypertext environ- ment, and besides its encyclopaedic content, it contains meta-level information and communication channels to enable further interaction.
    [Show full text]
  • Wikipedia Research and Tools: Review and Comments
    Wikipedia research and tools: Review and comments Finn Arup˚ Nielsen January 24, 2019 Abstract research is to find the answer to the question \why does it work at all?" I here give an overview of Wikipedia and wiki re- Research using information from Wikipedia will search and tools. Well over 1,000 reports have been typically hope for the correctness and perhaps com- published in the field and there exist dedicated sci- pleteness of the information in Wikipedia. Many entific meetings for Wikipedia research. It is not aspects of Wikipedia can be used, not just the raw possible to give a complete review of all material text, but the links between components: Language published. This overview serves to describe some links, categories and information in templates pro- i key areas of research. vide structured content that can be used in a vari- ety of other applications, such as natural language processing and translation tools. Large-scale efforts 1 Introduction extract structured information from Wikipedia and link the data and connect it as Linked Data. The Wikipedia has attracted researchers from a wide 2 range of diciplines| phycisist, computer scientists, papers with description of DBpedia. represent ex- librarians, etc.|examining the online encyclopedia amples on this line of research. from a several different perspectives and using it in It is odd to write a Science 1.0 article about a variety of contexts. a Web 2.0 phenomenon: An interested researcher Broadly, Wikipedia research falls into four cate- may already find good collaborative written arti- gories: cles about Wikipedia research on Wikipedia itself, see Table1.
    [Show full text]
  • Wikipedia and Its Tensions with Paid Labour
    tripleC 14(1): 78-98, 2016 http://www.triple-c.at Monetary Materialities of Peer-Produced Knowledge: The Case of Wikipedia and Its Tensions with Paid Labour Arwid Lund* and Juhana Venäläinen** *Uppsala University, Uppsala, Sweden, [email protected] **University of Eastern Finland, Joensuu, Finland, [email protected] Abstract: This article contributes to the debate on the possibilities and limits of expanding the sphere of peer production within and beyond capitalism. As a case in point, it discusses the explicit and tacit monetary dependencies of Wikipedia, which are not only ascribable to the need to sustain the techno- logical structures that render the collaboration possible, but also about the money-mediated suste- nance of the peer producers themselves. In Wikipedia, the “bright line” principle for avoiding conflicts of interest has been that no one should be paid for directly editing an article. By examining the after- math of the Wiki-PR scandal, where a consulting firm was allegedly involved in helping more than 12,000 clients to edit Wikipedia articles until 2014, the goal of the analysis is to shed light on the para- doxical situation where the institution for supporting the peer production (Wikimedia Foundation) found itself taking a more strict perspective vis-à-vis commercial alliances than the unpaid community edi- tors. Keywords: Wikipedia, Wiki-PR, Materialities, Peer Production, Commons 1. Introduction Peer production has often been portrayed as a potential complement or a radical alternative to capitalism (e.g. Moore 2011; Rigi 2013). As an early and prominent figure in the debate, Yochai Benkler (2002; 2006; 2011) depicted the regime of commons-based peer production as a revolutionary social form that will eventually transform the ways of organizing production in the contemporary economy.
    [Show full text]