Content Categorization for Contextual Advertising Using Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Content Categorization for Contextual Advertising Using Wikipedia Content Categorization for Contextual Advertising Using Wikipedia Ingrid Grønlie Guren August 2, 2015 Content Categorization for Contextual Advertising Using Wikipedia Ingrid Grønlie Guren August 2, 2015 ii Abstract Automatic categorization of content is an important functionality in online ad- vertising and automated content recommendations, both for ensuring contextual relevancy of placements and for building up behavioral profiles for users that consume the content. Within the advertising domain, the taxonomy tree that content is classified into is defined with some commercial application in mind to somehow reflect the advertising platform’s ad inventory. The nature of the ad inventory and the language of the content might vary across brokers (i.e., the operator of the advertising platform), so it was of interest to develop a system that can easily bootstrap the development of a well-working classifier. We developed a dictionary-based classifier based on titles from Wikipedia articles where the titles represent entries in the dictionary. The idea of the dictionary-based classifier is so simple that it can be understood by users of the program, also those who lack technical experience. Further, it has the ad- vantage that its users easily can expand the dictionary with desirable words for specific advertisement purposes. The process of creating the classifier includes a processing of all Wikipedia article titles to a form more likely to occur in docu- ments, before each entry is graded to their most describing Wikipedia category path. The Wikipedia category paths are further mapped to categories based on the taxonomy of Interactive Advertising Bureau (IAB), which are categories relevant for advertising. The results of this process is a dictionary with entries connected to categories from the taxonomy, and forms the base of our classi- fier. Finally, we explored the possibilities of using Wikipedia’s internal links to translate the English classifier’s dictionary to a Norwegian dictionary. The evaluation of the classifier was performed on rappler.com for the En- glish classifier and adressa.no for the Norwegian classifier. The results of the classifiers were compared with a class tag within the url structure of published articles, and we could see that the classifiers were able to correctly categorize most articles. However, there is room for further improvement of the classifier in order to achieve higher evaluation scores. This is partly because our dictionary- based classifier is a one-to-many classifier, while we compare the results to a one-to-one classification. Overall, we found that we are able to create a varied and thorough dictionary by just exploring the titles of Wikipedia articles, and that the classifier gives a good indication of the content of articles. iii iv Acknowledgements This study has been a collaboration project between the Department of Infor- matics at the University of Oslo and Cxense (http://www.cxense.com). It was started in the Spring 2014 and finished in August 2015. First, I would like to express my gratitude to my supervisor Professor Alek- sander Øhrn for all his incredibly important feedback and for all the advice he has given me through the process. His ideas for this project have been in- credible valuable, and his comments have helped me every time I needed help with a problem. I am very grateful for his thorough scrutiny of the thesis. I would also like to thank his colleague Gisle Ytrestøl for all his help, including all the quick responses on emails, long discussions about implementations and his never-ending support and optimism about the project. A special thanks goes to my study friends, especially Sindre, Frida and Karine, for constantly reminding me how little time I had left, but still support- ing me with discussions and good company at the University. I would also like to thank my friend Elisabeth for motivating me every day and for never asking me to stop talking about my thesis. Your support has been beyond words. Finally I wish to thank my family for all help, comments, discussions, feed- back on what I’ve written, and most important; for supporting me every day. I could never have done this without any of you. Ingrid Grønlie Guren August 2, 2015 i ii Contents 1 Introduction1 1.1 Motivation..............................1 1.2 The Project..............................1 1.3 An Overview of Challenges.....................3 1.4 Thesis Outline............................5 2 Background Materials7 2.1 Automatic Content Analysis.....................7 2.1.1 What is Content Analysis?.................7 2.1.2 Content Analysis in Advertising..............8 2.2 Categorization............................9 2.3 Wikipedia............................... 11 2.3.1 Structure of Wikipedia.................... 11 2.3.2 Accessing Information from Wikipedia........... 12 2.4 Interactive Advertising Bureau (IAB)............... 13 2.5 Cxense................................. 16 3 Related Works 19 3.1 Similar Projects............................ 19 3.2 Wikipedia’s Category Structure................... 20 3.3 Wikipedia as Encyclopedic Knowledge............... 21 3.4 Classifiers Based on Wikipedia................... 22 3.4.1 Evaluation of the Classifiers................. 23 3.5 Disambiguation............................ 24 4 Methods 27 4.1 Finding the meaning of Wikipedia Articles............. 27 4.1.1 Representing the underlying structure........... 27 4.2 Grading Categories.......................... 29 4.2.1 Grading based on Inlinks and Outlinks........... 29 4.2.2 Normalized Grading based on Inlink and Outlink Numbers 30 4.2.3 Deciding Relevant Paths................... 30 4.3 Evaluation............................... 31 4.3.1 Evaluation of the Classifier................. 31 4.3.2 Optimizing the Classifier.................. 32 iii iv CONTENTS 5 Implementation 35 5.1 Finding Full Path of Articles.................... 35 5.1.1 Creating the Underlying Category Structure........ 35 5.1.2 Representing the Underlying Structure........... 39 5.1.3 Following Links Between Categories............ 40 5.1.4 Redirects........................... 41 5.2 Id Mapping.............................. 44 5.3 Grading of Categories........................ 45 5.3.1 Grading based on Inlink and Outlink Numbers...... 45 5.3.2 Normalized Grading Based on Inlinks and Outlinks.... 49 5.4 Mapping to Desirable Output Categories.............. 50 5.4.1 Mapping based on Wikipedia Categories.......... 50 5.4.2 Mapping based on Wikipedia Path Excerpts........ 51 5.4.3 Automatic Mapping..................... 51 5.4.4 Processing Titles....................... 52 5.5 Dictionary-based Classifier for Other Languages.......... 54 5.5.1 Creating a Norwegian Dictionary-based Classifier..... 54 5.5.2 Deploying the Results.................... 57 6 Results and Discussion 59 6.1 Evaluation of Category Mapping.................. 59 6.1.1 Mapping from Path Excerpts to Output Categories.... 60 6.2 Versions of the Dictionary-Based Classifier............. 68 6.2.1 IAB Dictionary-1 (iab-1).................. 68 6.2.2 IAB Dictionary-2 (iab-2).................. 71 6.2.3 IAB Dictionary-3 (iab-3).................. 71 6.2.4 IAB Dictionary-4 (iab-4).................. 71 6.2.5 IAB Dictionary-5 (iab-5).................. 71 6.2.6 IAB Dictionary-6 (iab-6).................. 72 6.2.7 Variation of Categories and Number of Entries for the Different Dictionary Versions................ 72 6.3 Results from the Classifier...................... 74 6.3.1 Retrieving Results from Cxense............... 75 6.3.2 Weight for Classification................... 76 6.3.3 Results for Sports...................... 79 6.3.4 Results for Arts & Entertainment.............. 80 6.3.5 Results for Technology & Computing............ 81 6.3.6 Global Evaluation of the Classifier............. 82 6.3.7 Comparison with Another Classifier............ 83 6.4 Evaluation of the Norwegian Classifier............... 84 6.5 Discussion of the Results....................... 85 7 Conclusion and Further Works 89 7.1 Conclusion.............................. 89 7.2 Further Works............................ 90 References 93 List of Figures 1.1 Illustration of the mapping between keywords and categories...2 1.2 Simplified illustration of the categorization process.........3 1.3 Example of disambiguation in Wikipedia..............5 2.1 Retargeting within Interest-based Advertising...........9 2.2 Categorization of keywords to Wikipedia categories........ 10 2.3 Categorzation process of the keywords............... 10 2.4 Subcategories of the category Astrid Lindgren........... 12 2.5 Categories for an Wikipedia article................. 12 2.6 Categories of the IAB Taxonomy.................. 15 2.7 The categorization process of the keywords............ 16 2.8 Illustration of entry matching process............... 16 4.1 Simplified illustration of the underlying structure of Wikipedia.. 27 4.2 The structure where each category knows its subcategories.... 28 4.3 The structure where each category knows the title of its articles. 28 4.4 Example of inlink number and outlink number for a category.. 29 4.5 Illustration of a perfect classifier................... 33 5.1 INSERT statement entry in enwiki-latest-categorylinks.sql.gz 36 5.2 Excerpt from enwiki-latest-categorylinks.sql.gz ...... 37 5.3 Insert statement for hidden category................ 38 5.4 Example path with hidden category................ 38 5.5 Example path without hidden category............... 38 5.6 INSERT statement with newline................... 39 5.7 Example of sortkey in Wikipedia.................
Recommended publications
  • Complete Issue
    Culture Unbound: Journal of Current Cultural Research Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Edited by Jutta Haider & Olof Sundin Extraction from Volume 6, 2014 Linköping University Electronic Press Culture Unbound: ISSN 2000-1525 (online) URL: http://www.cultureunbound.ep.liu.se/ © 2014 The Authors. Culture Unbound, Extraction from Volume 6, 2014 Thematic Section: Changing Orders of Knowledge? Encyclopedias in Transition Jutta Haider & Olof Sundin Introduction: Changing Orders of Knowledge? Encyclopaedias in Transition ................................ 475 Katharine Schopflin What do we Think an Encyclopaedia is? ........................................................................................... 483 Seth Rudy Knowledge and the Systematic Reader: The Past and Present of Encyclopedic Reading .............................................................................................................................................. 505 Siv Frøydis Berg & Tore Rem Knowledge for Sale: Norwegian Encyclopaedias in the Marketplace .............................................. 527 Vanessa Aliniaina Rasoamampianina Reviewing Encyclopaedia Authority .................................................................................................. 547 Ulrike Spree How readers Shape the Content of an Encyclopedia: A Case Study Comparing the German Meyers Konversationslexikon (1885-1890) with Wikipedia (2002-2013) ........................... 569 Kim Osman The Free Encyclopaedia that Anyone can Edit: The
    [Show full text]
  • Strengthening and Unifying the Visual Identity of Wikimedia Projects: a Step Towards Maturity
    Strengthening and unifying the visual identity of Wikimedia projects: a step towards maturity Guillaume Paumier∗ Elisabeth Bauer [[m:User:guillom]] [[m:User:Elian]] Abstract In January 2007, the Wikimedian community celebrated the sixth birthday of Wikipedia. Six years of constant evolution have now led to Wikipedia being one of the most visited websites in the world. Other projects developing free content and supported by the Wikimedia Foundation have been expanding rapidly too. The Foundation and its projects are now facing some communication issues due to the difference of scale between the human and financial resources of the Foundation and the success of its projects. In this paper, we identify critical issues in terms of visual identity and marketing. We evaluate the situation and propose several changes, including a redesign of the default website interface. Introduction The first Wikipedia project was created in January 2001. In these days, the technical infrastructure was provided by Bomis, a dot-com company. In June 2003, Jimmy Wales, founder of Wikipedia and owner of Bomis, created the Wikimedia Foundation [1] to provide a long-term administrative and technical structure dedicated to free content. Since these days, both the projects and the Foundation have been evolving. New projects have been created. All have grown at different rates. Some have got more fame than the others. New financial, technical and communication challenges have risen. In this paper, we will first identify some of these challenges and issues in terms of global visual identity. We will then analyse logos, website layouts, projects names, trademarks so as to provide some hindsight.
    [Show full text]
  • Wikipedia Data Analysis
    Wikipedia data analysis. Introduction and practical examples Felipe Ortega June 29, 2012 Abstract This document offers a gentle and accessible introduction to conduct quantitative analysis with Wikipedia data. The large size of many Wikipedia communities, the fact that (for many languages) the already account for more than a decade of activity and the precise level of detail of records accounting for this activity represent an unparalleled opportunity for researchers to conduct interesting studies for a wide variety of scientific disciplines. The focus of this introductory guide is on data retrieval and data preparation. Many re- search works have explained in detail numerous examples of Wikipedia data analysis, includ- ing numerical and graphical results. However, in many cases very little attention is paid to explain the data sources that were used in these studies, how these data were prepared before the analysis and additional information that is also available to extend the analysis in future studies. This is rather unfortunate since, in general, data retrieval and data preparation usu- ally consumes no less than 75% of the whole time devoted to quantitative analysis, specially in very large datasets. As a result, the main goal of this document is to fill in this gap, providing detailed de- scriptions of available data sources in Wikipedia, and practical methods and tools to prepare these data for the analysis. This is the focus of the first part, dealing with data retrieval and preparation. It also presents an overview of useful existing tools and frameworks to facilitate this process. The second part includes a description of a general methodology to undertake quantitative analysis with Wikipedia data, and open source tools that can serve as building blocks for researchers to implement their own analysis process.
    [Show full text]
  • 1 Wikipedia As an Arena and Source for the Public. a Scandinavian
    Wikipedia as an arena and source for the public. A Scandinavian Comparison of “Islam” Hallvard Moe Department of Information Science and Media Studies University of Bergen [email protected] Abstract This article compares Wikipedia as an arena and source for the public through analysis of articles on “Islam” across the three Scandinavian languages. Findings show that the Swedish article is continuously revised and adjusted by a fairly high number of contributors, with comparatively low concentration to a small group of top users. The Norwegian article is static, more basic, but still serves as a matter-of-factly presentation of Islam as religion to a stable amount of views. In contrast, the Danish article is at once more dynamic through more changes up until recently, it portrays Islam differently with a distinct focus on identity issues, and it is read less often. The analysis illustrates how studying Wikipedia can bring light to the receiving end of what goes on in the public sphere. The analysis also illustrates how our understanding of the online realm profits from “groundedness”, and how comparison of similar sites in different languages can yield insights into cultural as well as political differences, and their implications. Keywords Wikipedia, public sphere, freedom of information, comparative, digital methods Introduction The online encyclopedia Wikipedia is heralded as a non-commercial, user generated source of information. It is also a space for debate over controversial issues. Wikipedia, therefore, stands out from other online media more commonly analyzed in studies of public debate: on the one hand, mainstream media such as online newspapers are typically deemed interesting since they (are thought to) reach a wide audience with curated or edited content.
    [Show full text]
  • Operationalizing a National Digital Library: the Case for a Norwegian Transformer Model Per E Kummervold Javier De La Rosa [email protected] [email protected]
    Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model Per E Kummervold Javier de la Rosa [email protected] [email protected] Freddy Wetjen Svein Arne Brygfjeld [email protected] [email protected] The National Library of Norway Mo i Rana, Norway Abstract (Devlin et al., 2019). Later research has shown that the corpus size might have even been too In this work, we show the process of build- small, and when Facebook released its Robustly ing a large-scale training set from digi- Optimized BERT (RoBERTa), it showed a consid- tal and digitized collections at a national erable gain in performance by increasing the cor- library. The resulting Bidirectional En- pus to 160GB (Liu et al., 2019). coder Representations from Transformers (BERT)-based language model for Nor- Norwegian is spoken by just 5 million peo- wegian outperforms multilingual BERT ple worldwide. The reference publication Ethno- (mBERT) models in several token and se- logue lists the 200 most commonly spoken na- quence classification tasks for both Nor- tive languages, and it places Norwegian as num- wegian Bokmal˚ and Norwegian Nynorsk. ber 171. The Norwegian language has two differ- Our model also improves the mBERT per- ent varieties, both equally recognized as written formance for other languages present in languages: Bokmal˚ and Nynorsk. The number of the corpus such as English, Swedish, and Wikipedia pages written in a certain language is Danish. For languages not included in the often used to measure its prevalence, and in this corpus, the weights degrade moderately regard, Norwegian Bokmal˚ ranges as number 23 while keeping strong multilingual prop- and Nynorsk as number 55.
    [Show full text]
  • Bootstrap Quantification of Estimation Uncertainties in Network Degree Distributions Received: 15 February 2017 Yulia R
    www.nature.com/scientificreports OPEN Bootstrap quantification of estimation uncertainties in network degree distributions Received: 15 February 2017 Yulia R. Gel1, Vyacheslav Lyubchich2 & L. Leticia Ramirez Ramirez3 Accepted: 5 June 2017 We propose a new method of nonparametric bootstrap to quantify estimation uncertainties in functions Published: xx xx xxxx of network degree distribution in large ultra sparse networks. Both network degree distribution and network order are assumed to be unknown. The key idea is based on adaptation of the “blocking” argument, developed for bootstrapping of time series and re-tiling of spatial data, to random networks. We first sample a set of multiple ego networks of varying orders that form a patch, or a network block analogue, and then resample the data within patches. To select an optimal patch size, we develop a new computationally efficient and data-driven cross-validation algorithm. The proposed fast patchwork bootstrap (FPB) methodology further extends the ideas for a case of network mean degree, to inference on a degree distribution. In addition, the FPB is substantially less computationally expensive, requires less information on a graph, and is free from nuisance parameters. In our simulation study, we show that the new bootstrap method outperforms competing approaches by providing sharper and better-calibrated confidence intervals for functions of a network degree distribution than other available approaches, including the cases of networks in an ultra sparse regime. We illustrate the FPB in application to collaboration networks in statistics and computer science and to Wikipedia networks. Motivated by a plethora of modern large network applications and rapid advances in computing technologies, the area of network modeling is undergoing a vigorous developmental boom, spreading over numerous disciplines, from computer science to engineering to social and health sciences.
    [Show full text]
  • Global Gender Differences in Wikipedia Readership
    Global gender differences in Wikipedia readership Isaac Johnson1, Florian Lemmerich2, Diego Saez-Trumper´ 1, Robert West3, Markus Strohmaier2,4, and Leila Zia1 1Wikimedia Foundation; fi[email protected] 2RWTH Aachen University; fi[email protected] 3EPFL; fi[email protected] 4GESIS - Leibniz Institute for the Social Sciences; fi[email protected] ABSTRACT Wikipedia represents the largest and most popular source of encyclopedic knowledge in the world today, aiming to provide equal access to information worldwide. From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present novel evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we report that (1) women are underrepresented among readers of Wikipedia, (2) women view fewer pages per reading session than men do, (3) men and women visit Wikipedia for similar reasons, and (4) men and women exhibit specific topical preferences. Our findings lay the foundation for identifying pathways toward knowledge equity in the usage of online encyclopedic knowledge. Equal access to encyclopedic knowledge represents a critical prerequisite for promoting equality more broadly in society and for an open and informed public at large. With almost 54 million articles written by roughly 500,000 monthly editors across more than 160 actively edited languages, Wikipedia is the most important source of encyclopedic knowledge for a global audience, and one of the most important knowledge resources available on the internet. Every month, Wikipedia attracts users on more than 1.5 billion unique devices from across the globe, for a total of more than 15 billion monthly pageviews.1 Data about who is represented in this readership provides unique insight into the accessibility of encyclopedic knowledge worldwide.
    [Show full text]
  • Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia*
    Nordicom Review 30 (2009) 1, pp. 183-199 Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia* MARIA MATTUS Abstract Wikipedia is a multilingual, Internet-based, free, wiki-encyclopaedia that is created by its own users. The aim of the present article is to let users’ descriptions of their impressions and ex- periences of Wikipedia increase our understanding of the function and dynamics of this collaboratively shaped wiki-encyclopaedia. Qualitative, structured interviews concerning users, and the creation and use of Wikipe- dia were carried out via e-mail with six male respondents – administrators at the Swedish Wikipedia – during September and October, 2006. The results focus on the following themes: I. Passive and active users; II. Formal and informal tasks; III. Common and personal visions; IV. Working together; V. The origins and creation of articles; VI. Contents and quality; VII. Decisions and interventions; VIII. Encyclopaedic ambitions. The discussion deals with the approach of this encyclopaedic phenomenon, focusing on its “unfinishedness”, its development in different directions, and the social regulation that is implied and involved. Wikipedia is a product of our time, having a powerful vision and engagement, and it should therefore be interpreted and considered on its own terms. Keywords: Wikipedia, encyclopaedia, wiki, cooperation, collaboration, social regulation Introduction Wikipedia is an international, multilingual1, Internet-based, free-content encyclopaedia. Consisting of about 8 million2 articles (27 million pages), it has become the world’s largest encyclopaedic wiki. Wikipedia uses the wiki technique and hypertext environ- ment, and besides its encyclopaedic content, it contains meta-level information and communication channels to enable further interaction.
    [Show full text]
  • Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia*
    10.1515/nor-2017-0146 Nordicom Review 30 (2009) 1, pp. 183-199 Wikipedia – Free and Reliable? Aspects of a Collaboratively Shaped Encyclopaedia* MARIA MATTUS Abstract Wikipedia is a multilingual, Internet-based, free, wiki-encyclopaedia that is created by its own users. The aim of the present article is to let users’ descriptions of their impressions and ex- periences of Wikipedia increase our understanding of the function and dynamics of this collaboratively shaped wiki-encyclopaedia. Qualitative, structured interviews concerning users, and the creation and use of Wikipe- dia were carried out via e-mail with six male respondents – administrators at the Swedish Wikipedia – during September and October, 2006. The results focus on the following themes: I. Passive and active users; II. Formal and informal tasks; III. Common and personal visions; IV. Working together; V. The origins and creation of articles; VI. Contents and quality; VII. Decisions and interventions; VIII. Encyclopaedic ambitions. The discussion deals with the approach of this encyclopaedic phenomenon, focusing on its “unfinishedness”, its development in different directions, and the social regulation that is implied and involved. Wikipedia is a product of our time, having a powerful vision and engagement, and it should therefore be interpreted and considered on its own terms. Keywords: Wikipedia, encyclopaedia, wiki, cooperation, collaboration, social regulation Introduction Wikipedia is an international, multilingual1, Internet-based, free-content encyclopaedia. Consisting of about 8 million2 articles (27 million pages), it has become the world’s largest encyclopaedic wiki. Wikipedia uses the wiki technique and hypertext environ- ment, and besides its encyclopaedic content, it contains meta-level information and communication channels to enable further interaction.
    [Show full text]
  • Wikipedia Research and Tools: Review and Comments
    Wikipedia research and tools: Review and comments Finn Arup˚ Nielsen January 24, 2019 Abstract research is to find the answer to the question \why does it work at all?" I here give an overview of Wikipedia and wiki re- Research using information from Wikipedia will search and tools. Well over 1,000 reports have been typically hope for the correctness and perhaps com- published in the field and there exist dedicated sci- pleteness of the information in Wikipedia. Many entific meetings for Wikipedia research. It is not aspects of Wikipedia can be used, not just the raw possible to give a complete review of all material text, but the links between components: Language published. This overview serves to describe some links, categories and information in templates pro- i key areas of research. vide structured content that can be used in a vari- ety of other applications, such as natural language processing and translation tools. Large-scale efforts 1 Introduction extract structured information from Wikipedia and link the data and connect it as Linked Data. The Wikipedia has attracted researchers from a wide 2 range of diciplines| phycisist, computer scientists, papers with description of DBpedia. represent ex- librarians, etc.|examining the online encyclopedia amples on this line of research. from a several different perspectives and using it in It is odd to write a Science 1.0 article about a variety of contexts. a Web 2.0 phenomenon: An interested researcher Broadly, Wikipedia research falls into four cate- may already find good collaborative written arti- gories: cles about Wikipedia research on Wikipedia itself, see Table1.
    [Show full text]
  • Wikipedia and Its Tensions with Paid Labour
    tripleC 14(1): 78-98, 2016 http://www.triple-c.at Monetary Materialities of Peer-Produced Knowledge: The Case of Wikipedia and Its Tensions with Paid Labour Arwid Lund* and Juhana Venäläinen** *Uppsala University, Uppsala, Sweden, [email protected] **University of Eastern Finland, Joensuu, Finland, [email protected] Abstract: This article contributes to the debate on the possibilities and limits of expanding the sphere of peer production within and beyond capitalism. As a case in point, it discusses the explicit and tacit monetary dependencies of Wikipedia, which are not only ascribable to the need to sustain the techno- logical structures that render the collaboration possible, but also about the money-mediated suste- nance of the peer producers themselves. In Wikipedia, the “bright line” principle for avoiding conflicts of interest has been that no one should be paid for directly editing an article. By examining the after- math of the Wiki-PR scandal, where a consulting firm was allegedly involved in helping more than 12,000 clients to edit Wikipedia articles until 2014, the goal of the analysis is to shed light on the para- doxical situation where the institution for supporting the peer production (Wikimedia Foundation) found itself taking a more strict perspective vis-à-vis commercial alliances than the unpaid community edi- tors. Keywords: Wikipedia, Wiki-PR, Materialities, Peer Production, Commons 1. Introduction Peer production has often been portrayed as a potential complement or a radical alternative to capitalism (e.g. Moore 2011; Rigi 2013). As an early and prominent figure in the debate, Yochai Benkler (2002; 2006; 2011) depicted the regime of commons-based peer production as a revolutionary social form that will eventually transform the ways of organizing production in the contemporary economy.
    [Show full text]
  • The Case for a Norwegian Transformer Model Per E Kummervold Javier De La Rosa [email protected] [email protected]
    Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model Per E Kummervold Javier de la Rosa [email protected] [email protected] Freddy Wetjen Svein Arne Brygfjeld [email protected] [email protected] The National Library of Norway Mo i Rana, Norway Abstract (Devlin et al., 2019). Later research has shown that the corpus size might have even been too In this work, we show the process of build- small, and when Facebook released its Robustly ing a large-scale training set from digi- Optimized BERT (RoBERTa), it showed a consid- tal and digitized collections at a national erable gain in performance by increasing the cor- library. The resulting Bidirectional En- pus to 160GB (Liu et al., 2019). coder Representations from Transformers (BERT)-based language model for Nor- Norwegian is spoken by just 5 million peo- wegian outperforms multilingual BERT ple worldwide. The reference publication Ethno- (mBERT) models in several token and se- logue lists the 200 most commonly spoken na- quence classification tasks for both Nor- tive languages, and it places Norwegian as num- wegian Bokmal˚ and Norwegian Nynorsk. ber 171. The Norwegian language has two differ- Our model also improves the mBERT per- ent varieties, both equally recognized as written formance for other languages present in languages: Bokmal˚ and Nynorsk. The number of the corpus such as English, Swedish, and Wikipedia pages written in a certain language is Danish. For languages not included in the often used to measure its prevalence, and in this corpus, the weights degrade moderately regard, Norwegian Bokmal˚ ranges as number 23 while keeping strong multilingual prop- and Nynorsk as number 55.
    [Show full text]