The Dao of Wikipedia Extracting Knowledge from the Structure of Wikilinks Cristian Consonni

UNIVERSITÀ DEGLI STUDI DI TRENTO Department Of Information Engineering And Computer Science ICT International Doctoral School CYCLE XXX The Dao of Wikipedia Extracting Knowledge from the Structure of Wikilinks cristian consonni Advisor: Alberto Montresor University of Trento, Trento Co-advisor: Yannis Velegrakis University of Trento, Trento 2019 Cristian Consonni: The Dao of Wikipedia, Extracting Knowledge from the Structure of Wikilinks, © 2019– Creative Commons Attribution-ShareAlike Licence 4.0 (CC BY-SA 4.0) The copyright of this thesis rests with the author. Unless otherwise indi- cated, its contents are licensed under a Creative Commons Attribution- ShareAlike 4.0 International (CC BY-SA 4.0). Under this licence, you may copy and redistribute the material in any medium or format for both commercial and non-commercial purposes. You may also create and distribute modified versions of the work. This on the condition that: you credit the author and share any derivative works under the same licence. When reusing or sharing this work, ensure you make the licence terms clear to others by naming the licence and linking to the licence text. Where a work has been adapted, you should indicate that the work has been changed and describe those changes. Please seek permission from the copyright holder for uses of this work that are not included in this licence or permitted under Copyright Law. For more information read the CC BY-SA 4.0 deed. For the full text of the license visit CC BY-SA 4.0 legal code. A Virginia, per essermi stata vicina. ABSTRACT Wikipedia is a multilingual encyclopedia written collaboratively by vol- unteers online, and it is now the largest, most visited encyclopedia in existence. Wikipedia has arisen through the self-organized collabora- tion of contributors, and since its launch in January 2001, its potential as a research resource has become apparent to scientists, its appeal lying in the fact that it strikes a middle ground between accurate, manually created, limited-coverage resources, and noisy knowledge mined from the web. For this reason, Wikipedia’s content has been exploited for a variety of applications: to build knowledge bases, to study interactions between users on the Internet, and to investigate social and cultural issues such as gender bias in history, or the spreading of information. Similarly to what happened for the Web at large, a structure has emerged from the collaborative creation of Wikipedia: its articles con- tain hundreds of millions of links. In Wikipedia parlance, these internal links are called wikilinks. These connections explain the topics being covered in articles and provide a way to navigate between different subjects, contextualizing the information, and making additional information available. In this thesis, we argue that the information contained in the link structure of Wikipedia can be harnessed to gain useful insights by extracting it with dedicated algorithms. More prosaically, in this thesis, we explore the link structure of Wikipedia with new methods. In the first part, we discuss in depth the characteristics of Wikipedia, and we describe the process and challenges we have faced to extract the network of links. Since Wikipedia is available in several language editions and its entire edition history is publicly available, we have extracted the wikilink network at various points in time, and we have performed data integration to improve its quality. In the second part, we show that the wikilink network can be effectively used to find the most relevant pages related to an article provided by the user. We introduce a novel algorithm, called CycleRank, that takes advantage of the link structure of Wikipedia considering cycles of links, thus giving weight to both incoming and outgoing connections, to produce a ranking of articles with respect to an article chosen by the user. In the last part, we explore applications of CycleRank. First, we describe the Engineroom EU project, where we faced the challenge to v find which were the most relevant Wikipedia pages connected to the Wikipedia article about the Internet. Finally, we present another contri- bution using Wikipedia article accesses to estimate how the information about diseases propagates. In conclusion, with this thesis, we wanted to show that browsing Wi- kipedia’s wikilinks is not only fascinating and serendipitous1, but it is an effective way to extract useful information that is latent in the user-generated encyclopedia. 1 https://xkcd.com/214/ vi PUBLICATIONS This thesis is based on the following papers: [1] Cristian Consonni, David Laniado, and Alberto Montresor. Wiki- linkgraphs: A complete, longitudinal and multi-language dataset of the wikipedia link networks. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pages 598– 607, 2019. [2] Cristian Consonni, David Laniado, and Alberto Montresor. Discov- ering Topical Contexts from Links in Wikipedia. 2019. [3] Cristian Consonni, David Laniado, and Alberto Montresor. Cycle- Rank, or There and Back Again: personalized relevance scores from cyclic paths on graphs. Submitted to VLDB 2020, 2020. [4] Paolo Bosetti, Piero Poletti, Cristian Consonni, Bruno Lepri, David Lazer, Stefano Merler, and Alessandro Vespignani. Disentangling social contagion and media drivers in the emergence of health threats awareness. Science Advances, 2019. Under review at Science Ad- vances. This Ph.D. was instrumental to study other topics, which I chose not to include in this manuscript: [5] Cristian Consonni, Paolo Sottovia, Alberto Montresor, and Yannis Velegrakis. Discovering Order Dependencies through Order Com- patibility. In International Conference on Extending Database Tech- nology, 2019. [6] Riccardo Pasi, Cristian Consonni, and Maurizio Napolitano. Open Community Data & Official Public Data in flood risk management: a comparison based on InaSAFE. In FOSS4G-Europe 2015, the 2nd European Conference for for Free and Open Source Software for Geospatial, 2015. [7] Marco Cè, Cristian Consonni, Georg P. Engel, and Leonardo Giusti. Non-Gaussianities in the topological charge distribution of the SU(3) Yang-Mills theory. Physical Review D, 92(7):074502, 2015. vii CONTENTS 1 introduction 1 i graphs from wikipedia 5 2 wikilinkgraphs: a complete, longitudinal and multi-language dataset of the wikipedia link networks 7 2.1 The WikiLinkGraphs Dataset . 10 2.1.1 Data Processing . 10 2.1.2 Dataset Description . 15 2.2 Analysis and Use Cases . 20 2.2.1 Comparison with Wikimedia’s pagelinks Data- base Dump. 21 2.2.2 Cross-language Comparison of Pagerank Scores . 22 2.3 Research Opportunities using the WikiLinkGraphs Dataset . 25 2.3.1 Graph Streaming. 25 2.3.2 Link Recommendation. 25 2.3.3 Link Addition and Link Removal. 25 2.3.4 Anomaly Detection. 26 2.3.5 Controversy mapping. 26 2.3.6 Cross-cultural studies. 26 2.4 Conclusions . 27 ii relevance on a graph 29 3 cyclerank, or there and back again: personalized relevance scores from cyclic paths on directed graphs 31 3.1 Problem Statement . 32 3.2 Background . 33 3.3 Related Work . 34 3.4 The CycleRank Algorithm . 36 3.4.1 Preliminary filtering . 37 3.4.2 Cycle enumeration . 39 3.4.3 Score computation . 40 3.5 Experimental Evaluation . 42 3.5.1 Dataset Description . 42 3.5.2 Alternative Approaches . 43 3.5.3 Implementation and Reproducibility . 46 3.5.4 Qualitative Comparison . 48 3.5.5 Quantitative Comparison . 57 3.5.6 Performance Analysis . 68 ix 3.6 Conclusions . 69 iii applications 73 4 next generation internet - engineroom 75 4.1 Keyword Selection . 76 4.2 Cross-language keyword mapping . 77 4.3 Network visualization . 79 4.4 Internet governance . 80 4.4.1 Longitudinal analysis . 81 4.4.2 Cross-language analysis . 81 4.5 Conclusions . 82 5 disentangling social contagion and media drivers in the emergence of health threats awareness 87 5.1 Results and Discussion . 89 5.2 Conclusions . 93 5.3 Material and Methods . 95 5.4 Tables and figures . 98 6 conclusions 103 iv appendix 107 a the engineroom eu project 109 a.1 Algorithmic bias . 109 a.2 Cyberbullying . 110 a.2.1 Longitudinal analysis . 111 a.2.2 Cross-language analysis . 112 a.3 Computer security . 116 a.3.1 Longitudinal analysis . 116 a.3.2 Cross-language analysis . 117 a.4 Green computing . 121 a.4.1 Longitudinal analysis . 121 a.4.2 Cross-language analysis . 122 a.5 Internet privacy . 123 a.5.1 Longitudinal analysis . 124 a.5.2 Cross-language analysis . 128 a.6 Net neutrality . 128 a.6.1 Longitudinal analysis . 132 a.6.2 Cross-language analysis . 133 a.7 Online identity . 134 a.7.1 Longitudinal analysis . 138 a.7.2 Cross-language analysis . 138 a.8 Open-source model . 139 a.8.1 Longitudinal analysis . 144 a.8.2 Cross-language analysis . 144 a.9 Right to be forgotten . 145 a.9.1 Longitudinal analysis . 149 a.9.2 Cross-language analysis . 149 x a.10 General Data Protection Regulation (GDPR) . 149 a.10.1 Longitudinal analysis . 152 a.10.2 Cross-language analysis . 153 bibliography 175 xi 1 INTRODUCTION At a first look, the brain, a knowledge base, and the Garden of Eden do not seem to have anything in common. However, it can be argued that in all these metaphorical places, knowledge is encoded in the structure of a graph. A graph is a structure composed by a set of objects in which some pairs of objects possess some given property. The objects correspond to abstractions called nodes, vertices or points; and each of the related pairs of vertices is called an edge, arc, or line. For the brain, the concept of neural network is well-known since the late XIXth century, and it is used as a practical tool in computer science since the 1980’s [5].

The Dao of Wikipedia Extracting Knowledge from the Structure of Wikilinks Cristian Consonni

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support