Wikipedia Graph Mining: Dynamic Structure of Collective Memory
Total Page:16
File Type:pdf, Size:1020Kb
Wikipedia graph mining: dynamic structure of collective memory Volodymyr Miz, Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst Ecole Polytechnique Fédérale de Lausanne Lausanne, Switzerland [email protected] ABSTRACT services, and knowledge bases generate a massive amount of logs, Wikipedia is the biggest encyclopedia ever created and the fifth containing traces of global online activity on the Web. A large-scale most visited website in the world. Tens of millions of people surf example of such publicly available information is the Wikipedia it every day, seeking answers to various questions. Collective user knowledge base and its history of visitors activity. This data is a activity on its pages leaves publicly available footprints of human great source for collective human behavior analysis at scale. Due to behavior, making Wikipedia an excellent source for analysis of this reason, the analysis of the Wikipedia in this area has become collective behavior. popular over the recent years [27], [14], [21]. In this work, we propose a new method to analyze and retrieve Collective memory [17] is an interesting social phenomenon of collective memories, the way social groups remember and recall human behavior. Studying this concept is a way to enhance our the past. We use the Hopfield network model as an artificial mem- understanding of a common view of events in social groups and ory abstraction to build a macroscopic collective memory model. identify the events that influence remembering of the past. Early To reveal memory patterns, we analyze the dynamics of visitors research on collective memory relied on interviews and self-reports activity on Wikipedia and its Web network structure. Each pattern that led to a limited number of subjects and biased results [26]. in the Hopfield network is a cluster of Wikipedia pages sharing a The availability of the Web activity data opened new opportunities common topic and describing an event that triggered human cu- toward systematic studies at a much larger scale [10], [16], [14]. riosity during a finite period of time. We show that these memories Nonetheless, the general nature of collective memory formation can be remembered with good precision using a memory network. and its modeling remain open questions. Can we model collective The presented approach is scalable and we provide a distributed and individual memory formation similarly? Is it possible to find implementation of the algorithm. collective memories and behavior patterns inside a collaborative knowledge base? In this work, we adopt a data-driven approach to CCS CONCEPTS shed some light on these questions. An essential part of the Web visitors activity data is the under- • Information systems → Wrappers (data mining); Web log lying human-made graph structure that was initially introduced analysis; • Networks → Network dynamics; • Human-centered to facilitate navigation. The combination of the activity dynamics computing → Wikis; and structure of Web graphs inspires an idea of their similarity KEYWORDS to biological neural networks. A good example of such network is the human brain. Numerous neurons in the brain constitute a Collective memory, Graph Algorithm, Hopfield Network, Wikipedia, biological dynamic neural network, where dynamics are expressed Web Logs Analysis in terms of neural spikes. This network is in charge of perception, ACM Reference Format: decision making, storing memories, and learning. Volodymyr Miz, Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst. During learning, neurons in our brain self-organize and form 2018. Wikipedia graph mining: dynamic structure of collective memory. In strongly connected groups called neural assemblies [1]. These Proceedings of (Submitted to KDD’18). , 9 pages. https://doi.org/10.475/123_4 groups express similar activation patterns in response to a spe- arXiv:1710.00398v5 [cs.IR] 14 Feb 2018 cific stimuli. When learning is completed, and the stimuli applied 1 INTRODUCTION once again, reactions of the assemblies correspond to consistent Over recent years, the Web has significantly affected the way people dynamic activity patterns, i.e. memories. Synaptic plasticity mecha- learn, interact in social groups, store and share information. Apart nisms govern this self-organization process. from being an essential part of modern life, social networks, online Hebbian learning theory [18] proposes an explanation of this self-organization and describes the basic rules that guide the net- Permission to make digital or hard copies of all or part of this work for personal or work design. The theory implies that simultaneous activation of a classroom use is granted without fee provided that copies are not made or distributed pair of neurons leads to an increase in the strength of their connec- for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM tion. In general,the Hebbian theory implies that the neural activity must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, transforms brain networks. This assumption leads to an interesting to post on servers or to redistribute to lists, requires prior specific permission and/or a question. Can temporal dynamics cause a self-organization process fee. Request permissions from [email protected]. Submitted to KDD’18, August 19–23, London, United Kingdom in the Wikipedia Web network, similar to the one, driven by neural © 2018 Association for Computing Machinery. spikes in the brain? ACM ISBN 123-4567-24-567/08/06...$15.00 https://doi.org/10.475/123_4 Submitted to KDD’18, August 19–23, London, United Kingdom V. Miz etal. To answer this question, we introduce a collective memory mod- collective memories using LDA [7], applied to a collection of news eling approach inspired by the theory of learning in the brain, articles. in particular, a content-addressable memory system, the Hopfield Despite the fact that Wikipedia is the largest ever created en- network [19]. In our experiments, we use a dynamic network of cyclopedia of public knowledge and the fifth most visited web- Wikipedia articles, where the dynamics comes from the visitors site in the world, the studies on collective memory considered the activity on each page, i.e. the number of visits per page per hour. We Wikipedia visitor activity data only recently. The idea of regard- assume that the Wikipedia network can self-organize similar to a ing Wikipedia as a global memory space was first introduced by Hopfield network with a modified Hebbian learning rule and learn Pentzold in 2009 [25]. Then, it was followed by a range of collective collective memory patterns under the influence of visitors activity memory studies based on various Wikipedia data archives. similar to neurons in the brain. The results of our experiments Analyzing Wikipedia page views, Kanhabua et al. [21] proposed demonstrate that memorized patterns correspond to groups of col- a collective memory model, investigating 5500 events from 11 cat- lective memories, containing clusters of linked pages that have a egories. The authors proposed a remembering score based on a closely related meaning. A topic of a cluster corresponds to a real combination of time-series analysis and location information. The world event that triggered the interest of Wikipedia visitors during focus of the work is on the four types of events: aviation accidents, a finite period of time. In addition, the collective memory gives earthquakes, hurricanes, and terrorist attacks. The work presents access to the time lapse when the event occured. We also show that extensive experimental results, however, it limited to particular our collective memory model is able to recall an event recorded in types of collective memories. the collective memory, given only a part of the event cluster. Here Traumatic events such as attacks and bombings have also been the term recall means that we recover a missing fraction of the investigated in [11], [10] based on the Wikipedia edit activity data. visitor activity signal in a memory cluster. The authors investigate the difference between traumatic and non- Contributions. Our contributions are as follows: traumatic events using natural language processing techniques. • We propose a novel collective memory learning framework, The study is limited to a certain type of events. inspired by artificial models of individual memory –the Another case study [14] focuses on memory-triggering patterns Hebbian learning theory and the Hopfield network model. to understand collective memories. The collective memory model • We formalize our findings into an content-addressable model is inferred from the Wikipedia visitors activity. The work considers of collective memories. So far, collective memory [17] has only a case of aircraft incidents reported in English Wikipedia. The been considered just as a concept that lacks a general model authors try to build a general mathematical model and explain of memory formation. In the experiments we demonstrate the phenomenon of collective memory, extracted from Wikipedia, that given a modified Hebbian learning rule, Wikipedia hy- based on that single topic. perlinks network can self-organize and gain properties rem- Popularity and celebrities represent another focus point of public iniscent of an associative memory. interest. The Wikipedia hourly visits on the pages of celebrities • We develop a graph algorithm for collective memory ex- was used to investigate fame levels of tennis players [29]. The traction. Computations are local on the graph, allowing to authors, though, did not tackle collective memories and aimed to build an efficient implementation. We provide a distributed quantify the relationship between performance and popularity of implementation of the algorithm to show that it can handle the athletes. dense graphs and massive amounts of time-series data. To the best of our knowledge, we are the first to apply a content- • We present graph visualizations as an interactive tool for addressable memory model to the Wikipedia viewership data for collective memory studies. collective memory analysis.