Death and Change Tracking: Wikipedia Edit Bursts Miles Lincoln
Total Page:16
File Type:pdf, Size:1020Kb
Death and Change Tracking: Wikipedia Edit Bursts Miles Lincoln ABSTRACT Bursts have been used in prior informetrics studies to predict the emergence of new fields and trends in research by looking at the occurrence of common terms throughout a body of knowledge, such as the full text of a journal. If a sudden flurry of activity in a field can be observed, it may be possible to quantify and determine to what effect this new area will take off. Having identified a new variety of burst observed in the activity on a Wikipedia page following a newsworthy event, this study took an interest in what could be predicted by visualizing these graphs, as they represent a granular, instantly updated resource that can be observed and analyzed. Additionally, beyond prediction, can the analysis of this data correlate to other aspects of the subject of the Wikipedia page? For example, does the popularity of an actor correlate to the size of the edit burst on their Wikipedia page Burst, or spike, beginning at x=3 following their death? This study develops a workflow and These bursts are typically associated with some event related to appends scripts capable of processing data from Wikipedia that the subject of the page, such as a newsworthy event. The news can be applied to any sampling of pages to visualize bursts in reaches the general public, then one editor first updates the page activity from which one could look for correlations and make their to reflect this news [2], followed by a number of other editors own predictions. For this experiment, the chosen sample created with the intention to either improve the quality of the first edit, some constraints, and although some correlation was observed tweak the wording or content in some way, or revert the change between an actor’s financial success and the number of edits to altogether. This study predicts that the length and volume of this their Wikipedia page made in the burst following their death, cycle of editing, re-editing and undoing is closely related to the there were too few instances where this was the case and too popularity of the subject of the page. In theory, this is how much discrepancy between individual actors to come to a concrete Wikipedia should work, however it is worthwhile to note that an conclusion. edit spike may be caused by something unrelated to the content of the page, and instead having to do with the editor themselves who may be spurred to edit a page for no other reason than to improve 1. INTRODUCTION the quality, or maybe even for devious reasons, such as wiki- vandalism, in which edits are made to degrade the quality of the As more and more knowledge enters the digital realm becoming a knowledge, sometimes surreptitiously [3]. Such acts of vandalism, machine-readable unit of information, the ability to analyze and reactionary edits and undoings, could also create an edit knowledge quantifiably expands. Every day, we see trending spike. Wikipedia allows anyone to edit almost any page topics on Twitter, popular queries on our favorite search engine (Wikipedia has instituted processes to screen edits applied to and countless other instances of recurring phrases linking together living people [4]), and it also provides snapshots of every edit to gain momentum. These bursts, or recurring instances of similar ever made to a page, along with who made the edit and when. content (like a trending topic) or activity (like thousands of people Given the quantifiable nature of dates, it is the ‘when’ that this Liking a Facebook post), are everywhere. paper is most interested in and capable of visualizing. This In the past decade, Wikipedia has grown to be a reliable source particular study will be examining the activity on the Wikipedia for information on an endless variety of subjects. From its roots pages of a range of actors surrounding their time of death. as a niche website with the reputation of dubious or Bursts have previously been used in informetrics to predict trends unsubstantiated knowledge, Wikipedia has evolved into a in research and emerging fields. Guo et al. found that word bursts constantly updated information resource [1] with a dedicated frequently preceded an emerging area, which, in turn, attracts new group of volunteer curators with rigid standards for ensuring authors [5], making these bursts useful indicators of future quality knowledge. In an attempt to visualize some of the activity success. Because the sample in this study is not large enough and on a Wikipedia page, this study focuses on what it identifies as does not stretch over a long enough term, it is difficult to use it for edit bursts, or spikes represented by a flurry of edits occurring purposes of prediction, or model the Wikipedia data after the within a short period of time (which, when graphed, display a research in Guo’s paper, but still there are similarities worth spike). considering. Setting itself apart further, the Wikipedia data is contained, unlike the studies of trend bursts and emerging fields, which span the full text of a body of scientific literature, each wiki page in this study is relevant only to the actor it describes. Also, the spikes are brief and difficult to extend into a method for prediction. Because our contained data set is relevant only to a single actor now deceased, the value of a prediction may be low. It would appear that the dynamic time in an actor’s life story is passed, but regardless of what the future holds, we have a series of snapshots immediately surrounding the time of death. Although no further news is expected of a dead celebrity, perhaps previously unrecognized relationships can be discovered in these community reactions to celebrity mortality. As a sample, a group of male actors were picked who had passed away in the first decade of Wikipedia’s existence (2001-2009), roughly one burst/actor death per calendar year. The study chose actors by browsing the ‘Deaths in (year)’ pages on Wikipedia and Sample output of Perl script selecting the most recognizable name in acting. The nine actors With the data in this new format, it was inserted into a pivot table chosen are listed in a table below, along with their date of death. in Excel which looked for dates occurring multiple times, to see While macabre, this sampling provided a set of Wikipedia pages which days had a high frequency of edits, or a burst. These data with (almost) guaranteed bursts around a predictable date attached were then graphed using several different parameters in Excel. to a controlled event (death of actor). Though efforts were made This process was repeated for the nine different sample Wikipedia to select the most popular (a subjective quality) actor each year, pages exhibiting edit bursts and combined into a single this was often difficult, and some years have a notable lack of star spreadsheet. Unifying all my data made it simple to chart power. By picking based on fame, the data admitted a diverse different comparisons and compare any facet on a whim. group of actors who have little in common with regard to age, cause of death, and other factors which would affect the newsworthiness of their passing, and the likelihood that a large 3. RESULTS number of tech-savvy Wikipedia editors would jump into action. The first thing to look at was the evolution of a burst over time. By pulling in one actor who should exhibit a spike in edit activity How did an edit burst change from 2002 to 2009? This was first for each year that Wikipedia has existed, we will also observe done by aligning all of the bursts on the same graph and how the edit spike evolves over the long-term lifespan of comparing their activity on the same timeline. Wikipedia. Actor Date of death Age at death Jack Lemmon 6/27/2001 76 James Coburn 11/18/2002 74 Gregory Peck 6/12/2003 87 Marlon Brando 7/1/2004 80 Ossie Davis 2/4/2005 87 Jack Palance 11/10/2006 87 Robert Goulet 10/30/2007 73 Heath Ledger 1/22/2008 28 Deaths 2002-09 aligned on x=10 on a log scale Patrick Swayze 9/14/2009 57 Here, we can see the more recent deaths at the upper Actors used in study end/exhibiting the largest edit bursts. Also interesting is the activity that continues after the initial burst in many of the actors studied, and the lack of activity that appears to fall around x=30. 2. METHODS Note that Jack Lemmon’s death occurred prior to Wikipedia’s current editing numeration scheme, and his page had such a small This study employed a Perl (a programming language capable of amount of activity to be inconsequential, so he has been left off of parsing text using regular expressions, such as dates) script to some graphs in order to keep clutter to a minimum. scrape Wikipedia page revision histories for the dates of all edits. In order to get a different view, the growth was compared by Wikipedia only shows these 500 at a time, so it was necessary to observing the trend in number of max edits per day over the years. copy/paste these 500 at a time into a text document that could then be fed to the script. The script produced a list of sequential dates in DD/MM/YYYY format with one entry for each edit made per day for a particular page. Change in initial (day of death) burst size 2002-09 Brad Pitt’s edit frequency in red, trendline is the moving average with a period of 200 Both of these approaches produced interesting graphs.