<<

and Change Tracking: Wikipedia Edit Bursts Miles Lincoln

ABSTRACT Bursts have been used in prior informetrics studies to predict the emergence of new fields and trends in research by looking at the occurrence of common terms throughout a body of knowledge, such as the full text of a journal. If a sudden flurry of activity in a field can be observed, it may be possible to quantify and determine to what effect this new area will take off. Having identified a new variety of burst observed in the activity on a Wikipedia page following a newsworthy event, this study took an interest in what could be predicted by visualizing these graphs, as they represent a granular, instantly updated resource that can be observed and analyzed. Additionally, beyond prediction, can the analysis of this data correlate to other aspects of the subject of the Wikipedia page? For example, does the popularity of an actor correlate to the size of the edit burst on their Wikipedia page Burst, or spike, beginning at x=3 following their death? This study develops a workflow and These bursts are typically associated with some event related to appends scripts capable of processing data from Wikipedia that the subject of the page, such as a newsworthy event. The news can be applied to any sampling of pages to visualize bursts in reaches the general public, then one editor first updates the page activity from which one could look for correlations and make their to reflect this news [2], followed by a number of other editors own predictions. For this experiment, the chosen sample created with the intention to either improve the quality of the first edit, some constraints, and although some correlation was observed tweak the wording or content in some way, or revert the change between an actor’s financial success and the number of edits to altogether. This study predicts that the length and volume of this their Wikipedia page made in the burst following their death, cycle of editing, re-editing and undoing is closely related to the there were too few instances where this was the case and too popularity of the subject of the page. In theory, this is how much discrepancy between individual actors to come to a concrete Wikipedia should work, however it is worthwhile to note that an conclusion. edit spike may be caused by something unrelated to the content of the page, and instead having to do with the editor themselves who may be spurred to edit a page for no other reason than to improve 1. INTRODUCTION the quality, or maybe even for devious reasons, such as wiki- vandalism, in which edits are made to degrade the quality of the As more and more knowledge enters the digital realm becoming a knowledge, sometimes surreptitiously [3]. Such acts of vandalism, machine-readable unit of information, the ability to analyze and reactionary edits and undoings, could also create an edit knowledge quantifiably expands. Every day, we see trending spike. Wikipedia allows anyone to edit almost any page topics on Twitter, popular queries on our favorite search engine (Wikipedia has instituted processes to screen edits applied to and countless other instances of recurring phrases linking together living people [4]), and it also provides snapshots of every edit to gain momentum. These bursts, or recurring instances of similar ever made to a page, along with who made the edit and when. content (like a trending topic) or activity (like thousands of people Given the quantifiable nature of dates, it is the ‘when’ that this Liking a Facebook post), are everywhere. paper is most interested in and capable of visualizing. This In the past decade, Wikipedia has grown to be a reliable source particular study will be examining the activity on the Wikipedia for information on an endless variety of subjects. From its roots pages of a range of actors surrounding their time of death. as a niche website with the reputation of dubious or Bursts have previously been used in informetrics to predict trends unsubstantiated knowledge, Wikipedia has evolved into a in research and emerging fields. Guo et al. found that word bursts constantly updated information resource [1] with a dedicated frequently preceded an emerging area, which, in turn, attracts new group of volunteer curators with rigid standards for ensuring authors [5], making these bursts useful indicators of future quality knowledge. In an attempt to visualize some of the activity success. Because the sample in this study is not large enough and on a Wikipedia page, this study focuses on what it identifies as does not stretch over a long enough term, it is difficult to use it for edit bursts, or spikes represented by a flurry of edits occurring purposes of prediction, or model the Wikipedia data after the within a short period of time (which, when graphed, display a research in Guo’s paper, but still there are similarities worth spike). considering. Setting itself apart further, the Wikipedia data is contained, unlike the studies of trend bursts and emerging fields, which span the full text of a body of scientific literature, each wiki page in this study is relevant only to the actor it describes. Also, the spikes are brief and difficult to extend into a method for prediction. Because our contained data set is relevant only to a single actor now deceased, the value of a prediction may be low. It would appear that the dynamic time in an actor’s life story is passed, but regardless of what the future holds, we have a series of snapshots immediately surrounding the time of death. Although no further news is expected of a dead celebrity, perhaps previously unrecognized relationships can be discovered in these community reactions to celebrity mortality. As a sample, a group of male actors were picked who had passed away in the first decade of Wikipedia’s existence (2001-2009), roughly one burst/actor death per calendar year. The study chose actors by browsing the ‘Deaths in (year)’ pages on Wikipedia and Sample output of Perl script selecting the most recognizable name in acting. The nine actors With the data in this new format, it was inserted into a pivot table chosen are listed in a table below, along with their date of death. in Excel which looked for dates occurring multiple times, to see While macabre, this sampling provided a set of Wikipedia pages which days had a high frequency of edits, or a burst. These data with (almost) guaranteed bursts around a predictable date attached were then graphed using several different parameters in Excel. to a controlled event (death of actor). Though efforts were made This process was repeated for the nine different sample Wikipedia to select the most popular (a subjective quality) actor each year, pages exhibiting edit bursts and combined into a single this was often difficult, and some years have a notable lack of star spreadsheet. Unifying all my data made it simple to chart power. By picking based on fame, the data admitted a diverse different comparisons and compare any facet on a whim. group of actors who have little in common with regard to age, cause of death, and other factors which would affect the newsworthiness of their passing, and the likelihood that a large 3. RESULTS number of tech-savvy Wikipedia editors would jump into action. The first thing to look at was the evolution of a burst over time. By pulling in one actor who should exhibit a spike in edit activity How did an edit burst change from 2002 to 2009? This was first for each year that Wikipedia has existed, we will also observe done by aligning all of the bursts on the same graph and how the edit spike evolves over the long-term lifespan of comparing their activity on the same timeline. Wikipedia. Actor Date of death Age at death 6/27/2001 76 11/18/2002 74 6/12/2003 87 7/1/2004 80 Ossie Davis 2/4/2005 87 11/10/2006 87 Robert Goulet 10/30/2007 73

Heath Ledger 1/22/2008 28 Deaths 2002-09 aligned on x=10 on a log scale Patrick Swayze 9/14/2009 57 Here, we can see the more recent deaths at the upper Actors used in study end/exhibiting the largest edit bursts. Also interesting is the activity that continues after the initial burst in many of the actors studied, and the lack of activity that appears to fall around x=30. 2. METHODS Note that Jack Lemmon’s death occurred prior to Wikipedia’s current editing numeration scheme, and his page had such a small This study employed a Perl (a programming language capable of amount of activity to be inconsequential, so he has been left off of parsing text using regular expressions, such as dates) script to some graphs in order to keep clutter to a minimum. scrape Wikipedia page revision histories for the dates of all edits. In order to get a different view, the growth was compared by Wikipedia only shows these 500 at a time, so it was necessary to observing the trend in number of max edits per day over the years. copy/paste these 500 at a time into a text document that could then be fed to the script. The script produced a list of sequential dates in DD/MM/YYYY format with one entry for each edit made per day for a particular page.

Change in initial (day of death) burst size 2002-09 ’s edit frequency in red, trendline is the moving average with a period of 200 Both of these approaches produced interesting graphs. As you can see in the second graph, there is a trend toward an increasing Pitt is one of the best-known actors, but his largest edit burst is number of edits, but not at a steady rate. Looking at this graph, it just barely over 50—only 1/7th the size of ’s. is also possible to explain the areas where the graph does not increase by the unfamiliarity of the actor’s name. When all of the data is combined, it becomes apparent that tracking data over the entire lifespan of Wikipedia is going to show more variation than desired. For example, it is near impossible to compare Jack Lemmon’s 2001 death with a spike of 1 edit with Heath Ledger’s spike of 352 edits almost a decade later. A larger sampling of similar subjects in a shorter time period would be preferable. Focusing on what we can work with, when looking specifically at the two most recent edit bursts (Ledger and Swayze), if one adjusts them to take into consideration one metric for judging an actor’s popularity, the spikes get close, but not quite identical. Using the average box office gross of each actor ($77,639,600 for Ledger, $49,751,800 for Swayze) adjusted for price inflation Moving average with period=200 blue: Pitt, red: Ledger [6,7], we see the difference in size of edit burst shrink graphed on a log scale significantly. Ledger’s edits are untouched, Swayze’s edits have been multiplied by a factor of 1.56 ($49m * 1.56 ≈ $77m). Here is the living actor (Pitt in red) compared to Ledger (blue) on the same graph (log scale). Again we see how this popular actor does not experience turbulent edit spikes in their everyday (celebrity) life. Another of the (undesired) variables inherent in our data is the cause of death/type of death. Our two clearest examples of edit bursts (Ledger and Swayze) have very different circumstances surrounding their deaths. Ledger’s death was unexpected, while Swayze’s health was known to be in decline.

Comparison of Heath Ledger and Patrick Swayze (adjusted) A larger sampling with more spikes occurring in the past two years would be helpful in confirming a correlation between the size and characteristics of an edit spike. Also required is a metric for comparing whatever this new sample is. In this study, the sample is constrained to actors who are recently deceased. Perhaps a future study could compare less decisive events, such as the debut of a film. Finally, having looked at the revision history activity of so many Ledger: blue, Swayze: red actors who have died, it is worth looking at an actor still working Each actor shows a large initial spike near the day of their death, to get a basis for comparison. but Ledger also shows a series of spikes following, as more details became known to the public. The public knew most circumstances of Swayze’s passing at the beginning, so edit activity is minimal [4] Cohen, Noam. 2009. Wikipedia to Limit Changes to after the initial spike; there are no aftershocks. Articles on People. . [5] Guo, Hanning, Scott Weingart, Katy Börner. 2011. Mixed-indicators model for identifying emerging 4. DISCUSSION research areas. Scientometrics. 89 (1), 421-435. While requiring some manual input from the user, this study [6] Box Office Mojo. 2011. Heath Ledger Movie Box establishes a workflow and provides all necessary resources Office Results. needed for it to be repeated across any number of data sets. In http://boxofficemojo.com/people/chart/?id=heathledger. order to move forward, one needs only a sample set and a quality htm metric that can be applied to the entire data set in order to look for a correlation that will level the playing field. [7] Box Office Mojo. 2011. Patrick Swayze Movie Box Office Results. Because only Ledger and Swayze had noteworthy acting careers http://boxofficemojo.com/people/chart/?id=patricksway recent enough to be featured on a website listing their inflation- ze.htm adjusted average gross, they were the only two eligible to be correlated to their popularity in a measurable way within the scope of this study. Because the sample size consists of two, it would not be wise to claim established correlation between the volume of Wikipedia response to your death and your popularity as measured by the average gross of your film, but this study establishes the groundwork to easily analyze similar relations. Furthermore, this study has also circumstantially displayed instances wherein the spike surrounding an actor’s death is far greater than any spikes surrounding an actor’s life. We have also visualized some of the different types of reaction spikes, which could tell the viewer something about how the news reached the public (all at once, or in waves).

5. FUTURE STUDY When using a sample of resources about humans created by other humans, there are an infinite number of variables to control. Future research might refine this study to focus on controllable aspects in order to have an experiment that is not vulnerable to the many differences that make the reaction to Heath Ledger’s unexpected death very different from the reaction to an actor’s passing of old age, which is in part different than the reaction that plays out as a celebrity’s ailing health declines in the public eye. Future studies may want to seek a sampling with more in common, such as bursts that occur in the same year, so that the user base of Wikipedia at the time, and other temporal aspects are not a variable. This study looked at edits made only in a quantitative light. The quality of edits made was out of the scope of this study. Future study may make attempts to classify different types of edits, maybe even automatically, as Wikipedia categorizes some types of edits, such as “minor edits.” This way, it would be possible to generate a more dimensional picture of the edits occurring.

6. REFERENCES [1] Mackey, Robert. 2009. Wikipedia’s Rapid Reaction to Outburst during Obama Speech. The New York Times.

[2] Onion, The. 2008. Area Man Honored To Be The One Who Added Death Date To Heath Ledger’s Wikipedia Page. http://www.theonion.com/articles/area-man- honored-to-be-one-who-added-death-date-to,2386/ [3] Wikipedia. 2011. Vandalism on Wikipedia. http://en.wikipedia.org/wiki/Vandalism_on_Wikipedia