RecentWikiTrends – A Storm-based tracker to observe trends in Wikipedia article alteration

Gustaf Helgesson (gjrh2) Clare Hall April 4, 2014

1 Introduction

It is widely known trends strongly correlate with major recent events around the globe, such as the Arabic spring and the death of pop star Michael Jackson, sometimes even breaking these news [6]. Aside of Twitter’s in-house trending pages, several frameworks have been developed to extract news from the social media cite such as [7]. Here we introduce the RecentWikiTrends, an Apache Storm based applica- tion which extracts and computes trending Wikipedia articles based on article changes for a given set of languages, in real-time. The aim of is to utilise the program to observe correlations between frequent article changes and recent events, similar to what is done in the earlier mentioned Twitter systems. The RecentWikiTrends program observes trends on both a per language and a cross- language basis, the latter accomplished through translating the given Wikipedia articles to a common language. This report discusses Storm, the framework used for the application (Sec- tion 2) and more detailed system design of RecentWikiTrends (Section 3). A test scenario is based on Wikipedia updates following the missing Malaysia Airlines Flight 370 (Section 4). Observed correlations and findings are then discussed (Section 5) and future directions identified (Section 7).

2 Storm

Storm is a data stream processing engine developed at Backtype before they were later acquired by Twitter after which Storm was released open source [8]. Similarly to other data stream processing systems such as IBM’s SPADE and Apache Samza, Storm attempts to solve the problem of large-scale real-time data processing, something traditionally not handled well by batch based systems such as Hadoop [10]. Storm was one of the first such frameworks to be released open source and is used extensively by companies such as GroupOn, Alibaba and Twitter [2]. When evaluating streaming frameworks I compared Storm against Samza and S4. I choose Storm over S4 for the additional control and Storm does, for example, not try to dynamically change the cluster setup as done in S4. Apache

1 Figure 1: RecentWikiTrends’s Web interface.

Samza looked comparable but is rather immature at this stage and at the time of writing some fault-tolerance mechanisms have yet to be implemented [2] Storm applications are written explicitly in terms of streams where the pro- grammer forms a graph of system input components, spouts, and system trans- formation/output components, bolts. Storm is written in the JVM based Clo- jure language and Storm applications can be written either directly in a JVM language or through using Storm’s streaming interface [2]. When developing RecentWikiTrends I chose to use Java to get more control over the application’s integration with Storm as well as allowing it to make full use of the built-in libraries.

3 System Design

The RecentWikiTrends subscribes to a given set of Wikipedia language editions and outputs the top n edited articles over the last m hours through using sliding windows. Rankings are made on per-language and global basis where all non- English language edits are translated to English before adding their counts. The rankings are kept up to date in a MySQL database which is exposed to the user over the web interface seen in Fig. 1.

3.1 Topology The Storm topology for RecentWikiTrends is outlined in Fig. 3. Each language has its own pipeline consisting of article extraction, sliding window based count- ing and a local ranking. To obtain global rankings, each language article counter sends pageids and counts to a Wikipedia translator bolt which translates the pages and emits English article names and counts. The following sections cover individual component design.

2 Sliding window counter Ranker

Current window map

Modification slots ... Add Subtract

Figure 2: The diff-based sliding window data transfers. As slots are shifted the counter emits the new and old slot counts to the ranker. The ranker then adds the new slot and subtracts the old slot from its global map before computing the new rankings.

WikiSpout A WikiSpout obtains article changes for a given language in JSON format by using the Wikipedia public API. It pulls data every 10 seconds off the site and sends the article names, pageid, counts and the query time to the sliding window- based article counter. Each Wikipedia language runs as a separate MediaWiki installation which means we need to query each language edition separately.

Article Counting Bolt The Article Counting Bolt performs sliding window based counting using a fixed window length and slot length, together determining the number of slots and update granularity. Once a slot’s time has passed the bolt outputs diffs in terms of addition and subtraction counts to its attached ranker(s). Additions are the counts from the head of the window and subtractions are counts from the old tail dropping off the new window. The rankers then update their local copies of current counts as seen in Fig. 2. A language can make use of multiple Article Counting Bolts if necessary.

Local Ranking Bolt The Local Ranking Bolt maintains state over the current sliding window and updates it with the diffs given by the Article Counting Bolt. After processing a diff it recomputes the top 20 rankings for the current window before outputting this to a MySQL database – making it available for consumption at the web interface. Only one Local Ranking Bolt is attached to Article Counting bolts from a language to achieve a global view of the counts for the language.

Translator Bolt Similarly to the Local Ranking Bolt, the Translator Bolt receives article count diffs from the sliding window counter. The bolt then translates article names to English using Wikipedia’s other language information as exposed in its public API. To reduce the load on Wikipedia servers the lookups are performed in batch

3 Bolts

Spouts Local Ranker English Sliding Counter EN Wiki EN Global Ranker Non-English Sliding Counter MySQL Wiki NON-EN Translator

Local Ranker NON-EN Apache

Figure 3: RecentWikiTrends’s Storm topology. In this example the topology consists of the English and one non-English Wikipedia source. and a local cache of translations is maintained. After translation the translator bolt emits the translated article names and counts to the Global Ranking Bolt. Any article not linked to an to English Wikipedia page is omitted from the global rankings.

Global Ranking Bolt The global ranking bolt retrieves tuple diffs from all translator bolts and the English Article Count Bolt. To ensure fair input from all sources it stores a list of diff messages per language and does not output new rankings until each language has had its diff delivered. This is necessary as the English Wikipedia recent changes is more frequently updating than other editions – normally more than 10x the amount of the Swedish one for example. In our evaluations the articles contain an equal weight independently of language.

3.2 Fault-Tolerance Storm’s message passing guarantees is done by producing a tuple dependency tree where a tuple is acknowledged and deleted only once it has been processed throughout the system [3]. This is, however, poorly suited for a sliding window environment as tuples are either dependencies for the entire length of the win- dow, in our case 6 hours, or needs to be constantly replayed from the Spout until its window has passed. In addition to the sliding window, the diff-based counter update solution adds additional strain on the system in the event of a fault as big amounts of state needs to be preserved for each of the system bolts. To solve this I added my own fault-tolerance to the Storm components. On a ranker failure the newly started ranker can retrieve lost sliding window information by querying the sliding window counter for the entire dataset (See Fig. 4). The handling of sliding counter failures is more complicated and is outlined in Fig. 5 Translation nodes are, caching aside, stateless and can simply be restarted by the Storm cluster as necessary. After a fault, a newly started Global Ranker can request translator nodes to re-emit the full sliding window, similar to what is done on Local Ranker failure.

4 Sliding window counter Ranker

Current window map

Modification slots ... Add Subtract Request full counts New ranker Ranker

Empty window map

...

...

Add all the window slots

Figure 4: Illustration of how ranker failures are handled in RecentWikiTrends. The ranker normally only sees the differences in counts between sliding windows before computing new ranks. Uoon failure this information is lost and needs to be retrieved. This is done by a feedback loop from the ranker to the sliding window counter where the ranker asks for the full sliding window using a special message.

Sliding window counter Ranker

Last output Current timestamp in window map Zookeeper

Modification slots ... Add Subtract

1) Timestamp + compute tail slot New Sliding window counter

3) Send diff as Modification slots usual 2) Get tail slot N/A ...

Figure 5: In the event of a sliding counter failure the new sliding counter is unaware of the subtraction part of the diff to update the ranker. Without further action this will lead to the ranker keeping counts of lost data indefinitely which will likely result in incorrect rankings. By keeping external state of the last processed counter in Zookeeper the newly started Article Counting Bolt can compute the time of its missing tail slot. The bolt can then retrieve the recent changes for this time period directly from Wikipedia and add it to the subtraction diff for its ranker which can carry on as if nothing happened.

5 4 Evaluation

To evaluate the use of Wikipedia as a real-time information source I simulated a 24 hour long run on March 8th 2014 – the date the Malaysian Airlines Flight MH370 went missing. Reasons for choosing the event include it happening within the last 30 days (the maximum history for recent changes exposed in Wikipedia [9]), it being a major news event and it having a well defined re- lease time as it was announced in a press release by Malaysian airlines. The expectation is that the event will become a frequently edited article shortly after the event and related articles will see frequent modifications across all language editions given the media attention given and reader interest of the Wikipedia article. I chose to use the English, Spanish, German and Swedish Wikipedia editions for evaluation due to different popularities, time zones and my own language abilities. In addition to evaluating Wikipedia changes as an event tracker I will also evaluate the use of Storm as a framework for the application in addition to the performance of the application.

Hardware The original plan was to run the experiment on a cluster of machine on Amazon EC2 but having tested to run the application locally on my personal laptop this turned out to be sufficient for the testing scenario. The machine is 3.2GHz dual-core Intel i5-based with 4GB of RAM running Arch Linux.

5 Results 5.1 Wikipedia recent changes as an event tracker The disappearance of flight MH370 was announced at 23:24 UTC on March 7th by Malaysia Airlines [1]. The English Wikipedia article was created only an hour and 13 minutes later at 00:37 [5]. The RecentWikiTrends application first observes it as a top 20 trending article in the English Wikipedia at 01:20, some- thing which propagates to the global top 20 by 01:32 as can be seen in Fig. 6. The other language editions observed were significantly slower at establishing an article with Spanish being the second language establishing an article at 02:54 UTC. The next article to be established is the German one at 08:35 followed by the Swedish one at 11:25 [5]. One possible explanation for these differences would be time zones as native speakers of English and Spanish are spread across the globe while German and Swedish speaking countries observe Central European Time (CET) and the announcement of the disappearing flight was not made until 01:37 local time. The tables below show all articles ranked 1st and 2nd in the sliding window counts throughout March 8th for per-language as well as global rankings. Many of them relate to the MH370 flight, such as the Boeing 777 article (the type of the aircraft [1]). Other recent events found in these recent changes are the 2014 Crimean crisis, the 2014 Asia Cup (this day was the last day of the tournament) and the Swedish internal Eurovision song contest, Melodifestivalen 2014, which had its final on this date.

6 iue6 rnigatceosrain fMlyi lgt30o ac 8 March on 7. March 370 on Flight UTC Malaysia 23:24 at of announcement observations Airlines Malaysia article the Trending following 6: Figure aasaArie lgt370 Flight Airlines Malaysia t aysCt,Maryland City, Mary’s St. Rank 20 10 15 ainlsce team soccer national ntdSae men’s States United 5 1 04Ciencrisis Crimean 2014 00:00

oig777 Boeing 01:30 Global

03:00

04:30 Recent changerankingsforMalaysiaFlight370 06:00

07:30

Time onMarch82014(UTC) 09:00 aasaArie lgt370 Flight Airlines Malaysia t aysCt,Maryland City, Mary’s St. 10:30 94tase fCrimea of transfer 1954 93Culrbscrash bus Chualar 1963 7

a mg format image Raw 12:00 04Ai Cup Asia 2014 ih Sector Right

saCup Asia 13:30 English

15:00

16:30

18:00

19:30

21:00 Spanish Swedish English German Global

22:30

00:00 German Spanish Mobilmachung Selecci´onde f´utbol de los Estados Unidos Rooftop Concert Joel Robles Berlin Alexanderplatz (Roman) Tony Carrasco Malaysia-Airlines-Flug 370 Vuelo 370 de Malaysia Airlines Robert Lehmann (Chemiker) Juegos Suramericanos de 2014 Kennedyallee (Bonn) Liga de F´utbol Norte de Santa Cruz 2014 Genietruppe Los Imaginadores Krimkrise 2014 Crisis de Crimea de 2014 Skeptikerbewegung Fundaci´onAlbacete Nexus Energ´ıa Boeing 777 Baldrick II Heiliggeistkirche (Speyer) Ruizanglada Abd ar-Razzaq Muhyi ad-Din Wolbodo de Lieja Wok-WM Torneo Clausura 2014 (M´exico) Adilia Castillo

Swedish Qin (stat) Emilia Elisabeth av Pfalz Adhamiyah Kawa Garmeyani Malaysia Airlines Flight 370 Anchiphiloscia pallida Valettiopsis dentatus Moldaviska SSR Vitryska SSR Morsealfabetet Social r¨orlighet R¨odingar Pi Maria Pietil¨aHolmner Serhiy Kvit Melodifestivalen 2014 Based on our data, trending recent articles appear to indicate it being likely for a trending article to relate to a big, recent event. The question then arises whether the opposite is true – that is will big recent events end up trending? The Trial of failed to show up despite being in top stories in news sources as CNN and regarding witnesses from the fifth day of trial [4]. The Wikipedia article appears to have numerous edits throughout the day but changes are larger and more spread out than in the MH370 case. In this case the RecentWikiTrender failed to pick up the news item.

Global Rankings We observe a strong correlation between elements trending in the global rankings with the English Wikipedia trends. In our setup each article carries an equal weight and given the English Wikipedia has more changes than other editions

8 this is expected. For the purpose of detecting recent events it is not bad however. For example in the case of the MH370 flight this trended in the English edition more than 5 hours before trending in the other language editions considered. One issue with the global rankings is when there is no link from a non- English language article to the English version. Because of the reliance of this as a translation function any non-linked non-English articles are discarded from the global counts. In particular I observed multiple cases where trending Spanish pages had an English equivalent without being explicitly linked, meaning the article is not getting a fair count in the global ranking process. This could be improved by using for example Google Translate when no link exists.

5.2 Performance With the framework running smoothly on the tet machine at real-time rates I decided to increase the processing speed of the past data to emulate a larger dataflow. The framework keeps up at 10x the processing speed of real-time without any caveats and only the English and Global rankings see minor degra- dation in service, that is a 5 second delay compared to other languages after 24 hours at 15-20x speedups. The program appears to scale well and in a dis- tributed setting Storm would hopefully place the language based pipelines on single machines, meaning network traffic will only carry global rankings traffic which in itself only contains of the diffs of two 5 minute counter slots. Through only emitting diffs data transfer is reduced up to 180x between sliding counters to local rankers or translators and translators to the global ranker. This is because our 6 hour sliding window contains 360 slots and with the diff based counting we only need to emit the first and last counters. Running the four languages and global trend real-time rankers requires only 300-500MB of memory at any given point. This is because the amount of streaming data is relatively small and the data is only kept for a few hours before being discarded. The CPU usage is relatively low, around 25% per core when running a 15x speedup and bandwidth used to query for and receive Wikipedia data is 6 and 25 KiB/s on average respectively.

5.3 Storm as a framework for RecentWikiTrends Developing the application in Storm took significantly longer than expected given the fairly straight forward topology design and convenient deployment tools such as Leiningen, Maven and the storm-deploy tool for setting up Storm on Amazon EC2. The framework lacked helper libraries and I ended up having to implement sliding windows and slot based counters on my own, something which surprised me given Storm’s popularity and the frequency of sliding win- dows in stream processing. Storm provides good fault-tolerance mechanisms and message guarantees for general process-and-discard type of programs through its tree-based dependency graph. As described in Section 3.2 this is not great for sliding window applications however and I ended up having to implement a big part of the fault-tolerance mechanisms for RecentWikiTrends. The control offered when constructing the streaming graph turned out to be incredibly useful with abilities to limiting the parallelisation and controlling the stream distribution directly. Getting the application to run was simple as well and it is portable with Maven being able to download and install the

9 dependencies required to run the program. Similarly, writing and running tests were easy and allowed for eliminating several bugs in the diff-based counting in the program. The Storm-provided untyped tuple and values datatypes turned out to be useful in setting up the program and reduced the time required to set up transfer types. Retrieving data by index appears to be the standard way of getting these however which seems unreliable in the event of scheme changes but each stream has field names declared which can be used to retrieved the data, largely overcoming this problem. Storm’s debugging option displays all log messages and tuple transfer de- tails from the topology on the executor, something that vastly simplifies the debugging of the distributed system. One problem with this approach however is the overwhelming amount of data produced even with few workers. This was overcome by piping the log data to a file which is then tailed and grepped for content of interest.

6 Conclusion

While not reacting as quickly as Twitter on recent events we see observe frequent updates on news related articles, such as the case with the Malaysia Airlines Flight 370 disappearance evaluated. It would be interesting to run tests on other historical events, both larger and smaller, to gain further confidence in this methodology of detecting events. In particular it would be interesting to compare to a historical event where most of the information is released at once as this could lead to fewer, larger edits of the Wikipedia article instead of many easily detectable small changes as in the case with the constantly reported new findings with regards to Malaysia Airlines Flight 370. This may be partly solvable by taking update size into account however. The use of a distributed streaming framework was a natural fit to the real- time nature of the problem but the amount of data was easily handled on the dual-core Intel i5 commodity machine used for the evaluation and as such a distributed framework may have been over the top for this scenario.

7 Future Directions

The RecentWikiTrends application could easily be modified to take on different sources of data and get updates from other sources than Wikipedia. An inter- esting use of extending the program in this way would be to combine more data sources, potentially with different weighs, to more accurately list current events happening in the world.

10 References

[1] Malaysia Airlines. Saturday, march 08, 07:30 AM MYT +0800 Media Statement - MH370 incident released at 7.24am. http://www.malaysiaairlines.com/my/en/site/dark-site.html. Accessed: 2014-04-03.

[2] Apache Storm. Apache Storm website. http://storm.incubator. apache.org/. Accessed: 2014-04-02. [3] Apache Storm. Storm guaranteeing mesage processing.

[4] Archive.org. Internet archive waybackmachine. http://archive.org/web/. accessed: 2014-04-03. [5] Various Authors. Wikipedia the free encyclopedia. http://en.wikipedia.org/. Accessed: 2014-04-02. [6] Karl Hodge. 10 news stories that broke on twitter first. techradar, Septem- ber 2010. http://www.techradar.com/news/world-of-tech/internet/10- news-stories-that-broke-on-twitter-first-719532. [7] Kristina Lerman and Rumi Ghosh. Information contagion: An empirical study of the spread of news on digg and twitter social networks. ICWSM, 10:90–97, 2010.

[8] Nathan Marz. About me – thoughts from the red planet. http://nathanmarz.com/about/. [9] MediaWiki. Api:recentchanges. http://www.mediawiki.org/wiki/API:Recentchanges. Accessed: 2014-03-39.

[10] Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, and Jimmy Lin. Fast data in the era of big data: Twitter’s real-time related query sugges- tion architecture. In Proceedings of the 2013 international conference on Management of data, pages 1147–1158. ACM, 2013.

11