A Storm-Based Tracker to Observe Trends in Wikipedia Article Alteration
Total Page:16
File Type:pdf, Size:1020Kb
RecentWikiTrends { A Storm-based tracker to observe trends in Wikipedia article alteration Gustaf Helgesson (gjrh2) Clare Hall April 4, 2014 1 Introduction It is widely known Twitter trends strongly correlate with major recent events around the globe, such as the Arabic spring and the death of pop star Michael Jackson, sometimes even breaking these news [6]. Aside of Twitter's in-house trending pages, several frameworks have been developed to extract news from the social media cite such as [7]. Here we introduce the RecentWikiTrends, an Apache Storm based applica- tion which extracts and computes trending Wikipedia articles based on article changes for a given set of languages, in real-time. The aim of is to utilise the program to observe correlations between frequent article changes and recent events, similar to what is done in the earlier mentioned Twitter systems. The RecentWikiTrends program observes trends on both a per language and a cross- language basis, the latter accomplished through translating the given Wikipedia articles to a common language. This report discusses Storm, the framework used for the application (Sec- tion 2) and more detailed system design of RecentWikiTrends (Section 3). A test scenario is based on Wikipedia updates following the missing Malaysia Airlines Flight 370 (Section 4). Observed correlations and findings are then discussed (Section 5) and future directions identified (Section 7). 2 Storm Storm is a data stream processing engine developed at Backtype before they were later acquired by Twitter after which Storm was released open source [8]. Similarly to other data stream processing systems such as IBM's SPADE and Apache Samza, Storm attempts to solve the problem of large-scale real-time data processing, something traditionally not handled well by batch based systems such as Hadoop [10]. Storm was one of the first such frameworks to be released open source and is used extensively by companies such as GroupOn, Alibaba and Twitter [2]. When evaluating streaming frameworks I compared Storm against Samza and S4. I choose Storm over S4 for the additional control and Storm does, for example, not try to dynamically change the cluster setup as done in S4. Apache 1 Figure 1: RecentWikiTrends's Web interface. Samza looked comparable but is rather immature at this stage and at the time of writing some fault-tolerance mechanisms have yet to be implemented [2] Storm applications are written explicitly in terms of streams where the pro- grammer forms a graph of system input components, spouts, and system trans- formation/output components, bolts. Storm is written in the JVM based Clo- jure language and Storm applications can be written either directly in a JVM language or through using Storm's streaming interface [2]. When developing RecentWikiTrends I chose to use Java to get more control over the application's integration with Storm as well as allowing it to make full use of the built-in libraries. 3 System Design The RecentWikiTrends subscribes to a given set of Wikipedia language editions and outputs the top n edited articles over the last m hours through using sliding windows. Rankings are made on per-language and global basis where all non- English language edits are translated to English before adding their counts. The rankings are kept up to date in a MySQL database which is exposed to the user over the web interface seen in Fig. 1. 3.1 Topology The Storm topology for RecentWikiTrends is outlined in Fig. 3. Each language has its own pipeline consisting of article extraction, sliding window based count- ing and a local ranking. To obtain global rankings, each language article counter sends pageids and counts to a Wikipedia translator bolt which translates the pages and emits English article names and counts. The following sections cover individual component design. 2 Sliding window counter Ranker Current window map Modification slots ... Add Subtract Figure 2: The diff-based sliding window data transfers. As slots are shifted the counter emits the new and old slot counts to the ranker. The ranker then adds the new slot and subtracts the old slot from its global map before computing the new rankings. WikiSpout A WikiSpout obtains article changes for a given language in JSON format by using the Wikipedia public API. It pulls data every 10 seconds off the site and sends the article names, pageid, counts and the query time to the sliding window- based article counter. Each Wikipedia language runs as a separate MediaWiki installation which means we need to query each language edition separately. Article Counting Bolt The Article Counting Bolt performs sliding window based counting using a fixed window length and slot length, together determining the number of slots and update granularity. Once a slot's time has passed the bolt outputs diffs in terms of addition and subtraction counts to its attached ranker(s). Additions are the counts from the head of the window and subtractions are counts from the old tail dropping off the new window. The rankers then update their local copies of current counts as seen in Fig. 2. A language can make use of multiple Article Counting Bolts if necessary. Local Ranking Bolt The Local Ranking Bolt maintains state over the current sliding window and updates it with the diffs given by the Article Counting Bolt. After processing a diff it recomputes the top 20 rankings for the current window before outputting this to a MySQL database { making it available for consumption at the web interface. Only one Local Ranking Bolt is attached to Article Counting bolts from a language to achieve a global view of the counts for the language. Translator Bolt Similarly to the Local Ranking Bolt, the Translator Bolt receives article count diffs from the sliding window counter. The bolt then translates article names to English using Wikipedia's other language information as exposed in its public API. To reduce the load on Wikipedia servers the lookups are performed in batch 3 Bolts Spouts Local Ranker English Sliding Counter EN Wiki EN Global Ranker Non-English Sliding Counter MySQL Wiki NON-EN Translator Local Ranker NON-EN Apache Figure 3: RecentWikiTrends's Storm topology. In this example the topology consists of the English and one non-English Wikipedia source. and a local cache of translations is maintained. After translation the translator bolt emits the translated article names and counts to the Global Ranking Bolt. Any article not linked to an to English Wikipedia page is omitted from the global rankings. Global Ranking Bolt The global ranking bolt retrieves tuple diffs from all translator bolts and the English Article Count Bolt. To ensure fair input from all sources it stores a list of diff messages per language and does not output new rankings until each language has had its diff delivered. This is necessary as the English Wikipedia recent changes is more frequently updating than other editions { normally more than 10x the amount of the Swedish one for example. In our evaluations the articles contain an equal weight independently of language. 3.2 Fault-Tolerance Storm's message passing guarantees is done by producing a tuple dependency tree where a tuple is acknowledged and deleted only once it has been processed throughout the system [3]. This is, however, poorly suited for a sliding window environment as tuples are either dependencies for the entire length of the win- dow, in our case 6 hours, or needs to be constantly replayed from the Spout until its window has passed. In addition to the sliding window, the diff-based counter update solution adds additional strain on the system in the event of a fault as big amounts of state needs to be preserved for each of the system bolts. To solve this I added my own fault-tolerance to the Storm components. On a ranker failure the newly started ranker can retrieve lost sliding window information by querying the sliding window counter for the entire dataset (See Fig. 4). The handling of sliding counter failures is more complicated and is outlined in Fig. 5 Translation nodes are, caching aside, stateless and can simply be restarted by the Storm cluster as necessary. After a fault, a newly started Global Ranker can request translator nodes to re-emit the full sliding window, similar to what is done on Local Ranker failure. 4 Sliding window counter Ranker Current window map Modification slots ... Add Subtract Request full counts New ranker Ranker Empty window map ... ... Add all the window slots Figure 4: Illustration of how ranker failures are handled in RecentWikiTrends. The ranker normally only sees the differences in counts between sliding windows before computing new ranks. Uoon failure this information is lost and needs to be retrieved. This is done by a feedback loop from the ranker to the sliding window counter where the ranker asks for the full sliding window using a special message. Sliding window counter Ranker Last output Current timestamp in window map Zookeeper Modification slots ... Add Subtract 1) Timestamp + compute tail slot New Sliding window counter 3) Send diff as Modification slots usual 2) Get tail slot N/A ... Figure 5: In the event of a sliding counter failure the new sliding counter is unaware of the subtraction part of the diff to update the ranker. Without further action this will lead to the ranker keeping counts of lost data indefinitely which will likely result in incorrect rankings. By keeping external state of the last processed counter in Zookeeper the newly started Article Counting Bolt can compute the time of its missing tail slot. The bolt can then retrieve the recent changes for this time period directly from Wikipedia and add it to the subtraction diff for its ranker which can carry on as if nothing happened.