Department of Computer Science Predicting Network Response

4400 University Drive MS#4A5 Department of Computer Science Fairfax, VA 22030-4444 USA George Mason University Technical Reports http://cs.gmu.edu/ 703-993-1530 Predicting Network Response Times Using Social Information Chen Liang Sharath Hiremagalore Angelos Stavrou [email protected] [email protected] [email protected] Huzefa Rangwala [email protected] Technical Report GMU-CS-TR-2010-18 Abstract 1 Introduction Public discussion boards have become popular over the Social networks and discussion boards have become a years due to their crowd-sourcing nature. Indeed, their significant outlet where people communicate and ex- members have the ability to post and express their opin- press their opinion freely. Although the social networks ion anonymously on stories that are shared publicly. The themselves are usually well-provisioned, the participat- popularity of these stories is voted upon by other anony- ing users frequently point to external links to substanti- mous readers who are also members of the discussion ate their discussions. Unfortunately, the sudden heavy board. Over the last few years, several websites, such as traffic load imposed on the external, linked web sites Digg [23], Reddit [11], Delicious [24] offer these services. causes them to become unresponsive leading to what Through these sites, users organize, share and discuss people call the “Flash Crowds” effect. Flash Crowds interesting references to externally hosted content and present a real challenge, as it is not possible to pre- other websites. dict their intensity and occurrence time. Moreover, al- There has been a plethora of research that focuses on though increasingly capable, most present-day web host- analyzing the discussion network structure [34], rela- ing servers and caching systems are designed to handle tionships [20, 2], even using the social network as an a nominal load of requests before they become unre- anti-spam and defense mechanism. One aspect of dis- sponsive. This can happen either due to the limited cussion boards that has received less research attention bandwidth or processing power allocated to the hosting is the effect they have on externally hosted websites. site. Indeed, public discussion boards and crowd-sourcing sites can cause instantaneous popularity of a website In this paper, we quantify the prevalence of flash owing to discussions in blogs or posts of other website, crowd events for a popular social discussion board known as the “Flash Crowd” effect: a steep and sudden (Digg). Using PlanetLab, we measured the response surge in the amount of traffic seen at these sites. As a times of 1289 unique popular websites. We were able result, these unanticipated flash crowds in the network to verify that 89% of the popular URLs suffered vari- traffic may cause a disruption in the existing communi- ations in their response times. In an effort to identify cation infrastructure and disrupt the services provided flash crowds ahead of time, we evaluate and compare by the website. But how prevalent is this “Flash Crowd” traffic forecasting mechanisms. We show that predicting phenomenon? network traffic using network measurements has very We show that a large portion of the websites that limited success and cannot be used for large-scale predic- become popular through stories on public discussion tion. However, by analyzing the content and structure of boards suffer from the flash crowd phenomenon. These the social discussions, we were able to classify 86% of the websites exhibit high latency and response time varia- popular web sites within 5 minutes of their submission tion as they increasingly become popular. To support our and 95% of the sites when more (5 hours) of social con- hypothesis, we measured periodically and over a large tent became available. Our work indicates that we can period of time the download times for all the external effectively leverage social activity to forecast network URLs that were submitted to a social discussion board events that will be otherwise infeasible to anticipate. using many network vantage points. We used PlanetLab, 1 a distributed platform that provides servers located all 2 Correlating Popularity with Re- over the globe. The external websites’ response times were measured concurrently on several PlanetLab nodes sponse Time across North America. Computing the changes in the website response time from different locations eliminates 2.1 Motivation the bias introduced by observing measurements at a sin- Our initial target was to assess the extent of the “Flash gle location. Then, we computed the correlation values Crowd” effect for websites that are linked to popular sto- between the variation in the measured network latency ries on social discussion boards. Figure 1 illustrates the with the popularity increase of website linked to a so- motivation for our problem. The layout of Digg home cial discussion board. We were able to confirm that 89% page presents users with the most popular links (story) of the popular URLs were adversely affected with 50% to external web resources. A story gains popularity as having correlation values above 0.7. This is a significant users comment and “Digg up” a story, i.e., click on a link portion of the submitted URLs and warrants investiga- to increase the Digg Number of that story. More popu- tion into techniques to predict these sudden spikes of lar stories are prominently displayed at the top of the traffic ahead of time. website. This could lead to some stories becoming very Ordinarily it would be possible to forecast the load on popular in a short span of time increasing the load on the server by observing the trends in network latency the servers that host this story. The consequence of this over a period of time. However, we show that the im- is a bad user experience where the site loads very slowly pulse in the network traffic caused by flash crowds is or network timeouts as an effect of the flash crowd. difficult to predict using network measurements and, therefore, allows little time to take any preventive action, such as temporarily adding capacity or rate limiting Link the existing users. However, by analyzing the content Story server and structure of the social discussions, we were able to What is the (3) Timeout 1. Story Features classify 86% of the popular web sites within 5 minutes earliest time we of their submission and 95% of the sites when more (5 (2) Browse can predict? (1) Get link from Digg hours) of social content became available. Digg number Prediction For our experiments we use Digg [23], a popular social bookmarking tool. Digg allows users to share news, images and videos as stories among themselves. Early User readers information helps us to predict the popularity of a story. When a top story is prominently displayed in Digg, more users read it. Stories are categorized into topics, some being more popular than others. Relation- Figure 1: This figure illustrates the effects of a “Flash ships between users in Digg are implicitly created when Crowd” event. The popularity of the social discussion users read, comment or rate each other’s stories. Digg board causes the externally linked website to become associates a Digg Number with each story to keep track slow or even unresponsive. of the number of users who like or dislike it. Using these factors, as well as information about the users, we are But how proliferate is the “Flash Crowd” phenomena able to predict the popularity of a given story. Our work for publicly accessible discussion boards? demonstrates that we can effectively leverage measurements of social activity to forecast network events that will be otherwise infeasible to measure or respond to 2.2 Website Response Time relying only on network measurements. The first step in estimating the prevalence of the Flash The rest of this paper is organized as follows. Section 2 Crowd effect is to accurately measure the network down- provides the problem motivation and the traffic correla- load and response time of all the external web sites that tion between for external URLs using a distributed mea- are linked via the social network discussion board. This surement infrastructure. Section 3 provides prediction study has to be done over a large period of time and results using social features. We present our experimen- for many URLs spanning many different and geographi- tal results on real Digg story data and show that we can cally distributed external story websites. Moreover, to predict the popularity of stories using historical story be able to perform a non-biased estimate of the web data both accurately and quickly. In Section 4 we detail site latency, we had to perform our measurements from the related work. We conclude the paper in Section 5 many geographically- and network-wise distinct net- with some discussion on the potential to drive network work points. To that end, we deployed the latency mea- characteristics based on data mining of social events and surement code on 30 nodes in Planetlab. Planetlab pro- our plans for future work. vides nodes with the same server specification, namely 2 with 1.6Ghz, 2G memory, 40G Hard Disk. Every 10 min- Correlation coefficient values in distribution 100 utes, we identified the 500 most popular stories from Digg Num & Latency Digg based on their score and we stored the URLs that 90 Digg Num & Standard Deviation of Latency they point on external websites on each of the Planetlab 80 nodes. For each of those external URLs, we computed 70 the network latency of their hosting website by computing the amount of time that it was required to download 60 their content to the Planetlab nodes. To achieve that, we 50 employed wget [21], a popular HTTP mirroring tool.

Load more