Characterizing the Life Cycle of Online News Stories Using Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Characterizing the Life Cycle of Online News Stories Using Social Media Reactions Carlos Castillo Mohammed El-Haddad Jurgen¨ Pfeffer Qatar Computing Al Jazeera Carnegie Mellon University Research Institute Doha, Qatar Pittsburgh, USA Doha, Qatar mohammed.haddad [email protected] [email protected] @aljazeera.net Matt Stempeck MIT Center for Civic Media Cambridge, USA [email protected] ABSTRACT The study of patterns of consumption of online news has This paper presents a study of the life cycle of news ar- attracted considerable attention from the research com- ticles posted online. We describe the interplay between munity for over a decade. This research started with the website visitation patterns and social media reactions to analysis of access patterns to websites, and has expanded the news content. We show that we can use this hy- to include topics such as new engagement metrics, per- brid observation method to characterize distinct classes sonalized news recommendations and summaries, etc. of articles. We also find that social media reactions can (see Section 2 for an overview). be used to predict future visitation patterns early and One line of research looks at consumption and interac- accurately. tion patterns as a single time series and attempts sev- We validate our methods using qualitative analysis as eral prediction tasks on it. For example, predicting total well as quantitative analysis on data from a large inter- comments from early comments [18, 28], total visits from national news network, for a set of articles generating early visits [16], etc. More recent works incorporate at- more than 3,000,000 visits and 200,000 social media re- tributes from each specific article (e.g. topic, source, actions. We show that it is possible to model accurately etc.) into the prediction [4]. the overall traffic articles will ultimately receive by ob- We adopt a novel approach, in which we integrate serving the first ten to twenty minutes of social media different types of interactions of users with an online reactions. Achieving the same prediction accuracy with news article including visits, social media reactions, and visits alone would require to wait for three hours of data. search/referrals. We evaluate our methods on data from We also describe significant improvements on the accu- Al Jazeera English, a large international news network, racy of the early prediction of shelf-life for news stories. deeply characterizing different classes of articles, and predicting their total number of page views and their 1. INTRODUCTION effective shelf-life (the effective shelf-life of an article is Traditional newspapers have been in decline in recent the time span during which it receives most of its visits). years in terms of readership and revenue; in comparison, The characterization and prediction of user behavior digital online news have been steadily increasing accord- around news articles is valuable for a news organization, ing to both metrics.1 Recent surveys have shown that as it allows them (i) to gain a better understanding of about half of the population of the US gets their news how people consume different types of news online; (ii) to online, and about one third goes online every day for deliver more relevant and engaging content in a proactive news.2 manner; and (iii) to improve the allocation of resources to developing stories over their life cycle. 1http://stateofthemedia.org/2012/overview-4/key-findings/ 2http://www.people-press.org/2012/09/27/section-2- online-and-digital-news-2/ Our contributions. In this paper we present a quali- tative and quantitative analysis of the life cycle of online Permission to make digital or hard copies of all or part of this work news stories. Our main contributions are the following: for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- • We find that social media reactions can contribute tage and that copies bear this notice and the full citation on the first substantially to the understanding of visitation pat- page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. terns in online news. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request per- • We characterize two fundamental classes of news sto- missions from [email protected]. ries: breaking news and in-depth articles, and describe CSCW'14 , February 15{19, 2014, Baltimore, Maryland, USA. ACM 978-1-4503-2540-0/14/02$15.00. the differences in users' behavior around them. http://dx.doi.org/10.1145/2531602.2531623 • We describe classes of short-term audience response of Twitter) after being exposed to the information by a profiles to news articles in terms of visits and social certain number of her neighbors. media reactions (decreasing, steady, increasing, and Yang and Leskovec [30] describe six classes of tempo- rebounding). ral shapes of attention. Attention is measured in terms • We improve significantly the accuracy and timeliness of the number of appearances of a given phrase (of a of predictive models of total visits and shelf-life of ar- variation of it) corresponding to an event. The patterns ticles, by incorporating social media reactions. describe the distribution of attention over time, as well as the ordering in which different types of media (profes- The remainder of this paper is organized as follows. Sec- sional blogs, news agencies, etc.) will \break" the story. tion 2 provides an overview of previous works related to ours. Section 3 introduces our data collection and de- In general, previous works have established that the evo- fines the concepts and variables we use. The main re- lution of the popularity of different on-line items de- sults of our paper are presented as descriptive and pre- pends on their class. Figueiredo et al. [10] describe how dictive analysis in the two following sections: Section 4 YouTube videos that are posted to a \top" page on the describes user behavior with respect to different classes website, and videos that are making use of professionally of articles, and Section 5 demonstrates the importance produced content, are different from randomly-chosen of incorporating social media information into the pre- videos in terms of their visit patterns. dictive modeling of visits. The last section concludes the Recently, researchers at URL shortening service Bit.ly [6] paper. described how an article's half-life (see definition in Section 3) is affected by topics, extending a previ- 2. RELATED WORK ous observation than in general there are some top- One of the earliest published studies of user behavior in ics that are more time-sensitive than others [12]. online news was conducted by Aikat [3], who studied the For instance, business-related articles have on aver- web sites of two large newspapers from November 1995 age a longer half-life, while articles related to poli- to May 1997. This work describes many of the patterns tics/celebrities/entertainment have an intermediate one. still seen in news sites today: visits occur mostly during Sports-related articles have in comparison a shorter half- weekdays and working hours; readers \skim" pages for life. Previously, Bit.ly researchers [5] have shown that information so dwell times tend to be short, and there are this half-life is also affected by the social media platform clear traffic \bursts" that can be attributed to specific where the link is first posted (e.g. links on Facebook news developments. were longer-lived than links on Twitter). With the advent in recent years of what can be consid- ered as new forms of journalism (blogs) and new prop- We deepen and complement previous works on agation mechanisms for news (micro-blogs and online behavioral-driven characterization of online content, by social networking sites), the volume of research publi- describing the life-cycle of online news articles consider- cations in this area has increased considerably. In this ing their visitation patterns as well as their social media section we overview a few previous works closely related reactions. to ours, but our coverage is by no means complete. Prediction of users' activity. The prediction of the Behavioral-driven article classification. Previous volume of user activities with respect to on-line content works including [8, 19] that have studied online activi- items has attracted a considerable amount of research. ties around online resources (e.g. visiting, voting, shar- This is attested by a number of papers, some of which ing, etc.), have consistently identified broad classes of are outlined in Table 1. Another active topic that is temporal patterns. These classes can be generally char- closely related, but different, is that of predicting real- acterized, first, by the presence or absence of a clear world variables such as sales or profits using social media \peak" of activity; and second, by the amount of activ- signals (e.g. [13] and many others). ity before and after the peak. Over the years, the models used to predict user behav- Crane and Sornette [8] describe classes of visitation pat- ior in social media have increased in complexity. For in- terns to online videos, and present models that are stance, Bandari et al. [4] and Ruan et al. [26] incorporate consistent with propagation phenomena in social net- into their models features extracted from the content of works. Lehmann et al. [19] extend these classes by ob- the articles, such as topics. Yin et al. [31] study vot- serving that for Twitter \hashtags" (user-defined top- ing behavior over on-line contents and describe a model ics) the distributions of activity in different periods that considers that users are divided into two popula- (before/during/after) induce distinct clusters of activ- tions: a group that follows the majority opinion, and a ity that can be interpreted considering the semantics of group that does not.