Cost-Effective Online Trending Topic Detection and Popularity Prediction in Microblogging
Total Page:16
File Type:pdf, Size:1020Kb
Cost-Effective Online Trending Topic Detection and Popularity Prediction in Microblogging ZHONGCHEN MIAO and KAI CHEN, Shanghai Jiao Tong University YI FANG, Santa Clara University JIANHUA HE, Aston University YI ZHOU and WENJUN ZHANG, Shanghai Jiao Tong University HONGYUAN ZHA, Georgia Institute of Technology 18 Identifying topic trends on microblogging services such as Twitter and estimating those topics’ future popu- larity have great academic and business value, especially when the operations can be done in real time. For any third party, however, capturing and processing such huge volumes of real-time data in microblogs are almost infeasible tasks, as there always exist API (Application Program Interface) request limits, monitor- ing and computing budgets, as well as timeliness requirements. To deal with these challenges, we propose a cost-effective system framework with algorithms that can automatically select a subset of representative users in microblogging networks in offline, under given cost constraints. Then the proposed system can online monitor and utilize only these selected users’ real-time microposts to detect the overall trending topics and predict their future popularity among the whole microblogging network. Therefore, our proposed system framework is practical for real-time usage as it avoids the high cost in capturing and processing full real-time data, while not compromising detection and prediction performance under given cost constraints. Experiments with real microblogs dataset show that by tracking only 500 users out of 0.6 million users and processing no more than 30,000 microposts daily, about 92% trending topics could be detected and predicted by the proposedr system and, on average, more than 10 hours earlier than they appear in official trends lists. CCS Concepts: Information systems → Data stream mining; Document topic models; Retrieval efficiency; Content analysis and feature selection; Information extraction; Social networks; Additional Key Words and Phrases: Topic detection, prediction, microblogging, cost ACM Reference Format: Zhongchen Miao, Kai Chen, Yi Fang, Jianhua He, Yi Zhou, Wenjun Zhang, and Hongyuan Zha. 2016. Cost- effective online trending topic detection and popularity prediction in microblogging. ACM Trans. Inf. Syst. 35, 3, Article 18 (December 2016), 36 pages. DOI: http://dx.doi.org/10.1145/3001833 This work was supported in part by the National Key Research and Development Program of China (2016YFB1001003), National Natural Science Foundation of China (61521062, 61527804), Shanghai Science and Technology Committees of Scientific Research Project (Grants No. 14XD1402100 and No. 15JC1401700), and the 111 Program (B07022). Authors’ addresses: Z. Miao, K. Chen (corresponding author), Y. Zhou, and W. Zhang, Department of Electronic Engineering, Shanghai Jiao Tong University, China; emails: {miaozhongchen, kchen, zy_21th, zhangwenjun}@sjtu.edu.cn; Y. Fang, Department of Computer Engineering, Santa Clara University, U.S.A.; email: [email protected]; J. He, School of Engineering and Applied Science, Aston University, U.K.; email: [email protected]; H. Zha, School of Computational Science and Engineering, Georgia Institute of Technol- ogy, U.S.A.; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 1046-8188/2016/12-ART18 $15.00 DOI: http://dx.doi.org/10.1145/3001833 ACM Transactions on Information Systems, Vol. 35, No. 3, Article 18, Publication date: December 2016. 18:2 Z. Miao et al. 1. INTRODUCTION Nowadays people’s daily life across the world is closely tied to online social networks. Microblogging services (e.g., Twitter1 and Weibo2), as one of the representative online social network services, provide a more convenient approach for everybody around the world to read news, deliver messages, and exchange opinions than traditional media such as TV or newspaper. So huge quantities of users tend to post microposts in microblogging services and talk about the things they just witnessed, the news they just heard, or the ideas they just thought. In our article, the topic of a micropost refers to a group of keywords (such as the name of items, news headline, or thesis of ideas) in its content. Those semantically related microposts that talk about the same items, news, or thoughts within a given time window are the set of microposts of that topic. Commonly, microblogging social networks are filled with a large number of var- ied topics all the time. However, if one topic is suddenly mentioned or discussed by an unusual amount of microposts within a relatively short time period, that topic is “trend- ing” in microblogging services. As microblogging becomes the earliest and fastest source of information, these trending topics shown on social networks are often referred to the breaking news or events that have or will have societal impact in our real lives, such as first-hand reports of natural or man-made disasters, leaking an excellent product, unexpected sports winners, and very controversial remarks/opinions. Because of this, identifying trending topics in microblogging networks is receiving increasing interest among academic researchers as well as industries. Moreover, it will produce even more scientific, social, and commercial value if the trending topics can be detected in real time and those topics’ future popularity can be predicted at early stages. For example, the early awareness of first-hand reports of disasters can give rescuers more priceless time to reach the incident site and help more victims. Taking another example, higher predicted popularity and longer lifetime of a “leaks of a specific product” trending topic can bode will for the product’s future reputation and sales, so businesspeople can be prepared to increase inventory and production. In fact, microblogging service providers themselves, such as Twitter and Weibo, are publishing their official trending topic lists regularly. However unfortunately, these official lists are commonly delayed in publishing, small in size (Top-10 only), and not fully customized by individual user’s preferences. They also do not contain topics’ future popularity prediction function at all, so it is hard to tell how long a trending topic will last. More critically, there are concerns that some trending topics will never appear in these official lists that are subjected to the service provider’s commercial considerations or even government censorship policy [Chen et al. 2013a]. If we rely only on the official trends lists, then some topics would most likely be missed or delayed by us. Therefore, business companies, organizations, or even individuals are in need of a reliable online real-time trending topics detection and prediction system on microblogging services and other social networks, which can produce impartial, accurate, and even customized results from a third-party perspective. Traditionally, online trending topics detection and prediction systems for microblog- ging comprise three major steps: (1) retrieving microposts and related information from a microblogging website as much as possible, (2) detecting trends from the obtained microblog dataset, and (3) predicting the detected topics’ future popularity using the ob- tained microblog dataset. In this way, the performance in steps 2 and 3 largely depends on the quantity and quality of the dataset retrieved in step 1. If at some time periods the sampled data source is biased [Morstatter et al. 2013, 2014], then an extra large-scaled 1http://twitter.com., the top microblogging service worldwide. 2http://weibo.com., the top microblogging service in China. ACM Transactions on Information Systems, Vol. 35, No. 3, Article 18, Publication date: December 2016. Cost-Effective Online Trending Topic Detection and Popularity Prediction in Microblogging 18:3 dataset at those time periods will be needed in order to remove the bias and get repre- sentative topic detection results for all times. However, for any third-party analyzers, the tremendous number of users in microblogging services and the ever-growing vol- umes of microposts pose significant challenges to the capturing and processing of such a big scale dataset in real time. Although we are in an era of cloud-based services, it is still very challenging for any third-party analyzers to acquire the full real-time data stream of the whole microblogging network in time, as microblogging services companies are heavily limiting and narrowing the API (Application Program Interface) requests rate per account or per IP (Internet Protocol) address to prevent large dataset collection.3 Moreover, to get online detection results in real time, it requires a large resource budget on network bandwidth, IP address, storage, CPU (Central Processing Unit), and RAM (Random-access Memory) in cloud-based services for collecting and processing these large scaled data in time. As a result, in practical usage, the cost to obtain the fresh data and to detect and predict trending topics in real time should be seriously considered, and how to make a full use of the limited budget becomes a very important problem. To deal with this difficulty, in this article we propose a cost-effective detection and prediction framework for trending topics in microblogging services. The core notion of the framework is to select a small subset of representative users among the whole microblog users in offline based on historical data.