Real-Time Detection and Sorting of News on Microblogging Platforms

PACLIC 29 Real-time Detection and Sorting of News on Microblogging Platforms Wenting Tu David W. Cheung Nikos Mamoulis Min Yang Ziyu Lu Department of Computer Science The University of Hong Kong Pokfulam, Hong Kong fwttu,dcheung,nikos,myang,[email protected] Abstract J. L and J. Yang, 2011; F. Z. et al, 2012). As an in- stance, (F. Z. et al, 2012) assumes that bursty topics Due to the increasing popularity of microblogging platforms (e.g., Twitter), detecting real- in microblogs correspond to events that have attract- time news from microblogs (e.g., tweets) has ed the most online attention. To find such events, recently drawn a lot of attention. Most of the this work uses a model to detect busty topics, as- previous work on this subject detect news by suming that a global event is likely to follow a time- analyzing propagation patterns of microblogs. dependent global topic distribution. Although detec- This approach has two limitations: (i) many tion methods relying on the propagation characteris- non-news microblogs (e.g. marketing activi- tics of microblogs are based on reasonable assump- ties) have propagation patterns similar to news tions, they have certain limitations. First, some mi- microblogs and therefore they can be false- ly reported as news; (ii) using propagation croblogs not related to news have very similar prop- patterns to identify news involves a time de- agation characteristics as news microblogs. For ex- lay until the pattern is formed, therefore news ample, a microblog with a promotion or a gift may are not detected in real time. We propose follow a similar propagation pattern as a popular an alternative approach, which, motivated by news event. Second, propagation behavior can on- the necessity of real-time detection of news, ly be analyzed after a microblog has been posted for does not rely on propagation of posts. More- a certain amount of time. Previous work on news de- over, we propose a real-time sorting strategy that orders the detected news microblogs us- tection based on propagation knowledge cannot per- ing a translational approach. An experimen- form real-time detection, since propagation knowl- tal evaluation on a large-scale microblogging edge can only be obtained a period of time after mi- dataset demonstrates the effectiveness of our croblogs are published. Some works explicitly men- approach. tion that trying to detect microblogs using propagation knowledge in a short time reduces the effective- 1 Introduction ness. For example, in (G. L. et al, 2010), experiments on Twitter data show that using 1-day prop- Microblogging platforms (e.g., Twitter or SinaWei- agation knowledge can mainly detect topics related bo) have become very popular and their role as news to daily activities; only by using a 2-days history this media has been recognized. As people actively talk method can detect some real emerging topics. about what is happening, microblogs are the place where the first-hand news appear. Actually, over Therefore, using propagation characteristics is not 85% of the leading topics on Twitter are news by a good idea if the objective is to detect news as soon nature (H. P et al, 2010). as possible. An additional challenge is how to sort Most of the recent works on news detection from and present the newborn news microblogs according microblogs rely on using temporal patterns of propa- to their importance, Most of the current news de- gation (G. L. et al, 2010; R. G. and K. Lerman, 2010; tection platforms sort the microblogs by their pub- 462 29th Pacific Asia Conference on Language, Information and Computation pages 462 - 470 Shanghai, China, October 30 - November 1, 2015 Copyright 2015 by Wenting Tu, David Cheung, Nikos Mamoulis, Min Yang and Ziyu Lu PACLIC 29 lication time or their popularity. However, at any suggest directions for possible future work in Sec- point in time, there can be lots of newborn news mi- tion 4. croblogs all of which have close publication time; thus, sorting them by the publication time may fail 2 Our methodology to show important news on the top. Besides, as we Our system includes three modules. In a training mentioned before, newborn news microblogs have session, the News-microblog Expert Detection mod- limited prorogation information; thus, it is very dif- ule detects a set of microblogging users who active- ficult to access the popularity of newborn news mi- ly post news microblogs. The posts by these experts croblogs. In this paper, we propose an alternative forms the training corpus of news microblogs, used system for detecting and sorting microblogs with to train the other two modules: the Expert-ensemble news in real time. Our framework does not rely Classifier and the BAV Sorter. The Expert-ensemble on any propagation knowledge. Our system consists Classifier (Section 2.2) is used to classify newborn of three modules: news-microblog expert detection, microblogs to news or non-news. It combines base news microblog detection, and news sorting. We ob- classifiers constructed from the experts’ corpus by serve that there exists a group of expert users, whose considering the professionalism and activity degrees microblogs are all of a single type (e.g., news). In of experts. The BAV Sorter module (Section 2.3) the first module, we apply a methodology for select- provides a new representation method for news mi- ing expert users based on their professionalism and croblogs and employs a value transfer strategy to de- activity. By simulating the training corpus as the mi- fine an importance score for each new post classified croblogs by the experts, the second module builds an as news by the Expert-ensemble Classifier. After ensemble classifier to detect news microblogs. The the system has been trained, the newly posted mi- ensemble model combines weak classifiers trained croblogs can be classified as news/non-news by the from the corpora of different experts into a strong Expert-ensemble Classifier, and those posts detected classifier. Moreover, it can be updated with low cost: to be news can be ranked according to their impor- once an expert posts some news microblogs, we only tance by the BAV Sorter module. In the remainder of need to update the module corresponding to its cor- this section, we describe in detail the three modules. pus instead of the whole model. The third module defines a score for each detected news microblog, 2.1 News-microblog Expert Detection in order to rank these microblogs. In this module, Since microblogging data are large and they are up- we firstly propose a novel text representation called dated at a high rate, it is not possible to manually Behavior-Actor-Venue bag of words (BAVbow) for label them. As an alternative, we propose an auto- news microblogs which consolidates the most infor- matic corpus construction method, motivated by the mative text from them. Then, we apply value trans- observation that there exists a group of users whose fer with confidence on the BAVbow representation, microblogs are of a single type only. In the news using the scores of the training corpus to rank the domain, some real-world examples include: @头 new microblogs whose scores are unknown. 条新闻 (#breaking news#) from SinaWeibo and We conduct experiments on data obtained from @BBCWorld from Twitter, which always post news the microblogging service SinaWeibo, one of the microblogs. Next, we present our methodology for most popular sites in China, used by well over 30% finding out these users which we call news experts. of Chinese Internet users, with a similar market pen- The selection strategy considers two characteristics etration as Twitter. The effectiveness of each module of users: professionalism and activity. is verified based on information collected by a group of users. 2.1.1 Expert Candidates Retrieval via User The remainder of this paper is organized as fol- Profile lows. In Section 2, we introduce our methodology Microblogging platforms have a very large num- by discussing in detail the news detection framework ber of users and it is impossible to analyze the mi- and the three sub-modules. Section 3 presents our croblogs written by all of them. Thus, it is neces- experimental analysis. We conclude the paper and sary to select a subset of them, which is expected 463 PACLIC 29 to include the news experts. Search for news ex- priateness. The selection is based on the follow- perts will then be confined to this subset. There are ing rules: (i) microblogs posted by experts should two types of data that describe a user: his/her profile focus on the type we are interested in (i.e., news), and the microblogs he/she posts. Profile informa- (ii) experts should be active, so that they provide tion can be divided into three parts: (i), Description: time-relevant microblogs to be used in training our This part includes usernames and other descriptive classification model. Thus, a candidate expert is data given by the users themselves. (ii), Author- more professional if a large percentage of his/her ity: Microblogging platforms provide verification- microblogs belong to the type. The more active the s for some users, called verified accounts. Verifi- expert is, the more up-to-date his/her corpus is and cation is currently used to establish authenticity in the more adaptive it is to newborn microblogs. Twitter. The verified badge helps users toward dis- Based on the above, we define the professional- covering high-quality sources of information. (iii), ism and activity degree of each candidate. To mea- Influence: A natural feature that indicates the influ- sure the professionalism of a candidate ec 2 EC , ence of a microblogging user is the number of fol- we need to use a classifier which indicates whether lowers, since this number indicates how many peo- a post by ec is a piece of news.

Load more