Predicting and Characterising User Impact on Twitter

Predicting and Characterising User Impact on Twitter Vasileios Lampos1, Nikolaos Aletras2, Daniel Preot¸iuc-Pietro2 and Trevor Cohn3 1 Department of Computer Science, University College London 2 Department of Computer Science, University of Sheffield 3 Computing and Information Systems, The University of Melbourne [email protected], n.aletras,d.preotiuc @dcs.shef.ac.uk, [email protected] { } Abstract ment.1 Intuitively, it is expected that user impact cannot be defined by a single attribute, but depends The open structure of online social net- on multiple user actions, such as posting frequency works and their uncurated nature give rise and quality, interaction strategies, and the text or to problems of user credibility and influ- topics of the written communications. ence. In this paper, we address the task of In this paper, we start by predicting user impact predicting the impact of Twitter users based as a statistical learning task (regression). For that only on features under their direct control, purpose, we firstly define an impact score function such as usage statistics and the text posted for Twitter users driven by basic account proper- in their tweets. We approach the problem as ties. Afterwards, from a set of accounts, we mea- regression and apply linear as well as non- sure several publicly available attributes, such as linear learning methods to predict a user the quantity of posts or interaction figures. Textual impact score, estimated by combining the attributes are also modelled either by word frequen- numbers of the user’s followers, followees cies or, more generally, by clusters of related words and listings. The experimental results point which quantify a topic-oriented participation. The out that a strong prediction performance is main hypothesis being tested is whether textual achieved, especially for models based on and non textual attributes encapsulate patterns that the Gaussian Processes framework. Hence, affect the impact of an account. we can interpret various modelling com- To model this data, we present a method based ponents, transforming them into indirect on nonlinear regression using Gaussian Processes, ‘suggestions’ for impact boosting. a Bayesian non-parametric class of methods (Ras- mussen and Williams, 2006), proven more effec- 1 Introduction tive in capturing the multimodal user features. The modelling choice of excluding components that Online social networks have become a wide spread are not under an account’s direct control (e.g. re- medium for information dissemination and inter- ceived retweets) combined with a significant user action between millions of users (Huberman et al., impact prediction performance (r = .78) enabled 2009; Kwak et al., 2010), turning, at the same us to investigate further how specific aspects of a time, into a popular subject for interdisciplinary user’s behaviour relate to impact, by examining the research, involving domains such as Computer Sci- parameters of the inferred model. ence (Sakaki et al., 2010), Health (Lampos and Among our findings, we identify relevant fea- Cristianini, 2012) and Psychology (Boyd et al., tures for this task and confirm that consistent ac- 2010). Open access along with the property of struc- tivity and broad interaction are deciding impact tured content retrieval for publicly posted data have factors. Informativeness, estimated by computing brought the microblogging platform of Twitter into a joint user-topic entropy, contributes well to the the spotlight. separation between low and high impact accounts. Vast quantities of human-generated text from Use case scenarios based on combinations of fea- a range of themes, including opinions, news and tures are also explored, leading to findings such as everyday activities, spread over a social network. that engaging about ‘serious’ or more ‘light’ topics Naturally, issues arise, like user credibility (Castillo may not register a differentiation in impact. et al., 2011) and content attractiveness (Suh et al., 2010), and quite often trustful or appealing informa- 1For example, the influence assessment metric of Klout — tion transmitters are identified by an impact assess- http://www.klout.com. 405 Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 405–413, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics 2 Data 0.15 @nikaletras For the experimental process of this paper, we @guardian formed a Twitter data set ( 1) of more than 48 mil- @David_Cameron D 0.1 lion tweets produced by U = 38, 020 users geolo- | | @PaulMasonNews cated in the UK in the period between 14/04/2011 @spam? @lampos and 12/04/2012 (both dates included, ∆t = 365 days). 1 is a temporal subset of the data set used 0.05 D Probability Density for modelling UK voting intentions in (Lampos et al., 2013). Geolocation of users was carried out by matching the location field in their profile with 0 −5 0 5 10 15 20 25 30 UK city names on DBpedia as well as by check- Impact Score (S) ing that the user’s timezone is set to G.M.T. (Rout Figure 1: Histogram of the user impact scores in et al., 2013). The use of a common greater geo- our data set. The solid black line represents a gen- graphical area (UK) was essential in order to derive eralised extreme value probability distribution fit- a data set with language and topic homogeneity. ted in our data, and the dashed line denotes the A distinct Twitter data set ( 2) consisting of ap- D mean impact score (= 6.776). User @spam? is a prox. 400 million tweets was formed for learning sample account with φin = 10, φout = 1000 and term clusters (Section 4.2). 2 was retrieved from D φλ = 0; @lampos is a very active account, whereas Twitter’s Gardenhose stream (a 10% sample of the @nikaletras is a regular user. entire stream) from 02/01 to 28/02/2011. 1 and D 2 were processed using TrendMiner’s pipeline D interest. Indeed, Pearson’s correlation between φin (Preot¸iuc-Pietro et al., 2012). and φλ for all the accounts in our data set is equal to .765 (p < .001); the two metrics are correlated, 3 User Impact Definition but not entirely and on those grounds, it would be On the microblogging platform of Twitter, user – reasonable to use both for quantifying impact. or, in general, account – popularity is usually quan- Consequently, we have chosen to represent user tified by the raw number of followers (φ 0), impact (S) as a log function of the number of fol- in ≥ i.e. other users interested in this account. Likewise, lowers, followees and listings, given by a user can follow others, which we denote as his set of followees (φ 0). It is expected that users (φ + θ)(φ + θ)2 out ≥ (φ , φ , φ ) = ln λ in , with high numbers of followers are also popular S in out λ φout + θ ! in the real world, being well-known artists, politi- (1) cians, brands and so on. However, non popular where θ is a smoothing constant set equal to 1 so entities, the majority in the social network, can also that the natural logarithm is always applied on a gain a great number of followers, by exploiting, real positive number. Figure 1 shows the impact 2 for example, a follow-back strategy. Therefore, score distribution for all the users in our sample, using solely the number of followers to quantify including some pointers to less or more popular impact may lead to inaccurate outcomes (Cha et al., Twitter accounts. The depicted user impact scores 2010). A natural alternative, the ratio of φin/φout form the response variable in the regression models is not a reliable metric, as it is invariant to scal- presented in the following sections. ing, i.e. it cannot differentiate accounts of the type φ , φ = m, n and γ m, γ n . We { in out} { } { × × } resolve this problem by squaring the number of 4 User Account Features 2 followers φin/φout ; note that the previous expres- This section presents the features used in the user sion is equal to (φ φ ) (φ /φ ) + φ and in− out × in out in impact prediction task. They are divided into two thus, it incorporates the ratio as well as the differ- categories: non-textual and text-based. All features ence between followers and followees. have the joint characteristic of being under the An additional impact indicator is the number of user’s direct control, something essential for char- times an account has been listed by others (φ 0). λ ≥ acterising impact based on the actions of a user. Lists provide a way to curate content on Twitter; Attributes such as the number of received retweets thus, users included in many lists are attractors of or @-mentions (of a user in the tweets of others) 2An account follows other accounts randomly expecting were not considered as they are not controlled by that they will follow back. the account itself. 406 a1 # of tweets graph partitioning on the word-by-word similar- a2 proportion of retweets ity matrix. In our case, tweet-term similarity is reflected by the Normalised Pointwise Mutual In- a3 proportion of non-duplicate tweets formation (NPMI), an information theoretic mea- a proportion of tweets with hashtags 4 sure indicating which words co-occur in the same a5 hashtag-tokens ratio in tweets context (Bouma, 2009). We use the random walk a6 proportion of tweets with @-mentions graph Laplacian and only keep the largest compo- a7 # of unique @-mentions in tweets nent of the resulting graph, eliminating most stop a8 proportion of tweets with @-replies words in the process. The number of clusters needs to be specified in advance and each cluster’s most a9 links ratio in tweets representative words are identified by the following a # of favourites the account made 10 metric of centrality: a11 total # of tweets (entire history) v c NPMI(w, v) a12 using default profile background (binary) C (c) = ∈ , (2) w c 1 a13 using default profile image (binary) P | | − a14 enabled geolocation (binary) where w is the target word and c the cluster it be- a population of account’s location longs ( c denotes the cluster’s size).

Load more