Predicting and Characterising User Impact on Twitter
Vasileios Lampos1, Nikolaos Aletras2, Daniel Preot¸iuc-Pietro2 and Trevor Cohn3
1 Department of Computer Science, University College London 2 Department of Computer Science, University of Sheffield 3 Computing and Information Systems, The University of Melbourne [email protected], n.aletras,d.preotiuc @dcs.shef.ac.uk, [email protected] { }
Abstract ment.1 Intuitively, it is expected that user impact cannot be defined by a single attribute, but depends The open structure of online social net- on multiple user actions, such as posting frequency works and their uncurated nature give rise and quality, interaction strategies, and the text or to problems of user credibility and influ- topics of the written communications. ence. In this paper, we address the task of In this paper, we start by predicting user impact predicting the impact of Twitter users based as a statistical learning task (regression). For that only on features under their direct control, purpose, we firstly define an impact score function such as usage statistics and the text posted for Twitter users driven by basic account proper- in their tweets. We approach the problem as ties. Afterwards, from a set of accounts, we mea- regression and apply linear as well as non- sure several publicly available attributes, such as linear learning methods to predict a user the quantity of posts or interaction figures. Textual impact score, estimated by combining the attributes are also modelled either by word frequen- numbers of the user’s followers, followees cies or, more generally, by clusters of related words and listings. The experimental results point which quantify a topic-oriented participation. The out that a strong prediction performance is main hypothesis being tested is whether textual achieved, especially for models based on and non textual attributes encapsulate patterns that the Gaussian Processes framework. Hence, affect the impact of an account. we can interpret various modelling com- To model this data, we present a method based ponents, transforming them into indirect on nonlinear regression using Gaussian Processes, ‘suggestions’ for impact boosting. a Bayesian non-parametric class of methods (Ras- mussen and Williams, 2006), proven more effec- 1 Introduction tive in capturing the multimodal user features. The modelling choice of excluding components that Online social networks have become a wide spread are not under an account’s direct control (e.g. re- medium for information dissemination and inter- ceived retweets) combined with a significant user action between millions of users (Huberman et al., impact prediction performance (r = .78) enabled 2009; Kwak et al., 2010), turning, at the same us to investigate further how specific aspects of a time, into a popular subject for interdisciplinary user’s behaviour relate to impact, by examining the research, involving domains such as Computer Sci- parameters of the inferred model. ence (Sakaki et al., 2010), Health (Lampos and Among our findings, we identify relevant fea- Cristianini, 2012) and Psychology (Boyd et al., tures for this task and confirm that consistent ac- 2010). Open access along with the property of struc- tivity and broad interaction are deciding impact tured content retrieval for publicly posted data have factors. Informativeness, estimated by computing brought the microblogging platform of Twitter into a joint user-topic entropy, contributes well to the the spotlight. separation between low and high impact accounts. Vast quantities of human-generated text from Use case scenarios based on combinations of fea- a range of themes, including opinions, news and tures are also explored, leading to findings such as everyday activities, spread over a social network. that engaging about ‘serious’ or more ‘light’ topics Naturally, issues arise, like user credibility (Castillo may not register a differentiation in impact. et al., 2011) and content attractiveness (Suh et al., 2010), and quite often trustful or appealing informa- 1For example, the influence assessment metric of Klout — tion transmitters are identified by an impact assess- http://www.klout.com.
405 Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 405–413, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics
2 Data 0.15 @nikaletras For the experimental process of this paper, we @guardian formed a Twitter data set ( 1) of more than 48 mil- @David_Cameron D 0.1 lion tweets produced by U = 38, 020 users geolo- | | @PaulMasonNews cated in the UK in the period between 14/04/2011 @spam? @lampos and 12/04/2012 (both dates included, ∆t = 365 days). 1 is a temporal subset of the data set used 0.05 D Probability Density for modelling UK voting intentions in (Lampos et al., 2013). Geolocation of users was carried out by matching the location field in their profile with 0 −5 0 5 10 15 20 25 30 UK city names on DBpedia as well as by check- Impact Score (S) ing that the user’s timezone is set to G.M.T. (Rout Figure 1: Histogram of the user impact scores in et al., 2013). The use of a common greater geo- our data set. The solid black line represents a gen- graphical area (UK) was essential in order to derive eralised extreme value probability distribution fit- a data set with language and topic homogeneity. ted in our data, and the dashed line denotes the A distinct Twitter data set ( 2) consisting of ap- D mean impact score (= 6.776). User @spam? is a prox. 400 million tweets was formed for learning sample account with φin = 10, φout = 1000 and term clusters (Section 4.2). 2 was retrieved from D φλ = 0; @lampos is a very active account, whereas Twitter’s Gardenhose stream (a 10% sample of the @nikaletras is a regular user. entire stream) from 02/01 to 28/02/2011. 1 and D 2 were processed using TrendMiner’s pipeline D interest. Indeed, Pearson’s correlation between φin (Preot¸iuc-Pietro et al., 2012). and φλ for all the accounts in our data set is equal to .765 (p < .001); the two metrics are correlated, 3 User Impact Definition but not entirely and on those grounds, it would be On the microblogging platform of Twitter, user – reasonable to use both for quantifying impact. or, in general, account – popularity is usually quan- Consequently, we have chosen to represent user tified by the raw number of followers (φ 0), impact (S) as a log function of the number of fol- in ≥ i.e. other users interested in this account. Likewise, lowers, followees and listings, given by a user can follow others, which we denote as his set of followees (φ 0). It is expected that users (φ + θ)(φ + θ)2 out ≥ (φ , φ , φ ) = ln λ in , with high numbers of followers are also popular S in out λ φout + θ ! in the real world, being well-known artists, politi- (1) cians, brands and so on. However, non popular where θ is a smoothing constant set equal to 1 so entities, the majority in the social network, can also that the natural logarithm is always applied on a gain a great number of followers, by exploiting, real positive number. Figure 1 shows the impact 2 for example, a follow-back strategy. Therefore, score distribution for all the users in our sample, using solely the number of followers to quantify including some pointers to less or more popular impact may lead to inaccurate outcomes (Cha et al., Twitter accounts. The depicted user impact scores 2010). A natural alternative, the ratio of φin/φout form the response variable in the regression models is not a reliable metric, as it is invariant to scal- presented in the following sections. ing, i.e. it cannot differentiate accounts of the type φ , φ = m, n and γ m, γ n . We { in out} { } { × × } resolve this problem by squaring the number of 4 User Account Features 2 followers φin/φout ; note that the previous expres- This section presents the features used in the user sion is equal to (φ φ ) (φ /φ ) + φ and in− out × in out in impact prediction task. They are divided into two thus, it incorporates the ratio as well as the differ- categories: non-textual and text-based. All features ence between followers and followees. have the joint characteristic of being under the An additional impact indicator is the number of user’s direct control, something essential for char- times an account has been listed by others (φ 0). λ ≥ acterising impact based on the actions of a user. Lists provide a way to curate content on Twitter; Attributes such as the number of received retweets thus, users included in many lists are attractors of or @-mentions (of a user in the tweets of others) 2An account follows other accounts randomly expecting were not considered as they are not controlled by that they will follow back. the account itself.
406 a1 # of tweets graph partitioning on the word-by-word similar- a2 proportion of retweets ity matrix. In our case, tweet-term similarity is reflected by the Normalised Pointwise Mutual In- a3 proportion of non-duplicate tweets formation (NPMI), an information theoretic mea- a proportion of tweets with hashtags 4 sure indicating which words co-occur in the same a5 hashtag-tokens ratio in tweets context (Bouma, 2009). We use the random walk a6 proportion of tweets with @-mentions graph Laplacian and only keep the largest compo- a7 # of unique @-mentions in tweets nent of the resulting graph, eliminating most stop a8 proportion of tweets with @-replies words in the process. The number of clusters needs to be specified in advance and each cluster’s most a9 links ratio in tweets representative words are identified by the following a # of favourites the account made 10 metric of centrality: a11 total # of tweets (entire history) v c NPMI(w, v) a12 using default profile background (binary) C (c) = ∈ , (2) w c 1 a13 using default profile image (binary) P | | − a14 enabled geolocation (binary) where w is the target word and c the cluster it be- a population of account’s location longs ( c denotes the cluster’s size). Examples of 15 | | a16 account’s location latitude extracted word clusters are illustrated in Table 4. Other techniques were also applied, such as online a17 account’s location longitude LDA (Hoffman et al., 2010), but we found that a proportion of days with nonzero tweets 18 the results were not satisfactory, perhaps due to Table 1: Non textual attributes for a Twitter account the short message length and the foreign terms co- used in the modelling process. All attributes refer occuring within a tweet. After forming the clusters to a set of 365 days (∆t) with the exception of a11, using 2, we compute a topic score (τ) for each D the total number of tweets in the entire history of an user-topic pair in 1, representing a normalised account. Attributes a , i 2 6, 8, 9 are ratios D i ∈ { − } user-word frequency sum per topic. of a1, whereas attribute a18 is a proportion of ∆t. 5 Methods 4.1 Non textual attributes This section presents the various modelling ap- The non-textual attributes (a) are derived either proaches for the underlying inference task, the im- from general user behaviour statistics or directly pact score (S) prediction of Twitter users based on from the account’s profile. Table 1 presents the 18 a set of their actions. attributes we extracted and used in our models. 5.1 Learning functions for regression 4.2 Text features We formulate this problem as a regression task, We process the text in the tweets of 1 and com- i.e. we infer a real numbered value based on a set D pute daily unigram frequencies. By discarding of observed features. As a simple baseline, we ap- terms that appear less than 100 times, we form ply Ridge Regression (RR) (Hoerl and Kennard, a vocabulary of size V = 71, 555. We then form 1970), a reguralised version of the ordinary least | | a user term-frequency matrix of size U V with squares. Most importantly, we focus on nonlinear | |×| | the mean term frequencies per user during the time methods for the impact score prediction task given interval ∆t. All term frequencies are normalised the multimodality of the feature space. Recently, it with the total number of tweets posted by the user. was shown by Cohn and Specia (2013) that Sup- Apart from single word frequencies, we are also port Vector Machines for Regression (SVR) (Vap- interested in deriving a more abstract representa- nik, 1998; Smola and Scholkopf,¨ 2004), commonly tion for each user. To achieve this, we learn word considered the state-of-the-art for NLP regression clusters from a distinct reference corpus ( 2) that tasks, can be outperformed by Gaussian Processes D could potentially represent specific domains of (GPs), a kernelised, probabilistic approach to learn- discussion (or topics). From a multitude of pro- ing (Rasmussen and Williams, 2006). Their setting posed techniques, we have chosen to apply spec- is close to ours, in that they had few (17) features tral clustering (Shi and Malik, 2000; Ng et al., and were also aiming to predict a complex con- 2002), a hard-clustering method appropriate for tinuous phenomenon (human post-editing time). high-dimensional data and non-convex clusters The initial stages of our experimental process con- (von Luxburg, 2007). Spectral clustering performs firmed that GPs performed better than SVR; thus,
407 we based our modelling around them, including The first and simplest one (A) uses only the non- RR for comparison. textual attributes as features; the performance of A In GP regression, for the inputs x Rd we want is tested using RR,3 SVR as well as a GP model. ∈ to learn a function f: Rd R that is drawn from For SVR we used an RBF kernel (equivalent to → a GP prior kiso), whereas for the GP we applied the following covariance function f(x) m(x), k(xx,xx0) , (3) ∼ GP where m(x) and k(xx,x x0) denote the mean (set to k(aa,aa0) = kard(aa,aa0) + knoise(aa,aa0) + β, (6) 0 in our experiments) and covariance (or kernel) where k (aa,aa ) = σ2 δ(aa,aa ), δ is a Kro- functions respectively. The GP kernel function rep- noise 0 × 0 resents the covariance between pairs of input. We necker delta function and β is the regression bias; this function consists of ( a + 3) hyperparame- wish to limit f to smooth functions over the inputs, | | with different smoothness in each input dimension, ters. Note that the sum of covariance functions is assuming that some features are more useful than also a valid covariance function (Rasmussen and others. This can be accommodated by a squared ex- Williams, 2006). ponential covariance function with Automatic Rele- The second model (AW) extends model A by vance Determination (ARD) (Neal, 1996; Williams adding word-frequencies as features. The 500 most frequent terms in 1 are discarded as stop words and Rasmussen, 1996): D and we use the following 2, 000 ones (denoted by d 2 2 (xi xi0 ) w). Setting x = aa,ww , the covariance function kard(xx,xx0) = σ exp −2 , (4) { } " − 2`i # becomes Xi 2 k(xx,xx0) = kard(aa,aa0) + kiso(ww,ww0) where σ denotes the overall variance and `i is (7) the length-scale parameter for feature xi; all hy- + knoise(xx,xx0) + β, perparameters are learned from data during model where we apply kiso on the term-frequencies due to inference. Parameter `i is inversely proportional to the feature’s relevancy in the model, i.e. high val- their high dimensionality; the number of hyperpa- rameters is ( a + 5). This is an intermediate model ues of `i indicate a low degree of relevance for the | | aiming to evaluate whether the incorporation of corresponding xi. By setting `i = ` in Eq. 4, we learn a common length-scale for all the dimensions text improves prediction performance. – this is known as the isotropic squared exponen- Finally, in the third model (AC) instead of rely- ing on the high dimensional space of single words, tial function (kiso) since it is based purely on the difference x x . k is a preferred choice when we use topic-oriented collections of terms extracted | − 0| iso the dimensionality of the input space is high. Hav- by applying spectral clustering (see Section 4.2). ing set our covariance functions, predictions are By denoting the set of different clusters or topics as τ and the entire feature space as x = aa,ττ , the conducted using Bayesian integration { } covariance function now becomes P(y x , ) = P(y x , f)P(f ), (5) ∗| ∗ O ∗| ∗ |O k(xx,xx0) = kard(xx,xx0) + knoise(xx,xx0) + β. (8) Zf where y is the response variable, a labelled train- The number of hyperparameters is equal to ( a + ∗ O | | ing set and x the current observation. We learn the τ + 3) and this model is applied for τ = 50 and ∗ | | | | hyperparameters of the model by maximising the 100. log marginal likelihood P(y ) using gradient as- |O cent. However, inference becomes intractable when 6 Experiments many training instances (n) are present as the num- ber of computations needed is (n3) (Quinonero-˜ Here we present the experimental results for the O Candela and Rasmussen, 2005). Since our training user impact prediction task and then investigate the samples are tens of thousands, we apply a sparse factors that can affect it. approximation method (FITC), which bases param- eter learning on a few inducing points in the train- 6.1 Predictive Accuracy ing set (Quinonero-Candela˜ and Rasmussen, 2005; We evaluated the performance of the proposed Snelson and Ghahramani, 2006). models via 10-fold cross-validation. Results are presented in Table 2; Root Mean Squared Error 5.2 Models 3 Given that the representation of attributes a16 and a17 For predicting user impact on Twitter, we develop (latitude, longitude) is ambiguous in a linear model, they were three regression models that build on each other. not included in the RR-based models.
408 z z z z α Linear (RR) Nonlinear (GP) Model r RMSE r RMSE A .667 2.642 .759 2.298
AW .712 2.529 .768 2.263 AC, τ = 50 .703 2.518 .774 2.234 z z α | | AC, τ = 100 .714 2.480 .780 2.210 | | Table 2: Average performance (RMSE and Pear- r 10 son’s ) derived from -fold cross-validation for the task of user impact score prediction. z α Model Top relevant features A a13 , a11, a7, a1, a9, a8, a18, a4, a6, a3 AW a7, a1, a11, a13 , a9, a8, a18, a4, a6, a15 AC, τ = 50 a , a11, a7, τ 0 , a1, a9, a8, τ 0 , a6, τ 0 13 1 2 3 z α AC, τ = 100 a13 , a11, a7, a1, a9, τ1, τ2, τ3, a18, a8 Table 3: The 10 most relevant features in descend- τ τ ing relevance order for all GP models. i0 and i denote word clusters (may vary in each model).6 z z z z α
(RMSE) and Pearson’s correlation (r) between pre- dictions and responses were used as the perfor- mance metrics. Overall, the best performance in terms of both RMSE (2.21 impact points) and lin- ear correlation (r = .78, p < .001) is achieved Figure 2: User impact distribution (x-axis: impact by the GP model (AC) that combines non-textual points, y-axis: # of user accounts) for users with a attributes with a 100 topic clusters; the difference low (L) or a high (H) participation in a selection in performance with all other models is statistically of relevant non-textual attributes. Dot-dashed lines 4 significant. The linear baseline (RR) follows the denote the respective mean impact score; the red same pattern of improvement through the differ- line is the mean of the entire sample (= 6.776). ent models, but never manages to reach the perfor- mance of the nonlinear alternative. As mentioned 6.2 Qualitative Analysis previously, we have also tried SVR with an RBF Given the model’s strong performance, we now kernel for model A (parameters were optimised on conduct a more thorough analysis to identify and a held-out development set) and the performance characterise the properties that affect aspects of (RMSE: 2.33, r = .75, p < .001) was significantly 4 the user impact. GP’s length-scale parameters (`i) worse than the one achieved by the GP model. – which are inversely proportional to feature rele- Notice that when word-based features are intro- vancy – are used for ranking feature importance. duced in model AW, performance improves. This Note that since our data set consists of UK users, was one of the motivations for including text in the some results may be biased towards specific cul- modelling, apart from the notion that the posted tural properties. content should also affect general impact. Lastly, turning this problem from regression to classifi- Non-textual attributes. Table 3 lists the 10 most cation by creating 3 impact score pseudo-classes relevant attributes (or topics, where applicable) as based on the .25 and the .9 quantiles of the re- extracted in each GP model. Ranking is determined sponse variable (4.3 and 11.4 impact score points by the mean value of the length-scale parameter for respectively) and by using the outputs of model each feature in the 10-fold cross-validation process. AC (τ = 100) in each phase of the 10-fold cross- We do not show feature ranking derived from the validation, we achieve a 75.86% classification ac- RR models as we focus on the models with the best curacy.5 performance. Despite this, it is worth mentioning 6Length-scales are comparable for features of the same 4 Indicated by performing a t-test (5% significance level). variance (z-scored). Binary features (denoted by ) are not 5Similar performance scores can be estimated for different z-scored, but for comparison purposes we have rescaled their class threshold settings. length-scale using the feature’s variance.
409 Label µ(`) σ(`) Cluster’s words ranked by centrality c ± | | τ1: Weather 3.73 1.80 mph, humidity, barometer, gust, winds, hpa, temperature, kt, #weather [...] 309 ± τ2: Healthcare 5.44 1.55 nursing, nurse, rn, registered, bedroom, clinical, #news, estate, #hospital, 1281 Finance ± rent, healthcare, therapist, condo, investment, furnished, medical, #nyc, Housing occupational, investors, #ny, litigation, tutors, spacious, foreclosure [...] τ3: Politics 6.07 2.86 senate, republican, gop, police, arrested, voters, robbery, democrats, presi- 950 ± dential, elections, charged, election, charges, #religion, arrest, repeal, dems, #christian, reform, democratic, pleads, #jesus, #atheism [...] τ4: Showbiz 7.36 2.25 damon, potter, #tvd, harry, elena, kate, portman, pattinson, hermione, jen- 1943 Movies ± nifer, kristen, stefan, robert, catholic, stewart, katherine, lois, jackson, vam- TV pire, natalie, #vampirediaries, tempah, tinie, weasley, turner, rowland [...] τ5: Commerce 7.83 2.77 chevrolet, inventory, coupon, toyota, mileage, sedan, nissan, adde, jeep, 4x4, 608 ± 2002, #coupon, enhanced, #deal, dodge, gmc, 20%, suv, 15%, 2005, 2003, 2006, coupons, discount, hatchback, purchase, #ebay, 10% [...] τ6: Twitter 8.22 2.98 #teamfollowback, #500aday, #tfb, #instantfollowback, #ifollowback, #in- 194 Hashtags ± stantfollow, #followback, #teamautofollow, #autofollow, #mustfollow [...] τ7: Social 8.37 5.52 #egypt, #tunisia, #iran, #israel, #palestine, tunisia, arab, #jan25, iran, israel, 321 Unrest ± protests, egypt, #yemen, #iranelection, israeli, #jordan, regime, yemen, #gaza, protesters, #lebanon, #syria, egyptian, #protest, #iraq [...] τ8: Non English 8.45 3.80 yg, nak, gw, gue, kalo, itu, aku, aja, ini, gak, klo, sih, tak, mau, buat [...] 469 ± τ9: Horoscope 9.11 3.07 horoscope, astrology, zodiac, aries, libra, aquarius, pisces, taurus, virgo, 1354 Gambling ± capricorn, horoscopes, sagitarius, comprehensive, lottery, jackpot [...] τ10: Religion 10.29 6.27 #jesustweeters, psalm, christ, #nhl, proverbs, unto, salvation, psalms, lord, 1610 Sports ± kjv, righteousness, niv, bible, pastor, #mlb, romans, awards, nhl [...] Table 4: The 10 most relevant topics (for model AC, τ = 100) in the prediction of a user’s impact score | | together with their most central words. The topics are ranked by their mean length-scale, µ(`), in the 10-fold cross-validation process (σ(`) is the respective standard deviation).
that RR’s outputs also followed similar ranking pat- pattern. Avoiding mentioning (a7) or replying (a8) terns, e.g. the top 5 features in model A were a18, to others may not affect (on average) an impact a7, a3, a11 and a9. Notice that across all models, score positively or negatively; however, accounts among the strongest features are the total number that do many unique @-mentions are distributed of posts either in the entire account’s history (a11) around a clearly higher impact score. On the other or within the 365-day interval of our experiment hand, users that overdo @-replies are distributed be- (a1) and the number of unique @-mentions (a7), low the mean impact score. Furthermore, accounts good indicators of user activity and user interaction that post irregularly with gaps longer than a day respectively. Feature a13 is also a very good predic- (a18) or avoid using links in their tweets (a9) will tor, but is of limited utility for modelling our data probably appear in the low impact score range. set because very few accounts maintain the default profile photo (0.4%). Less relevant attributes (not Topics. Regarding prediction accuracy (Table 2), shown) are the ones related to the location of a performance improves when topics are included. user (a16, a17) signalling that the whereabouts of a In turn, some of the topics replace non-textual at- user may not necessarily relate to impact. Another tributes in the relevancy ranking (Table 3). Table 4 low relevance attribute is the number of favourites presents the 10 most relevant topic word-clusters that an account did (a10), something reasonable, as based on their mean length-scale µ(`) in the 10- those weak endorsements are not affecting the main fold cross-validation process for the best perform- stream of content updates in the social network. ing GP model (AC, τ = 100). We see that clusters | | In Figure 2, we present the distribution of user with their most central words representing topics impact for accounts with low (left-side) and high such as ‘Weather’, ‘Healthcare/Finance’, ‘Politics’ (right-side) participation in a selection of non- and ‘Showbiz’ come up on top. textual attributes. Low (L) and high (H) participa- Contrary to the non-textual attributes, accounts tions are defined by selecting the 500 accounts with with low participation in a topic (for the vast major- lowest and highest scores for this specific attribute. ity of topics) were distributed along impact score The means of (L) and (H) are compared with the values lower than the mean. Based on the fact that mean impact score in our sample. As anticipated, word clusters are not small in size, this is a rational accounts with low activity (a11) are likely to be outcome indicating that accounts with small word- assigned impact scores far below the mean, while frequency sums (i.e. the ones that do not tweet very active accounts may follow a quite opposite much) will more likely be users with small impact
410 τ τ τ τ τ