A Twitter Network Analysis of Scientific Leaders

Scientific Influence and Social Prestige: A Twitter Network Analysis of Scientific Leaders

Arslan Majid, Richard Allred, Amir Aminjavaheri, Eric Wang Department of Electrical and Computer Engineering University of Utah, Salt Lake City, Utah [email protected], [email protected], [email protected], [email protected]

Abstract— This paper discusses the analysis of a set of the # Followers top 50 scientists on Twitter determined by [1]. Scientists among K-index = (1) others use Twitter in order to express their scientific and social 43.3 ∗ (# Citations)0.32 ideas. Interestingly, as it turns out, the social fame of these scientists is not necessarily related to their scientific influence. The The rest of this paper is organized as follows. In Section study of this relationship is the central idea of this paper. More II, we discuss the network analysis tools and methods that specifically, we consider 50 most famous scientists in the internet we have used in the paper as well as how we form the [1] and try to characterize different features that are related to corresponding network for our analysis. Section III contains their scientific influence versus their social fame. We approach the problem by forming the networks of hashtags, mentions, the results and analysis. We conclude the paper in Section IV. and retweets and try to analyze them accordingly using network analysis tools. Furthermore, we employ factor analysis in order II.METHODS to relate different extracted features of these networks together. A. Network Characteristics With the rise of social network platforms such as Twitter, I.INTRODUCTION LinkedIn and Facebook, much interest has been placed on un- As social media become more popular, the analysis of social derstanding how these networks transfer, digest and propagate networks has grown in its usefulness. This research centers information. In an effort to understand how scientific ideas on understanding how scientific information flows through flow, the Twitter activity of 50 leading scientists are studied. the Twitter network of 50 prominent scientists. The inves- Here network characteristic information such as node degree, tigation focused on understanding the correlations between closeness, betweenness and eigenvector centrality values are social media fame and scientific influence. Social media fame related to measures of scientific influence and social network was quantified by the K-index, Twitter profile information activity. and network centrality characteristics while scientific influ- To form the networks, several approaches were taken to ence was quantified by the H-index. The correlation study form both undirected and directed weighted graphs. One was performed by using factor analysis and robust principle approach [4] is to find the number of hashtags used in common component analysis, which was able to clearly separate the between two twitter users, yielding an undirected weighted scientific influence from the social media fame and activity. graph. Another possibility is to identify and count the number The 50 scientific leaders were selected by [1] which iden- of times a Twitter user retweets another user, yielding a tified the top most followed scientists on Twitter. For these directed weighted graph. A last possibility is to identify and 50, their most recent tweets were extracted and analyzed. The count the number of times a Twitter user mentions another Twitter API limits the number of tweets extracted for each Twitter user by their Twitter handle. user to about 3200 tweets. For some scientists this resulted To identify the most important nodes in the graph several in the extraction of Twitter activity for only the past few measures of centrality [5] are explored. These metrics are: months while for others their sparse twitter activity ranged 1) Degree centrality measured the number of other nodes back to 2008. The number of tweets from each scientist ranged each node is connected to and can be thought of as how from 6 to 3246 with about 60% of the scientists reaching the connected a node is and how likely that it will observe saturating number of extractable tweets. whatever quantity is flowing through the network. Each scientist’s professional contribution was quantified by 2) The closeness centrality is defined as the inverse of the the H-index [2] which uses factors, such as the number of farness of each node which is the sum of the distances publications and the number of times these publications have between the given node and all other nodes. Thus the been cited to approximate the scientific impact of each indi- smaller the closeness centrality the closer the node is to vidual. A similar measure which relates the scientists social the middle of the network. media activity to scientific contribution is the K-index [3]. This 3) The betweenness centrality is determined by finding the metric is a function of the number of Twitter followers and number of times a node is in the critical minimum scientific citations; It is defined as path between two other nodes. This measure can be useful for identifying those nodes in a network, which than others. Given this notation, we can write a factor analysis if removed, would result in an overall reduction in model for Z as: information transfer capability. 4) The Eigenvalue centrality measure is the dominant     Z1 `11f1 + ...`1mfm + 1 eigenvector of the adjacency matrix which describes the Z2 `21f1 + ...`2mfm + 2 network. The motivation for this metric is to value nodes Z =   =   = Lf + (3)  .   .  which, in addition to the above centrality measures, are  .   .  also in the neighborhood of other important nodes. This Zp `p1f1 + ...`pmfm + p concept is very similar to the approach used by Google Here L is a matrix of factor loadings, f is a set of common in their PageRank algorithm. factors and is residual error. The factors are taken to be The objective of the study is to determine what network char- uncorrelated with one another (IE: cov(fi,fj) = 0) so that acteristics are correlated to the scientific and social influence they may each represent a new set of variables which are measures H-index and K-index. As such, the above centrality uncorrelated to each other. In summary, the model represents p measures were determined for the given networks. observable features as m unobserved and uncorrelated factors. B. Network Formation To estimate the factors loadings L, we turn to principal In order to study some of the characteristics of our scientists component analysis to diagonalize the covariance matrix of Z we translate the raw data into a particular network. More in (3) by taking its spectral decomposition. specifically, we formed the following networks: T T RZZ = cov(Z) = ZZ = EΛE (4) • Network of users and hashtags: In this network, we connect each of the scientists to the hashtags that they Using the following assumptions: 1) the residual error have used in their tweets. Fig. 1 illustrated this network. component and factors are uncorrelated, 2) the variance of • Network of mentions: If the user a mentions user b, we the factors is 1, 3) the factors are 0 mean, and 4) the connect a to b using a directed edge in order to form covariance matrix of errors- Ψ = cov()- is a diagonal matrix, the network of mentions. Fig. 2 illustrates the network indicating that cross-diagonal residuals are negligible, RZZ of mentions within the scientists. We have also studied can be expanded from equation (3). the global network of mentions. It should be noted that, T T in Twitter, mentioning a user is used in two cases of RZZ = LL + Ψ = EΛE (5) replying to that user or bringing the attention of that user to a subject. We do not aim to differentiate between this If we take LLT = EΛET , then Ψ = 0, but this is not a two purposes of the mentioning. parsimonious model since there are as many factor loadings • Network of retweets: In the third network that we study, m as features p. To make it a simpler model where m p, we directly connect each user to the other user that he/she only the principal components are taken. has retweeted from. LLT = E0Λ0E0T (6) C. Factor Analysis A factor analysis is used to determine correlations among E0 is the set of m eigenvectors of E with largest m the many dimensions that exist within our data-set. The factor eigenvalues in Λ0. Using equation 6, the factor loadings L analysis approach is similar to [6] where the authors determine can be taken as factors related to retweeting. The purpose of the analysis is to √ help interpret which components of a scientist’s profile and L = E0 Λ0 (7) tweet data relate to the parameters of interest: k-index and The above L gives the factor loadings, the parameter of h-index. interest in factor analysis. We know that the factors are The procedure that we follow is based the principal compo- uncorrelated to one another, and so they may be represented nent method [7] for factor analysis. Call M a matrix that is a as axes in a graph and the factor loadings then indicate the subset of the data acquired through the Twitter API describing degree to which each feature is correlated to a given factor. information that includes various features of the Twitter data. Each factor is a linear combination of all the features in Z and Section III details the Twitter features we extracted and M is the factor itself represents an unobserved variable derived from defined in (2). the set of observed variables. By taking m factors derived from   M1 the largest m principal components, we extract the features of M2 Z that maximally account for the variations within Z. M =   = Matrix of all traits (2)  .  In short, the factor analysis describes the variability be-  .  tween observed and correlated features in term of unobserved Mp and uncorrelated variables called factors. The large number The raw data M is standardized to a matrix Z to remove of observed features are assumed to be a weighted linear the effects of features that have higher variances and means combination of a small number of factors plus some residual, Fig. 1: Network of scientists and hashtags.Here, to make the visualization simple we only show the most often used hashtags.

Fig. 2: Network of mentions within the scientists. TABLE I: Tweet Features TABLE II: Factors and corresponding eigenvalues Features Definition Factor Eigenvalue (Sorted) % Cumulative Variance Followers # of users who follow author 1 3.88 0.29 Citations # of citations a scientist has accumulated 2 2.77 0.51 Ratio of scientists # of followers with cita- 3 1.80 0.65 K-index tions 4 1.11 0.73 Measure of citation impact for published 5 0.86 0.80 H-index work by scientist 6 0.69 0.85 Tweet Fav. Count # of tweets the author has favorited 7 0.55 0.89 Friends Count Authors friend count 8 0.44 0.93 Tweet Count # of users who follow author 9 0.33 0.95 # of tweets an author has mentioned some- 10 0.21 0.97 Sci. Ment. Others one else 11 0.12 0.98 Sci. is Ment. # of tweets a scientist is mentioned in 12 0.11 0.99 RT Count Total retweet count for all of authors tweets 13 0.06 1.00 Total Hashtags Used # of total hashtags used by author Real Name in Handle Does author have real name in handle? Twitter Verified Is author a verified Twitter user? based on their communalities. Any communality below 0.15 was removed. A communality is defined to be the Euclidean meaning that this analysis can be taken as a way of creating distance of a given factor loading to m factors. Communalities a low-rank approximation of Z. can be interpreted as a metric that defines how well the m- factor model explains a feature. The higher the communality, III.RESULTS the better represented a feature is by the factor analysis. A. Network Analysis The features of Table I are used to develop the matrix Z. The In this section we discuss some of the characteristics of eigenvalues of Z are provide in Table II and the cumulative the networks that we defined in Section II-B. In the network variance described by the eigenvectors is given next to it. It of users and hashtags, the most frequently used hashtags are can be seen that the first 3 factors account for 65% of the total science (by 26 scientists), ff (24), ebola (16), autism (14), variance in Z and therefore we can use an m = 3 factor model irony (12), etc. . However, it turns out that the most important for our data set. hashtags, based on the eigenvector centrality definition, are Figures 3, 4, and 5 represent the factor analysis results. One science, NASA, space and breaking. This reveals the fact that caveat with the large amount of features being analyzed is that most of the high-ranked scientists in our network are tweeting it becomes difficult to interpret the factor maps. This is because about the space science area, rather than other areas such as as the number of variables increases, there are more and more biology, engineering, etc. . features pulling and pushing other features towards each other. In our network of mentions, on the other hand, the most Another problem is that the interpretation of factor analysis frequently mentioned users are guardian (by 31 scientists), results can be interpreted in many different ways. The way we nytimes (27), edyong209 (25), YouTube (24), washingtonpost (23), neiltyson (21), etc. . This shows that our scientists are mostly mentioning the news agencies and journalists Finally, in our network of retweets, the most often retweeted 1 H−indexCitations users are Steven Pinker (12 scientists have retweeted from 0.8 him), Richard Dawkins (10), and Phil Plait (10). 0.6 B. Factor Analysis 0.4 The matrix Z that we use is derived from a large, sparse, Real Name in HandleTweet Fav. Count and erroneous matrix of Twitter data from the set of 50 0.2 Total HT Followers famous scientists. At this point, we would like to note that Sci. is Ment. upon considering the entire set of 50 scientists, we found that 0 Friends Count Twitter Verified RT Count removing the scientist with the largest k-index and the scientist −0.2 with the largest h-index provided a much more informative K−index Sci. Ment. Others analysis of the dataset. This is because the two scientists with −0.4 these properties were significantly affecting the results of the −0.6 Tweet Count factor analysis as they were outliers in many different features given their placement in k/h index. For example, Neil deGrasse −0.8 Tyson had a k-index that was 30 times larger than the mean −1 and due to that his impact on the factor loadings was much −1 −0.5 0 0.5 1 larger than the other scientists. Table I represents the features were extracted from the set. Many other features were analyzed, but were removed Fig. 3: Factor map with Factor 1 (x-axis) and Factor 2 (y-axis) interpret them is that features which have a factor loading > analysis is that it has effectively separated social media influ- 0.5 for a given factor represent more interesting phenomenon ence and scientific influence into two different uncorrelated than factors with factor loadings < 0.5 for the same factor. variables- Factor 1 and Factor 2. Admittedly, one of the Recall that factor loadings are the correlation between a factor weaknesses of the factor analysis results is that only 48 and the feature. scientists (or trials) are in Z and the lack of sampling of Based on the factor maps, we can interpret Factor 1 to repre- more scientists gives rise to probabilistic inaccuracies within sent social media fame. This is because the features correlated the data. well with Factor 1 represent the k-index, number of followers, IV. CONCLUSION and tweet favorite count very well. Factor 2 represents the scientific influence of the scientists as it correlates very well The analysis of Twitter networks has the capabilities of with the number of citations and the h-index. Factor 3 is a revealing interesting insights into human connections and ac- bit more difficult to analyze since it is not aligning with a tivities. It has been shown that for this set of 50 scientists that tweet feature as well as Factor 1 or 2 are. However, we say Twitter activity is inversely proportional to rigorous scientific that Factor 3 represents the network activity of the scientists contributions. Future work could center on understanding the since it correlates very well with the tweet count and scientists network structure more and providing a means for predicting mentioning other Twitter users. behavior on the network characteristics. Factor 1 represents social media fame very well and speaks ACKNOWLEDGMENT to how effective the k-index is at measuring social media fame. Take figure 3 for example. Here, we see that the k-index is Thanks to the CHASM Lab from the Communication De- strongly correlated with # of followers, tweet favorite count, partment at University of Utah for providing the data and retweet count, Twitter verification, and to some extent scientist motivation for this project. The CHASM team includes Dr. mentions count. This indicates that scientists with social media Sara Yeo, Dr. Avery Holton, Dr. Ye Sun, Veronica Dawson fame have an active user base supporting them by retweeting, and Elliot Fenech. mentioning, and favoring their tweets. Interestingly enough, REFERENCES scientists who are verified by Twitter are more likely to have [1] J. You, “Who are the science stars of twitter?” Science, vol. 345, no. a higher k-index than those that are not verified. 6203, pp. 1440–1441, 2014. Factor 2 speaks of scientific influence. In figure 1, we see [2] J. E. Hirsch, “An index to quantify an individual’s scientific research an interesting effect in that scientists who have significant output,” Proceedings of the National academy of Sciences of the United States of America, vol. 102, no. 46, pp. 16 569–16 572, 2005. influence in the field are not tweeting often. To some lesser [3] N. Hall, “The kardashian index: a measure of discrepant social media extent, these scientists tend to have their real name portrayed profile for scientists,” Genome biology, vol. 15, no. 7, p. 424, 2014. in their Twitter profile and though they are not tweeting often, [4] K. Holmberg, T. D. Bowman, S. Haustein, and I. Peters, “Astrophysicists conversational connections on twitter,” PloS one, vol. 9, no. 8, p. e106086, their tweets are favorited by users. 2014. On a higher level, the most interesting feature of the factor [5] M. Newman, Networks: an introduction. Oxford University Press, 2010.

1 1

0.8 0.8 Tweet Count Sci. Ment. Others Sci. Ment. Others Total HT 0.6 Tweet Count Friends Count 0.6 Sci. is Ment. Total HT Friends Count 0.4 0.4 Sci. is Ment.

0.2 RT Count 0.2 Citations Citations RT Count 0 Tweet Fav. Count 0 H−index Tweet Fav. Count Followers H−index −0.2 −0.2 K−index Followers K−index −0.4 Real Name in Handle −0.4 Real Name in Handle

−0.6 Twitter Verified −0.6 Twitter Verified

−0.8 −0.8

−1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Fig. 4: Factor map with Factor 1 (x-axis) and Factor 3 (y-axis) Fig. 5: Factor map with Factor 2 (x-axis) and Factor 3 (y-axis) [6] B. Suh, L. Hong, P. Pirolli, and E. H. Chi, “Want to be retweeted? large scale analytics on factors impacting retweet in twitter network,” in Social computing (socialcom), 2010 ieee second international conference on. IEEE, 2010, pp. 177–184. [7] R. Gorsuch, “Factor analysis, 2nd,” Hillsdale, NJ: LEA, 1983.