Finding Influencers in Social Networks

Finding Influencers in Social Networks Carolina de Figueiredo Bento Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering Jury President: Prof. Dr. Mario´ Jorge Costa Gaspar da Silva Supervisor: Prof. Dr. Bruno Emanuel da Graça Martins Co-Supervisor: Prof. Dr. Pavel´ Pereira Calado Member: Prof. Dr. Alexandre Paulo Lourenço Francisco November 2012 Abstract rom the millions of users that social platforms have, one can acknowledge that the activities of a Fselected number of users are more rapidly perceived and spread through the network, than those of others. These users are the influencers. They generate trends and shape opinions in social networks, being crucial in areas such as marketing or opinion mining. In my MSc thesis, I studied network analysis methods to identify influencers, experimenting with different types of networks, namely location-based networks from services like FourSquare or Twitter, that include relationships between users and between users and the locations they have visited, and academic citation networks, i.e., networks that relate scientific papers through citations. Within location-based networks I estimated the most influential nodes, through a set of network analysis techniques. Assessing the veracity of these results was done by comparison to traditional measures (e.g., the number of friends a user has) because there is no ground-truth list, i.e., a list containing a set of well known accepted influencers. The majority of the influencers are not the ones who have the highest number of friends. Within academic citation networks, the most influential papers identified were really considered important publications, due to being authored by renowned authors and recipients of important awards, being fundamental reading or recent developments on a topic. I also developed a framework to predict future influence scores and download counts through a combination of features. Accurate estimates were obtained through the use of learning methods such as the RT-Rank. Keywords: Social Networks, Network analysis, Impact Scores, Finding Influencers, i Resumo s serviços de social networking temˆ milhoes˜ de utilizadores contudo, percebemo-nos que a activi- O dade de um grupo selecto de utilizadores e´ mais rapidamente captada e propagada pela rede, do que a de outros. Chamamos a este grupo os influentes. Eles criam tendenciasˆ e dominam as opinioes˜ nas redes sociais, sendo cruciais em areas` como o marketing ou opinion mining. Na minha tese, estudei metodos´ de analise´ de redes para a identificar influentes, analizando dois tipos de redes, nomeadamente, redes baseadas na localizaçao,˜ provindas de serviços como o FourSquare ou o Twitter, que incluem relaçoes˜ entre os utilizadores e entre estes e os locais que estes visitaram, e redes de citaçoes˜ academicas,´ i.e., relacionando artigos cient´ıficos atraves´ de citaçoes.˜ Em redes baseadas na localizaçao,˜ estimaram-se quais os nos´ mais influentes, atraves´ de um conjunto de tecnicas´ de analise´ de redes. A veracidade destes resultados foi aferida comparando medidas tradicionais (e.g., o numero´ de amigos de um utilizador) dado nao˜ existir uma lista de influentes para validaçao,˜ i.e., uma lista contendo um conjunto de influentes unanimemente reconhecidos. Em redes de citaçoes˜ academicas,´ os artigos obtidos como mais influentes sao˜ realmente publicaçoes˜ importantes, devido a serem da autoria de cientistas de renome galardoados passado, por serem publicaçoes˜ essenciais ou desenvolvimentos recentes num topico´ espec´ıfico. Desenvolvi tambem´ uma framework que preveˆ futuros valores de influenciaˆ e o futuro total de downloads efectuados, combinando caracter´ısticas como valores de influenciaˆ anteriores. Atraves´ da utilizaçao˜ de metodos´ de aprendiza- gem com o RT-Rank, e´ poss´ıvel realizar estimativas precisas. Palavras-chave: Redes Sociais, Analise´ de Redes, Valores de Influencia,ˆ Encontrar Influentes iii Acknowledgments irst and foremost I have to thank my parents, sister and brother-in-law for the unconditional support Fand selflessness throughout these years, and specially during my MSc thesis. I must thank my advisors, Prof. Dr. Bruno Martins and Prof. Dr. Pavel´ Calado, for all the support, motivation, patience and availability. It is very comforting to be able to share ideas and openly discuss new ways of addressing a problem with such ease. Also, I must thank them for giving me the oppor- tunity of being part of projects, such as, the European Digital Mathematics Library (EuDML) and the Services for Intelligent Geographical Information Systems (SInteliGIS), both funded by the Portuguese Foundation for Science and Technology (FCT) through the project grants with reference 250503 in CIP- ICT-PSP.2009.2.4 and PTDC/EIA-EIA/109840/2009, respectively. I thank all the colleagues and close friends that have accompanied me throughout the years, and specially, the ones who have filled these last couple of years with so much joy, laughter and camaraderie. So, to Ana Silva, Joao˜ Lobato Dias, Lu´ıs Santos, Joao˜ Amaro, Pedro Cruz, Jacqueline Jardim, Maria Rosa, Lu´ıs Luciano, Carlos Simoes,˜ Mafalda Abreu, Celia´ Tavares and, thankfully many others, I express my enormous gratitude for keeping me (in)sane. Last, but definitely not the least, I must thank my boyfriend, Joao˜ Fernandes, for the unconditional love, support, patience and confidence, for helping me being more creative and acute during the stressful times and for showing me there is always a light at the end of the tunnel. v Contents Abstract i Resumo iii Acknowledgments v 1 Introduction 1 1.1 Hypothesis and Methodology . .2 1.2 Main Contributions . .3 1.3 Organization of the Dissertation . .4 2 Fundamental Concepts5 2.1 Fundamental Concepts in Graph Theory . .5 2.2 Influencers in Social Networks . .7 2.3 Prestige, Popularity and Attention in Social Networks . .9 2.4 Recognition, Novelty, Homophily and Reciprocity . 10 2.5 Active versus Inactive Users, User Retention, Confounding, Social Influence and Social Correlation . 10 2.6 Information Cascades . 11 2.7 Information Diffusion Models and Measures . 12 2.8 Graph Centrality Measures and Bibliographic Indexes . 14 2.9 Unsupervised Rank Aggregation Approaches . 20 2.10 Supervised Learning for Rank Aggregation . 24 2.11 Summary . 26 vii 3 Related Work 27 3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm . 27 3.2 The PageRank algorithm and its Variants . 28 3.2.1 Weighted PageRank . 30 3.2.2 Topic-Sensitive PageRank . 32 3.2.3 TwitterRank . 33 3.3 The Influence-Passivity (IP) Algorithm . 36 3.4 Citation and Co-Authorship Networks . 38 3.5 Temporal Issues in Ranking Scientific Articles . 40 3.6 Summary . 41 4 Finding Influencers in Social Networks 43 4.1 Available Resources for Finding Influencers . 44 4.1.1 Characterizing Networks . 44 4.2 Analysis of Location-based Social Networks . 45 4.2.1 Data Collection from Online Services . 47 4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm . 49 4.3 Analysis of Academic Social Networks . 50 4.3.1 Predicting Future Influence Scores and Download Counts . 51 4.3.2 The Learning Approach . 54 4.4 Summary . 55 5 Validation Experiments 57 5.1 The Considered Datasets . 57 5.2 Evaluation Methodology . 60 5.3 The Obtained Results . 62 5.3.1 Finding Influencers . 63 5.3.2 Predicting Future PageRank Scores and Download Counts . 67 5.4 Summary . 70 viii 6 Conclusions 71 6.1 Summary of Results . 72 6.2 Future Work . 73 Bibliography 75 Apendices 83 A Important Awards in Computer Science 83 ix List of Tables 5.1 Characterization of the FourSquare and Twitter networks. 58 5.2 Characterization of the DBLP dataset. 59 5.3 Characterization of the DBLP network. 61 5.4 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the FourSquare dataset. 63 5.5 User influence scores for PageRank and HITS algorithms, for the User Graph, built from the FourSquare dataset. 64 5.6 User influence scores for the IP algorithm, built from the FourSquare dataset. 64 5.7 Spot influence scores for PageRank and HITS algorithms (that present the exact same top-10), for the User+Spot Graph, built from the FourSquare dataset. 65 5.8 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitter dataset. 66 5.9 User influence scores for PageRank and HITS algorithms, for the User Graph, built from the Twitter dataset. 66 5.10 Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitter dataset. 66 5.11 PageRank scores for top-10 highest ranked papers of the DBLP dataset. 67 5.12 Results for the prediction of impact PageRank scores for papers in the DBLP dataset. 68 5.13 Results for the prediction of download numbers for papers in the DBLP dataset. 69 xi List of Figures 2.1 A graph with the set of vertices V=f1; :::; 8g, the set of edges E=f(1; 2); (2; 4); (3; 4); :::g and encoding a path P with length 6 (adapted from (Diestel, 2005)). .7 2.2 Graph with three components and two SCC’s denoted by dashed lines (adapted from Easley & Kleinberg(2010) and Cormen et al. (2001)). .8 2.3 Flowchart for the Single Transferable Vote rule. 22 2.4 Learning-To-Rank (L2R) Framework (adapted from Liu(2009)). 24 3.5 A graph with hubs and authorities (adapted from Kleinberg(1998)). 28 3.6 A graph illustrating the computation of PageRank (adapted from Page et al. (1998)). 29 3.7 The general TwitterRank framework (adapted from Weng et al. (2010)). 34 4.8 Example of a location-based social network (adapted from Zheng & Zhou(2011)). 46 4.9 A sequence of subdivisions of the world sphere, starting from the octahedron, down to level 5 corresponding to 8192 spherical triangles.

Finding Influencers in Social Networks

Evolving Networks and Social Network Analysis Methods And

Package 'Qgraph'

Sparsity of Weighted Networks: Measures and Applications

An Extended Correlation Dimension of Complex Networks

Weighted Network Analysis Steve Horvath

Knowledge Discovery in Scientific Databases Using Text Mining and Social Network Analysis

Assortativity Measures for Weighted and Directed Networks

PHYSICAL REVIEW E 98, 042304 (2018) Weight Thresholding on Complex Networks

Weighted Network Estimation by the Use of Topological Graph Metrics

Research on the Node Importance of a Weighted Network Based on the K-Order Propagation Number Algorithm

A Network Analysis of Money Laundering in the United Kingdom

Graphlet Decomposition of a Weighted Network