Finding Influencers in Social Networks

Carolina de Figueiredo Bento

Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering

Jury President: Prof. Dr. Mario´ Jorge Costa Gaspar da Silva Supervisor: Prof. Dr. Bruno Emanuel da Grac¸a Martins Co-Supervisor: Prof. Dr. Pavel´ Pereira Calado Member: Prof. Dr. Alexandre Paulo Lourenc¸o Francisco

November 2012

Abstract

rom the millions of users that social platforms have, one can acknowledge that the activities of a Fselected number of users are more rapidly perceived and spread through the network, than those of others. These users are the influencers. They generate trends and shape opinions in social networks, being crucial in areas such as marketing or opinion mining.

In my MSc thesis, I studied network analysis methods to identify influencers, experimenting with different types of networks, namely location-based networks from services like FourSquare or Twitter, that include relationships between users and between users and the locations they have visited, and academic citation networks, i.e., networks that relate scientific papers through citations.

Within location-based networks I estimated the most influential nodes, through a set of network analysis techniques. Assessing the veracity of these results was done by comparison to traditional measures (e.g., the number of friends a user has) because there is no ground-truth list, i.e., a list containing a set of well known accepted influencers. The majority of the influencers are not the ones who have the highest number of friends.

Within academic citation networks, the most influential papers identified were really considered impor- tant publications, due to being authored by renowned authors and recipients of important awards, being fundamental reading or recent developments on a topic. I also developed a framework to predict fu- ture influence scores and download counts through a combination of features. Accurate estimates were obtained through the use of learning methods such as the RT-Rank.

Keywords: Social Networks, Network analysis, Impact Scores, Finding Influencers,

i

Resumo

s servic¸os de social networking temˆ milhoes˜ de utilizadores contudo, percebemo-nos que a activi- O dade de um grupo selecto de utilizadores e´ mais rapidamente captada e propagada pela rede, do que a de outros. Chamamos a este grupo os influentes. Eles criam tendenciasˆ e dominam as opinioes˜ nas redes sociais, sendo cruciais em areas` como o marketing ou opinion mining.

Na minha tese, estudei metodos´ de analise´ de redes para a identificar influentes, analizando dois tipos de redes, nomeadamente, redes baseadas na localizac¸ao,˜ provindas de servic¸os como o FourSquare ou o Twitter, que incluem relac¸oes˜ entre os utilizadores e entre estes e os locais que estes visitaram, e redes de citac¸oes˜ academicas,´ i.e., relacionando artigos cient´ıficos atraves´ de citac¸oes.˜

Em redes baseadas na localizac¸ao,˜ estimaram-se quais os nos´ mais influentes, atraves´ de um conjunto de tecnicas´ de analise´ de redes. A veracidade destes resultados foi aferida comparando medidas tradicionais (e.g., o numero´ de amigos de um utilizador) dado nao˜ existir uma lista de influentes para validac¸ao,˜ i.e., uma lista contendo um conjunto de influentes unanimemente reconhecidos.

Em redes de citac¸oes˜ academicas,´ os artigos obtidos como mais influentes sao˜ realmente publicac¸oes˜ importantes, devido a serem da autoria de cientistas de renome galardoados passado, por serem publicac¸oes˜ essenciais ou desenvolvimentos recentes num topico´ espec´ıfico. Desenvolvi tambem´ uma framework que preveˆ futuros valores de influenciaˆ e o futuro total de downloads efectuados, combinando caracter´ısticas como valores de influenciaˆ anteriores. Atraves´ da utilizac¸ao˜ de metodos´ de aprendiza- gem com o RT-Rank, e´ poss´ıvel realizar estimativas precisas.

Palavras-chave: Redes Sociais, Analise´ de Redes, Valores de Influencia,ˆ Encontrar Influentes

iii

Acknowledgments

irst and foremost I have to thank my parents, sister and brother-in-law for the unconditional support Fand selflessness throughout these years, and specially during my MSc thesis. I must thank my advisors, Prof. Dr. Bruno Martins and Prof. Dr. Pavel´ Calado, for all the support, motivation, patience and availability. It is very comforting to be able to share ideas and openly discuss new ways of addressing a problem with such ease. Also, I must thank them for giving me the oppor- tunity of being part of projects, such as, the European Digital Mathematics Library (EuDML) and the Services for Intelligent Geographical Information Systems (SInteliGIS), both funded by the Portuguese Foundation for Science and Technology (FCT) through the project grants with reference 250503 in CIP- ICT-PSP.2009.2.4 and PTDC/EIA-EIA/109840/2009, respectively.

I thank all the colleagues and close friends that have accompanied me throughout the years, and spe- cially, the ones who have filled these last couple of years with so much joy, laughter and camaraderie. So, to Ana Silva, Joao˜ Lobato Dias, Lu´ıs Santos, Joao˜ Amaro, Pedro Cruz, Jacqueline Jardim, Maria Rosa, Lu´ıs Luciano, Carlos Simoes,˜ Mafalda Abreu, Celia´ Tavares and, thankfully many others, I express my enormous gratitude for keeping me (in)sane.

Last, but definitely not the least, I must thank my boyfriend, Joao˜ Fernandes, for the unconditional love, support, patience and confidence, for helping me being more creative and acute during the stressful times and for showing me there is always a light at the end of the tunnel.

v

Contents

Abstract i

Resumo iii

Acknowledgments v

1 Introduction 1

1.1 Hypothesis and Methodology ...... 2

1.2 Main Contributions ...... 3

1.3 Organization of the Dissertation ...... 4

2 Fundamental Concepts5

2.1 Fundamental Concepts in Graph Theory ...... 5

2.2 Influencers in Social Networks ...... 7

2.3 Prestige, Popularity and Attention in Social Networks ...... 9

2.4 Recognition, Novelty, and Reciprocity ...... 10

2.5 Active versus Inactive Users, User Retention, Confounding, Social Influence and Social Correlation ...... 10

2.6 Information Cascades ...... 11

2.7 Information Diffusion Models and Measures ...... 12

2.8 Graph Measures and Bibliographic Indexes ...... 14

2.9 Unsupervised Rank Aggregation Approaches ...... 20

2.10 Supervised Learning for Rank Aggregation ...... 24

2.11 Summary ...... 26

vii 3 Related Work 27

3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm ...... 27

3.2 The PageRank algorithm and its Variants ...... 28

3.2.1 Weighted PageRank ...... 30

3.2.2 Topic-Sensitive PageRank ...... 32

3.2.3 TwitterRank ...... 33

3.3 The Influence-Passivity (IP) Algorithm ...... 36

3.4 Citation and Co-Authorship Networks ...... 38

3.5 Temporal Issues in Ranking Scientific Articles ...... 40

3.6 Summary ...... 41

4 Finding Influencers in Social Networks 43

4.1 Available Resources for Finding Influencers ...... 44

4.1.1 Characterizing Networks ...... 44

4.2 Analysis of Location-based Social Networks ...... 45

4.2.1 Data Collection from Online Services ...... 47

4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm ...... 49

4.3 Analysis of Academic Social Networks ...... 50

4.3.1 Predicting Future Influence Scores and Download Counts ...... 51

4.3.2 The Learning Approach ...... 54

4.4 Summary ...... 55

5 Validation Experiments 57

5.1 The Considered Datasets ...... 57

5.2 Evaluation Methodology ...... 60

5.3 The Obtained Results ...... 62

5.3.1 Finding Influencers ...... 63

5.3.2 Predicting Future PageRank Scores and Download Counts ...... 67

5.4 Summary ...... 70

viii 6 Conclusions 71

6.1 Summary of Results ...... 72

6.2 Future Work ...... 73

Bibliography 75

Apendices 83

A Important Awards in Computer Science 83

ix

List of Tables

5.1 Characterization of the FourSquare and Twitter networks...... 58

5.2 Characterization of the DBLP dataset...... 59

5.3 Characterization of the DBLP network...... 61

5.4 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the FourSquare dataset...... 63

5.5 User influence scores for PageRank and HITS algorithms, for the User Graph, built from the FourSquare dataset...... 64

5.6 User influence scores for the IP algorithm, built from the FourSquare dataset...... 64

5.7 Spot influence scores for PageRank and HITS algorithms (that present the exact same top-10), for the User+Spot Graph, built from the FourSquare dataset...... 65

5.8 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitter dataset...... 66

5.9 User influence scores for PageRank and HITS algorithms, for the User Graph, built from the Twitter dataset...... 66

5.10 Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitter dataset...... 66

5.11 PageRank scores for top-10 highest ranked papers of the DBLP dataset...... 67

5.12 Results for the prediction of impact PageRank scores for papers in the DBLP dataset. . . 68

5.13 Results for the prediction of download numbers for papers in the DBLP dataset...... 69

xi

List of Figures

2.1 A graph with the set of vertices V={1, ..., 8}, the set of edges E={(1, 2), (2, 4), (3, 4), ...} and encoding a path P with length 6 (adapted from (Diestel, 2005))...... 7

2.2 Graph with three components and two SCC’s denoted by dashed lines (adapted from Easley & Kleinberg(2010) and Cormen et al. (2001))...... 8

2.3 Flowchart for the Single Transferable Vote rule...... 22

2.4 Learning-To-Rank (L2R) Framework (adapted from Liu(2009))...... 24

3.5 A graph with hubs and authorities (adapted from Kleinberg(1998))...... 28

3.6 A graph illustrating the computation of PageRank (adapted from Page et al. (1998)). . . . 29

3.7 The general TwitterRank framework (adapted from Weng et al. (2010))...... 34

4.8 Example of a location-based (adapted from Zheng & Zhou(2011)). . . . . 46

4.9 A sequence of subdivisions of the world sphere, starting from the octahedron, down to level 5 corresponding to 8192 spherical triangles. The circular triangles have been plotted as planar ones, for simplicity (adapted from Szalay et al. (2007))...... 48

4.10 The HTM recursive division process (adapted from Szalay et al. (2007))...... 49

4.11 Transformation of the original network graph (left) to our IP algorithm graph (right). . . . . 51

4.12 Structure of the citation graph built upon the DBLP data...... 51

4.13 Framework for predicting future PageRank scores and download counts...... 52

5.14 for nodes in the User+Spot Graph and the User Graph, from the FourSquare and Twitter datasets...... 60

5.15 Degree distribution for the DBLP dataset from 2008 to 2011...... 62

xiii

Chapter 1

Introduction

he rise of platforms such as Twitter1 or Google+2, with their focus on user-generated T content and social networks, has brought the study of authority and influence over social networks to the forefront of current research. For companies and other public entities, identifying and engaging with influential authors in social media is critical, since any opinions they express can rapidly spread far and wide. For users, when presented with a vast amount of content relevant to a topic of interest, sorting content by the source’s authority or influence can also assist in information retrieval.

There has been a substantial amount of recent work studying influence and the diffusion of informa- tion in social networks. Moreover, there has also been much work in the field of network analysis that has focused explicitly on sociometry, including quantitative measures of influence, authority, centrality or prestige. These measures (e.g., degree centrality or betweenness centrality) are essentially heuristics, usually based on intuitive notions such as access and control over resources, or brokerage of informa- tion.

In the context of my MSc thesis I conducted a thorough study on the problem of identifying the most influential nodes in a social network. With two different types of networks at hand, namely location-based social networks from services such as FourSquare or Twitter, and academic citation networks encoding relations between papers, the main focus was to use well-known techniques and algorithms.

One of the most important contributions of this work consisted in adapting the Influence-Passivity (IP) algorithm, initially strictly intended for Twitter data and relying on re-tweets to capture information flow, to be used in the context of location-based social networks, where the propagation of information is done via the locations that users visited, i.e., exploring patterns related to having a new user j visiting a location l, after one of his friends i had already visited l.

1http://twitter.com/ 2https://plus.google.com/

1 In what regards the study of influence in academic social networks, I studied techniques for estimat- ing the future influence scores and future download counts. In this context, I specifically developed a framework to predict the future PageRank scores and future download counts of scientific articles that were downloaded in the ACM Digital Library1, for a specific year, through a combination of features that include the age of the article and previous PageRank scores.

1.1 Hypothesis and Methodology

In the context of my MSc thesis, I focused on the task of identifying the most influential users in a social network, working with two types of networks, namely (1) location-based networks from services like FourSquare or Twitter, that include relationships between users in the network and between users and the locations they have visited, and (2) academic citation networks, i.e., networks that relate scientific papers according to their citation relationships. The main hypothesis I tried to validate was that we can identify the most influential users through social network analysis techniques and algorithms. More specifically, with location-based networks, the presence of locations aids in the propagation of influence scores through the network and, on the other hand, with academic citation networks, one can focus on the temporal dynamics of networks and use networks from the past to predict future networks, assessing how influence scores evolve through time.

In order to validate the research hypothesis, we began by collecting real and up-to-date data from two social networking platforms, namely, FourSquare2 and Twitter. To assess the accuracy of our results for social networks based on location, we made an empirical analysis of our top-10, looking into the user profiles and spot check-ins, in order to understand how their profile characteristics were related to their influence in the network. Regarding academic social networks, a citation network was built with data from the DBLP3 digital library. In location-based social networks, different ranking algorithms were computed and the top-10 highest ranked users and the top-10 highest ranked spots were extracted and analyzed. To assess the accuracy of our results over location-based social networks, conducted an empirical analysis, relying on profile information.Regarding the quality of the results from the academic citation network, it was assessed by crossed-checking the authors of the top-10 highest ranked scientific papers in the DBLP collection with the recipients of various renowned scientific awards - see AppendixA. Considering the experiments for estimation of future influence scores of scientific papers and future download counts for these scientific papers, a set of evaluation metrics, including the normalized root mean squared error and the spearman correlation, was used to assess the quality of our predictions comparing to the real influence scores.

1http://dl.acm.org/ 2https://foursquare.com/ 3http://www.informatik.uni-trier.de/˜ley/db/

2 1.2 Main Contributions

The following are the most important contributions of this thesis, according to their relevance:

• I conducted a thorough study regarding ranking algorithms, with special focus in the PageRank algorithm and its variants. I specifically implemented the HITS and the Influence-Passivity (IP) algorithms. The IP algorithm was adapted to the context of location-based social networks. I computed the influence for each node and extracted the highest ranked nodes in each type of network. The code implementation of HITS and IP algorithms was made available as an open- source project1, so that it can be re-used by others researching this topic.

• I implemented crawlers to extract data from FourSquare and from Twitter, from which I built net- works with two types of nodes, namely users and spots (i.e., the locations users have visited). These networks were used in the context of experiments for finding the most influential nodes through algorithms such as HITS, PageRank or IP. The source code for the FourSquare crawler was made available as an open-source project2, so it can be re-used by others researching this topic.

• I built a citation network with data from the DBLP digital library, being able to extract its most influential papers, after computing the PageRank algorithm. The accuracy of these results was assessed by cross-checking the authors of these papers against a list of the recipients of various renowned scientific awards. From this experiment, we could conclude that the majority of the most influential papers in this network are authored by recipients of important scientific awards.

• I developed a framework to predict the future PageRank scores and the future download counts of scientific papers, for a specific year, using the citation network built from DBLP data. This task was addressed through an ensemble learning regression algorithm. I assessed the impact that different features have in the accuracy of the results. Our predictions were compared to the real PageRank scores and the real number of downloads from the ACM Digital Library for each specific paper and year. We concluded that in some cases, depending on the combination of features that we used, having more information can deviate negatively the results, while in others, as we combine more information, predictions become closer to the real values. Globally, this prediction approach proved to be accurate, with the results being very close to the real values.

1http://code.google.com/p/ezgraph/ 2http://code.google.com/p/fscrawler/

3 1.3 Organization of the Dissertation

The structure for the rest of this document is the following: Chapter2 presents fundamental concepts in social network analysis. Chapter3 describes the most significant work related to the task of finding influencers in social networks, and related to the analysis of location-based social networks. Chapter4 details the work that was developed in the context of my MSc thesis, namely, the methodology for data collection, how the networks were built, the specific implementation and adaptation of the IP algorithm, as well as the methodology to find the influential nodes in the networks. Regarding the experiment on the prediction of future PageRank scores, Chapter4 also includes the description of the features and the learning approach that was used. Chapter5 describes the validation experiments and the obtained results, alongside with a brief discussion. Finally, Chapter6 closes this document, highlighting the most important conclusions of this MSc thesis, and presenting possible paths for improvement and future work.

4 Chapter 2

Fundamental Concepts

his chapter introduces the fundamental concepts related to the problem of finding influencers in T social networks. After a brief introduction to graph theory, more specific concepts are then pre- sented, such as what it is to be an influencer, the distinction between popularity and prestige, what does one mean when discussing social gestures, and the social gestures that are more relevant in the context of this MSc thesis, namely homophily and reciprocity. Finally, this chapter introduces fundamental con- cepts behind graph centrality measures, bibliometric indexes and rank aggregation approaches, these latter concerning the combination of the output of various ranking methods, to generate a consensual ranked list.

2.1 Fundamental Concepts in Graph Theory

A graph G can be represented as a pair G = (V,E), where V or V (G) is the set of vertices or nodes and E or E(G) is the set of edges or links between the nodes (Figure 2.1). The number of vertices of a graph indicates the graph’s order (Diestel, 2005). Graphs are usually used when representing networks, either undirected (Figure 2.1) or directed (i.e., digraphs in which the edges have a direction from node a A to node a B). A way of representing a directed graph D is with an adjacency matrix, which is a square matrix A = A(D) where each cell (i, j) has a value equal to 1 if there is an edge from i to j, and a value equal to 0 otherwise (Harary, 1962).

In what regards graph measures, the degree dG(i) or valency of a vertex i in an undirected graph G is the number |E(i)| of edges at i, which is equal to the number of neighbours of i, i.e., the number of vertices that are adjacent to i. It can be mathematically expressed as follows, where a(i, j) denotes a cell in the graph’s adjacency matrix:

5 X X dG(i) = a(i, j) = a(j, i) (2.1) j j

In what regards directed graphs, we have the same notation as in undirected graphs, with the exception that, when specifying the set of edges E, all pairs of connected vertices have to be oriented. Besides in out the measure of degree, one can also measure the in-degree dG(i) and out-degree dG(i) of a vertex i, which are, respectively, the number of incoming edges and outgoing edges of that vertex (Clark & Holton, 1991). The indegree and outdegree can also represent the cardinality of, respectively, the set of predecessors and successors of a node, and can be formally expressed as follows:

in X dG(i) = a(j, i) (2.2) j

out X dG(i) = a(i, j) (2.3) j

One might also want to represent a weighted network, i.e., a network in which each edge is assigned with a specific weight. A weighted network can be expressed as an adjacency matrix with each entry indicating the weight of the edges (wij), as follows (Newman, 2004):

Aij = wij (2.4)

When representing a weighted network with a graph, one just has to add the weights to each edge, thus defining a weighted graph. For a weighted network, besides the in-degree and out-degree for a vertex i, one is usually more interested in the strength of i, i.e., the sum of the weights w of the corresponding in out edges. The in-strength Si and out-strength Si of a vertex i are expressed as follows (Luciano et al., 2005):

in X si = w(j, i) (2.5) j

out X si = w(i, j) (2.6) j

Also important in graph analysis is the notion of stochastic matrix. A square matrix A = (akλ) can only be called stochastic if all its elements are non-negative and if the following conditions are verified (Brauer, 1952):

6 n X akλ = 1 (k = 1, 2, , ..., n) (2.7) λ=1

Stochastic matrices can be used to encode weighted graphs where the indegree or the outdegree cor- respond to probability distributions.

A path within a graph is a non-empty sub-graph P = (V,E) such that V = {x0, x1, ..., xk}, where

E={(x0x1), (x1x2), ..., (xk−1xk)}, and where xi are all distinct from one another – see Figure 2.1. The nodes x0 and xk are called the ends of path P (Bondy & Murty, 1976). For undirected and unweighted graphs, the number of edges (|E|) in a path is the length of the path.

Figure 2.1: A graph with the set of vertices V={1, ..., 8}, the set of edges E={(1, 2), (2, 4), (3, 4), ...} and encoding a path P with length 6 (adapted from (Diestel, 2005)).

One might also be interested in determining the geodesic path, i.e., the shortest path, between two vertices. The geodesic path between vertices i and j is the path between them that has the minimum length (Luciano et al., 2005).

When describing the structure of a graph, one can parcel it out into components or connected compo- nents, i.e., subsets of nodes in which every node has a path to every other node, but are not part of a larger set that is also internally connected (Gibbons, 1985) – see Figure 2.2. A directed graph can have strongly connected components (SCCs), which are sets of nodes such that, for any nodes i and j belonging to the set, there is an acyclic path from i to j and from j to i (Gibbons, 1985). Dangling nodes are defined as nodes that have no outlinks. Figure 2.2 illustrates both these concepts in a graph.

2.2 Influencers in Social Networks

Influence in social networks is very important, not only from the perspective of information flow, but also for network analysis applications aimed at business and marketing purposes. In terms of what it is to be influential, many authors have their particular definitions.

Watts & Dodds(2007) define an influential person or an opinion leader as an individual that is part of a minority and who has influence over a great number of peers. This influential individual belongs to the top q% of the influential distribution p(n), having as a premise that an individual i, within a population of

7 Figure 2.2: Graph with three components and two SCC’s denoted by dashed lines (adapted from Easley & Kleinberg (2010) and Cormen et al. (2001)).

size N, influences ni other randomly chosen individuals, where ni comes from p(n) and refers to how many people i influences, regarding a specific topic X.

From work developed in the Web Ecology Project, in the context of Twitter, an influential is defined as a user who, from his actions (i.e., from interactions such as replies, retweets, mentions or attributions) has the potential to initiate an action from another user (Leavitt et al., 2009). These actions are called markers of influence and should be taken into account in the task of assessing influence on Twitter users, instead of the primordial measure of the follower count, which states that the user with the greater amount of followers is the most influential.

Bakshy et al. (2011), also on Twitter, consider that if a person B is following a person A, if person A posted an URL earlier than person B did, and if person A is the only of B’s friends who has posted that specific URL, then person A has influenced person B to post that URL. Regarding the computation of influence, the authors recognize that three different approaches can be considered, if person B has more than one friend who has posted the same URL:

i. First Influence, crediting exclusively the person who first posted the content, thus assuming that individuals are influenced when they first see novel information, even if they do not act on it imme- diately;

ii. Last Influence, crediting the last person who posted the content;

iii. Split Influence, crediting equally all friends that posted that specific content before its most recent post. This last approach assumes that either the likelihood of noticing novel content or the intention of acting upon it steadily accumulates, as the information is reposted by more and more friends.

On their turn, and still in the realm of Twitter, Cha et al. (2010) defined three types of influence for a user, instead of just one. These metrics are directly related to interpersonal activities:

8 i. Indegree Influence, counting the total number of followers to determine the size of the user’s audi- ence in the network;

ii. Retweet Influence, counting the total amount of retweets with a user’s name to measure the ability of a user to generate content that is spread by others through the network (i.e., his pass-along value);

iii. Mention Influence, counting the total amount of mentions with a user’s name to measure the ability of engaging other users in a conversation.

Another important aspect of influence in social networks relates to the fact that influence is determined by the information flow through the network, i.e., the flow of user content and its propagation through the network (Romero et al., 2011).

2.3 Prestige, Popularity and Attention in Social Networks

Although popularity and prestige are two distinct concepts, they are commonly mistaken one for the other. Both these concepts are related to influence, since prestigious and/or popular users are more likely to be influential.

One can define popularity as a direct quantification of the level of attention someone, in a social network, has (Romero et al., 2011). Regarding social networks, one can, for instance, assess the popularity in Digg1 or in Youtube2, respectively, by the number of votes (Diggs) and the number of views that the content of a given user has (Szabo & Huberman, 2010).

As for the notion of prestige, it is most commonly associated with scholar networks, such as paper cita- tion and journal citation networks. In this realm, there is also the distinction between journal popularity and journal prestige, as the former considers journals that are frequently cited by journals with little pres- tige, and the latter considers journals that have few citations, but only from highly prestigious journals (Bollen et al., 2006).

Regarding the popularity and prestige of authors, journal or paper, in a scholar network, the popularity of an author, journal or paper, is the quantification of the number of times he was cited by other nodes in the network, while prestige is the number of times the node was cited by other highly cited nodes on the network (Ding & Cronin, 2011).

In the academic realm, attention is seen as a payment mode, as well as, the main input to scientific production (Franck, 1999). Scientific publications earn attention when cited by other authors, in their

1http://digg.com/ 2http://youtube.com/

9 publications. Also in other social networks, attention is regarded as a form of value and as a catalyst for more contributions in the social network (Wu et al., 2009).

2.4 Recognition, Novelty, Homophily and Reciprocity

Influential and popular people are recognized by their peers and also by many others outside their communities. As for recognition, may it be in blogs, academia or social media, it is done by referencing a person’s work, opinions or ideas, and it can have a bidirectional relationship with influence, since the more influential is what a user references, the more influential the user can become (Agarwal et al., 2008).

Novelty is also correlated with influence, in the way that novel ideas generally exert more influence. In the blogosphere, novelty is also correlated with the number of outlinks of a blog post. Nevertheless, this is a negative correlation, as a greater number of outlinks indicates that the post refers to many other blog posts, revealing that the post is not likely to be novel (Agarwal et al., 2008).

In the context of human interaction, being at the presence of homophily involves recognizing that similar people or people with similar characteristics, interests and/or preferences, tend to be more in contact with each other than with people with less characteristics and/or preferences in common. As stated in the work of McPherson et al. (2001), homophily implies that distance in terms of social characteristics translates into network distance, the number of relationships through which a piece of information must travel to connect two individuals.

Another important social phenomena is reciprocity, rising from the following relationships in social net- works, such as Twitter, where a user has the tendency of following back a user that followed him in the first place. This is revealed by the high correlation existing between the number of friends and followers, meaning that the more friends a user has, the more followers he usually has, and vice-versa (Weng et al., 2010).

Weng et al. (2010), in the study of TwitterRank, addressed the presence of homophily and reciprocity on Twitter, considering that these characteristics are behind the following relationships, giving more meaning to social ties and to the identification of influential people on Twitter.

2.5 Active versus Inactive Users, User Retention, Confounding, Social Influence and Social Correlation

When a user performs an action for the first time, such as purchasing a product or visiting a website, one can state that the user has become active. With a total number of a already active friends, a user

10 has an activation probability p(a), which can be modeled with a logistic function expressed as follows:

eα ln(α+1)+β p(a) = (2.8) 1 + eα ln(α+1)+β

In the formula, α and β are coefficients, with α measuring social correlation. Both can be estimated using maximum likelihood logistic regression (Anagnostopoulos et al., 2008).

An active user can become a retained user if he stays active in the network, therefore affecting the retention of other users and keeping them from leaving the network (Heidemann et al., 2010). This can also be used as an evaluation metric to identify influential users in social networks, as Heidemann et al. (2010) proposed to do.

Also, one can state that two adjacent nodes u and v in a social network have a social correlation tie if the events that turned u into an active user are correlated with the events that turned v into an active user as well. This behavioral correlation can be accounted by homophily, confounding factors (i.e., the environment) and social influence (Anagnostopoulos et al., 2008).

Confounding factors are the influences from external elements, which end up affecting the individuals that are closer in a social network. It can be mathematically expressed as the presence of a confounding variable X and a set of active individuals W , both in a social network G, and the fact that the set of active individuals W comes from a distribution that is correlated with X (Anagnostopoulos et al., 2008). In confounding, the individuals’ choices of becoming friends with others and of becoming active are exclusively affected by the same unobserved variable X.

The phenomena of social influence is also one of the causes for social correlation. With social influence, the actions of individuals can induce their friends in acting the same way, which can occur via (i) an example to their friends, (ii) informing friends about the action taken, or (iii) increasing the value of an action for their friends (Anagnostopoulos et al., 2008).

2.6 Information Cascades

In the theory behind information cascades, we assume that agents observe private signals of some in- herent state and make public decisions. The following decision-makers will face the difficulty of knowing if their own private signal is significant in the choice of a state that is unlikely, given the public decisions that were previously observed (Anderson & Holt, 1995).

We are at the presence of information cascades when all decisions (initial and subsequent) coincide in the way that it is optimal for the following decision-makers to ignore their private signals and follow a

11 pattern that has been established. For example, suppose that a worker is not hired by several prospec- tive employees because of poor interview performances. With this pubic decision information, we have that a following prospective employee may not hire the worker, due to the fact the worker’s information is dominated by negative signals inferred by previous rejections, even if the candidate does well in his interview (i.e., a positive private signal). Therefore, an information cascade can result from rational in- ferences that others’ decisions are based on information that dominates one’s own signal (Anderson & Holt, 1995).

From the work developed by Papagelis et al. (2009) in the context of the blogosphere, we have that a cascade can be characterized by its (i) size, i.e., the number of nodes involved in the cascade, excluding its initiator; (ii) height, i.e., the height of the resulting spanning tree, after a depth first search traversal on the cascade; (iii) minimum reaction time of all posts in the cascade, excluding its initiator; (iv) mean reaction time of all posts in the cascade, excluding its initiator; and (v) maximum reaction time of all posts in the cascade, excluding its initiator.

In social networks, there are many factors that influence information cascades, such as the graphical interface used to interact with the network (Millen & Patterson, 2002), the fact that an in-topic conversa- tion/interaction is being maintained (Arguello et al., 2006), or positive attention and feedback (Huberman et al., 2009).

The analysis of information cascades can provide insight on public opinion over a variety of topics (Papagelis et al., 2009). Therefore, this is related to the task of finding influential users on a social network, since those influential users are the ones who tend to shape, i.e., influence, the opinions of other users the social network.

2.7 Information Diffusion Models and Measures

Arising, respectively, from the realms of marketing, sociology and economics, Young(2009) presents three information diffusion models, namely, (i) social contagion, (ii) social influence and (iii) social learn- ing.

In social contagion, information spreads like in an epidemic, i.e., people spread information when they contact with others who have already been in contact with that same information (Young, 2009). This model is, thus, based on exposure. The homogeneous contagion model at time t can be mathematically described as the following ordinary differential equation:

p˙(t) = (λp(t) + γ)(1 − p(t)) (2.9)

12 In the formula, λ and γ are non-negative parameters, not both equal to zero, and respectively corre- sponding to the instantaneous rate at which a current non-adopter hears about the information from a previous adopter within and outside the group.

In social influence, users spread information when enough other people in their group have already been in contact with it. In a standard model, it is assumed that users have different social thresholds, which determine if they will spread that information or not, as a function of the number of others that have already spread it. Users are, thus, moved by social pressure, in a way that the aforementioned thresholds refer to their degree of responsiveness to social influence. Also, the threshold of user i is

the minimum proportion ri ≥ 0, such that i will only spread information if, at least, a proportion ri of the

members of the group already have done the same. If ri > 1, it is implied that, for user i to spread the information, at least, the whole group had to have spread it as well. Therefore, in this latter case, i never spreads the information. With F (r) being the cumulative distribution function of thresholds in some given population, at time t, the proportion of people whose thresholds have been crossed is F (p(t)). Having λ as the instantaneous rate at which people are converted to spread the information, and assuming that p(t) have already spread it, the proportion of users whose thresholds have been already crossed, but have not yet spread information is F (p(t)) − p(t) (Young, 2009). Thus, this model can be expressed as follows:

p˙(t) = λ[F (p(t)) − p(t)], λ > 0 (2.10)

In a social learning model, users spread information once they have enough empirical evidence to convince them that the information is worth spreading. Thus, users make rational use of previously gathered evidence in order to reach a decision (e.g., when a new smartphone is out in the market, people tend to see how it works for others over some period of time before trying for themselves). Due to sources of heterogeneity, such as discrepancies in their prior beliefs, the amount of information they have gathered, or idiosyncratic costs, people may spread information at different times. In this type of model, which gives us the reason why people would spread information, given that others have already spread it, the adoption decision flows directly from the rational evaluation of evidence. There are two types of social learning models, namely (i) social learning models with direct observation, where the evidence comes from other people’s experiences, i.e., people believe that the information is worth spreading because other people have done it, and their spreading payoff is fully observable, and (ii) herding models, where only the spreading act is observable (Young, 2009).

In a social learning model with direct observation, one can assume that (Young, 2009)):

i. Payoffs are observable;

ii. Payoffs generated by different individuals and/or at different points in time are independent and

13 equally informative;

iii. Users are risk-neutral and myopic (i.e., they only see close items);

iv. There is no idiosyncratic component to payoffs due to differences in user’s types, although users may have different costs (not necessarily observable);

v. There are differences in users’ prior beliefs about how good the information is relative to the status quo;

vi. There are differences in the average number of people users observe, and hence in the amount of information they have;

vii. The population is fully mixed.

In this case, the system becomes very simple and the various types of heterogeneity are reduced to a composite index that measures the probability of a given user spreading, conditional on the amount of information that has been generated so far, in the population (Young, 2009). Regarding information diffusion measures, the most commom are (i) speed, which considers when the diffusion instance will take place and if it will take place or not, (ii) scale, i.e., the number of instances that were affected at a first degree, and (iii) range, which measures how far the diffusion chain can continue on its depth (Yang & Counts, 2010).

2.8 Graph Centrality Measures and Bibliographic Indexes

In graph theory, graph centrality measures provide a way of measuring the varying importance of network vertices, according to specific criteria and the role played by the nodes of a network. In Bibliometrics, an area concerned with the analysis of patterns in scientific literature, bibliometric indexes are used to evaluate the quality, impact and relevance of the work of a particular scientist, usually by analyzing the citation graph. In the context of this MSc thesis, both these areas are particularly important, because they can provide robust approaches for estimating influence. Some of the most important network centrality metrics are as follows:

i. Degree Centrality: Degree centrality is a measure of the popularity of a node in a network (New- man, 2003). It is defined according to the number of edges connected to a particular vertex in the network, and is mathematically expressed as follows:

d (v) C (v) = G (2.11) D n − 1

In the formula, dG(v) is the degree of vertex v and n is the total number of vertices in the network.

14 ii. Betweenness Centrality: This measure is based on the number of shortest paths that pass through a vertex. For instance, the betweenness of a vertex i is the fraction of geodesic paths between pairs of vertices of the network that happen to be passing through i. In case of more than one shortest path between a pair of vertices, each path is given an equal weight such that their sum is (jk) equal to one (Newman, 2003). Assuming that gi is the number of geodesic paths from vertex

j to vertex k that are passing through i, assuming that njk is the total number of geodesic paths from vertex j to vertex k, and assuming that n is the total number of vertices in the network, the betweenness of vertex i is computed as follows:

P (jk) g /njk b = j

With the betweenness measure, the extent to which a node has control over the information that flows between others can be estimated. iii. Closeness Centrality: This measure is defined as the average geodesic distance, i.e., the average shortest path, between a vertex and all the other vertices that are reachable from it. By measuring a vertex’s closeness, we can measure how long it will take to spread information from this par- ticular vertex to the other vertices in the network (Freeman, 1978). Closeness Centrality can be mathematically expressed as follows:

1 C (i) = C P (i,j) (2.13) j∈V \i g

In the formula, V represents the total set of vertices of the network and g(i,j) is the distance of the geodesic path between vertices i and j. iv. Eigenvector Centrality: This measure weights the contacts according to their , taking into account the whole pattern of the network and computing the weighted sum of both direct and indirect connections of every length. Therefore, having the graph G(E,V ), the adjacency matrix

A, λ as the largest eigenvalue of A, and n as the number of vertices, the eigenvector centrality xi of node i can be expressed as follows (Bonacich, 2007):

n X λxi = aijxj i = 1, ..., n (2.14) j=1 v. Clustering Coefficient: As a measure for transitivity, Watts & Strogatz(1998) introduced the clus- tering coefficient. This coefficient measures the degree to which neighbours on a network can be closer to one another, and it can be globally expressed as follows (Kaiser, 2008):

15 P i∈V Γi C = P (2.15) dG(i)(dG(i) − 1)

In the formula, i is a vertex of graph G that has V as its set of vertices, dG(i) is the degree of i and

Γi is the number of edges between vertex i and its neighbours. The above global definition of the clustering coefficient is obtained through the computation of a local clustering coefficient which, for undirected graphs is defined as in Equation 2.16, and for directed graphs is as expressed in Equation 2.17:

2|e | C(i) = jk (2.16) dG(i)(dG(i) − 1)

|e | C(i) = jk (2.17) dG(i)(dG(i) − 1)

In both formulas, i, j and k are vertices of graph G, dG(i) is the degree of i, and |ejk| represents the total number of existing edges between the neighbours of vertex i.

vi. Average Path Length: This measure determines the distance between any pair of vertices, and it can be used to determine if the graph is characteristic of a social network (Reka & Barabasi´ , 2002). It is computed as the average length over all shortest paths between pairs of vertices (Luciano et al., 2006), and it can be mathematically expressed as follows:

1 X hLi = gik (2.18) n(n − 1) i,k∈V

In the formula, V is the set of vertices in the network, gik represents the distance of the geodesic path between vertices i and k, and the parameter n represents the total number of vertices in the graph.

To compute these network centrality measures, some readily available open-source libraries can be used. These include:

i. Gephi1 (Bastian et al., 2009): A Java library for social network analysis and data visualization;

ii. NetworkX 2 (Hagberg et al., 2008): A Python library to create, manipulate and analyze complex networks;

iii. Network Workbench3: A Java framework for large-scale network analysis and data visualization;

1http://gephi.org/developers/ 2http://networkx.lanl.gov/ 3http://nwb.cns.iu.edu/

16 iv. iGraph1: A C library for graph analysis which integrates with the R package2 for data visualization and statistical computing, which also provides other methods for social network analysis;

v. CiteSpace3 (Chen, 2006): A Java application for visualizing and analyzing trends and patterns in scientific literature;

vi. NetKit-SL4 (Macskassy & Provost, 2007): A set of Java packages which provide an implementation of several graph centrality measures;

vii. CytoScape5 (Shannon et al., 2003): A Java software platform for visualization, which also provides network analysis via plugins.

As for bibliometric indexes, some of the most widely used are as follows:

i. The h-index and its variants : Proposed by Hirsch(2010) to quantitatively represent the output of a researcher, this index measures the productivity and total impact of a scientist, supporting comparisons between scientists of different ages (Hirsch, 2010). A researcher has an h-index of h

if h of his/her Np papers (i.e., the total number of published papers) have at least h citations each,

and the other (Np − h) papers have ≤ h citations each.

Several variants of this metric have been proposed, in order to deal with some of the problems of the original h-index. One such extension is the contemporary h-index (Sidiropoulos et al., 2007), which takes into account the age of an article and allows us to acknowledge the work of young promising scientists and of senior scientists, who happen to still be active. The contemporary h-index score Sc(i) for article i depends on the value of:

Sc(i) = γ ∗ (Y (now) − Y (i) + 1)−δ ∗ |C(i)| (2.19)

In the formula, Y (i) represents the year of publication of the article i and C(i) represents the number of articles that cite article i. The parameter δ is set to 1, so that Sc(i) is the total number of citations received by article i, divided by the age of the article. By introducing this parameter, we have that the score Sc(i) will be too small, so the coefficient γ is set to 4, making the citations of an article of the current year account as four times more important and, consequently, an article published 4 years ago will have its citations in account only once. With this approach and, as time goes by, older articles gradually lose their value.

1http://igraph.sourceforge.net/ 2http://www.r-project.org/ 3http://cluster.cis.drexel.edu/˜cchen/citespace/ 4http://netkit-srl.sourceforge.net/ 5http://www.cytoscape.org/

17 c c In brief, a researcher has a contemporary h-index of h if h of his/her Np articles have a score of c c c c c S (i) ≥ h each, and the remaining (Np − h ) articles each have a score of S (i) ≤ h .

Another variant is the trend h-index, addressing the fact that the h-index does not take into account the age of a citation (Sidiropoulos et al., 2007). Articles that continue to be cited along the years indicate that the topic/solution is still up to date and that the respective scientist can be an influential mind, who still has an impact on younger scientists. As an article is continually cited, we can also be in the presence of a trend-setter, i.e., a scientist whose work is, in some way, pioneering and/or is currently working on something that is considered as trendy. Hence, the trend h-index, with γ, δ , Y (i) and S(i) as defined in Equation 2.19, can be expressed with basis in the value of:

X St(i) = γ ∗ (Y (now) − Y (x) + 1)−δ (2.20) ∀x∈C(i)

t t t t In brief, a researcher has a trend h-index of h if h of his/her Np articles have a score of S (i) ≥ h t t t each, and the remaining (Np − h ) articles each have a score of S (i) ≤ h .

There is also the normalized h-index, which mitigates the fact that scientists from different research areas do not publish the same number of articles, providing a fairer h-index metric (Sidiropoulos n et al., 2007). A researcher has a normalized h-index of h = h/Np if h of its Np articles have

received at least h citations each, and the remaining (Np − h) articles have received no more than h citations.

Recent work developed by Devezas et al. (2011) applied the h-index to the task of ranking web blogs. Analogously to Bibliometrics, blogs can be seen as the authors and the posts as the papers published by them. Therefore, a blog has a index h if h of its N posts have at least h inlinks each and the remaining (N − h) posts have no more than h in-links each. The h-index turned out to be a more balanced metric, comparing to the use of the indegree, to assess the importance of a blog. ii. The g-index: This index is an improvement over the h-index, measuring the global citation perfor- mance of a list of articles (Egghe, 2006). A set of papers has a g-index of g if g is the highest unique rank such that the top g papers have, together, at least g2 citations. This is valid if the list of articles is sorted in decreasing order of the number of citations received by each article, and the top g + 1 papers have less than g2 citations. Thus, with α > 2 denoting the Lotkaian exponent (Lotka, 1926) and with T denoting the total number of sources, i.e., articles, the g-index can be mathematically expressed as follows:

α−1   α α − 1 1 ln(growth rate of sources) g = T α , with α = 1 + (2.21) α − 2 ln(growth rate of items)

18 In the formula, the sources are the scientific articles and the items are the citations between those articles (Egghe, 2009). iii. The a-index: The a-index is a derived index, dependent on the h-index, and is a constant ranging between 3 and 5 that helps us to better understand the relation between the total number of

citations of an article (Nc,tot) and the h-index (Zhang, 2009). The a-index allows us to describe the magnitude of the hit contributions of individual scientists and is defined as follows (Sidiropoulos et al., 2007):

2 Nc,tot = ah (2.22)

This index can be used as a secondary metric to rank and evaluate scientists, due to the fact that 2 2 h underestimates the Nc,tot of the h most cited papers, which is usually greater than h , and disregards the papers that have less than h citations (Hirsch, 2010). iv. The e-index: This metric was proposed by Zhang(2009) to address two specific drawbacks from the original h-index, namely:

• Loss of citation information - excess citations are ignored, making the comparisons based only on the h-index misleading;

• Low resolution - the h-index is composed of natural numbers, instead of real numbers, hence, confining a relatively narrow range to the results.

The e-index can formally be defined as follows:

h 2 X 2 e = citj − h (2.23) j=1

th 2 In the formula, citj is the number of received citations by the j paper and the e value is ex- pressed as a real number. This index is also related to the aforementioned a-index in the following way:

e2 a = h + (2.24) h v. The ISI Impact Factor: This index measures the popularity of a journal referring to a specific year. It is defined as the mean number of citations that have occurred in the specified year, to articles that were published in the journal during the prior two years (Bollen et al., 2006).

X c(vj, vi, t) IF (v , t) = (2.25) i n(v ) j i

19 In the formula, c(vj, vi, t) is the number of citations from journal vj to journal vi in year t, and n(vi)

corresponds to the number of publications in journal vi during the two years prior to t, which ends up normalizing the resulting citation count of an article in a mean 2-year citation rate (Bollen et al., 2006).

vi. The Y-Factor: Due to the fact that there can be some discrepancies between the values of the ISI Impact Factor and the Weighted PageRank, introduced in Section 3.2.1, (i.e., a journal may have a high ISI Impact Factor, but a low Weighted PageRank value) this measure results from the

multiplication of both these values. The Y-factor of journal vj can be mathematically expressed as follows:

Y (vj) = ISIIF × PRw(vj) (2.26)

When assessing the authority of an individual, it is important to use these measures not only individually, but also in combination, since scientific impact can be seen as a multi-dimensional construct (Bollen et al., 2009).

2.9 Unsupervised Rank Aggregation Approaches

Given that each metric introduced in the previous section produces an ordering for the nodes in a graph, we can leverage on the rank aggregation methods from social choice theory (i.e., voting protocols) to combine the individual rankings.

In the realm of voting protocols, we consider that there are voters who submit votes over their favorite alternatives, i.e., the candidates. Determining the winner, or the best ordering of candidates, requires the aggregation of the rankings of all voters. This process depends on the voting rule that is used, and it can be defined as follows: Let C be the set of candidates, R(C) the set of all possible rankings of the candidates, and n the number of voters. A voting rule is a mapping from R(C)n to C, if one wishes to produce a winner, and from R(C)n to R(C), if one wishes to produce an aggregate ranking (Conitzer, 2006a). The most common voting rules are as follows:

i. Scoring Rules - Borda Rule, Plurality Rule and Veto Rule: Let ~α = hα1, ..., αmi be a vector of

integers. For each voter, α1 is the number of points that a candidate gets if the voter ranks him

first, α2 the number of points that candidate gets if the voter ranks him second, and so on.

With the Plurality Rule, candidates are ranked simply in terms of how often voters have ranked them in the first place, thus having a system of scores corresponding to ~α = h1, 0, ..., 0i. With this rule, it is irrelevant how voters rank the candidates that are below the top candidate.

20 The Veto Rule is the opposite of the Plurality Rule, because it is based on a system of scores with ~α = h1, 1, ..., 1, 0i, i.e., it only takes into account how often the candidate is not ranked in last place. As such, each voter vetoes a single candidate and the least vetoed candidate wins the election (Procaccia et al., 2006).

The Borda Rule is based on a system of scores with ~α = hm − 1, m − 2, ..., 0i, which means that a candidate obtains m − 1 points for the first position in the preference of a voter, m − 2 points for the second position, and so forth, with m representing the total number of candidates. The candidate who sums the maximum number of points from all voters is the winner (Kiselev, 2008). ii. Single Transferable Vote (STV): This is a method to calculate the result of an election with the guarantee of proportional representation, under reasonable conditions, for the sets of voters who share a set of most preferred candidates (Geller, 2002). Running through m − 1 rounds, the STV voting rule is based upon three principles:

• Order of preference - the candidates are listed in ordinal preference by the voters (i.e., in descending order).

• Quota - the number of votes needed for a candidate to win the election must be calculated in the following way:

 |V |  q = + 1 (2.27) e + 1

In the formula, |V | represents the total number of voters and e the number of seats available in the election (i.e., the number of candidates to elect). In each round, if a candidate gets a greater number of votes than the quota, that candidate is automatically elected.

• Transfer - When a candidate c is elected and there are still more seats to be filled, the surplus of the votes from that newly elected candidate must be redistributed to each voters’ next

ranked candidate. The transfer value fc takes into account the quota q and the number of

votes wc that candidate c has. It is computed as follows:

w − q fc = c (2.28) wc

When, in each round, the top voted candidate does not have enough votes to be elected (i.e., the total number of votes is less than the quota), the last placed candidate is eliminated, and those candidate’s votes are redistributed to the next highest ranked candidate, for each voter for whom the recently eliminated candidate was the top preference.

The flowchart in Figure 2.3 depicts the steps taken in each round, in order to conclude the election. It is based on the additional information provided by an online simulation of the Single Transferable

21 Vote1 system.

Figure 2.3: Flowchart for the Single Transferable Vote rule. iii. Plurality Rule with Run-Off : This rule proceeds in two rounds. In the first, all candidates are eliminated, except the ones with the highest plurality scores, i.e., the candidates with the first and second highest number of votes in the election. Then, as in the STV Rule, all the votes are transfered to these selected candidates. The second round, which is called the run-off, is used to define the final winner of the election, from the two remaining candidates. All candidates are ranked according to their Plurality scores, except the top two whose relative ranking is determined according to the results of the second round. iv. Maximin: Letting N(c1, c2) be the number of votes that show the preference of candidate c1 over candidate c2, the maximin score (also known as the Simpson Score) assigned to a candidate c1 is as follows:

s(c1) = minc16=c2N(c2, c1) (2.29)

In the formula, s(c1) is the worst score of candidate c1 in a pairwise election. As all candidates are ranked by their scores, the winner of the election is the candidate with the highest maximin score. v. Copeland: For any two candidates c1 and c2 we simulate a pairwise election, so we can determine how many voters prefer c1 over c2 and how many prefer c2 over c1 (Xia et al., 2011). All candidates are ranked by their score, and they gain or lose a Copeland point for, respectively, each election they win or lose (Conitzer & Sandholm, 2005). If there is a tie, Copeland points are also assigned to the candidates. Therefore, for a pairwise election between candidates c1 and c2, a score is assigned according to the following procedure: 1http://stv.humancube.com/

22  1,N(c1,c2) > N(c2,c1)  C(c1, c2) = 1 (2.30) 2 ,N(c1,c2) = N(c2,c1)  0,N(c1,c2) < N(c2,c1)

Then, the Copeland Score of candidate c1 is given by:

X s(c1) = C(c1, c2) (2.31) c26=c1

The candidate who has the highest score wins the election.

vi. Bucklin: The Bucklin Score of a candidate c is the smallest number lc such that more than half of n the voters rank c among the top positions, i.e., the Bucklin score B(lc) > 2 (Xia et al., 2011). The winner is the candidate with the lower Bucklin Score. All candidates are ranked in inverse order by

lc and if there is a tie, B(lc) is used as a tie-breaker. vii. Slater: In the Slater voting rule, we choose a ranking of candidates that is inconsistent with the out- comes of as few pairwise elections as possible (Conitzer, 2006b). An inconsistency corresponds to the case in which, for each pair of candidates c1 and c2, c1 is ranked higher than c2 and c2 defeats c1 in their pairwise election. Therefore, the intent of the Slater ranking is to minimize such inconsistencies. viii. Kemeny: Similarly to the Slater Rule, a ranking is a Kemeny ranking if it minimizes the number of inconsistencies. However, this rule produces a ranking that aims at minimizing the number of times that the ranking is inconsistent with a vote on the ranking of two candidates. Therefore, an inconsistency in the terminology of the Kemeny ranking is defined as follows: Given the ranking r

of two candidates, given each combination of candidates (c1, c2), and given a voter ra, we have an

inconsistency if ra ranks c1 higher than c2, but ri ranks c2 higher than c1.

ix. Cup and its variants: Cup Rule runs a single-elimination contest to decide which candidate wins the election. It does not produce a full aggregate ranking of the candidates, and it requires an additional schedule for matching up the remaining candidates. The rule is defined by a balanced binary tree T , where each candidate is assigned to a leaf through the aforementioned schedule. To each of the remaining non-leaf nodes is assigned the winner of the pairwise election of that node’s children. There is a winner whenever a candidate is assigned to the root node.

As for Cup Rule’s variations, we have the regular cup, which assumes that all voters know to which leaf a candidate is assigned to, prior to their voting, and the randomized cup, in which the assignment of candidates to leaves is uniformly chosen at random, after the voting. Votes can be weighted and thus there can also be a different interpretation to the weight, such that it can represent the decision power of a voting agent in a setting where not all agents are considered

23 equal, e.g., a weight of K counting as K votes of weight 1.

2.10 Supervised Learning for Rank Aggregation

In the previous section we presented unsupervised techniques to perform rank aggregation. Neverthe- less, one can also use supervised learning techniques to address this task. In order to do that, Learning to Rank (L2R) has emerged as a way of using machine learning techniques for rank aggregation (Li, 2011).

In L2R, there are two general phases, namely, learning and ranking. The learning phase takes training data as input, which corresponds to ranked lists of objects, with each object being described by a set of features (i.e., a set of simple ranking measures that we want to combine). Given a new set of objects, one aims at predicting the best possible ranking, combining the available information. Figure 2.4 illustrates the general framework.

Figure 2.4: Learning-To-Rank (L2R) Framework (adapted from Liu(2009)).

Learning-to-Rank methods can be categorized according to three different types of approaches, namely, pointwise, pairwise, and listwise (Li, 2011; Liu, 2009).

In the pointwise approach, the ranking problem is transformed into a classification, regression or an ordinal classification problem. The input space has each object’s feature vector, while the output space contains the ranking order predicted to each object (Liu, 2009). The loss function is said to be pointwise because it is defined on a single object’s feature vector (Li, 2011) and inspects the ground truth ranking order for each single object. The hypothesis space on a pointwise approach contains the functions that take the feature vector of an object as input and predict the ranking order of that same object (Li, 2011).

In the pairwise approach, the ranking problem is transformed into a pairwise classification problem, i.e.,

24 one classifies a given pair of objects as if the pair were in a correct ranking order or not. In this approach, the loss function is pairwise, due to it being defined on a pair of feature vectors.

The listwise approach takes ranked lists of objects as instances and, unlike the aforementioned ap- proaches, it maintains the group structure of the ranked lists. This approach also learns a ranking model from the given training data, which can later assign scores to feature vectors, and then ranks these feature vectors using those scores.

One particular supervised listwise ranking method is CRanking (Lebanon & Lafferty, 2002) which applies the following probabilistic model:

k 1 X P (π|θ, Σ) = exp( θ · d(π, σ )) (2.32) Z(θ, Σ) j j j=1

In the formula, π is the final ranking, Σ = (σ1, ..., σk) are the basic rankings being combined, d is the distance between the two rankings (e.g., Kendall’s τ) and θ is a weighting parameter. Z is a normalization factor over all the possible rankings, and can be defined as follows:

k X X Z(θ, Σ) = exp( θj · d(π, σj)) (2.33) π j=1

m When learning, the algorithm is given S = {(Σi, πi)}i=1 as training data, in order to build a model for rank aggregation. Maximum Likelihood Estimation is used to learn the model’s parameters. Considering that both the final ranking and the basic rankings are all full ranking lists in the training data, the likelihood function can be computed as follows:

m Pk X exp( j=1 θj · d(πi, σi,j)) L(θ) = log k (2.34) P Q exp P θ · d(π , σ ) i=1 πi∈ j=1 j i i,j

For the final step of prediction, the algorithm is given the learned model and the basic rankings Σ. Then, the probability distribution π : P (π|θ, Σ) of the final ranking is calculated, in order to be later used when calculating the expected rank for each object. Objects are finally sorted according to their expected rank, the latter being defined as follows:

n n X X X E(πi|θ, Σ) = r · P (πi = r|θ, Σ) = r · P (π|θ, Σ) (2.35) r=1 r=1 π∈Q,π(i)=r

25 2.11 Summary

In this chapter, the fundamental concepts regarding the tasks of characterizing a network and finding the network’s most influential nodes were introduced. Broader concepts such as prestige, popularity or recognition were also explored, distinguishing them from what it is to be an influencer. Other related network analysis topics were introduced, namely information cascades and information diffusion models, since the most influential nodes in a network have the capacity to disseminate information through the network at a much faster pace, reaching a greater number of other nodes. Learning-To-Rank and rank aggregation techniques were also introduced as ways of combining different ranking lists to produce a single, global and uniform ranking list.

26 Chapter 3

Related Work

his chapter presents the most important related work in the context of my MSc thesis. The chap- T ter starts by presenting the HITS algorithm and Google’s PageRank algorithm for ranking web pages, discussing how the latter evolved from its original implementation to more detailed and specific approaches, such as the Weighted PageRank algorithm and the Topic-Sensitive PageRank algorithm. Then, the section introduces the IP Algorithm, a recent development that extends the benefits of PageR- ank and determines the influence and passivity of network nodes based on their capacity to forward information. In the specific realm of Twitter we present TwitterRank, an approach to measure the influ- ence of a Twitter user, based on the principle of homophily regarding the topics that users write about. Finally, we take a deeper look at the work that has been done in Bibliometrics, in order to find influencers in citation and co-authorship networks, and also describing works that take into account the temporal evolution of graphs.

3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm

The HITS algorithm, a Web page ranking method developed by Kleinberg (Kleinberg, 1998), is based on the notion of authorities and hubs. The authorities, i.e., pages that have a greater amount of inlinks, have a mutually reinforcing relationship with hubs, i.e., the pages that have outlinks to many related authorities, in a way that a good hub is a page that points to many good authorities, and a good authority is a page that is pointed by many good hubs – see Figure 3.5. This relationship is put into use through the iterative procedure shown in Algorithm1, which maintains and updates the weights of each page (Kleinberg, 1998).

27 Figure 3.5: A graph with hubs and authorities (adapted from Kleinberg(1998)).

Algorithm 1 The Hyperlinked Induced Topic Search (HITS) Algorithm G: A graph with n interlinked pages k: A constant corresponding to the number of iterations z: The vector (1,1,1,...,1) ∈ Rn Set x0 := z Set y0 := z for i = 1, 2, ..., k do P 0 Apply xp = q,q→p yq to (xi−1, yi−1), obtaining new x-weights xi P 0 0 Apply yp = q,p→q xq to (xi, yi−1), obtaining new y-weights yi 0 Normalize xi, obtaining new authority scores xi 0 Normalize yi, obtaining new hub scores yi end for

In order to compute the HITS algorithm, the aforementioned Gephi1, NetworkX 2 and Network Work- bench3 software packages can be used.

3.2 The PageRank algorithm and its Variants

The PageRank algorithm arose in the context of the development of Google’s search engine, at the time described as a prototype of a large-scale search engine that made heavy use of the hyperlinked structure of the web (Brin & Page, 1998).

PageRank is based on principles from academic citation analysis, applied to the web. It can be mathe- matically expressed as follows:

(1 − d) X PR(Ti) PR(A) = + d (3.36) N C(T ) i i

A page A has T1, ..., Tn pages that point to it (i.e., that cite page A) and, C(T1), ..., C(Tn) is the number of outlinks from page A to pages T1, ..., Tn. The term N corresponds to the total number of pages in

1http://gephi.org/developers/ 2http://networkx.lanl.gov/ 3http://nwb.cns.iu.edu/

28 the network. The free parameter d is called the damping factor and controls the performance of the algorithm, being usually set to 0.85. In a random web surfer scenario, the surfer can restart his search with probability 1 − d by jumping to another page that is randomly and uniformly chosen, instead of following a random link, which can be done with probability d (Chen et al., 2007). Figure 3.6 depicts the computation of the PageRank score for a three-node network.

Figure 3.6: A graph illustrating the computation of PageRank (adapted from Page et al. (1998)).

From Figure 3.6, one can acknowledge that page A has an inlink from page C and two outlinks to pages B and C. Therefore, page A is going to split its PageRank score of 0.4 for its two outlinks, equally transfering a value of 0.2 to pages B and C. In its turn, page B has a PageRank score of 0.2 that A transfered to it. Because B only has an outlink to C, this page entirely transfers its PageRank score to page C. Finally, page C, that receives PageRank scores of 0.2 from A and B, accumulates a PageRank score of 0.4, which is entirely transfered to its only outlink, page A.

A page can achieve a high PageRank score if it has many other pages pointing to it, i.e., if it is highly cited, or if some of the pages that point to it have themselves a high PageRank score.

Even though PageRank works over networks originally corresponding to directed graphs, the works of Perra & Fortunato(2008) and of Mihalcea(2004) revealed that PageRank can also be applied to undirected graphs, hence having vertices with equal indegrees and outdegrees.

In the realm of Bibliometrics, PageRank is used as a complementary method to citation analysis, due to the fact it mitigates citation count’s drawback of not taking into account the importance of a paper. PageRank allows us to identify publications that are being referenced by highly cited articles (Ding et al., 2009).

Authors such as Chen et al. (2007) suggested to set d = 0.5, due to the hypothesis that, in the context of citation networks, the entries in a reference list of a typical paper are collected following, approximately, an average length of 2. Chen et al.’s justification is based on the empirical observation that about 50% of the articles that are in the references list of a paper A have at least one citation following the pattern B → C, in which the article C is part of A’s reference list. Thus, the author assumes that there is a feed-forward loop among A, B and C, such that A → B, B → C and, consequently, A → C.

Due to its probabilistic nature, and also to the fact that each node is guaranteed to be visited, PageRank

29 scores are not comparable across different graphs. To mitigate this, Berberich et al. proposed a normal- ization of the PageRank scores, which eliminates any dependency on the size of the graph (Berberich et al., 2006). The normalized PageRank score can be computed as follows:

ˆ PR(v) PR(v) = 1 P  (3.37) |V | d + (1 − d) d∈D PR(d)

In the formula, the denominator represents the lower-bound for Equation 3.36, while |V | is the total number of vertices in the graph and D ⊆ V is the set of dangling nodes.

Alternatively to the random surfer model, and specifically for social phenomena such as epidemics or word-of-mouth recommendation, Ghosh et al. (2011) proposed a broadcast-based non-conservative diffusion model, due to the fact that this phenomena can be modeled as contact processes, in which an active (infected) node will activate its neighbours, via broadcast, with some probability. The differ- ence between this model and the random surfer model is that, while the latter conserves the amount of substance that is being diffused on the network, the former is non-conservative in a way that the information changes while it spreads from an individual to his neighbours. Ghosh et al. (2011) state that PageRank is a steady state solution of conservative diffusion and, therefore, a conservative metric, while Alpha-Centrality, a non-conservative metric, which measures the total number of paths from a node ex- ponentially attenuated by their length, is a steady state solution of linear non-conservative diffusion. In their study, the authors propose an efficient algorithm for computing the Alpha-Centrality.

To compute the PageRank algorithm, we can use some readily available open-source software li- braries, such as the aforementioned Gephi, NetworkX and Network Workbench packages, or the LAW- Webgraph1 Java library for large-scale web graph analysis (Boldi & Vigna, 2004).

3.2.1 Weighted PageRank

In the original PageRank algorithm from Equation 3.36, we have no notion of hyperlink weight, and thus all hyperlinks express the same degree of relationship between the pages they link (Bollen et al., 2006). However, in many practical applications, we have that not all links express the same type of relationship.

Acknowledging that some links in a web page may be more important than others, Xing & Ghorbani (2004) proposed a Weighted PageRank algorithm that assigns higher scores to more important links, instead of the traditional even division among the outlinks of a page. Each link is assigned with a value that is proportional to the popularity of the destination node, i.e., proportional to its number of inlinks and outlinks.

in out In this approach, there is an inlink weight W(v,u) and an outlink weight W(v,u). The inlink weight of link

1http://webgraph.dsi.unimi.it/

30 (v, u) is based on the number of inlinks of page u and the number of inlinks from all the pages that are referenced by page v. The outlink weight is analogous. They are calculated as follows:

in Iu out Ou W(v,u) = P W(v,u) = P (3.38) p∈R(v) Ip p∈R(v) Op

In the formulas, Iu and Ip represent, respectively, the number of inlinks of pages u and p, while Ou

and Op, represent the number of outlinks of pages u and p. R(v) is the set of outlinks from page v. Considering the introduction of these two weights in the computation of a Weighted PageRank algorithm, the latter can be mathematically expressed as follows:

X in out PR(u) = (1 − d) + d PR(v)W(v,u)W(v,u) (3.39) v∈B(u)

The studies conducted within the work of Xing and Ghorbani revealed that their Weighted PageRank algorithm has a better performance than the original PageRank.

Fiala et al. (2008) also proposed modifications to the original PageRank algorithm, ensuring its applica- tion in bibliographic networks. The authors take into account the citation and co-authorship information, in the way that each edge (u, v) ∈ E, with E corresponding to the set of edges between the vertices of the

graph where nodes correspond to the authors of the papers, is associated with weights wu,v, cu,v, bu,v.

The value wu,v is the number of citations from author u to author v, the value cu,v is the number of

common publications by u and v and bu,v can assume different values, depending on the semantics of edge weights that we want to stress. The new ranking for authors is defined as follows:

wv,u c +1 v,u P w 1 − d X bv,u+1 (v,j)∈E v,j R(u) = + d R(v) (3.40) |A| P wv,k (v,k)∈E cv,k+1 P (v,u)∈E wv,j bv,k+1 (v,j)∈E

In the formula, |A| is the set of vertices (e.g., the set of authors of the papers) and d is a damping factor, empirically set to d = 0.9.

In this approach, a Weighted PageRank algorithm is reached if, according to Equation 3.40, the coeffi- cients b and c equal to zero.

Bollen et al. (2006), when applying the Weighted PageRank algorithm to journal citation networks, took into account journal citation frequencies in the transfer of PageRank values, so that the prestige of a jour- nal can be accordingly transfered along the iterations of the algorithm. They referred to this transfered value as the Propagation Proportion between journals and defined it as follows:

W (vj, vi) w(vj, vi) = P (3.41) k W (vj, vk)

31 In the formula, W (vj, vi) is the weight of the link between journals vj and vi, which are then normalized

by the weights of journal vj’s outlinks. In the application of the Weighted PageRank algorithm described

by Bollen et al. (2006)., the number of outlinks C(Ti) from Equation 3.36 has been replaced with the Propagation Proportion, resulting in the following equation:

(1 − d) X PR = + d PR (v ) × w(v , v ) (3.42) w N w j j i

On the other hand, within the work of Yan & Ding(2011), citation counts are incorporated with the network topology, resulting in the following integrated Weighted PageRank algorithm:

CC PR PR = (1 − d) (p) + d w(pi) (3.43) w PN Pk j=1 CC(pj ) i=1 C(pi)

PN In the formula, CC(p) represents the number of citations pointing to an author p, j=1 CC(pj ) is the sum of the citation counts for all the nodes in the network, and (1 − d), as in previous PageRank definitions, ensures that results sum up to one. Yan & Ding(2011) pointed out two extreme scenarios regarding the PN variation of d. If d = 0, then each node would have its relative citation score equal to j=1 CC(pj ), which equals to the normalized citation counts. Also, and in accordance with Boldi et al. (2005), when d → 1− PageRank becomes unstable and its convergence rate slows.

3.2.2 Topic-Sensitive PageRank

The link-structure of the Web is used in the original PageRank algorithm to pre-compute topic-independent scores that reflect the importance of web pages. The pre-computed importance scores can afterwards be combined with other Information Retrieval scores, e.g., term frequency, to produce a ranking of the pages towards specific user queries (Brin & Page, 1998).

Haveliwala(2002) proposed a Topic-Sensitive PageRank algorithm, where one computes offline a set of PageRank vectors, which are biased towards a set of representative basis topics from the Open Directory Project1. For each page, and regarding the considered set of topics, a set of importance scores is created and, at query-time, the similarity of the query and/or user context is calculated. To achieve the final ranking, one linearly combines the topic-sensitive vectors, which are weighted with the similarity of the query towards the topics.

The mathematical approach to this Topic-Sensitive PageRank is as follows. Considering q the query and q0 its respective context in the page u, we may have a search in context (i.e., the user is viewing a document and selects a term from it, in order to get more information about the selected term). The context q0 consists of all terms in u if we have a search in context, and otherwise q0 consists only in the

1http://www.dmoz.org/

32 query q. For each topic cj, the following quantity is computed:

0 P (cj) · P (q |cj) Y P (c |q0) = ∝ P (c ) · P (q0|c ) (3.44) j P (q0) j j i

0 In the formula, P (q |cj) can be computed from the class term-vector Dj, which consists in the terms of the documents below each of the 16 top-level categories of the Open Directory Project (ODP). Finally, a

composite topic-independent importance score sqd is computed as follows:

X 0 sqd = P (cj|q ) · rjd (3.45) j

In the formula, rjd is the rank of document d, given the PageRank vector PR(α, vj), for topic cj. In its turn, PR(α, vj) has as parameters a bias factor α and the non-uniform damping vector vj, with Tj being the set of URLs in the ODP category cj:

 1  , i ∈ Tj |Tj | vji = (3.46)  0, i∈ / Tj

The bias factor, similarly to PageRank’s damping factor, can influence the biasing degree of the resulting vector towards the topic vector that was used. This bias was heuristically set to α = 0.25 by the authors.

3.2.3 TwitterRank

In the context of Twitter, the popular microblogging service, there is often the need to determine which are the influential users.

From the work of Weng et al. (2010) arose TwitterRank, an extension of the PageRank algorithm that takes both the topic similarity between users and the link structure of the social network into account. However, the influence of a user may vary in different topics, since a Twitter user can have interests or expertise in many distinct areas.

In the same way that in Bibliometrics we have that citation count is the simplest method to assess the influence of an author in an author-publication network, we have that, on Twitter, the follower count, i.e., the total number of people who are following a particular user, has been interpreted as a good indicator of influence. Nevertheless, Weng et al. (2010) observed that 72.4% of the users follow more than 80% of their followers, and that 80.5% of the users have 80% of their friends (i.e, twitterers whose updates are being followed) following them back. This can be contradictory, because either the act of following is so casual that a twitterer randomly follows other twitterers and they, politely, just follow them back, or this following relationship can reflect the existence of a strong similarity among users, due to the interest

33 in the topics the twitterers tweet about. The latter denotes the homophily phenomena.

The general framework proposed for TwitterRank is depicted in Figure 3.7. First, in the topic distillation phase, the topics twitterers are interested in are extracted with basis on what they tweet about. Then, a topic-specific relationship network is built, based on the previously gathered topics. Finally, the Twit- terRank algorithm is applied to measure the topic-sensitive influence of a twitterer, taking into account both the topics that were distilled and the structure of the topic-specific relationship network. A process of identifying top-topics is done in the order of the probabilities of topic presence, as it is captured in

matrix WT , of W unique words in tweets and T topics. For each entry Wit we have the number of times

the unique word wi has been assigned to topic T .

Figure 3.7: The general TwitterRank framework (adapted from Weng et al. (2010)).

This approach addresses two important shortcomings of PageRank, namely the fact that it does not take into account (i) the interests of the nodes of the network, and (ii) the indegree associated with the follower count in Twitter.

To mathematically describe the topic-specific TwitterRank algorithm, we can see the Twitter network as a directed graph D(V,E), where the vertices V are the twitterers and the edges E are the following connections between two twitterers. These connections are directed from follower to friend. In a random surfer scenario, the surfer visits each twitterer with a certain topic-specific probability, by following the appropriate edge in D. A transition matrix for topic t from follower si to friend sj, Pt, is defined as follows, where |τ | is the number of tweets published by s and P |τ | is summing up the number of j j a: si follows sa a

tweets published by all of si’s friends.

|τ | P (i, j) = j ∗ sim (i, j) (3.47) t P |τ | t a: si follows sa a

The similarity between si and sj in topic t, denoted by simt(i, j) is defined as follows:

0 0 simt(i, j) = 1 − |DTit − DTjt| (3.48)

In the formula, DT 0 is the row-normalization of matrix DT , with D being the twitterers and T the topics. 0 In DT , each row is the probability distribution of twitterer si’s interest over the T topics. Thus, the

similarity between si and sj in topic t can be assessed as the difference between the probability that the

two are both interested in topic t. The higher their similarity, the higher the transition probability from si

to sj.

34 There is also the possibility of having some twitterers following one another in such a cyclic way that they do not follow anyone outside that particular circle of following relations, which can end up in an accumulation of high influence that is not distributed. To account with this situation, Weng et al. (2010) introduced a teleportation vector Et that captures the probability that a random surfer would jump to some twitterer instead of following the edges of graph D. The teleportation vector is defined as follows:

00 Et = DT .t (3.49)

00 0 In the formula, DT .t is the t-th column of DT , the column-normalized form of matrix DT , the latter being part of the results from the topic distillation phase. In each entry, DT contains the number of times words in a twitterer’s tweets have been assigned to a specific topic.

Thus, the topic-specific TwitterRank can be calculated as follows:

−−→ −−→ TRt = γPt × TRt + (1 − γ)Et (3.50)

In the formula, γ is a parameter that directly controls the probability of teleportation, analogous to PageR- anks’s damping factor, and has a value that can range from 0 to 1, usually set to γ = 0.85.

The formula from Equation 3.50 gives the representation of the topic-specific TwtiterRank vectors that are generated. However, these vectors only refer to the twitterer’s influence in individual topics. To measure the overall influence of a twitterer in different topics, we need to compute the aggregated TwtiterRank vector as follows:

−→ X −−→ TR = rt · TRt (3.51) t

−−→ In the formula, TRt is the TwitterRank vector for a topic t, and rt is the weight assigned to topic t and −−→ associated with TRt.

Weng et al. (2010) observed that the most active twitterers are not necessarily the most influential in each topic. Also, and due to the consideration of the topical dimension, there is a higher correlation be- tween TwitterRank and the Topic-Sensitive PageRank (Section 3.2.2) than with the indegree or with the original PageRank algorithm. The experiments conducted by Weng et al. (2010), which used a Twitter dataset with messages from Singapore-based twitterers, collected in April 2009, showed that Twitter- Rank outperforms other related algorithms, including both PageRank and the algorithm that Twitter was using by the time of their study.

35 3.3 The Influence-Passivity (IP) Algorithm

Romero et al. (2011) came to the conclusion that, if a user is to be considered influential, then he does not only have to be popular and get attention from his peers, but he has also to overcome passivity, a state in which a user receives information but does not propagate it through the network. Thus, this approach determines the influence and also the passivity of a user, based on his information forwarding activity.

The algorithm proposed by Romero et al. (2011) is similar to HITS and to PageRank. However, the dif- ference in this approach is that the diffusion behaviour among the users is also taken into consideration. This work was conducted on Twitter and assigns to every user both a passivity score and an influence score, which respectively correspond to the authority and hub scores in the HITS algorithm. The use of passivity in the algorithm comes from the evidence that users in Twitter are generally passive and thus, when determining the influence of a user, taking into account the passivity of all the people that are influenced by him is also very important. The following assumptions are considered by the authors:

1. The influence score of a user depends on the number of people he influences, as well as on their passivity.

2. The influence score of a user depends on how dedicated the people that he influences are. This dedication is measured by the amount of attention a user pays to some other user, as compared to everyone else.

3. The passivity score of a user depends on the influence of those who he is exposed to, but not influenced by.

4. The passivity score of a user depends on how much he rejects some other user’s influence, com- pared to everyone else’s influence.

Given these assumptions, one should note that the network graph for this algorithm is a weighted graph

G = (N,E,W ) with N nodes, E edges and W edge weights, where weight wij represents the ratio of influence that node i has over node j to the total influence that i attempted to have over j. The output of the IP Algorithm is a function I : N → [0, 1] and a function P : N → [0, 1], which represent each node’s relative influence and passivity, respectively. For each edge e = (i, j) ∈ E, the authors defined an acceptance rate that represents the amount of influence accepted by j from all users in the network and that, thus, can reflect the loyalty user j has to user i. The acceptance rate is defined as follows:

wij uij = P (3.52) k:(k,j)∈E wkj

There is also a rejection rate, which is the opposite of the acceptance rate, because 1−wji is the amount

36 of influence user i rejects from user j. Thus, the rejection rate vji is the influence that user i rejected from user j, normalized by the total influence rejected from j by all other users in the network. The

rejection rate vji is mathematically expressed as follows:

1 − wji vji = P (3.53) k:(j,k)∈E(1 − wjk)

The IP Algorithm is thus based on two operations that relate directly to the aforementioned assumptions.

The operation Ii is related to a user’s influence and is as follows:

X Ii ← uijPj (3.54) j:(i,j)∈E

In the formula, the term Pj corresponds to the passivity referred in Assumption 1, and the term uij to the amount of dedication referred to in Assumption 2. As for operation Pi, it relates to a user’s passivity and is as follows:

X Pi ← vjiIj (3.55) j:(j,i)∈E

In the formula, the term Ij corresponds to the influence referred in Assumption 3, and vji to the rejection rate referred in Assumption 4.

The algorithm takes as input a weighted graph and computes the IP scores for each node in m iterations, as depicted in the pseudo-code of Algorithm2.

Algorithm 2 The Influence-Passivity (IP) Algorithm. G(N,E,W ): An influence graph with N nodes, E edges and W edge weight |N| I0 ← (1, 1, ..., 1) ∈ R |N| P0 ← (1, 1, ..., 1) ∈ R for i = 1 → m do P Update Pi using operation Pi ← j:(j,i)∈E vjiIj and the values Ii−1 P Update Ii using operation Ii ← j:(i,j)∈E uijPj and the values Pi for j = 1 → |N| do Ij Ij = P k∈N Ik Pj Pj = P k∈N Pk end for end for

The authors also concluded that there is a weak correlation between popularity and influence. The IP Algorithm turned out to provide better indicators of popularity than PageRank.

37 3.4 Citation and Co-Authorship Networks

In Bibliometrics, there are two classes of ranking algorithms. In the class of collection-based ranking algorithms, a weighted graph is used and its nodes correspond to the collections, e.g., journals and con- ference inproceedings, having the weighted edges representing the total number of citations that point from one collection to the other. The other class corresponds to publication-based ranking algorithms, where the nodes of the citation graph are individual publications and the edges represent citations be- tween papers (Sidiropoulos & Manolopoulos, 2005).

Both PageRank (Brin & Page, 1998) and HITS (Kleinberg, 1998) are part of the second class of ranking algorithms, while the ISI Impact Factor (Bollen et al., 2006) is part of the first class.

Following the assessment that neither PageRank nor HITS are perfectly suitable for bibliometrics, the latter due to the fact that a publication only gets a high authority score if there are good hubs that point to it, and the former because it was designed in a way that a node’s score is mostly affected by the scores of nodes that point to it and less by the number of incoming links, Sidiropoulos & Manolopoulos(2005) introduced the SCEAS Rank, which is a collection-based ranking algorithm, where scores are computed over a weighted graph where the nodes correspond to collections. SCEAS can be defined as follows:

X Si + b S = a−1 (a ≥ 1, b > 0) (3.56) j N i→j i

In the formula, Ni is the number of outgoing citations of node i, b is the direct citation enforcement factor, which is used so that citations from zero scored nodes can also contribute to the score of their citing publications, and a denotes the speed at which an indirect citation enforcement converges to zero. If a change in the score of node i occurs, it is going to affect the score of node j that is x nodes away, with a factor of a−x. Also the SCEAS approach has the following advantages over the PageRank and HITS algorithms:

1. A node’s score is affected by the number of incoming citations.

2. The algorithm’s computation and convergence is very fast. In the experiment conducted by Sidiropou- los & Manolopoulos(2005) with a DBLP dataset, they have verified that SCEAS needed half the time needed by PageRank, and about 1/10 of the time needed by HITS.

3. A node’s score is less affected by the score of distant nodes and, whenever new nodes and ci- tations are added to the network, the new score’s computation can be performed incrementally, using the previous score vector as the input vector for the computation.

Specifically for co-authorship networks, where the graph nodes represent authors and edges repre- sent ties between two authors, Liu et al. (2005) proposed AuthorRank, a modification to the PageRank

38 algorithm that is computed over a weighted directed co-authorship graph.

The co-authorship graph is directed and weighted in order to express the magnitude of the relationship between two authors and is, as in the Weighted PageRank, represented by G = (V,E,W ), with a set of

V authors, a set of E co-author relationships, and a set W of normalized weights wij connecting authors

vi and vj. The normalized weights wij are such that the weights of an author sum up to one, and they are computed as follows:

cij wij = Pn (3.57) k=1 cik

In the formula, cij and cik correspond to the co-authorship frequency (Equation 3.58), which is also correlated with exclusivity.

The idea behind co-authorship frequency is to assign more weight to authors that co-publish more papers together, and do so exclusively (Liu et al., 2005). For a set of m articles, co-authorship frequency is defined as follows:

m X cij = gi,j,k (3.58) k=1

In its turn, exclusivity, i.e., giving more weight to co-authorship ties in articles with fewer total co-authors than in articles with large number of co-authors (Liu et al., 2005), for authors vi and vj, who co-author article ak, is defined as follows:

1 gi,j,k = (3.59) f(ak) − 1

In the formula, f(ak) is the total number of authors of article ak.

The magnitude of the connection between two authors is determined by the following factors:

1. Frequency of co-authorship: Authors that co-author frequently should have a higher co-authorship weight;

2. Total number of co-authors on articles: Less weight should be assigned to the co-author relation- ship if the article has many authors

Therefore, the AuthorRank of author i is expressed as follows:

n X AR(i) = (1 − d) + d AR(j) × wj,i (3.60) j=0

In the formula above, AR(j) is the AuthorRank score of the backlinking node j and wj,i corresponds to

39 the weight of the edge between node j and node i.

Also, when exclusivity and collaboration frequency are taken into account, one can assess that some ties are more prestigious than others.

3.5 Temporal Issues in Ranking Scientific Articles

Citation networks are generally static networks, since a scientific article can not lose citations throughout the years, and since articles do not disappear from the network. On the other hand, social networks are generally characterized as dynamic networks, which change at a very fast pace, due to new users that make new connections and former users that leave the social network, breaking the ties they have established. Still, even in the case of citation networks, new articles are also being constantly introduced. Therefore, time is a key factor in social network analysis.

Sayyadi and Getoor developed FutureRank, which computes the expected PageRank score of a sci- entific article, based on the citations it will obtain in the future (Sayyadi & Getoor, 2009). This number of future citations is referred to as the usefulness of the article, and the authors assumed that recent articles are more useful. Nevertheless, older and highly cited articles still get a good ranking, due to being cited by recent articles. The algorithm is computed in a network that has two different types of nodes, namely, articles and authors, thus being unfold into two distinct networks (i) a citation network connecting articles through citation edges, and (ii) a authorship network connecting articles and authors through co-authorship edges. In the second network, articles can be mapped as the authorities and authors as the hubs from the HITS algorithm. As the networks share nodes, information is passed from and to one another.

In short, FutureRank runs one step of PageRank in the first network, in order to transfer authority from the articles to their references, and one step of HITS in the second network. These results are repeatedly combined until convergence is reached. The ranking of articles also involves a personalized PageRank vector, which is pre-computed with basis on the current time and the publication time of the articles, instead of being based on the number of nodes in the network as in the original PageRank algorithm.

The CiteRank algorithm (Walker et al., 2007) makes use of publication time in order to rank articles, where each researcher, independently of others, is assumed to start his search with recent articles, proceeding in a chain of citations until full satisfaction. The output of the algorithm can be seen as an estimate of traffic to an article, i.e., the probability of encountering an article via a path of any length, and is correlated to the number of citations in a way that the larger the number of citations, the more likely it will be for the article to be visited via one of the incoming links. CiteRank is in all similar to PageRank algorithm, except for the fact that CiteRank initially distributes random surfers exponentially with age and

−agei/τ th with probability ρi = e dir , where agei is the age of the i article and τdir is the decay of time, thus

40 favoring recent articles.

3.6 Summary

This chaper presented what has been previously done regarding the task of finding influencers in a network, having its main focus on the PageRank algorithm and in the different variants that have arisen over the years. The Influence Passivity (IP) algorithm was also presented, i.e, a novel approach to influence based on the HITS and PageRank algorithms that also takes information diffusion into account. Finally, we glanced into a recent trending research topic that concerns the temporal issues in ranking scientific articles, specifically, the prediction of future PageRank scores in a citation network, based on future citations that an article may receive.

41

Chapter 4

Finding Influencers in Social Networks

his chapter presents and details the work that was developed in the context of my MSc thesis. I T focused on studying and developing techniques to identify influential nodes in a network so that, given a network, one can characterize it and assess which are the nodes that exert more influence over others, i.e., which are the nodes that induce others to have a particular behavior, e.g., forward a message or visit a renowned monument or concert venue.

Two distinct experiments were conducted, each with a different type of network. In the first experiment we collected real and up-to-date data from a location-based social networking service, namely FourSquare, and from Twitter, a social networking and microblogging service, building social networks from the the collected data. In the case of the network built from FourSquare’s data, it is commonly called a location- based social network due to its inclusion of information from user’s interactions with other users, as well as user’s interactions with locations, as they check-in in different places. The second experiment involved data from DBLP, a digital library containing information about academic publications and their citations, from which a citation network was built.

With the work that was developed, we wanted to prove the hypothesis that we can identify a network’s most influential nodes, through network analysis metrics and algorithms. These techniques were applied to different kinds of social networks, in order to explore influence in distinct contexts. On the experiment with location-based social networks we wanted to test how good these social network analysis metrics and algorithms are, in the task of identifying the most relevant nodes. On the other hand, when experi- menting with academic social networks, we wanted to identify which were the most important papers in the dataset and test if it was possible to predict the future influence scores of the nodes in the network, based on their previous influence scores.

The remaining of this chapter is organized as follows: first we introduce the main software package that

43 were used and extended in the course of this research. Then, we describe the metrics used to char- acterize the social networks of our experiments. In Section 4.2 we thoroughly describe the experiment with location-based social networks, while in Section 4.3 we describe the experiment with the academic social network developed from DBLP, going from the process of data collection, the algorithms that were computed, and the methods to find influential nodes. We finish this chapter with a brief summary of what has been presented.

4.1 Available Resources for Finding Influencers

To perform our experiments and fulfill the tasks of characterizing a social network and finding which are its most influential nodes, we used several state-of-the-art algorithms and open-source software packages for network analysis, in which the LAW Webgraph open-source software package is included.

LAW Webgraph is an open source project developed by researchers from the Laboratory of Web Algo- rithms at the University of Milan. It contains a Java library for large-scale web graph analysis, presenting a novel approach to graph compression that enables the creation and storage of web-scale graphs. Among other things, the LAW Webgraph package contains an implementation of the PageRank algo- rithm, which was the first algorithm we used for assessing the influence of nodes in our experiments. As it was intended to extend this software package with the HITS and IP algorithms, the structure of LAW’s PageRank algorithm implementation served as a template for our algorithmic extensions.

For the implementation of the HITS algorithm we followed the pseudo-code in Algorithm1, in which we have to compute two different scores - the hub score and the authority score. The computation of these scores is based, respectively, on outlinks and inlinks from every node in the graph. Through LAW Webgraph’s API we could only have access to the successors of a node. To overcome this limitation, when computing the HITS algorithm, we built the graph and its transpose, instead of just the graph, so we can access both the successors and predecessors of each node through the transpose of the original graph (i.e., the inlinks of a node are the outlinks on the graph’s transpose).

Analogously, the Influence-Passivity (IP) algorithm involves the computation of two scores - the influence score and the passivity score. Thus, two graphs were again built. In this implementation we followed the pseudo-code in Algorithm2 from Section 3.3.

4.1.1 Characterizing Networks

To understand aspects such as the dimension or how well connected are the nodes in our generated graphs, some well-known network analysis metrics were used.

With the average path length one can assess the average distance between the nodes in our networks,

44 understanding how tightly connected they are (e.g., a small average path length indicates that all nodes are closely connected, which means that it will be easy to spread information through the network). The clustering coefficient allows us to assess how neighbours on our networks are close to one another, i.e., how our neighbours tend to create clusters with a large number of ties between them. On the other hand, by studying the degree distribution of the nodes in a network, one can assess if we are at the presence of a large-scale network that is characterized by a power-law distribution of the nodes degree, i.e., in the presence of a network in which the majority of the nodes have few connections, but where there is a smaller set of nodes holding an extremely large number of connections. These well connected nodes are called the hubs, and they can also be seen as central points of aggregation in the network.

4.2 Analysis of Location-based Social Networks

A traditional social network comprises an unique type of nodes, which are the users in the network. The edges between these nodes represent the friendship ties between the users. In its turn, a location- based social network has all the properties of a social network however, we now have two types of nodes instead of just one, namely (1) user nodes, which are the users in the network and who can be friends with other users, and (2) location nodes, which are the locations users have visited or mentioned in their personal messages. Therefore, one can say that a location-based social network also has two types of edges or social ties, namely (1) user-user ties, corresponding to the edges between two users and in all similar to the edges existing in social networks, and (2) user-location ties, corresponding to the edges between users and locations, which are derived from a user mentioning or visiting a specific location. Location-based social networks yield a great amount of information, because one can look at them as according to two layers: one where the users are connected to their friends, and an underlying layer where users are connected to locations, the latter being an intersecting layer through which one can identify the most visited locations (i.e., locations that are connected to a larger number of users) and, on a location perspective, which locations exert more influence to the users they are connected to - see Figure 4.8.

Most online social networking services have public APIs, which allow the search and extraction of pub- licly available, real and up-to-date data. In our experiments, all the considered social network platforms permitted the access to a public API. Thus, the first step to gather information from these social net- working services was to request data from the API and store it in a structured way, e.g., a XML file, for subsequent processing. With the raw data organized, it was then filtered to decouple user information from location information and also from relationship ties. The different ranking algorithms and network analysis metrics were finally applied to a graph generated from relationship ties in the data that was previously filtered.

45 Figure 4.8: Example of a location-based social network (adapted from Zheng & Zhou(2011)).

Data was collected from two different social network platforms: FourSquare and Twitter. FourSquare is a location-based social network that allows users to check-in in different locations which, in their terminology are called venues, ranging from restaurants to nightclubs, movie theaters, university campi or a city’s most iconic monument. It was founded in 2009 and is a web application specially intended to be used in mobile devices. With the widespread availability of smartphones and mobile gadgets with Internet connection, FourSquare’s network and service has been growing and evolving throughout the years, reaching the 7 million registered users milestone in 2011.

In FourSquare, registered users can search for other users or venues, e.g., one can search for Indian Restaurant near New York and access an extensive list of restaurants, each one with an address and a geospatial location, user uploaded photos, reviews by users that have had checked-in there, as well as a list of venues that are similar to the searched one. Venues can be associated with categories and tags. There is also an underlying game-play concept in this kind of social networks, encouraging continuous interaction: (i) users earn points for checking-in at venues or adding new venues to FourSquare, (ii) users earn badges if they check-in in various different venues or complete tasks, (iii) a user in FourSquare can become mayor of a specific venue if he has checked-in in that venue for more days that anyone else, in a period of 60 days.

On the other hand, Twitter is a social networking and microblogging service that allows users to post messages 140 characters long - the tweets. Created in 2006, it has grown to be one of the most well known social networks with over 500 million active users. Initially, Twitter was only accessible via their website, but today one has a multitude of mobile applications at hand to manage one’s account, tweet wherever we please, and also attach links to tweets. Nowadays, many Twitter users tweet as they arrive (or check-in) at a specific location, deliberately attaching the geographical coordinates of that place to their tweet. This way, we can associate Twitter users with locations, building a location-based social network.

46 4.2.1 Data Collection from Online Services

To extract data about users and venues in FouSquare, we used the FourSquare API1, which returns JSON2 objects that contain the result of each API call. Nevertheless, for simplicity of use, an open- source Java implementation3 of the FourSquare API was used, providing straightforward methods to interact with the FourSquare API. This Java API includes all methods in the official FourSquare API. However, the functionality of the method that searches for venues (i.e., venuesSearch) was not fully implemented, so there was the need to make a simple change to the FourSquare’s Java API in order to extract reliable data. Even though the original API’s venuesSearch method allowed us to obtain a set of venues that are near the provided latitude-longitude coordinates and within a specified radius ranging up to 5 km, this radius functionality was not implemented in the open-source FourSquare Java API, which led to a simple addition of the radius parameter to the venuesSearch API call, thus, taking full advantage of that functionality and obtaining more venues per call - see pseudo-code in Algorithm3. Also, we have defined a bounding box for the New York City-Manhattan area, restricting our data collection to that geographical area, in order to make a more contained study.

Algorithm 3 Pseudocode for the extraction of user and friend data from FourSquare.

latmax: maximum latitude for the New York City - Manhattan bounding box longmax: maximum longitude for the New York City - Manhattan bounding box latmin: minimum latitude for the New York City - Manhattan bounding box longmin: minimum latitude for the New York City - Manhattan bounding box lat: current latitude long: current longitude radius = 1000 (i.e., 1km) userSet: Set of users from a venue for all lat ∈ [latmin, latmax] and long ∈ [longmax, longmin] do venueSet ← all venues for lat, long within radius for all venue ∈ venueSet do Retrieve and store venue info userSet ← all venue’s visiting users for all user in userSet do Retrieve users’ friends Store friend information end for end for end for

As for Twitter, we used the Twitter Public Stream API4 that provides 1% of all the tweets that have been published in each API second. The data collection process had the following phases:

1. From that 1% of tweets only the ones which had geographical coordinates were selected. Also, for each tweet we collected information such as, the user id, users that he is following and users that

1https://developer.foursquare.com 2http://www.json.org/ 3http://code.google.com/p/foursquare-api-java/ 4https://dev.twitter.com/docs/streaming-apis

47 are following him. Afterwards, with the coordinates associated to a user’s tweet, we could establish user-location ties and, with the following and follower relationships, one could establish user-user ties.

2. From the collected user information, the users which had the greater amount of connections were selected and the data about their friends and followers was gathered.

3. Afterwards, similarly to was done in FourSquare, all the collected data was filtered in order to keep only the information about tweets that were within the New York City-Manhattan area.

In order to perform the discretization of geospatial coordinates, we used the Hierarchical Triangular Mesh (HTM) approach to divide the Earth’s surface into a set of triangular regions, each roughly occupying an equal area of the Earth (Dutton, 1996; Szalay et al., 2007). In brief, we have that the HTM offers a multi-level recursive decomposition of a spherical approximation to the Earth’s surface. It starts at level zero with an octahedron and, by projecting the edges of the octahedron onto the sphere, it creates 8 spherical triangles, 4 on the Northern and 4 on the Southern hemisphere. Four of these triangles share a vertex at the pole and the sides opposite to the pole form the equator. Each of the 8 spherical triangles can be split into four smaller triangles by introducing new vertices at the midpoints of each side, and adding a great circle arc segment to connect the new vertices with the existing ones - see Figure 4.9.

Figure 4.9: A sequence of subdivisions of the world sphere, starting from the octahedron, down to level 5 corre- sponding to 8192 spherical triangles. The circular triangles have been plotted as planar ones, for simplicity (adapted from Szalay et al. (2007)).

This sub-division process can be repeated recursively, until we reach the desired level of resolution, as shown in Figure 4.10. The triangles in this mesh are the regions used in our representation of the Earth, and every triangle, at any resolution, is represented by a single numeric ID. For each location given by a pair of coordinates on the surface of the Earth, there is an ID representing the triangle, at a particular resolution, that contains the corresponding point. Notice that the proposed representation scheme contains a parameter k that controls the resolution, i.e. the area of the triangular regions. With a resolution of k, the number of regions n used to represent the Earth corresponds to n = 8 · 4k.

48 Figure 4.10: The HTM recursive division process (adapted from Szalay et al. (2007)).

From the geographical coordinates found in some of the collected tweets, we computed the hierarchical triangular mesh (HTM) so we could give to each geographical coordinate a trixel representation. Thus, with a trixel representation instead of a latitude-longitude representation, one can have more freedom in specifying the range of the collected locations. In our case, we managed to establish three ranges of trixels, according to their resolution, i.e., locations with resolution 25, with resolution 20 and resolution 10.

Nevertheless, this data collection process had some limitations. The main limitation in the FourSquare API, was that their rate limit for authenticated calls per hour is set to 500, which is a very low threshold considering that we have performed an extensive crawl and each request for the listing of a user’s friends is a frequent authenticated API call. As for the Twitter API, we had a rate limit of 600 calls per hour and, exceeding that limit, we had to wait until the next hour to make more API calls. This made us disregard a larger number of tweets during that waiting time.

4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm

A major contribution of this work was the adaptation and implementation of the aforementioned Influence- Passivity (IP) algorithm. Developed by Romero et al. (2011), the IP algorithm was part of a study on information propagation in Twitter, where the authors came to the conclusion that most users of this social network act as passive consumers of information, not forwarding content to the network. This al- gorithm presents a novel way of quantifying the influence of nodes in a network by considering that each node has an influence score, as well as, a passivity score. These scores have a mutually reinforcing relationship, like the hub score and authority score in the HITS algorithm ( Kleinberg(1998)).

For our implementation, some changes had to be conducted to the original IP algorithm, in order to adapt it to location-based social networks and perform an edge weight calculation that was consistent with the datasets we were working with. From the Twitter data collected by Romero et al. (2011), the weight of an edge e = (i, j) was assigned as the follows:

49 Sij we = (4.61) Qij

In the formula, Qi represents the number of URLs that node i mentioned and Sij is the number of URLs that were mentioned by node i and retweeted by node j.

In the case of our datasets from FourSquare and Twitter, we wanted to generate a weight exclusively based on user-location and user-user ties, instead of URLs or retweets, as proposed by the original authors. Thus, we built a graph that rather than having two types of nodes, i.e., locations and users, would only have user nodes, estimating exclusively the influence of users in the network.

To calculate the weight of edges between users, we adapted the Qij and Sij parameters, having Qij as the number of locations node i has visited and Sij as the number of locations visited by both i and j, i.e., number of common visited locations between nodes i and j, having i visited the location before j had visited it. From our adaptation of the algorithm, user influence is always dependent on the popularity of the locations a user has visited.

The original graph built from our datasets is depicted in Figure 4.11, i.e., the left-most graph which includes two types of nodes: (i) user nodes, represented by U1...U4, and (ii) location nodes, represented by S1...S3, and has undirected user-location ties and directed user-user ties. Also, the right-most graph in Figure 4.11 is the result of our adaptation of the IP algorithm, generating a network graph that only has directed and weighted user-user ties and has some differences regarding its structure, e.g., the original user-user edges no longer exist and new edges arise from common visits to locations. The connection between two nodes is associated with a non-negative, non-zero weight if they share a visited location, e.g., U3 and U2 both visited location S2 so there is a new edge from U3 to U2, with the weight w2, because U3 visited S2 after U2 had visited it.

4.3 Analysis of Academic Social Networks

Alongside with social networks, this work focused on assessing the influence of nodes in an academic social network, which is a network where the nodes either refer to authors of scientific papers connected via co-authorship ties that form a co-authorship network, or to the scientific papers themselves con- nected through citation ties, originating a citation network. We wanted to assess which were the most influential papers in the scientific community, i.e., the ones that were gathering more attention either due to the importance of their author(s), due to being about a trending topic or an important breakthrough. To do so, we gathered the already organized data from the digital library DBLP, via the Arnetminer Project1, which contains information about scientific papers from 1935 to 2011, including the abstract and the

1http://arnetminer.org/DBLP_Citation

50 Figure 4.11: Transformation of the original network graph (left) to our IP algorithm graph (right).

number of citations. From this data we built a citation network for set of time-stamps ranging from 2007 to 2011, as depicted in Figure 4.12, in order to have a record of how the network evolved over time.

Figure 4.12: Structure of the citation graph built upon the DBLP data.

Although any other ranking algorithm could have been used, in the case of the DBLP citation network, the most influential papers on the dataset were determined through the computation of the PageRank algorithm. The top-10 highest ranked papers were then selected and we gathered their full information, in order to cross-check the set of authors of each paper with the recipients of renowned computer science and engineering awards such as the Gerard Salton award or the Turing award, identifying which of these authors were distinguished by the scientific community.

4.3.1 Predicting Future Influence Scores and Download Counts

Instead of computing future PageRank scores of scientific papers based on their future citations, as did Sayyadi & Getoor(2009), we created a framework to predict the future PageRank scores of scientific papers in a citation network for a specific year, based on their previous PageRank scores, among other

51 features. The same principle was also applied to the prediction of download counts for scientific articles downloaded from the ACM Digital Library website, in the year of 2011.

In the framework depicted in Figure 4.13, and in order to predict the future PageRank scores and future download counts, we have three distinct phases:

1. Feature Vector Creation The first phase is to prepares the input for further computation related to the prediction of impor- tance scores. Having the dataset, either for paper citations or downloads counts, one generates the different features, namely the text, age and PageRank scores, and store them in a relational database, so then feature vectors can be generated.

2. Prediction In a second phase, one creates training and test files from the generated feature vector files, in order to proceed with the computation of a machine learning technique intended for predicting the future PageRank scores and the future download counts.

3. Accuracy Assessment Finally, to assess the quality of the obtained results, one proceeds with the computation of various evaluation metrics.

Figure 4.13: Framework for predicting future PageRank scores and download counts.

Each aforementioned phase is a preparation to following one. To predict the PageRank scores and the download counts we relied on features that can represent the characteristics of the information in the dataset. The following types of features was considered:

1. Absolute Scores - Includes the PageRank score resulting from the computation of the algorithm for papers that were published until a specific year, inclusive. Regarding the PageRank score of a

52 paper, we defined 5 different cumulative time-stamps, from 2007 to 2011, so we could have access to the respective PageRank scores in each k previous year.

2. Differential Scores - Includes the Rank Change Rate (Racer), representing the change rate of PageRank score between two consecutive years, capturing the evolution of PageRank scores.

The Rank Change Rate between to time-stamps ti and ti+1, for paper p is given by the following equation:

rank(p, ti+1) − rank(p, ti) racer(p, ti) = (4.62) rank(p, ti + 1)

3. Profile Information - Includes the Average PageRank Score, that represents the average of the PageRank score of all publications that have an author in common with the paper’s set of au- thors, and the Maximum PageRank Score, which represents the maximum PageRank score of all publications that have an author in common with the paper’s set of authors.

4. Age - Includes the difference between the present year and the publication year of a paper, i.e., its age.

5. Text - Includes the term frequency score for the top 100 most frequent tokens in abstracts and titles of publications, not having in consideration the terms from the Standard English stop-word list.

For each aforementioned type of feature, except age and text, its value for the previous k years, with k ranging from 1 to 3 was considered, e.g., when predicting the future PageRank score for year 2010, one predicted that score only with information from the PageRank score of the previous year (k =1, i.e., 2009), then with information from the two previous years (k = 2, i.e., 2009 and 2008) and finally from the three previous years (k=3, i.e., 2009, 2008, 2007).

In order to enrich the way we made our predictions, we made a structured combination of the previously enumerated types of features, which fit into three different groups:

• 1 - In this group we used exclusively the PageRank scores of the paper as features.

• 1 + 2 - In this group we used both PageRank and Racer scores of the paper as features.

• 1 + 2 + 3 - In this group we used PageRank scores, Racer scores, Average Author scores and Maximum Author scores as features.

The remaining text and age features were separately added to the aforementioned combination of fea- tures enabling the creation of two distinct subsets of results. Thus, alongside with the different range of k used, one could assess if for that particular type of feature or group of features, adding more information about previous years would improve or deviate the accuracy of our results. Also, for a straightforward

53 computation of the Racer, Average PageRank score, Average PageRank score an feature vectors, the PageRank scores for each paper in each time-stamp and information about the authors of the papers and the information about download counts was stored in a relational database.

4.3.2 The Learning Approach

To predict future PageRank scores and future download counts, we used an ensemble machine learning technique included in the RT-Rank1 package, which is an open-source project consisting of the imple- mentation of various machine learning algorithms based on regression trees.

The algorithm we used, called Initialized Gradient Boosting Regression Trees (IGBRT) is essentially a point-wise machine learning algorithm developed by the team from Washington University of St. Louis for the 2010 Yahoo Learning-To-Rank Challenge. The algorithm is shown in Algorithm4, and it is based on Gradient Boosting Regression Trees (GBRT) (Mohan et al., 2011). GBRT is a machine learning technique based on tree averaging, which uses a set of trees to classify a new object, instead of the single best tree (Oliver & Hand, 1995). It sequentially adds small trees (d≈ 4), each with high bias and, in each iteration, the new tree to be added focuses strictly on the objects that are responsible for the current remaining regression error. IGBRT follows the guidelines of SVM light 2, proposed by Joachims (1999, 2002).

Algorithm 4 Initialized Gradient Boosted Regression Trees (Squared Loss)

Input: data set D = {(x1, y1), ..., (xn, yn)}, Parameters: α, MB,d, KRF , MRF F ← RandomForests(D,KRF , MRF ) Initialization: ri = yi − F (xi) for i = 1 → n for i = 1 → MB do Tt ← Cart({(x1, r1), ..., (xn, rn)} , f, d) {Build Cart of depth d, with all f features, and targets ri} for i = 1 → MB do ri ← ri − αTi(xi) {Update residual of each sample xi} PMB T (·) + α t=1 Tt (·) {Combine the Regression Trees T1, ..., Tm with the RF F } end for end for return T (·)

With the intention of addressing GBRT’s main weakness, i.e., the inherent trade-off between the step- size and the early stopping, Mohan et al. (2011) proposed an ensemble algorithm that starts-off at a point very close to the global minimum and refines the already good predictions. Thus, instead of initializing the algorithm with an all-zero function, as occurred in GBRT, the IGBRT algorithm is initialized with the predictions of Random Forests (Breiman, 2001), due to the latter being known as being resistant towards overfitting, insensitive to parameter settings, and not implying additional parameter tunning. IGBRT uses GBRT to further refine the results of Random Forests, which are regarded by the authors

1https://sites.google.com/site/rtranking/ 2http://svmlight.joachims.org/

54 as a good starting point for the algorithm.

4.4 Summary

In this chapter I detailed of the two types of experiments that were conducted within my MSc thesis. I began explaining the characteristics of location-based social networks and of academic social networks, emphasizing their peculiarities. Then, for each experiment, I described the datasets, the data collection technique, and the methodology for finding the influencers in the network, alongside with the algorithms that were used. For the particular case of academic social networks, a novel approach to predicting future PageRank scores and future download counts was also presented.

55

Chapter 5

Validation Experiments

his chapter presents the results of the undertaken experiments and the evaluation methodology T used to assess the veracity of the obtained results. Beginning with a concise characterization of all the datasets that were used and their respective networks, the evaluation methodology is then presented, comprising all the metrics that were used to assess the quality and veracity of the results. Finally, the obtained results for each experiment are presented and further discussed. The results comprise the experiments for finding influencers in FourSquare and Twitter, and the citation network built upon the DBLP dataset, as well as, the experiments for predicting the future PageRank score of a scientific papers from 2010 and 2011 in the DBLP citation network and the prediction of download counts for the scientific papers published in 2011, downloaded from the ACM Digital Library.

5.1 The Considered Datasets

This section includes the dataset and network characterization of all the datasets that we used.

In order to understand the structural differences between a location-based social network and a social network that only consists in relationships between users, and how this structure affects influence esti- mation, we created two different graphs for both FourSquare and Twitter datasets. First we considered a graph consisting in the original location-based network built upon the data that was crawled, which we called the User+Spot Graph. Afterwards, we disregarded all the user-location relationships and built a graph consisting only in user-user ties, which we called the User Graph.

In the case of the DBLP dataset, the distinction between two graph was not needed, because our focus was on creating a citation network upon which we could estimate the PageRank scores of their nodes and use them as features for the algorithm that predicts future influence scores of papers and future download counts. As for FourSquare and Twitter, this structural difference presents interesting results

57 when estimating user influence.

FourSquare Twitter Total 48,257 1,358 HTM Resolution 10 — 13 Spots HTM Resolution 20 — 1,277 HTM Resolution 25 — 1,358 Total 447,545 2,603,505 Users Relations 970,587 3,218,997 Visiting Spots 16,960 1,017 PageRank & HITS (User+Spot Graph) 2,539,986 3,757,555 Arcs PageRank & HITS (User Graph) 1,017,887 3,576,157 IP Algorithm 1,017,887 PageRank & HITS (User+Spot Graph) 451,664 2,604,863 Nodes PageRank & HITS (User Graph) 403,407 2,603,505 IP Algorithm 447,545 Minimum (User+Spot Graph) 0 1 Maximum (User+Spot Graph) 3,166 38,542 Average (User+Spot Graph) 2.8626 5.6162 InDegree Minimum (User Graph) 0 1 Maximum (User Graph) 3,166 38,452 Average (User Graph) 2.5478 5.6256 Minimum (User+Spot Graph) 0 1 Maximum (User+Spot Graph) 1,000 460,466 Average (User+Spot Graph) 74.8821 1.5615 OutDegree Minimum (User Graph) 0 1 Maximum (User Graph) 1,000 460,466 Average (User Graph) 60.5829 1.5618 Total (User+Spot Graph) 5.4640 3.8868 Users (User+Spot Graph) 5.6714 2.8878 Average Degree Spots (User+Spot Graph) 5.7118 1.0376 Total (User Graph) 5.0488 2.8872 Users+Spot Graph 4.7369 3.9776 Average Path Length User Graph 4.7764 3.9823 Users+Spot Graph 0.2987 0.1156 Clustering Coefficient User Graph 0.3718 0.1152

Table 5.1: Characterization of the FourSquare and Twitter networks.

Regarding the characteristics of both graphs in the FourSquare and Twitter datasets depicted in Ta- ble 5.1, one can acknowledge that while the first dataset is more complete in terms of user-location ties and quantitative spot information, the latter is more complete in terms of user-user ties and user friend- ship information. We have this behaviour, since FourSquare is a pure location-based network focused on sharing the locations users have visited, while Twitter is a microblogging and social network platform focused on the exchange of messages between users, thus giving priority to the relationship between the user and his friends and followers. In what regards the HTM resolution, we used a resolution of 26.

When considering the average path length and the clustering coefficient, one can assess that while the nodes in FourSquare network are more close to each other, neighbours of nodes in Twitter are more close to one another than in FourSquare. The latter phenomena has to do with the fact that we could collect a greater extent of data for friends of users in the Twitter dataset, resulting in the scenario

58 where friends of different users can, themselves, be friends and/or have friends in common. Also, one can observe that the User-Graph has naturally a greater average path length and a greater clustering coefficient than the User+Spot Graph, because the User-Graph as less nodes and, thus, shortens the distance between users and neighbourhoods of users, previously parted by the spots between them. The academic citation network built upon DBLP data comprises scientific papers from 1935 to 2011 and, from Table 5.2, one can also have an idea of the dimension of the dataset for each of the considered time-stamps, as well as, how complete the information about the scientific papers is.

Regarding the degree distribution in the FourSquare and Twitter networks in both User+Spot Graph and the User Graph, one can acknowledge from Figure 5.14 that the degree distribution for these datasets follows a power-law distribution, which a characteristic of large-scale networks, i.e., networks in which the majority of the nodes very few connections, while very few nodes have a high number of connections. Nevertheless, from the values of average path length and clustering coefficient, one can say that both FourSquare and Twitter networks are not representative of large-scales networks, because in large- scale networks, besides the power-law distribution for the degree, the average path length must be much smaller than the clustering coefficient, revealing that the nodes are very close to each other and their neighbourhoods are highly clustered.

Publications Citations Authors Papers with Papers with Average Terms Downloads Abstract Per Paper Overall 1,572,277 2,084,019 601,339 17,973 529,498 104 2007 135,277 1,150,195 330,001 15,516 343,837 95 2008 146,714 1,611,761 385,783 17,188 419,747 98 2009 155,299 1,958,352 448,951 17,973 504,900 101 2010 129,173 2,082,864 469,719 17,973 529,201 103 2011 8,418 2,083,947 469,917 17,973 529,498 104

Table 5.2: Characterization of the DBLP dataset.

On the other hand, one can acknowledge from the network characterization in Table 5.3 that the aca- demic social network that was built naturally grows in each time-stamp, although this growth is not as significant in the last two time-stamps as it is in the first two.

Focusing on the average path length and the clustering coefficient, one can conclude that as we include more papers in the network, i.e., at each time-stamp, papers are closer to one another through the existence of more citation relationships between them, even though they tend not to be as clustered together over time.

From the plots in Figure 5.15, one can acknowledge that the number of papers increases trough the years. However, these new papers have tend to have few citations, and so the tail of the plots get ticker throughout the years, i.e., new fewer cited papers are frequently added to the dataset, while the number highly cited paper remains almost unaltered.

59 Degree Distribution in FourSquare (User+Spot Graph) Degree Distribution in FourSquare (User Graph)

● ●

● ●

● ● ● ● ● ● ● 10000 10000 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ●● ●● ●● ●● ●●● ●●● ●●● ●●●●● ●●● Degree ●●●● Degree ●●●● ●●●●● ●● ●●● ●●●●● ●●●● ●●●● ●●●●● ●●●● ●●●● ●●● ●●● ●●●● ●●●●●●● ●●●● ●● ● ●●●●●

100 ● 100 ●●● ●●● ●●● ●●●● ●●● ●●●●● ●●●● ●●●● ●●●●● ●●●● ●●●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●● ● ●●●● ●●●●● ● ●●●●● ● ●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 1

1 5 10 50 100 500 1 5 10 50 100 500

Node id Node id

Degree Distribution in Twitter (User+Spot Graph) Degree Distribution in Twitter (User Graph)

● ●

● ● 1e+06 1e+06

● ● ● ●

● ● ● ● 1e+04 1e+04 ●● ●● ●● ●● ● ●

Degree ● Degree ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ● ● ●● ●● ● ● ● ●

1e+02 ● 1e+02 ● ●●● ●●● ●● ●● ●●●● ●●●● ●● ●● ● ● ●●●● ●●●● ● ●●●●●● ●●●●●● ●●●●● ●●●●● ●●●● ● ● ●● ● ● ●●●●●●● ●●●●●● ● ● ● ●●●●●●●● ● ●●●●●●●●●●● ●●● ●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● 1e+00 1e+00

1 100 10000 1 100 10000

Node id Node id

Figure 5.14: Degree distribution for nodes in the User+Spot Graph and the User Graph, from the FourSquare and Twitter datasets.

5.2 Evaluation Methodology

When assessing the quality and veracity of the results for the top-10 highest ranked users and spots in the FourSquare and Twitter datasets, we conducted an empirical analysis and relied on profile infor- mation, due to the fact that, this research area is still evolving and there are not strict parameters or ground-truth lists to truly assess the influence of a node in these networks. On the other hand, when assessing the veracity of the DBLP top-10 highest ranked papers, we empirically analyzed our results against a list of recipients of renowned scientific awards, like the Gerard Salton Award and the Turing Award, and if they were not part of that list, we also checked their academic publication profiles1 in order to assess if they were renowned scientists.

In the case of the experiment of future PageRank and future download count prediction, we used a set of error metrics. One of these metrics is Kendall’s Tau, which corresponds to a value ranging between

1http://academic.research.microsoft.com/

60 In-Degree Out-Degree Degree Average Clustering Min Max Avg Min Max Avg Min Max Avg Path Length Coefficient 2007 0 1,508 2.9153 0 227 2.9153 0 1,508 5.8329 0.1323 6.1800 2008 0 1,875 3.5357 0 266 3.5357 0 1,875 7.0790 0.1319 6.1047 2009 0 2,207 3.6993 0 269 3.6993 0 2,207 7.4012 0.1314 6.0833 2010 0 2,306 3.7670 0 269 3.7670 0 2,306 7.5430 0.1312 6.0665 2011 0 2,311 3.7673 0 269 3.7673 0 2,311 7.5367 0.1310 6.0676

Table 5.3: Characterization of the DBLP network.

[−1, 1] and is defined as follows:

2ci τ = 1 − 1 (5.63) 2 ni(ni − 1)

In the formula, ci is the number of concordant pairs between the produced ranked list and the ground

truth list, and ni is the length of the two lists Li(2011). The aforementioned LAW-Webgraph software package includes an implementation of this metric.

We can also assess the level of correlation between two ranked lists using Spearman’s Correlation (i.e Spearman’s ρ), according to the formula bellow:

6 Pn (x − y )2 ρ = 1 − i=1 i i (5.64) n3 − n

In the formula, x1, ..., xn and y1, ..., yn are the two rankings of n objects (Best & Roberts, 1975). This metric was computed via its implementation in the R-Project open source statistical software1. Both Kendall’s Tau and Spearman’s Correlation measure the strength of the association between two ranked lists Cha et al. (2010). The correlation ranges between [−1, 1] and, hence, if it is close to −1, one can determine the variables are negatively correlated, whereas if it is close to +1 they are positively correlated. To perform the Spearman’s Correlation we used the R-Project for statistical computing, which a specific statistical language and open-source software package that includes various mathematical and statistical techniques, being also suitable for large amounts of data.

In order to measure the accuracy of the prediction models, we used the normalized root-mean-squared error (NRMSE) metric between our predictions and the true values, which is given by the formula:

q Pn 2 i=1(x1,i−x2,i) NRMSE = N (5.65) xmax − xmin

The average of absolute error, which is the average of the difference between the inferred, i.e., pre- dicted value and the actual value, was also used and specially relevant for assessing the quality of the predictions of download counts.

1http://www.r-project.org/

61 Degree Distribution in DBLP (2008) Degree Distribution in DBLP (2009)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

10000 ● ● ● 10000 ● ●● ●● ● ● ●● ●● ●● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ● ●● ●● ●● ●● ●● ●●● ●● ●● ●●● ●●● ● ● ●● ● ● ● ●● ●● ●● ●● ●● Degree ●● Degree ●●● ●●● ●● ● ●● ●●●● ●●●●● 100 ● ● 100 ●●●● ●●●● ●● ●●●●●● ●●● ●●●● ●●●● ● ●● ●●● ●●● ●●●● ●● ●●●● ●●●●●● ●●● ●●●● ● ●●●●● ● ●●●● ●●● ●●●●●●● ●●●●●●● ● ●●●● ● ●●●●● ●●●●●● ●●● ● ●●● ●●● ● ●●●●●●●● ●●●●● ●●●●● ●●●●● ●●●●●● ● ●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●● ● ● ● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●● ● ●

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● 1 1

1 5 10 50 100 500 1000 1 5 10 50 100 500 1000

Node id Node id

Degree Distribution in DBLP (2010) Degree Distribution in DBLP (2011)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ●● 10000 ●● ●● ●● ● ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ● ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●● Degree ●● Degree ●● ●● ●● ●● ●● ●● ●● ●●●● ●●●● 100 ●●● 100 ●● ●●●● ●●●● ●●● ●●● ●●● ●●●● ● ●●● ●●● ●●●●● ●● ●●●●● ●●●●● ●●● ●●●●● ●●●● ● ●●●●● ●●●●● ●●●● ●●●●●● ●●●●● ● ● ●●●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●● ●●●●●●●●● ● ●●●●●●● ●●●● ● ●●●●●●● ●●●●●●● ●●●●●●● ● ●●●●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● 1 1

1 5 10 50 100 500 1000 1 5 10 50 100 500 1000

Node id Node id

Figure 5.15: Degree distribution for the DBLP dataset from 2008 to 2011.

5.3 The Obtained Results

This section exhibits the results obtained from the various conducted experiments, alongside with their discussion. First of all, the results from the experiments for finding influencers in FourSquare and Twitter, as well as, for the BDLP citation network are presented and further discussed, where we assess the quality of these results and if the top-10 highest ranked list of individuals and spots produced by the different algorithms really corresponds to the top-10 of influencers and influential spots in the network.

The results for the experiment of predicting future PageRank scores and download counts are then presented, alongside with their discussion, where we compare the output of the different evaluation metrics that were computed for the different groups of features, in order to understand if the task of predicting a future PageRank score and the future download counts could be successfully accomplished with the framework that was developed.

62 5.3.1 Finding Influencers

In the following sections the results of the computation of PageRank, HITS and IP algorithms for the FourSquare and Twitter datasets are presented, as well as, the results of the computation of PageRank algorithm for the DBLP dataset. While the first two datasets comprise the top-10 highest ranked users and the top-10 highest ranked spots in the network, the results from the DBLP highlight solely the most influential papers in the DBLP digital library dataset.

We begin by exposing and discussing the results from the experiments with, respectively, the FourSquare and Twitter datasets, then we present and discuss the influence estimation for the DBLP dataset, closing this section with the results from the future PageRank scores and download counts experiment.

In order to identify the most influential users and spots in FourSquare and Twitter datasets, aver- age anonymous users and spots (e.g., streets) are identified, respectively, by P erson − XXXX and Spot − YY : ZZ, where XXXX corresponds to the real user id, YY corresponds the latitude and ZZ to the longitude associated with that spot id in the network, while publicly well-known companies, loca- tions/venues and people are identified by their real name, e.g., Ellen DeGeneres for users and Dunkin’ Donuts for spots.

5.3.1.1 Location-based social networks: FourSquare & Twitter

From the user influence scores for PageRank and HITS algorithm depicted in Table 5.4, one can ac- knowledge that the addition of spots to the network reveals well-known influentials, such as worldwide celebrities, TV channels or magazines.

PageRank HITS - Authority HITS - Hub Name Friends Likes Name Friends Likes Name Friends Likes TimeOut NY — 122,172 ZAGAT — 328,189 ZAGAT — 328,189 Lucky Mag. — 164,323 TimeOut NY — 122,172 MTV — 731,067 ZAGAT — 328,189 MTV — 731,067 Bravo TV — 375,363 NYPL — 61,132 Bravo Tv — 375,363 History Chnl — 541,847 MTV — 731,067 History Chnl — 541,847 The NY Times — 367,008 Person-12935563 956 20 Starbucks — 929,915 Starbucks — 929,915 Bravo TV — 375,363 The NY Times — 367,008 VH1 — 380,987 Person-1478079 981 96 Lucky Mag. — 164,323 People Mag. — 372,008 NYC Parks — 17,429 VH1 — 380,987 TimeOut NY — 122,172 History Chnl — 541,847 NYPL — 61,132 The WSJ — 227,894 Table 5.4: User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the FourSquare dataset.

Meanwhile, when we have the User Graph, as depicted in Table 5.5, the average users of social plat- forms are distinguished both in the PageRank and the HITS algorithms, the latter when ordered by hub scores. In this case, average users are highlighted through their great amount of mayorships, checkins, tips about locations and friends. Mostly through their outlinks, they become network users that other users want to follow and listen to.

63 PageRank HITS - Authority HITS - Hub Name Friends Likes Name Friends Likes Name Friends Likes Person-11890308 794 84 ZAGAT — 328,189 Person-2630685 110 817 Person-449480 1,000 374 MTV — 731,067 Person-1127366 39 749 Person-1544684 987 144 Bravo TV — 375,363 Person-4148169 77 899 Person-619656 823 8 History Chnl — 541,847 Person-634270 216 755 Person-4071912 1,004 860 Starbucks — 929,915 Person-42695 128 775 NYCHA 807 59 The NY Times — 367,000 Person-1011520 39 723 Person-6935835 990 275 VH1 — 380,987 Person-3231666 14 713 Person-6004767 958 319 Ellen DeGeneres — 457,155 Person-7991820 3 767 Person-10934560 1,001 64 TimeOut NY — 122,172 Person-3290360 62 632 Person-10554269 985 4 People Mag. — 372,008 Person-6483868 95 765 Table 5.5: User influence scores for PageRank and HITS algorithms, for the User Graph, built from the FourSquare dataset.

When the location-based network was reshaped to connect only the users that have visited at least one location in common, for the IP algorithm, the average user of FourSquare is distinguished, yet again due to a combination of factors that include their great amount of mayorships, checkins, tips about locations and friend counts, as one can acknowledge from Table 5.6.

In brief, the fact that worldwide TV channels, magazines, and celebrities are highlighted in a network that contains both users and spots reveals a strict connection between these well known influentials and the spots, through a continuous activity that is intended to gather and retain their followers. When these ties are removed, the connections between real users prevail.

Name Friends Likes Person-9797197 52 10 Person-9726342 5 — Person-9615360 25 9 Person-9578554 34 — Person-9553862 4 — Person-9450025 47 7 Person-9264407 43 — Person-8956766 28 — Person-8916830 47 4 Person-884020 95 32

Table 5.6: User influence scores for the IP algorithm, built from the FourSquare dataset.

As for the most influential spots in the FourSquare dataset, the top-10 highest ranked spots resulting from the computation of both PageRank and HITS algorithms, either with authority or hub sort, was the same. Focusing on the type of spots that were highlighted, they mainly include bars, boardwalks and other spots near the New York coastline due to the fact that the data collection was done during the months of August and early September of 2012.

64 Name Checkins Tattoo Shot Lounge 227 Dunkin’ Donuts 970 Gargiulo’s Restaurant 697 The Freak Bar 540 Ruby’s Bar & Grill 2,025 Coney Island Beach & Boardwalk 36,206 Cha Cha’s 1,142 Denny’s Delight 84 Coney Island Sound 280 Coney Island Polar Bear Club 85

Table 5.7: Spot influence scores for PageRank and HITS algorithms (that present the exact same top-10), for the User+Spot Graph, built from the FourSquare dataset.

When finding influencers in the Twitter dataset, one must acknowledge that users tweet wherever they are, may it be at home, while waiting for a doctor’s appointment, etc ... Therefore many of the locations that we could identify are not necessarily venues, i.e., the geographic coordinates associated with a tweet may point to a street or avenue, and not a theater, museum or restaurant like it happened in the FourSquare experiment. Nevertheless, this is only due to the inner characteristics of the Twitter social network, which is content and user-centered and not location-centered like FourSquare. Due to the fact that social networks have a dynamic behaviour, i.e., they can change over time with the addition or loss of users and relationship ties, the third highest ranked user for HITS - Authority, from Tables 5.8 and 5.9 had a profile on Twitter and was active during our crawl, between July and August of 2012, nevertheless, he no longer has a Twitter profile thus, being marked with a *, after the user id.

In the case of the Twitter dataset, the results from the computation of IP algorithm are not be presented, because the obtained results were not coherent and not nearly comparable with the ones that were obtained in FourSquare.

From Table 5.8, we can observe that HITS algorithm, with influence sorted by authority or hub score, reveals Twitter users that are well-known to the public and whom exert significant influence due to their roles on society, e.g., by being an entrepreneur, a journalist or an actor. Also, due to their professional activity and media exposure, one can say that they can shape conversations, they are users other network users want to listen to. Conversely, from the top-10 generated by PageRank algorithm, one can acknowledge that friendship ties among anonymous (to public) users are highlighted.

Regarding the User Graph, we can see that the output from HITS an PageRank algorithms, depicted in Table 5.9, is exactly the same as in the User+Spot Graph. This enhances the fact that in this particular dataset there is a greater number of relationships among users than between users and locations, so when these location ties are disregarded the strong ties between users naturally prevail. Also, one can see from Tables 5.8 and 5.9 that, yet again, the total number of follower and friends is not necessarily correlated with influence on Twitter.

65 PageRank HITS - Authority HITS - Hub Name Followers Following Name Followers Following Name Followers Following Person-67779865 45,702 41,870 J. Wortham 463,772 3,424 J. Lupton 301,965 276,780 J. K. Pulver 469,092 38,542 J. K. Pulver 469,092 38,542 NOH8 Campaign 426,079 251,158 JobsDirectUSA.com 17,075 18,782 Person-325410549* — — Person-25915690 595,404 192,241 Person-479562736 16,703 16,241 B. Thurston 124,722 5,707 M. Allen 144,540 55,678 America Hires 11,824 13,006 StumbleUpon 72,133 10,370 Person-203455506 188,527 41,190 Person-52306188 9,989 9,878 DL Hughley 73,835 886 NY Daily News 85,821 10,681 Person-35844123 10,030 9,761 J. Rampton 47,593 578 Person-18704291 19,212 21,098 Person-24883913 11,191 9,583 Person-51560438 103,721 14.766 J. Calacanis 151,155 112,248 Person-213105865 8,531 9,965 Person-67779865 45,699 41,868 92YTribeca 13,015 10,560 Person-30735143 7,837 8,513 Person-1536651 34,216 456 C.C. Chapman 34,512 28,505 Table 5.8: User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitter dataset.

PageRank HITS - Authority HITS - Hub Name Followers Following Name Followers Following Name Followers Following Person-67779865 45,702 41,870 J. Wortham 463,772 3,424 J. Lupton 301,965 276,780 J. K. Pulver 469,092 38.542 J. K. Pulver 469,092 38,542 NOH8 Campaign 426,079 251,158 JobsDirectUSA.com 17,075 18,782 Person-325410549* — — Person-25915690 595,404 192,241 Person-479562736 16,703 16,241 B. Thurston 124,722 5,707 M. Allen 144,540 55,678 America Hires 11,824 13,006 StumbleUpon 72,133 10,370 Person-203455506 188,527 41,190 Person-52306188 9,989 9,878 DL Hughley 73,835 886 NY Daily News 85,821 10,681 Person-35844123 10,030 9,761 J. Rampton 47,593 578 Person-18704291 19,212 21,098 Person-24883913 11,191 9,583 Person-51560438 103,721 14.766 J. Calacanis 151,155 112,248 Person-213105865 8,531 9,965 Person-67779865 45,699 41,868 92YTribeca 13,015 10,560 Person-30735143 7,837 8,513 Person-1536651 34,216 456 C.C. Chapman 34,512 28,505 Table 5.9: User influence scores for PageRank and HITS algorithms, for the User Graph, built from the Twitter dataset.

As one can observe from Table 5.10, a great majority of the top-10 highest ranked scores are not venues per se, the geographical locations associated with these tweets correspond to streets or avenues, due to the use of Twitter in various mobile applications. Nevertheless, some well known spots like Times Square and JFK are naturally highlighted. Also, one can acknowledge that, in this particular case, the spots with greater number of checkins turn out to be the most influential spots in the dataset.

PageRank HITS - Authority HITS - Hub Name Checkins Name Checkins Name Checkins Broadway - Times Square 4 Pace University 8 Spot40.71498749:-73.95485289 2 JFK Airport 2 Spot40.679254:-73.8632521 1 Spot40.7827699:-73.95211752 1 JFK Airport (Subway Station) 1 Spot40.67982674:-73.86344992 1 Spot40.76619859:-73.91322359 1 Spot40.80567362:-73.91862858 1 Spot40.6792906:-73.8622276 1 Skin Magic Ltd 1 Spot40.66931554:-74.20359207 1 Park Lane Hotel 1 Spot40.76614592:-73.91323331 1 Spot40.73262798:-73.98359375 1 Astoria Bowl 1 Spot40.76616717:-73.91319381 1 Rosa Mexicano (Restaurant) 1 Spot40.7166368:-73.9543937 1 Broadway - Times Square 1 The Abyssinian Baptist Church 1 Columbus Circle 1 Spot40.75612638:-73.90477465 1 St Luke’s School 1 Spot40.86745661:-74.12978901 1 Spot40.76113205:-73.97952078 1 Spot40.742727:-73.994372 1 Spot40.89064994:-73.89948689 1 JFK Airport 1 Table 5.10: Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitter dataset.

5.3.1.2 Academic social network: DBLP

In Table 5.11 are the top-10 highest ranked papers from the citation network built upon DBLP data, where recipients of scientific awards are highlighted in bold. From this table one can acknowledge that the top-10 remained unaltered for scientific papers published until 2010 and until 2011, and that the

66 majority of these publications are authored by recipients of one or more of the renowned awards from the list in AppendixA.

Focusing on the title of these scientific papers, one can also verify that this top-10 comprises publications that can be considered breakthroughs in a specific research area, e.g., Gerard Salton’s leading work in information retrieval, or inevitable textbook references, e.g., Cormen et al.’s Introduction to Algorithms. Nevertheless, even if the authors aren’t recipients of renowned scientific awards, the fact that they col- laborate with many other authors lead them to be cited in a greater number of publications, reinforcing their PageRank score.

PageRank Paper Authors 2010 2011 A Unified Approach to Functional Philip A. Bernstein, J. Richard Swenson, 0,000903919 0,000903646 Dependencies and Relations Dennis Tsichritzis On the Semantics of the Hans Albrecht Schmid, J. Richard Swenson 0,000891394 0,000891123 Relational Data Model Database Abstractions: Aggregation John Miles Smith, Diane C. P. Smith 0,000860181 0,00085993 and Generalization Smalltalk-80: The Language Adele Goldberg, David Robson 0,000763314 0,000763174 and Its Implementation A Characterization of Ten Hidden-Surface Ivan E. Sutherland, Robert F. Sproull, 0,000716136 0,000716507 Algorithms Robert A. Schumacker An algorithm for hidden line elimination R. Galimberti 0,000706674 0,000707118 Introduction to Modern Information Retrieval Gerard Salton, Michael McGill 0,000699671 0,000699584 C4.5: Programs for Machine Learning J. Ross Quinlan 0,000635416 0,000636705 Introduction to Algorithms Thomas H. Cormen, Charles E. Leiserson, 0,000592198 0,000592414 Ronald L. Rivest Compilers: Princiles, Techniques, and Tools Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman 0,000528325 0,000528235

Table 5.11: PageRank scores for top-10 highest ranked papers of the DBLP dataset.

5.3.2 Predicting Future PageRank Scores and Download Counts

In this section, the experiment regarding the prediction of future influence scores and future download counts is detailed and thoroughly discussed. For a better understanding, we call the model for predicting future PageRank scores and download counts that includes the age of each article the age model and the model that includes age of the article and the term frequency of the 100 most frequent words in the abstract and title of each paper the text model - see Table 5.12.

From Table 5.12 and considering the experiment of predicting the PageRank scores for the year of 2010, both models have provided very similar results, both improving as we added more information, i.e, comparing the three groups of features (PageRank Scores, PageRank scores with racer scores, and PageRank scores with Racer scores, Average PageRank score of the author and Maximum PageRank score of the author) and also comparing within the same groups, the quality of the results improves

67 consistently. Only for the set of features that combine the PageRank score of one previous year with its respective Racer and the author’s Average and Maximum PageRank score, the age model is outper- formed by the text model. Comparing the error rate for the same year, one can assess that, for both models, as we add more information the error rate increases, resulting in the deviation of the results. Nevertheless, for the first two groups of features, the text model has a lower error rate than the age model, while the opposite happens for the third group of features.

Having computed the absolute error for all the groups of features in both models, the results show that, on average, the text model has always a lower absolute error than the age model.

PageRank 2010 PageRank 2011 Features ρ τ NRMSE ρ τ NRMSE Rank k = 1 0.9725065 0.9163994 0.0003224 0.9929880 0.9837121 0.0001057 Rank k = 2 0.9836493 0.9381865 0.0006161 0.9999050 0.9994758 0.0000995 Rank k = 3 0.9890716 0.9506366 0.0006391 0.9999002 0.9993787 0.0004768 Racer + Rank k = 1 0.9724540 0.9173649 0.0003469 0.9998887 0.9994037 0.0002322 Age Racer + Rank k = 2 0.9837098 0.9387564 0.0006520 0.9999004 0.9992955 0.0001634 Racer + Rank k = 3 0.9888725 0.9493687 0.0006605 0.9952435 0.9866206 0.0005492 A + R + Rank k = 1 0.9675213 0.9098510 0.0005354 0.9998529 0.9994497 0.0002530 A + R + Rank k = 2 0.9840530 0.9355465 0.0008336 0.9998353 0.9993422 0.0002962 A + R + Rank k = 3 0.9892456 0.9468673 0.0006986 0.9938021 0.9828511 0.0005317 Rank k = 1 0.9708719 0.9101722 0.0003608 0.9992124 0.9979693 0.0002479 Rank k = 2 0.9831039 0.9310399 0.0006268 0.9997962 0.9992362 0.0004543 Rank k = 3 0.9886945 0.9451537 0.0006276 0.9995012 0.9983375 0.0005800 Racer + Rank k = 1 0.9711170 0.9098901 0.0005515 0.9994290 0.9984499 0.0001590 Text Racer + Rank k = 2 0.9832037 0.9314405 0.0006747 0.9997300 0.9990720 0.0001919 Racer + Rank k = 3 0.9887959 0.9470102 0.0006667 0.9994104 0.9980729 0.0006416 A + R + Rank k = 1 0.9705230 0.9984499 0.0001590 0.9997019 0.9990583 0.0002480 A + R + Rank k = 2 0.9837012 0.9990720 0.0001919 0.9998617 0.9993443 0.0002800 A + R + Rank k = 3 0.9888386 0.9980729 0.0006416 0.9998793 0.9993885 0.0006987

Table 5.12: Results for the prediction of impact PageRank scores for papers in the DBLP dataset.

For the year of 2011, as we add more information to the models, the text model outperforms the age model, as shown in the last two sets of features from the third group. Also, in the scenario in which the models only have the information about the immediately previous PageRank score, the age model is again outperformed by the text model. Nevertheless, when considering the error rate for both models for this year, the text model has an overall higher error rate than the age model showing that, even though the quality of the predicted results is lower in the age model, the results are more accurate.

As occurred for the computation of the absolute error for the year 2010, in all groups of features in both models, the results for the year of 2011 show that, on average, the text model has a lower absolute error than the age model.

Regarding the prediction of download counts depicted in Table 5.13, one can acknowledge that using a text model increases the quality of our results. In the age model, we can verify that adding information about the Racer to the previous PageRank scores affects the results negatively, while combining previ- ous PageRank scores with Racer, and the author’s Average and Maximum PageRank scores provides better results with a lower error rate. From this fact, we can conclude that the age model provides a

68 more accurate prediction as it becomes more complete. The opposite happens in all groups of the text model, i.e., as we, within the same group, add more information to the model, one can acknowledge that the quality of the results decreases, even though they are far better than the corresponding results in the age model.

We can also verify that the age model, for the groups of features that only include previous PageRank scores, and for the ones that combine previous PageRank scores with Racer and author’s Average and Maximum PageRank scores, have a lower error rate than the corresponding groups in the text model. And even though text model has better overall results, the error rate is greater than in the age model for download counts prediction.

As for the absolute error the results showed that, generally, the text model has a lower absolute error rate than the age model in all groups, except the third.

Features ρ τ NRMSE Rank k = 1 0.3864814 0.2742998 0.0080585 Rank k = 2 0.4221492 0.3001470 0.0029377 Rank k = 3 0.4323201 0.3080974 0.0028074 Racer + Rank k = 1 0.4396605 0.3076576 0.0076713 Age Racer + Rank k = 2 0.3370149 0.4747241 0.0078403 Racer + Rank k = 3 0.3313412 0.4612442 0.0088301 A + R + Rank k = 1 0.3377553 0.2558403 0.0147155 A + R + Rank k = 2 0.5335481 0.3894899 0.0088093 A + R + Rank k = 3 0.5406937 0.3962472 0.0078576 Rank k = 1 0.5250188 0.3837016 0.0086955 Rank k = 2 0.5261168 0.3849615 0.0087775 Rank k = 3 0.5060003 0.3674801 0.0091976 Racer + Rank k = 1 0.5325432 0.3887987 0.0085328 Text Racer + Rank k = 2 0.5224018 0.3822982 0.0089440 Racer + Rank k = 3 0.5087407 0.3703400 0.0091979 A + R + Rank k = 1 0.5709764 0.4234845 0.0076071 A + R + Rank k = 2 0.5651282 0.4180070 0.0079000 A + R + Rank k = 3 0.5608946 0.4148554 0.0088935

Table 5.13: Results for the prediction of download numbers for papers in the DBLP dataset.

In brief, from the results in Tables 5.12 and 5.13, we can acknowledge that predicting the number of downloads is an harder task than predicting the future PageRank scores. We can also see that, when predicting future PageRank scores, as more information is added to the model, the more the results deviate. Nevertheless, the opposite happens when we are trying to predict the number of downloads.

Comparing the years of 2010 and 2011, we can acknowledge that predicting the PageRank scores of a more recent year is easier than if we progressively go back in time to predict the PageRank score of a more distant year.

69 5.4 Summary

In this chapter I presented and discussed the results obtained from the experiments of finding influ- encers in FourSquare and Twitter, as well as, in the DBLP citation network, and from the experiments for predicting future PageRank scores and future download counts for scientific papers downloaded from the ACM Digital Library.

Regarding location-based social networks, one can acknowledge that, most of the time, the most influ- ential users in a network are not the ones who have more followers. From the results one can see that in the User Graph, the relationships between unknown (to the public) users prevails, while TV channels, celebrities or worldwide magazines are highlighted and, thus, among the most influential users in the User+Spot Graph.

As for the experiment with the DBLP citation network, results have shown that the proposed frame- work, based on an ensemble regression model, offers highly accurate predictions, providing an effective mechanism to support the future ranking of papers in academic digital libraries.

70 Chapter 6

Conclusions

n my MSc thesis I proposed to explore the task of finding influential users in a social network, with Ithe aid of network analysis techniques and algorithms. As I intended to perform experiments with different types of social networks, I began by collecting real and up-to-date data from both FourSquare and Twitter, in order to build two distinct social networks based on location, and gathered a dataset from the DBLP digital library, already structured in the context of the Arnetminer project, so an academic citation network could be built.

Influence was then estimated through the computation of ranking state-of-the-art algorithms, such as PageRank, HITS and IP. In the particular case of the IP algorithm, and concerning location-based social networks, we wanted to estimate exclusively user influence, thus instead of building a network with user-user and user-location ties, the original implementation of the IP algorithm was adapted so that the resulting network graph consisted solely in weighted user-user ties.

Regarding the academic citation network, besides an influence estimation for all the papers in the dataset, we also addressed a recent research topic and developed a framework to predict the future influence scores of scientific papers and the future download counts of papers downloaded from the ACM digital library for a specific year, based on the previous years’ influence scores. In this experiment we could test and combine different sets of features, resulting in two different models for the prediction of future influence scores: (1) a model including the age of the paper, and (2) a model including the 100 most frequent words in all papers’ titles and abstracts.

Rank Aggregation was also part of the initial objectives of this work, in order to combine the output of the different algorithms nonetheless, due to some difficulties with the completion of the remaining tasks included in the MSc thesis work,this task could not be addressed in time.

With the results of our experiments we could perform a detailed characterization of the aforementioned

71 social networks, and verify that social network analysis techniques can be used to assess the most in- fluential nodes of a network. As for the prediction of future influence scores, we can conclude that the framework that was developed for academic citation networks provides reliable and accurate estima- tions, very close to the real values.

A major limitation of this work resides in the evaluation of the results regarding location-based networks. Unlike academic social networks, where one can either assess the validity of the most influential authors or the most influential articles through an extensive list of renowned scientific awards that have been earning prestige throughout the years, social network analysis and, most specifically, location-based networks is a recent area of studies in which one does not yet have a list of characteristics that indicate without flaws that a user or a spot is influential, or a series of public prizes that award people, companies or spots due to their relevance and influence in a specific context. Therefore, this task had to be done by comparison to well known state-of-the-art social network analysis metrics. Also, social networks are dynamic, so that set of users or spots that can be considered influent or trendy today, might be different if we make the same estimation, within the same conditions, in a couple of months or a year.

6.1 Summary of Results

In brief, the following are the most important contributions of my MSc thesis, according to their relevance:

Crawling software I implemented crawlers to extract data from FourSquare and from Twitter, using their respective APIs. From the data that was collected I built two location-based networks, from which I extracted its most influential nodes. The source code for the FourSquare crawler was made available as an open-source project1, so it can be re-used by others researching this topic.

Implementation and adaptation of the Influence-Passivity (IP) algorithm Having conducted a thorough study regarding ranking algorithms, with special focus in the PageR- ank algorithm and its variants, I implemented Influence-Passivity (IP) algorithm. The originality in the implementation of IP resides in the fact that the network is built in such way that it only con- tains user-user arcs and the weights assigned to each edge depend on the number of spots that the two users have visited in common. This adaptation of IP bears the fact that in location-based networks the information is spread differently than in average social networks. The code for the implementation of the IP algorithm was made available as an open-source project2, so it can be used and improved by others researching this topic.

1http://code.google.com/p/fscrawler/ 2http://code.google.com/p/ezgraph/

72 Academic Citation Network From the already structured data from DBLP, organized in the context of the Arnetminer Project, I built an academic citation network and was able to extract its most influential papers, through the computation of the PageRank algorithm. The results were validated against an extensive list of renowned scientific awards, coming to the conclusion that the majority of the top-10 highest ranked papers in the network are either authored by recipients of the aforementioned awards or represent breakthroughs, unquestionable text books on a specific topic or are authored by scientists who have collaborated and co-authored with a great number of other scientists.

Framework for prediction of future PageRank scores and future download counts I developed a framework to predict the future PageRank score of scientific papers and the future download counts of a scientific paper for a specific year, using the academic citation network mentioned in the previous item.

This task was address through an ensemble learning regression algorithm, the IGBRT. I also as- sessed the impact that different features and the combination of different features like previous PageRank scores or the age of the paper, have in the accuracy of the results. Our predictions were compared to the real PageRank scores and the real number of downloads in the ACM Digital Library for that specific paper and year and we concluded that in some cases, depending on the combination of features that we used, having some information can negatively deviate the results, while in others, as we combine more information, the predictions become closer to the real values. Globally, this approach to future PageRank prediction proved to be accurate, with the predicted results very close to the real values.

6.2 Future Work

In terms of future work, it would be important to address all the tasks that I initially intended fulfill, namely, conduct rank aggregation in the aforementioned experiments. It would also be very interesting to find the most influential users and spots for more complete datasets, which could result in a much richer network and subsequent analysis.

Taking advantage of the fact that this research area is still in its infancy, we could combine the work of this MSc thesis with the work of Lima & Musolesi(2012), which adapts well known local and global social network analysis metrics like degree or clustering coefficient that are location-agnostic, giving them a spatial context, e.g., to calculate the degree of a node in the network, but only considering the friends of this node that are associated with a specific geographical location, such as a city or a state.

Also, due to the fact that social networks are dynamic networks, i.e, its structure can change overtime with the addition or loss of nodes and relationships, we could integrate state-of-the-art frameworks

73 and algorithms in order to include the passage of time in the networks we have studied. Even though dynamic networks have been frequently addressed regarding network visualization (Demoll & Mcfarland, 2005), works such as of Berger-Wolf & Saia(2006) break away from conventional networks analysis, by proposing a mathematical framework for dynamic network analysis.

On the other hand, we could also extend our work with the implementation of temporal distance metrics proposed by Tang et al. (2009), that could be applied to networks that change over time and allow us to capture the properties of these time-varying graphs, such as delay, duration and time order of interactions between nodes.

74 Bibliography

AGARWAL,N.,LIU,H.,TANG,L.&YU, P.S. (2008). Identifying the influential bloggers in a community. In Proceedings of the 2008 International Conference on Web Search and Web Data Mining.

ANAGNOSTOPOULOS,A.,KUMAR,R.&MAHDIAN, M. (2008). Influence and correlation in social net- works. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

ANDERSON,L.R.&HOLT, C.A. (1995). Information cascades in the laboratory. American Economic Review, 87.

ARGUELLO,J.,BUTLER,B.S.,JOYCE,E.,KRAUT,R.,LING,K.S.,ROSE´ ,C.&WANG, X. (2006). Talk to me: foundations for successful individual-group interactions in online communities. In Proceedings of the 2006 SIGCHI Conference on Human Factors in Computing Systems.

BAKSHY,E.,HOFMAN,J.M.,MASON, W.A. & WATTS, D.J. (2011). Everyone’s an influencer: quantifying influence on twitter. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining.

BASTIAN,M.,HEYMANN,S.&JACOMY, M. (2009). Gephi: An open source software for exploring and manipulating networks. In Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media.

BERBERICH,K.,BEDATHUR,S.&WEIKUM, G. (2006). Rank synopses for efficient time travel on the web graph. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management.

BERGER-WOLF, T.Y. & SAIA, J. (2006). A framework for analysis of dynamic social networks. In Pro- ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.

BEST, D.J. & ROBERTS, D.E. (1975). Algorithm as 89: The upper tail probabilities of spearman’s rho. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24.

BOLDI, P. & VIGNA, S. (2004). The webgraph framework I: compression techniques. In Proceedings of the 13th International Conference on World Wide Web.

75 BOLDI, P., SANTINI,M.&VIGNA, S. (2005). Pagerank as a function of the damping factor. In Proceed- ings of the 14th International Conference on World Wide Web.

BOLLEN,J.,RODRIGUEZ,M.A.&VANDE SOMPEL, H. (2006). Journal status. Scientometrics, 69.

BOLLEN,J.,VANDE SOMPEL,H.,HAGBERG,A.&CHUTE, R. (2009). A principal component analysis of 39 scientific impact measures. Public Library of Science, 4.

BONACICH, P. (2007). Some unique properties of eigenvector centrality. Social Networks, 29.

BONDY,J.A.&MURTY, U.S.R. (1976). Graph Theory with Applications. Macmillan.

BRAUER, A. (1952). Limits for the characteristic roots of a matrix. IV: Applications to stochastic matrices. Duke Mathematical Journal, 19.

BREIMAN, L. (2001). Random forests. Machine Learning, 45.

BRIN,S.&PAGE, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceed- ings of the 7th International Conference on World Wide Web.

CHA,M.,HADDADI,H.,BENEVENUTO, F. & GUMMADI, K.P. (2010). Measuring user influence in twitter: The million follower fallacy. In Proceedings of the 2010 International AAAI Conference on Weblogs and Social Media.

CHEN, C. (2006). Citespace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science, 57.

CHEN, P., XIE,H.,MASLOV,S.&REDNER, S. (2007). Finding scientific gems with google’s pagerank algorithm. Journal of Informetrics, 1.

CLARK,J.&HOLTON, D.A. (1991). A First Look at Graph Theory. World Scientific.

CONITZER, V. (2006a). Computational Aspects of preference aggregation. Ph.D. thesis, Carnegie Mellon University.

CONITZER, V. (2006b). Computing slater rankings using similarities among candidates. In Proceedings of the 21st National Conference on Uncertainty in Artificial Intelligence.

CONITZER, V. & SANDHOLM, T. (2005). Common voting rules as maximum likelihood estimators. In Proceedings of the 2005 National Conference on Uncertainty in Artificial Intelligence.

CORMEN, T.H., LEISERSON,C.E.,RIVEST,R.L.&STEIN, C. (2001). Introduction to Algorithms. The MIT Press, 2nd edn.

DEMOLL,B.S.&MCFARLAND, D. (2005). The Art and Science of Dynamic Network Visualization. Jour- nal of Social Structure, Volume 7.

76 DEVEZAS,J.,NUNES,S.&RIBEIRO, C. (2011). Using the H-index to Estimate Blog Authority. In Pro- ceedings of the 5th International AAAI Conference on Weblogs and Social Media.

DIESTEL, R. (2005). Graph Theory, vol. 173. Springer-Verlag, Heidelberg, 3rd edn.

DING, Y. & CRONIN, B. (2011). Popular and/or prestigious? measures of scholarly esteem. Information Processing and Management, 47.

DING, Y., YAN,E.,FRAZHO,A.&CAVERLEE, J. (2009). Pagerank for ranking authors in co-citation networks. Journal of the American Society for Information Science and Technology, 60.

DUTTON, G. (1996). Improving locational specificity of map data - a multi-resolution, metadata-driven approach and notation. International Journal of Geographical Information Science, 10.

EASLEY, D. & KLEINBERG, J. (2010). Networks, Crowds, and Markets: Reasoning About a Highly Con- nected World. Cambridge University Press.

EGGHE, L. (2006). Theory and practise of the g-index. Scientometrics, 69.

EGGHE, L. (2009). Lotkaian informetrics and applications to social networks. The Bulletin of the Belgian Mathematical Society, 16.

FIALA, D., ROUSSELOT, F. & JEZEKˇ , K. (2008). PageRank for bibliographic networks. Scientometrics, 76.

FRANCK, G. (1999). Essays on Science and Society: Scientific Communication–A Vanity Fair? Science, 286.

FREEMAN, L.C. (1978). Centrality in social networks conceptual clarification. Social Networks, 215.

GELLER, C. (2002). Single transferable vote with Borda elimination: A new vote counting system. Tech. rep., Deakin University, Faculty of Business and Law, School of Accounting, Economics and Finance.

GHOSH,R.,LERMAN,K.,SURACHAWALA, T., VOEVODSKI,K.&TENG, S.H. (2011). Non-conservative diffusion and its application to social network analysis. Arxiv article pre-print.

GIBBONS, A. (1985). Algorithmic Graph Theory. Cambridge University Press.

HAGBERG,A.A.,SCHULT, D.A. & SWART, P.J. (2008). Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference.

HARARY, F. (1962). The determinant of the adjacency matrix of a graph. Society for Industrial and Ap- plied Mathematics, 4.

HAVELIWALA, T.H. (2002). Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web.

77 HEIDEMANN,J.,KLIER,M.&PROBST, F. (2010). Identifying key users in online social networks: A pagerank based approach. In Proceedings of the 31st International Conference on Information Sys- tems.

HIRSCH, J.E. (2010). An index to quantify an individual’s scientific research output that takes into ac- count the effect of multiple coauthorship. Scientometrics, 85.

HUBERMAN,B.A.,ROMERO, D.M. & WU, F. (2009). Crowdsourcing, attention and productivity. Journal of Information Science, 35.

JOACHIMS, T. (1999). Advances in kernel methods. chap. Making large-scale support vector machine learning practical, MIT Press.

JOACHIMS, T. (2002). Learning to classify text using support vector machines. Kluwer, dissertation.

KAISER, M. (2008). Mean clustering coefficients: the role of isolated nodes and leafs on clustering measures for small-world networks. New Journal of Physics, 10.

KISELEV, V. (2008). On eligibility by the Borda voting rules. International Journal of Game Theory, 37.

KLEINBERG, J.M. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms.

LEAVITT,A.,BURCHARD,E.,FISHER, D. & GILBERT, S. (2009). The influentials: New approaches for analyzing influence on twitter. Webecology Project.

LEBANON,G.&LAFFERTY, J.D. (2002). Cranking: Combining rankings using conditional probability models on permutations. In Proceedings of the 19th International Conference on Machine Learning.

LI, H. (2011). Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers.

LIMA,A.&MUSOLESI, M. (2012). Spatial dissemination metrics for location-based social networks. In Proceedings of the 4th ACM International Workshop on Location-Based Social Networks (LBSN 2012). Colocated with ACM UbiComp 2012.

LIU, T.Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Re- trieval, 3.

LIU,X.,BOLLEN,J.,NELSON,M.L.&VANDE SOMPEL, H. (2005). Co-authorship networks in the digital library research community. Information Processing and Management, 41.

LOTKA, A.J. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Science, 16.

78 LUCIANO,RODRIGUES, F.A., TRAVIESO,G.&BOAS, V.P.R. (2005). Characterization of complex net- works: A survey of measurements. Advances in Physics, 56.

LUCIANO,RODRIGUES, F.A., TRAVIESO,G.&BOAS, V.P.R. (2006). Characterization of complex net- works: A survey of measurements. Advances in Physics, 56.

MACSKASSY,S.A.&PROVOST, F. (2007). Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8.

MCPHERSON,M.,SMITH-LOVIN,L.&COOK, J.M. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27.

MIHALCEA, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summa- rization. In Proceedings of the 2004 Annual Meeting of the Association for Computational Linguistics.

MILLEN, D.R. & PATTERSON, J.F. (2002). Stimulating social engagement in a community network. In Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work.

MOHAN,A.,CHEN,Z.&WEINBERGER, K.Q. (2011). Web-search ranking with initialized gradient boosted regression trees. Journal of Machine Learning Research - Proceedings Track, 14.

NEWMAN, M.E.J. (2003). A measure of betweenness centrality based on random walks. Social Net- works, 27.

NEWMAN, M.E.J. (2004). Analysis of weighted networks. Physical Review E, 70.

OLIVER,J.J.&HAND, D.J. (1995). On pruning and averaging decision trees. In In Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann.

PAGE,L.,BRIN,S.,MOTWANI,R.&WINOGRAD, T. (1998). The pagerank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference.

PAPAGELIS,M.,BANSAL,N.&KOUDAS, N. (2009). Information cascades in the blogosphere: A look behind the curtain. In Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media.

PERRA,N.&FORTUNATO, S. (2008). Spectral centrality measures in complex networks. Physical Re- view E, 78.

PROCACCIA, A.D., ZOHAR,A.&ROSENSCHEIN, J.S. (2006). Automated design of voting rules by learning from examples. In In Proceedings of the 1st International Workshop on Computational Social Choice.

REKA,A.&BARABASI´ (2002). Statistical mechanics of complex networks. Reviews of Modern Physics, 74.

79 ROMERO, D.M., GALUBA, W., ASUR,S.&HUBERMAN, B.A. (2011). Influence and passivity in social media. In Proceedings of the 20th International Conference Companion on World Wide Web.

SAYYADI,H.&GETOOR, L. (2009). Futurerank: Ranking scientific articles by predicting their future pagerank. In Proceedings of the 2009 SIAM International Conference on Data Mining.

SHANNON, P., MARKIEL,A.,OZIER,O.,BALIGA,N.S.,WANG, J.T., RAMAGE, D., AMIN,N.,

SCHWIKOWSKI,B.&IDEKER, T. (2003). Genome Research, 13.

SIDIROPOULOS,A.&MANOLOPOULOS, Y. (2005). A citation-based system to assist prize awarding. ACM SIGMOD Record, 34.

SIDIROPOULOS,A.,KATSAROS, D. & MANOLOPOULOS, Y. (2007). Generalized hirsch h-index for dis- closing latent facts in citation networks. Scientometrics, 72.

SZABO,G.&HUBERMAN, B.A. (2010). Predicting the popularity of online content. Communications of the ACM, 53.

SZALAY,A.S.,GRAY,J.,FEKETE,G.,KUNSZT, P.Z., KUKOL, P. & THAKAR, A. (2007). Indexing the sphere with the hierarchical triangular mesh. Techinical Report.

TANG,J.,MUSOLESI,M.,MASCOLO,C.&LATORA, V. (2009). Temporal distance metrics for social network analysis. In Proceedings of the 2nd ACM workshop on Online social networks.

WALKER, D., XIE,H.,YAN,K.K.&MASLOV, S. (2007). Ranking scientific publications using a simple model of network traffic. Journal of Statistical Mechanics.

WATTS, D.J. & DODDS, P.S. (2007). Influentials, networks, and public opinion formation. Journal of Consumer Research, 34.

WATTS, D.J. & STROGATZ, S.H. (1998). Collective dynamics of ’small-world’ networks. Nature, 393.

WENG,J.,LIM, E.P., JIANG,J.&HE, Q. (2010). Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining.

WU, F., WILKINSON, D.M. & HUBERMAN, B.A. (2009). Feedback loops of attention in peer production. In Proceedings of the 2009 International Conference on Computational Science and Engineering.

XIA,L.,LANG,J.&MONNOT, J. (2011). Possible winners when new alternatives join: New results com- ing up. In In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems.

XING, W. & GHORBANI, A. (2004). Weighted pagerank algorithm. In Proceedings of the 2004 Annual Conference on Communication Networks and Services Research.

80 YAN,E.&DING, Y. (2011). Discovering author impact: A pagerank perspective. Information Processing and Management, 47.

YANG,J.&COUNTS, S. (2010). Predicting the speed, scale, and range of information diffusion in twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media.

YOUNG, H.P. (2009). Innovation Diffusion in Heterogeneous Populations: Contagion, Social Influence, and Social Learning. American Economic Review, 99.

ZHANG, C.T. (2009). The e-index, complementing the H-index for excess citations. Public Library of Science, 4.

ZHENG, Y. & ZHOU, X., eds. (2011). Computing with Spatial Trajectories. Springer.

81

Appendix A

Important Awards in Computer Science

The following renowned award lists were used as ground-truth lists in the task of assessing the veracity of the PageRank scores obtained for the DBLP dataset:

• A. M. Turing Award1

• Knuth Prize2

• IEEE John von Neumann Medal3

• IEEE Emanuel R. Piore Award4

• ACM SIGMOD Edgar F. Codd Innovations Award5

• ACM SIGMOD Best Paper Award6

• ACM SIGMOD Test of Time Award7

• ACM Software System Award8

• ACM Innovation Award9

• National Science Foundation Presidential Young Investigator Award10

1http://amturing.acm.org/ 2http://www.sigact.org/Prizes/Knuth/ 3http://www.ieee.org/about/awards/medals/vonneumann.html 4http://www.ieee.org/about/awards/tfas/piore.html 5http://www.sigmod.org/sigmod-awards/sigmod-awards#innovations 6http://www.sigmod.org/sigmod-awards/sigmod-awards#bestpaper 7http://www.sigmod.org/sigmod-awards/sigmod-awards#time 8http://awards.acm.org/homepage.cfm?srt=all&awd=149 9http://www.sigkdd.org/awards_innovation.php 10http://www.nsf.gov/awards/presidential.jsp

83 • SIGIR Gerard Salton Award1

1http://www.sigir.org/awards/awards.html

84