Analysis of Online Social Network Connections for Identification Of

Analysis of Online Social Network Connections for Identification of Influential Users: Survey and Open Research Issues MOHAMMED ALI AL-GARADI, KASTURI DEWI VARATHAN, SRI DEVI RAVANA, EJAZ AHMED,GHULAM MUJTABA, UNIVERSITY OF MALAYA, KUALA LUMPUR, MALAYSIA MUHAMMAD USMAN SHAHID KHAN, COMSATS INSTITUTE OF IT, PAKISTAN SAMEE U. KHAN, NORTH DEKOTA STATE UNIVERSITY, USA

Online social networks (OSNs) are structures that help users to interact, exchange, and propagate new ideas. The identification of the influential users in OSNs is a significant process for either accelerating the propagation of information that includes marketing applications or hindering the dissemination of unwanted contents, such as viruses, negative online behaviors, and rumors. The present paper presents a detailed survey of influential users’ identification algorithms and their performance evaluation approaches in OSNs. The survey covers recent techniques, applications, and open research issues on analysis of online social network connections for identification of influential users.

KEYWORDS Influential users; OSNs; Complex Networks; Identification Algorithms; Social Media; Big Data

1 INTRODUCTION Online social networks (OSNs) are dynamic social interaction platforms for billions of users worldwide. Information and ideas are rapidly disseminated among these users through online social interactions. The online interactions among OSN users generate a huge volume of data that provides the opportunity to study human behavioral patterns (Ratkiewicz et al., 2011). An in-depth investigation of OSNs is important to enhance the understanding of the relationships among people and to help address several issues on society and sociality. The literature on information dissemination has shown that only a few influential people have observable qualifications that shape the opinion of a large population (Katz and Lazarsfeld, 1955). The identification of influential users holds tremendous practical importance, and it has recently attracted considerable attention (Pei et al., 2014a; Kitsak et al., 2010). Targeting influential users is vital for designing techniques for either accelerating the propagation of information in marketing applications (Richardson and Domingos, 2002; Subramani and Rajagopalan, 2003) or hindering the dissemination of unwanted contents (e.g., viruses, online negative behaviors, and rumors) (Zhao et al., 2011; Kwon et al., 2013; Fire et al., 2014). Moreover, the analysis of influence patterns is significant to comprehend the rapid adoption of specific trends. Influence patterns are beneficial for advertisers to implement highly effective campaigns. Numerous recent studies have been published on the identification of influential users in social networks. The search for the best set of influential users is a many-body problem that depends on the topological interactions among users (Altarelli et al., 2013; Altarelli et al., 2014; Morone and Makse, 2015b). Influential users are generally identified by their ranking relative to topological measures. Therefore, an effective topological measurement algorithm is fundamental in identifying influential users in OSNs. However, the comparison of published techniques is challenging due to the unavailability of enough social network data for most OSNs (Pei et al., 2014a). Moreover, the consensus on the best influential users identification algorithm is also lacking in literature (Pei et al., 2014a).

Page 1 of 34

1.1 Social Network Analysis A social network is generally assumed to be viewed as graphs, where vertices denote users and edges represent relationships among users. The importance of a user in a network can be calculated by using metrics imported from graph theory (Mislove et al., 2007; Jiang et al., 2013; Barabási and Albert, 1999; Freeman, 1977; Faust, 1997). These metrics rely on network topology. For example degree centrality pertains to the quantity of edges linked to a vertex. This centrality is often interpreted as a node’s immediate risk of catching viruses or may be used for enhancing information that flows through a network for marketing strategies (Mislove et al., 2007; Jiang et al., 2013; Barabási and Albert, 1999). Social network influence pertains to the ability of a user to change the feelings, attitudes, or behaviors of other users in a network (Easley and Kleinberg, 2010). The network connection among users helps them spread influence. The strength of the link between the two nodes of a network relies on the overlap of their neighborhoods (Granovetter, 1973). Influential individuals tend to be significantly connected with a larger number of groups compared with normal individuals. However in OSNs this condition may not be always applicable to identify influential users (Cha et al., 2010). The identification of influential users within social networks has a wide range of applications, from blocking the spread of virus and disease to maximizing the diffusion of ads and marketing campaigns (Song et al., 2006; Song et al., 2007b). Several studies have assumed that in a real social network, influential users are the most important nodes for spreading influence (Goyal et al., 2010; Song et al., 2007a; Song et al., 2007b). However, Watts and Dodds (2007) argue that instead of influential users, the critical mass of easily influenced users is the main cause of fast information diffusion in the OSNs. Therefore, the measurement of influence and influential users is of significant concern from both analysis and design perspectives. This survey intensively analyzes cutting edge influential user identification techniques in OSNs. The survey focuses on investigating validation approaches to evaluate the performances of different influential user identification techniques. Recent studies on the influence maximization in OSNs are also discussed. Furthermore, taxonomy of influential spreader identification algorithms in OSNs is devised for classification. The commonalities and deviations in such studies are simultaneously examined based on identification algorithms, metrics used, types of network, and evaluation models. The paper concludes by identifying the major issues related to the identification of influential users in OSNs. The taxonomy of these issues is devised according to network-related issues, efficiency of identification algorithm-related issues, validation-related issues, understanding of the role of influential users in OSN-related issues, and user privacy-related issues. These issues are presented as future directions to guide researchers in identifying ideas for further investigations. The introduction is presented in Section 1, and Section 2 discusses existing surveys. Section 3 describes the influential user identification algorithms in OSNs. In Section 4, the performance evaluations of different influential user identification algorithms are presented. Section 5 reviews the influence maximization in OSNs. Section 6 devises the taxonomy of the identification of influential spreader studies in OSNs. Section 7 presents the applications of influential spreader identification in the OSNs context and related opportunities. Section 8 discusses the identified major open research issues related to the identification of influential users in OSNs. Section 9 concludes the paper.

2 RELATED SURVEYS Studies on the identification of influential users in the OSN context have been increasing; however, relatively few surveys have been conducted in this specific area. The related published surveys can be categorized in two: information diffusion in OSNs (Guille et al., 2013) and user influence measurement (Probst et al., 2013; Riquelme and González-Cantergiani, 2016). Guille et al. (2013) have focused on information diffusion in OSNs and its properties. Moreover, the authors have studied the approaches to identify important topics in OSNs using information diffusion properties and discusses the techniques used to model information diffusion. Probst et al. (2013) have focused on exploring the different characteristics of influential users that are used to measure user influence. Riquelme and González-Cantergiani (2016) have conducted a survey on user influence measurement on Twitter, which summarizes how activity, popularity, and user influence are measured on Twitter. However, our work covers the classification of state-of-the-art identification influential user algorithms in OSNs and a devisal of a thematic taxonomy. Furthermore, this paper investigates current

Page 2 of 34 validation approaches for evaluating the performances of different influential user identification algorithms and current studies on influence maximization in OSNs. The survey includes various applications and identifies major open research issues. This review is conducted based on the studies obtained from Web of Science, Scopus databases, and Google Scholar using keywords related to its scope, that is, “the identification of influential users in OSNs.” The first part of the keywords (queries) is related to spreaders, such as influential users, influential spreaders, super-spreaders, and influential nodes. The second part of the keywords is related to OSN websites, such as social media, online social network, OSNs, Twitter, Sina Weibo, YouTube, and LinkedIn. The criteria used to include studies in this survey is as follows: English articles and articles published between 2004 and 2017 that primarily introduce algorithms or methods for the identification of influential users in the context of OSNs. This review excludes studies based on the exclusion criteria as follows: Studies not mainly intended to use OSN networks or studies not mainly designed to identify influential users (spreaders). Moreover, anchor papers of complex networks are considered to support or contradict the reviewed papers although they are not directly applied to OSNs.

3 IDENTIFICATION OF INFLUENTIAL USERS IN OSNS Influential users were initially identified as users with numerous friends or followers. However, the concept recently evolved, and researchers began to use several factors related to network structure and user content along with number of friends in order to identify influential users (Morone and Makse, 2015b; Räbiger and Spiliopoulou, 2015; Xiao et al., 2013). Moreover, the characteristics of OSNs differ from that of traditional social networks. Considering the above-mentioned aspects, the next section explores in detail the most prominent techniques used to identify influential spreaders in the OSN context. The drawbacks of each approach are also outlined. In this section, the following two questions are addressed: 1. What are the different state-of-the-art techniques widely used to identify influential users in OSNs? 2. What are the weaknesses and issues of the state-of-the-art influential user identification techniques in OSNs?

3.1 Influential User Identification Algorithms in OSNs User influence is measured based on various factors and by using various techniques. This section investigates state- of-the-art approaches on the influence measurement and detection of influential users in the OSN context. The classification of these identification methods is based on their working principles (Liao et al., 2017; Lü et al., 2016).

3.1.1 Local Measures. Local measures, including degree, are based on local metrics such as the number of links for computing influential users in networks. Local measure is calculated through the direct neighborhood of a given user. Local measure such as degree centrality pertains to the quantity of links connecting a node, wherein a high-degree node is assumed as the authority for the largest information dissemination (Albert et al., 2000; Pastor-Satorras and Vespignani, 2001). A network can be either directed or undirected. In-degree centrality pertains to the number of edges that connect to the node, whereas out-degree centrality denotes the number of edges that originate from the node. In directed networks, in-degree centrality usually refers to the popularity of a user, whereas out-degree centrality typically indicates the sociality of a user (Mislove et al., 2007; Jiang et al., 2013). In the OSN context, degree counts refer to the size of the audience for users, the number of social relationships, or the amount of interactions. Several studies have used the degree measure to identify the most influential users in OSNs (Cha et al., 2010; Kim and Han, 2009; Romero et al., 2011; Bakshy et al., 2011). However, the degree measure alone cannot accurately reveal the influence of users, and users with a high degree are not necessarily considered as influential (Cha et al., 2010).

3.1.2 Short Path-Based Measures. Short path-based measures consider all the shortest paths that go through a user in a network to calculate the user’s influence and importance in the network. Algorithms based on short paths such as Closeness, Betweenness, and Katz algorithms are discussed in following subsections.

Closeness centrality: In a social network, Closeness centrality calculates the closeness of a node to all remaining nodes in the graph. Closeness is calculated using the distance of the average shortest path from a node to all remaining

Page 3 of 34 nodes in the graph. To any user, the closeness centrality value is reciprocal to the average geodesic distance to all other users (Bavelas, 1950; Freeman, 1979).

Betweeness centrality: Betweenness centrality of a user can be calculated by the counts of the shortest paths that pass through that user. The betweenness centrality has been used for identifying important users in OSNs. For example, Catanese et al. (2012) have applied betweenness centrality to data graphs from Facebook in order to identify the central nodes of the network.

Katz Algorithm: Katz centrality (Katz, 1953) calculates influence of users (nodes) by taking into consideration all network paths. The Katz centrality’s influence on the node is determined by all network links that pass through the node. Katz centrality differs from Closeness centrality in which only the short path lengths are considered among the nodes; Katz centrality considers all network links (Lü et al., 2016). The main difference of Katz centrality from Closeness centrality is that the former assigns a certain minimum score to every user in a network (Liao et al., 2017). Katz centrality is built with a good mathematical assumption for analyzing networks and identifying important nodes. However, Katz centrality has high computational complexity, which limits its application in large networks (Lü et al., 2016).

3.1.3 Iterative Calculation-Based Measures. Iterative calculation-based measures do not only consider the direct links among users but also the accounts for all network links to calculate user influence. In Iterative calculation-based measures, each network user contributes its ranking value to its connected users and obtains new updated scores from connected users in each iteration round. This process is iterated until each user reaches a steady state. Commonly used Iterative calculation-based measures in OSNs, such as Eigenvector centrality and PageRank-like algorithms, are discussed in the following subsections.

Eigenvector centrality: In Eigenvector centrality, not only the number of links (connected users) but also the influence score of connected users is used to calculate user influence. Consequently, in Eigenvector centrality, users receive high scores if they are connected to important users (users with high influence scores). User influence is proportional to the sum of the influence score of the users to which it is connected (Bonacich, 2007). Previous studies have used Eigenvector centrality to identify influential users in a graph (He, 2007; Duda et al., 2012; Borgatti and Everett, 2006).

PageRank-like algorithms: PageRank is a famous Google algorithm for ranking web pages introduced by Brin and Page (2012). PageRank is a global ranking of web pages, irrespective of content and constructed solely on the connected edges on a web graph. The algorithm iteratively calculates the value of a node and depends on the main metrics and connected link counts, as well as the PageRank score of all connected links. The PageRank algorithm is used in various applications such as finding an important node in a graph. The algorithm is characterized by simple assumptions, direct implementation, and comparatively low time complexity. Therefore, many studies are motivated to apply PageRank to recognize critical influential users in numerous practical situations. The PageRank algorithm is expressed as follows:

푃 푅(푢) = (1 − 푑)/푁 + 푑 ∑ 푃푅(푣)/퐿(푣) (1) 푣∈푀(푢) where N is the count of users in the network, L (v) is the count of the out-degree links from a user v, M (u) is the user in OSNs indicating or connecting a user u, and d is a damping factor (Brin and Page, 2012).

The study (Yin and Zhang, 2012) in the Sina Microblog proposes a user interaction model that considers the personal characteristics of users, such as their level of activity and readiness to retweet, in order to calculate user influence (influence score) using a model with a weighted PageRank algorithm. Moreover, this research aims to measure the influence between two users (pairwise influence) rather than measuring global influence (user influence in an entire

Page 4 of 34 network). The characteristics of OSNs differ from those of traditional web pages. Therefore, some PageRank variants have been designed to improve the identification of the influential users in OSNs.

TunkRank (Tunkelang, 2009): TunkRank algorithm is a variant of PageRank. The algorithm assumes that the influence of a user is evenly distributed among his/her followers. Therefore, the probability of a user reading any user’s tweet from the following list is equal to one out of the total number of users in the same list. The preceding concept is mathematically represented as: 1 + 푝 ∗ 퐼푛푓푙푢푒푛푐푒(푗) 퐼푛푓푙푢푒푛푐푒(푖) = ∑ ‖푓표푙푙표푤푖푛푔(푗)‖ (2) 푗 Follower of 푖 where 푝 is the probability that user (푖) is going to retweet user (푗)’s tweets, and ||following (j)|| represents the total number of users followed by user j.

TURank (Yamaguchi et al., 2010): The TURank algorithm studies the relationship among users along with the users’ posts. The TURank algorithm works on the relationship graph network of user-to-tweets. The network models the information flow in Twitter and calculates the ranking scores of users. The ranking algorithm is constructed based on several observations appeared in the study (Yamaguchi et al., 2010). The most important observations are as follows: an influential user is followed by many other influential users, a tweet retweeted by many influential users tends to be a valuable tweet, and a user who posts many valuable tweets is likely to be an influential user.

TwitterRank (Weng et al., 2010): The TwitterRank algorithm uses the topics discussed on Twitter along with the network structure to rank user influence on Twitter. TwitterRank computes each user’s topic distribution based on tweet contents using latent dirichlet allocation (Blei et al., 2003; Griffiths and Steyvers, 2004). The topic distribution helps to compute specific topic similarity scores between each pair of users. TwitterRank subsequently iteratively measures user influence on a topic as follows:

푇⃗⃗⃗⃗푅⃗⃗⃗푡 = 훾푝푡 × 푇⃗⃗⃗⃗푅⃗⃗⃗푡 + (1 − 훾)퐸푡 (3) where 푇푅푡 is the user influence on topic t, and 퐸푡 is the teleportation vector that represents the probability that a user will find another user without following the edges of the user-to-user connectivity graph. Moreover, in Equation 3, damping factor 훾 is the value between 0 and 1 to control teleportation probability, and 푝푡 is the transition probability. The transition probability in TwitterRank is defined as:

(i , ji )( j , ) T. tweets j pt simt  T. tweets a (4) a: i follows a where T.tweets j is the total number of tweets by user 푗 , ∑푖 푓표푙푙표푤푠 푎 푇푡푤푒푒푡푠푎 is the count of tweets by all friends of user 푖 , and 푆푖푚 푡(푖, 푗) is the topic t similarity between users 푖 and 푗 .

LeaderRank (Lü et al., 2011): The LeaderRank algorithm introduces the ground node concept. Ground node is connected to each node in the network using bidirectional links. The ranking process assigns one-unit prestige to all the nodes in a network except the ground node. The unit prestige of the nodes is further evenly distributed to neighboring nodes through direct links. The process continues similar to a random walk for a directed network until a steady state is reached. The authors claimed that the proposed LeaderRank algorithm has numerous advantages over PageRank algorithm. Furthermore, the convergence rate of the LeaderRank algorithm is faster than that of PageRank in a strongly connected network. Moreover, the algorithm is more tolerant of noisy data, such as spurious and missing links and is applicable to any type of network. Furthermore, the algorithm is robust against spammers.

The weighted LeaderRank (Li et al., 2014) is an improved variant of the LeaderRank algorithm. The improvement is achieved by making the ground node more biased toward nodes with more fans using a biased random walk. The weighted LeaderRank can identify the increase of influential users. The algorithm is relatively more tolerant of noisy data and more robust against intentional attacks than the original LeaderRank algorithm.

Page 5 of 34

InfluenceRank (Chen et al., 2012b): The InfluenceRank technique is calculated using two models. The first model measures the users’ relative influence, and the second model calculates the user network global influence. The users’ relative influence model is calculated based on three factors: quality of tweet, ratio of retweets, and topic similarity among users. The model can be mathematically presented as follows:

푅푒푙푎푡푖푣푒 퐼푛푓푙푢푒푛푐푒 푅퐼 (푣푖, 푣푗) = 푄푣푖 + 푅 (푣푖, 푣푗) + 푠푖푚(푣푖, 푣푗) (5) where 푅푒푙푎푡푖푣푒 퐼푛푓푙푢푒푛푐푒 푅퐼 (푣푖 , 푣푗) is the influence of user 푣푖 on user 푣푗, and 푄푣푖 is the quality of tweet and is calculated as: 푛푢푚푏푒푟 표푓 푟푒푡푤푒푒푡푠 (푣푖) + 푛푢푚푏푒푟 표푓 푐표푚푚푒푛푡푠(푣푖) 푄 = (6) 푣푖 푇표푡푎푙 푛푢푚푏푒푟 표푓 푡푤푒푒푡푠 (푣푖)

In Equation (5), 푅 (푣푖, 푣푗) refers to the retweet ratio of from users 푣푗 to 푣푖 and is calculated as:

푅푒푡푤푒푒푡푠 (푣푖, 푣푗) 푅 (푣푖, 푣푗) = (7) 푅푒푡푤푒푒푡푠 (푣푗) The function 푠푖푚(푣푖, 푣푗) in Equation (5) is the topic similarity between users 푣푖 and 푣푗. The user network global influence model of the InfluenceRank technique is a PageRank-like algorithm. The model is calculated by replacing the Relative Influence RI value into the PageRank algorithm. The model is mathematically calculated as follows:

|퐹표푙푙표푤푒푟푠 (푣)| 푅퐼 (푣푖, 푣푗) × 퐼푛푓푙푢푒푛푐푒(푣푗) 퐼푛푓푙푢푒푛푐푒(푣) = (1 − 휆) + 휆 × ∑ (8) 푁 퐹표푙푙표푤푖푛푔(푣푗) 푣푗 ∈퐹표푙푙표푤푒푟푠 (푣푖) where 휆 is the damping factor. The only difference between the above-represented method and the original PageRank algorithm is the scoring value. The scoring value in the InfluenceRank technique is unequally distributed to all followers. Moreover, the technique uses a biased random walk, and the scoring value is determined by the푅푒푙푎푡푖푣푒 퐼푛푓푙푢푒푛푐푒 (푅퐼 (푣푖, 푣푗)).

InfRank (Jabeur et al., 2012): InfRank is a PageRank-like algorithm that models OSNs to identify influential users and leaders. The study measures the users’ influence by initially measuring the ability to spread information in the network (i.e., by having a high number of retweets) and subsequently determining good influential users in the retweet list (i.e., their tweets are retweeted by influential users). The InfRank technique constructs a graph network using users as nodes and retweets as edges between users. The edge is constructed between users 푣푖 and 푣푗 if at least one tweet of

푢푠푒푟 푣푗 is retweeted by 푢푠푒푟 푣푖. The user-retweet graph has advantages and disadvantages compared with the user- follower graph. The former represents a strong social connection because the user can follow another without ever retweeting them and can retweet without following the author of the original tweet. However, the user-retweet graph is relatively sparser than the user-follower graph. Moreover, the InfRank technique distributes the ranking score retweet edges in terms of weights. Weighted edge is calculated as follows:

|푡표푡푎푙 푛푢푚푏푒푟 표푓 푡푤푒푒푡푠 푏푦 푣푗 ∩ 푡표푡푎푙 푛푢푚푏푒푟 표푓 푟푒푡푤푒푒푡푠 푏푦 푣푖 | 푤 (푣푖, 푣푗 ) = (9) |푡표푡푎푙 푛푢푚푏푒푟 표푓 푡푤푒푒푡푠 푏푦 푣푖 |

SpreadRank (Ding et al., 2013): The SpreadRank method is a variant of the PageRank method in microblogs. This method constructs a user-retweet network similar to that of the InfRank technique. User-retweet network edges are weighted by a unique retweet weight. The reweights of the edge is computed as the ratio of retweet counts to tweets. Moreover, the time interval of retweets has a significant importance in the SpreadRank method. The author argues that the faster the tweets are retweeted, the higher the diffused rate. The study uses the location of users in information cascades to measure the teleport vector that indicates that a location closer to the root (main tweets) will obtain higher scores. The influence transition from users 푣푖 to 푣푗 is calculated as follows: ∑푟푖푗 푓(푡푖푗) 푃(푣푗, 푣푖) = (10) 푙표푔푇푗 where (푡푖푗) is the time interval of the retweets, and 푇 푗 is the number of tweets by user 푣푗.

Page 6 of 34

ProfileRank (Silva et al., 2013): The ProfileRank model is inspired by the PageRank algorithm. This model introduces an integrated view of user influence and content relevance in information diffusion. ProfileRank identifies influential users by measuring the users’ abilities to create and propagate relevant content for a significant portion of the community. The algorithm is computed by the random walks on a user-content bipartite network. The aforementioned PageRank-like techniques are summarized in Table 1 along with the methodologies, objectives, input parameters, network types, and weight attributes required for the respective technique.

Table 1. Comparison of PageRank-like algorithms

Criteria Methodology Objective Input Parameters Network Weight / PageRank- Type like algorithm

TunkRank TunkRank measures user Both attention distribution and Number of User–follower The edge is weighted by (Tunkelang, influence as the expected retweeting probability are followers and network the introduced ( 푝), the 2009) number of users who will considered to recursively probability that constant probability that read a tweet. measure user influence. users will retweet. users retweet a tweet.

TURank TURank constructs a A user, who is followed by Follower, tweets, User–tweet The weights of edges are (Yamaguchi et user–tweet network in many influential users and is and retweet counts. network calculated as 푤(푒) = al., 2010) which users, their tweets, likely to be an influential user is 푤(푒푠) , and followers are linked considered in recursively 푂푢푡퐷푒푔(푢,푒푒푠) with corresponding edges. measuring user influence. where (푒푠) ∈ 퐸푠 is the edge of the same type as 푒, and 푂푢푡퐷푒푔(푢, 푒푠) is the count of outgoing edges of type 푒푠 from user 푢. TwitterRank TwitterRank uses both The topic similarity between Number of User–follower The edge is weighted by (Weng et al., network structure and users and network structure is followers and topic network the topic similarity of 2010) topic similarity in considered to recursively similarity. users. calculating user influence. measure user influence.

LeaderRank LeaderRank introduced a LeaderRank should be more Number of fans and Fan–leader– No weight is applied. (Lü et al., 2011) ground node that is tolerant of noisy data and robust ground node ground node connected to each node in against spammers to propose (bidirectional links network the network using algorithms that can effectively to every node in the bidirectional links. quantify user influence. network).

Weighted The algorithm uses a The ground node toward the Number of fans and Fan–leader– The weights are set to LeaderRank biased random walk nodes that are more popular ground node ground node the nodes according to (Li et al., 2014) instead of that used in the (with more fans) is biased to (bidirectional links network the node’s in-degree (i.e., original LeaderRank to improve the original to every node in the fan count). make the ground node LeaderRank. network). more biased toward nodes with more fans.

The algorithm works in A user’s relative influence is Number of Follower The edge is weighted by Infl two steps: the first step considered to recursively followers, quality of network a user’s relative uenceRank(Ch involves measuring a user measure users’ influence. followers, quality of influence. en et al., 2012b) relative influence, and the tweet, retweet ratio, second step involves and topic similarity. measuring user network global influence.

Page 7 of 34

InfRank The study measures user InfRank measures user influence Retweets count and Retweet The edge is weighted by (Jabeur et al., influence by initially to identify influencers, leaders, number of network the retweet ratio. 2012) measuring their ability to and discussers in online social influential users in spread information in the network microblogs. the retweet list. network and subsequently determining the good influential users in the retweet list. SpreadRank SpreadRank constructs a Both weights and time interval Retweet count Retweet Edges are weighted by (Ding et al., network using users as a of retweets are considered to and time interval of network unique retweet weight, 2013) node, and an edge exists recursively measure user retweets. i.e., ratio of number of between users 푣푖 and 푣푗, influence. retweets to tweets. when at least one tweet of user 푣푗 is retweeted by 푢푠푒푟 푣푖.

ProfileRank This algorithm measures User influence and content Users and content User–content Random walks are on a (Silva et al., the ability of users to relevance is considered to bipartite user–content bipartite 2013) create and propagate a recursively measure user directed network. relevant content to a influence. network significant portion of the community

3.1.4 Coreness-based Measures. Coreness-based measure (also known as the k-core decomposing method) assumes that the location of a user is more important than its direct connection in calculating its diffusion of influence (Kitsak et al., 2010). The k-core depends on decomposition processes of the network (Dorogovtsev et al., 2006). The decomposition prunes the network down to the k-core. The k-core is the maximal connected sub-graph of an original graph network. In the k-core decomposition processes, all nodes with degrees less than the k are repeatedly removed. The procedure is initiated on users with a degree one. Users with a degree one are assigned to the 1-shell. All users with degree 푘 = 1 are first eliminated, and pruning processes will persist till no user with 푘 = 1 exists. This process is similarly applied to the following k-shells and continues until the cores of the network are identified (Batagelj and Zaversnik, 2003). The k-shell methods are effective techniques for finding influential users in complex networks. The network’s core contains influential nodes. Therefore, influential nodes are successfully identified by the k-shell decomposition method (Kitsak et al., 2010). The influential users’ existence in the network’s core confirms that highly connected users correspond to each other. Recently, many improved variants of k-core algorithms are proposed. In (Zeng and Zhang, 2013), the mixed degree decomposition is used in overcoming limitations related to the original k-shell decomposition, such as considering only the residual degree (i.e., links between the remaining nodes) and entirely ignoring the exhausted degree (i.e., links connected to the removed nodes) (Zeng and Zhang, 2013). In other studies (Garas et al., 2012; Wei et al., 2015) the k-shell is adapted to consider weighted complex networks. However, the algorithms only considered weight as the degree of connected nodes, e.g., a node connected to high degree nodes receives high weight. Therefore, in these studies, only the degrees of nodes are considered, and the quantity of interactions is ignored. However, in the OSN context, the degree of nodes infrequently reflects user influence (Cha et al., 2010). Another study improved the original k-core by considering the interaction between the users as a weight for links (Al-garadi et al., 2017). This study is based on that the interaction is an important element in spreading of information. This study only consider single network (social network) to represent the relationship between the users which is weighted by the interaction between these users. However in the reality, the users can still interact even if there is no social network (following or follower relation) connections between them through common friends, therefore multilayer network representation of OSNs is required to be explored to deeply understand the relationships between the users and effectively represent the different connections. The unavailability of complete network data distinguishes OSNs from many complex systems and prevents the direct validation and comparisons of the efficiencies of user influence measuring techniques. Pei et al. (Pei et al., 2014a) applied different user influence measurements in identifying influential users in OSNs and noted that k-core outperformed the PageRank and degree (Pei et al., 2014a). In another research (Feng, 2011), k-shell algorithm was modified to measure user influence in Twitter. The authors used the Logarithmic mapping technique to find the k-

Page 8 of 34 core. In this technique, each k-shell level roughly represents the log value of the count of analyzed connections. The technique decomposed the network in a manner that nodes with a degree 2푘 − 1 or less are placed in a similar k-level. Authors showed that the modified k-shell methods more effectively and rapidly identify a small group of users than the original method. However, the results were validated only against Twitter usage data (i.e., average tweets/retweets from Twitter usage data). Several studies have shown the plausible circumstances that influential users do not correspond to Twitter usage data (Cha et al., 2010).

3.1.5 Machine Learning Algorithms. Learning approaches use machine learning algorithms to predict influential users, and the most common of which is supervised learning (Hinton et al., 2006). Naïve Bayes, support vector machine, and decision tree are the most common supervised learning techniques. An effective learning approach requires a robust set of features that can provide important discriminative power to better predict results. Learning approaches require sufficient datasets in training and testing the machine learning model. The majority of the studies on predicting influential users (Mei et al., 2015; Liu et al., 2014; Bigonha et al., 2012; Chai et al., 2013; Xiao et al., 2013; Cossu et al., 2015) have proposed important features instead of any specific algorithms to improve the overall prediction model. Predicting influential users in OSNs generates controversial discussions on the selection of specific features that effectively predict user influence. Basic and direct features, such as follower, retweet, and tag counts are used to predict influential users. However, Mei et al. (Mei et al., 2015) shows that aside from direct features, other effective features, such as number of public lists, new tweets, and followers’ ratio to friends can predict influential users. In another research (Liu et al., 2014), several features are extracted to train a support vector machine (SVM). Features use three different means of aggregation, namely, score-, list-, and SVM-based aggregation (Liu et al., 2014). Another approach combines user location in a network, user opinion polarity, and tweet quality to obtain a combined influence score (Bigonha et al., 2012). Moreover, a logistic regression analysis was applied to identify significant features for predicting user influence (Xiao et al., 2013). These features were used to train and compare four machine learning algorithms. The ACQR framework was proposed in (Chai et al., 2013). This framework extracts a set of features that are considered discriminatory attributes in identifying efficient users in OSNs. Features are derived from four aspects, namely, activeness, centrality, quality of post, and reputation. The four features are subsequently used to train the SVM. Cossu et al. (2015) investigate a large selection of traditional features, such as features based on user activity, local topology, stylistic aspects, tweet characteristics, and occurrence-based term weighting to identify influential users in OSNs. This study concludes that traditional features provide insignificant results. The authors propose a set of new features with enhanced performance. However, this study cannot be generalized because the comprehensive traditional features from previous literature are not used. Furthermore, the study utilizes a specific dataset with results valid only for the considered dataset. The following Table 2 compares the different features used in training the learning models to identify influential users in OSNs.

3.1.6 Other Methods. To overcome the drawbacks from techniques in identifying influential users, researchers have introduced either hybrid methods, which are proposed based on more than one technique engineered in a synergistic manner to improve methods’ effectiveness in identifying influential users, or concepts for a new technique in identifying influential users. This section discusses the recent developments in these alternative methods. k-core can identify a single influential spreader effectively by ranking, which is proven in actual big network data (Kitsak et al., 2010; Pei et al., 2014b; Pei and Makse, 2013). However, k-core ranking has severe intersection of seed influence ranges; thus, k-core performs poorly when identifying multiple influential users (Sen Pei et al., 2017). Most previous studies have identified individual influential spreaders as an isolated spreader. However, Teng et al. (2016) have identified multiple spreaders from real OSN data using the collective influence (CI) method from another study (Morone and Makse, 2015a). The CI method examines the collective influence of multiple spreaders. The set of spreaders identified by this method performed better than the commonly used algorithms, such as PageRank and k- core, in terms of the identification of multiple spreaders and high degree of information propagation. This method is effective in identifying multiple influential users in OSNs. Morone et al. (2016) have implemented the CI algorithm from another study (Morone and Makse, 2015a) in approximate linear time when nodes are deleted one by one. Furthermore, they have introduced two extensions of the original CI algorithm to optimize its performance, which

Page 9 of 34 exhibited small performance improvements over the original CI algorithm. However, these versions suffer from high computation time, e.g., 푂(푁2) compared with 푂(푁 푙표푔 푁) of the original CI algorithm, thereby limiting their application in big social networks, such as OSNs (Morone et al., 2016). Yang et al. (2016) have constructed a method using the concepts of closeness and betweenness centralities. This method modifies the original betweenness centrality by computing the degree of closeness between nodes and then creating link weights with this closeness calculation. However, this method is evaluated in small networks and was untested in real large networks, which is the main drawback of the underlying methods (closeness and betweenness centralities). Consequently, closeness and betweenness centralities require high-computation time when applied to large OSNs (which contain millions of users). Sun et al. (2016) have established a model using a set of proposed multi-features as input to the Bayesian network and PageRank. Features, such as tweet content, retweet count, follower and following counts, comments, tweet count, user authentication, professional background, and user interest, are used to construct this model. However, this model inherits the limitation of machine learning and PageRank methods in identifying influential users. Through the dense group generating algorithm, Ma et al. (Ma et al., 2017) search for the initial spreader first by finding the dense group and simultaneously selected the initial spreader from each dense group. Their proposed method is validated through the artificial susceptible-infectious (SI) model to demonstrate effectivity and efficiency. However, the disease spread of the SI model differed from the information spread process in the real world, such as in OSNs. Therefore, the proposed method’s effectivity may differ when applied on real social networks (Centola and Macy, 2007; Singh et al., 2013). A model based on real dynamic information spread in real OSNs should be investigated to confirm the effectivity of the proposed method. To overcome the limitation of k-core in identifying influential users, Sheikhahmadi and Nematbakhsh (2017) propose a hybrid technique where first super-spreaders nodes are identified using k-core with taken into consideration the degree, and friends’ diversity of the nodes; they then propose a method to reduce the overlap degree between identified nodes in order to effectively identify multiple spreaders. Zhuang et al. (2017) have introduced a method called SIRank, which calculates the influence diffusion of users in OSNs by considering user features, such as retweet time intervals and position of users in information cascades. To detect influential spreaders in the network, the method uses the features of user cascade influence and user-related influence in random walk similar to PageRank’s original concept. The proposed method shows improvement from algorithms, such as PageRank, TunkRank, RetweetRank, and degree; however, this method is based on the main concept of PageRank that caused similar drawbacks of PageRank as mentioned in 3.2.3. Xia et al. (2016) have first applied low-computational methods, such as degree or k-core, and then have eliminated all low influence users to reduce network size. The reduced network size enables the global measures that are applicable to only small networks. However, such approaches are combined separately and are not integrated as a one- stage approach, which causes the main method to still encounter drawbacks of these methods in every stage. The drawbacks of these methods are clearly not completely eliminated; therefore, such framework must be investigated to prove its effectivity in large-scale OSN networks. A trusted network is constructed by Zhu (2013) in the first stage to optimize the identification of influential users; in the second stage, time factor is considered in model construction to simulate the information dissemination process. Moreover, a dynamic algorithm is proposed in the study to identify the influential users in a network. However, the time complexity of the proposed method should be investigated when applied to large-scale networks. Moreover, the trusted user network is assumed complete to effectively implement the proposed method. However, obtaining entire networks is challenging due to the privacy settings of OSNs. Tan et al. (2016) have developed a network centrality method and used a graph convexity to describe the level of user influence before proposing a message passing method to identify influential spreaders who play major roles in spreading information in a network. However, a challenge is imposed by OSNs’ characteristics (e.g., incomplete graph network and evolving network); therefore, this method should be investigated under such features. Moreover, this approach is established based on the concept of the SI spreading model. Consequently, the methods for identifying influential spreaders in OSNs based on such artificial model do not continuously reflect the effectivity of the proposed method because these models and the real dynamic information spread in OSNs are different (Centola and Macy, 2007; Singh et al., 2013).

Page 10 of 34

Current studies have reported that critical parts of spectral properties are represented by a non-backtracking (NB) matrix, which define the properties of the bond percolation method in complex networks (Radicchi, 2011; Krzakala et al., 2013). By merging these well-known facts, researchers have proposed the NB centrality (Radicchi and Castellano, 2016) as the quantity of choice to identify influential users in chaotic topologies.

3.2 Issues and Shortcomings of Influential Users Algorithms This section discusses the drawbacks of the influence dissemination algorithms. The heuristic methods (degree, eigenvector centrality, closeness centrality, PageRank, and k-core) do not enhance the global function of influence as these methods identify the influential users individually and ignore the role of collective influence. Therefore, the results of the heuristic methods lack assurance in their results (Morone and Makse, 2015b). A recent study (Morone and Makse, 2015b) shows that the set of vital influential users is considerably smaller than the number of influential users detected by heuristic methods. Remarkably, previously ignored numerous weakly connected nodes appear among vital influential users. Ignored nodes are identified as low-degree nodes according to the structural analysis of the network surrounded by hierarchical coronas of hubs. Ignored nodes are exposed only through the optimal collective exchange of all influential users in the network (Morone and Makse, 2015b). The subsequent section discusses the drawbacks and open issues of applying the techniques to OSNs for identifying influential spreaders or users.

Page 11 of 34

Table 2. Comparison of different features used in training the learning model to identify influential users in OSNs

Propagation Network User Quality Topic Features Activity Features Features Information Features Features Features Study Reference

F 1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F 14 F15 F 16 F 17 F18 F19 F20 F21 F22 F23 (Mei et al., 2015)   − − − − − −    −   − − − − −   − −

(Liu et al., 2014)  − − − − − − −   − − −  − − − − −   −  (Bigonha et al., 2012)    −       − − −   −  − −  − − 

(Xiao et al., 2013)     − − − −   − − − − − − −    − − − (Chai et al., 2013)   − −  − − −   − −   − − − − −     (Cossu et al., 2015)               −  −  −  −  

F1 Number of reposts (e.g., sharing, retweets) by others F13 Number of (likes or favorites) F2 Number of tags (e.g., tagging or mentioning others) F14 Number of (comments or replies) F3 Hashtag (#) F15 Content quality F4 Shared URL links F16 Text feature (TF × IDF or bag or words) F5 In-degree F17 Polarity features (positive, negative, or neutral) F6 Betweenness centrality F18 Users’ topic similarity features F7 Closeness centrality F19 Topic distribution F8 Eigenvector centrality F20 Number of posts (e.g., status and tweets) F9 Number of friends (or followers) in the list F21 Number of others’ posts reposted

F10 Number of followers F22 Number of others’ posts acknowledged (e.g., like or favorite other posts) F11 Account information (e.g., official, verified, and account F23 Number of interactions with others (e.g., number of age) comments and replies to other posts)

F12 User description in the profile

Page 12 of 34

Degree centrality: The most connected nodes are generally considered as authority for the largest information dissemination, which are viewed as most influential nodes (Albert et al., 2000). Moreover, a research showed that a simple weighted in-degree outperforms complex measurements, such as PageRank (Tang and Yang, 2012). However, this fact is not always common, the degree measure alone cannot accurately reveal the influence of users, and users with a high degree are not necessarily considered as influential (Cha et al., 2010). A limitation of the degree centrality method is that hubs (i.e. the vertices with many outgoing edges) may form tightly-knit groups called “rich-clubs” (Colizza et al., 2006). The strategies based on the degree method are extremely be biased toward rich-club hubs (Morone and Makse, 2015b). However, sensible circumstances occur when influential users are not the highly connected users (Kitsak et al., 2010). Furthermore, Pei et al. (2014a) reported invalidity of the degree measure for identifying influential users. The degree measure only reflects the quantity of a user’s adjacent neighbors. A weakly connected hub located in the core of the network may generate an important effect that induces information diffusion through a huge portion of users.

Closeness centrality: The closeness centrality rates are high for individuals closer to the center of local clusters. Therefore, the closeness measure over-allocates influential users next to each other (Morone and Makse, 2015b). Moreover, closeness centrality is unsuitable for large-scale OSNs due to its high computational complexity (Chen et al., 2012a).

Betweenness centrality: Betweenness centrality is a popular technique in the complex network analysis, particularly in community detection. However, the technique suffers high computational time. The best algorithm for betweenness centrality requires a computational time equivalent to 푂(푁푀) for unweighted networks with 푁 nodes and 푀 edges (Brandes, 2001). Therefore, betweenness centrality is impractical for large OSNs.

Katz centrality: Katz centrality is constructed based on a good mathematical assumption for analyzing a network and identifying important nodes. However, Katz has high computational complexity of centrality that limits its application to large networks (Lü et al., 2016). Moreover, Katz centrality produces inadequate ranking results in networks with a diverse out-degree distribution (Liao et al., 2017).

Eigenvector centrality: Particularly, the eigenvector method is inefficient in scale-free networks. The method assigns weight to a few nodes (hub), and the remaining majority receives considerably small weights, which causes the inaccurate ranking of most nodes (Morone and Makse, 2015b). However, eigenvector centrality may result in inaccurate ranking when applied to the degree distribution for networks, such as the Internet (Barabási and Albert, 1999), e-mail (Ebel et al., 2002), and Facebook (Catanese et al., 2012), which are scale-free networks. Therefore, the eigenvector centrality inaccurately ranks the abovementioned networks. PageRank-like algorithms: PageRank and PageRank-like measurements are network-based diffusion algorithms that are computed through random walks on network graphs. These algorithm’s desirability is attributed to its known effectiveness in ranking web pages (Ghoshal and Barabási, 2011). The shortcoming of PageRank is attributed to the dissemination procedure of the node’s score to connected nodes. Specifically, a user with a high PageRank provides a significantly high score to weakly connected neighbors. Moreover, the complete OSN structure is unavailable due to the inherent limitations of OSNs caused by API restrictions and user privacy. Therefore, the PageRank algorithm is unreliable in measuring OSNs. Ghoshal and Barabási (2011) have stated that the ranking results given by PageRank are sensitive to changes in the network topology, which makes PageRank unreliable in measuring dynamic and incomplete networks. Similarly, Pei et al. (2014a) show that PageRank is unreliable in identifying influential users in OSNs. PageRank is established on the assumption of random information diffusion in the network. Nevertheless, the process of information dissemination is not totally based on random walks (Goel et al., 2012). Therefore, a considerable divergence exists between accurate outcomes and PageRank results. From above studies, it can be concluded that; PageRank is sensitive to topology changes in the network, rendering it unreliable in noisy or incomplete networks (Ghoshal and Barabási, 2011). Consequently, PageRank is unsuitable for OSNs because most networks are incomplete, and the PageRank concept fails in ranking important users in growing networks (Mariani et al., 2015).

Page 13 of 34

K-core (k-shell) algorithm: Pei et al. (2014a) have conducted a research with large datasets from LiveJournal, Twitter, and Facebook and have reported that the most influential users are located in the networks’ core. The k-core algorithm not only calculates influential spreaders more effectively than other approaches (PageRank and degree) but also distinguishes influential users more precisely. The original k-core method is proposed to deal with unweighted networks. However, in reality, most networks are weighted, and networks’ weight defines the important properties of underlying connections. Researchers in (Garas et al., 2012; Wei et al., 2015) have attempted to eliminate the limitation caused by weights in k-core algorithms and proposed the weighted k-core technique. The proposed link weights in the weighted k-core technique are only using the nodes’ degree. Consequently, a node connected to high degree nodes obtains a high weight. Moreover, the weighted k-core technique only considers the degrees of nodes and ignores interactions’ quantity in networks. However, in the OSN context, the degree of nodes infrequently reflects user influence (Cha et al., 2010). Consequently, Al-garadi et al. (2017) have developed a k-core based method that is weighted based on users’ interaction in OSNs. This approach also reduces the impact of considering all links equally when the original k-core weighs the link between users using their amount of the interaction. Various numerical simulations have been recently implemented (Liu et al., 2015a) to understand the relationship between influential nodes identified by the k-core algorithm and the influence of identified nodes in real networks. The implementations spurred the realization that not all nodes with high shell numbers are highly influential (Liu et al., 2015a). The realization helped to improve the k-core algorithm by removing repetitive connections causing core- like group issues (Liu et al., 2015b). Similarly, the k-shell ranking is not optimal to maximize the collective influence of multiple influential users (Morone and Makse, 2015b). Collective influence comes from interactions between influential users via the overlap of the spheres of influence. The overlap interactions of influential users mostly remain nonexistent within the core when receiving influence from the most ranked influential users (Morone and Makse, 2015b). Therefore, even when the best individually ranked users are located in the k-core, the collective influence is still determined by the full set of interactions.

Machine learning algorithms: Supervised learning methods have the drawback of being dependent on training data. Obtaining training data (i.e., labeled samples) is difficult, expensive, and time-consuming. A robust learning approach for detecting influential users requires a large amount of the labeled training data to effectively learn different classes of models. To alleviate this drawback, semi-supervised approaches are used with only a small amount of the labeled data. However, obtaining efficient knowledge from small samples is mostly inaccurate (Bouguessa, 2011). Another drawback of the machine learning-based researches is the absence of evidence that supports the validity of recommended models (Räbiger and Spiliopoulou, 2015). The success of the machine learning models in measuring user influence depends on several factors (Domingos, 2012). The selection of a set of the best features with a high discriminative power between influential and non-influential users is a complex task. Most machine learning models focus on feature selection (Domingos, 2012). Feature selection algorithms can significantly aid in determining the best features to train models. However, influential users form a small set compared with normal users, which may imbalance class distribution in datasets. An imbalanced class distribution can stop machine learning models from appropriately categorizing the instances of a smaller class. Machine learning methods have great potential to work with complex network methods in a synergistic manner to achieve an effective hybrid method for identifying influential users in large networks (Zanin et al., 2016). However, in-depth investigation and further research are required to enhance the integration between these two approaches. 3.3 Comparison between Influential User Algorithms This section compares the abovementioned algorithms with their advantages, disadvantages, and time complexity as bases and in accordance to state-of-the-art influential user identification algorithms as shown in Table 3.

Page 14 of 34

Table 3. Comparison between influential user identification algorithms

Algorithms Advantages Disadvantages Time complexity Degree  Simple and fast.  Degree centrality only measures the local 푂(푀),where, 푀 represents centrality features of users. number of edges.

Closeness  Performs on the side of  Closeness centrality suffers high 푂(푁3), where, 푁 represents centrality network nodes and computational time. Therefore, closeness number of nodes. results in a global centrality is not applicable for most large impact. OSNs. Betweenness  Reveals shortest paths  Betweenness suffers high computational 푂(푁3) and the best optimized centrality between all node pairs. time. Therefore, betweenness is not betweenness algorithm require  Global measure. applicable to most large OSNs. the computational time 푂(푁푀) for unweighted networks with 푁 nodes and 푀 edges (Brandes, 2001). Katz centrality  Global measure.  Obtains inadequate outcomes in networks 푂(푁3)  It Has a well-proposed with diverse out-degree distribution (Liao mathematical structure et al., 2017). (Lü et al., 2016).  Katz centrality suffers high computational time. Therefore, katz centrality is not applicable to most large OSNs. Eigenvector  Eigenvector centrality is  This centrality may result in inaccurate 푂(푁3) centrality simple and effective for measurement when used for OSNs, as the a network where degree degree distribution for most OSNs, such as is biased in a manner Facebook are scale-free networks that a node is important (Catanese et al., 2012). when linked to other important nodes. PR algorithm  Simple assumptions.  For random networks, the measurements 푂(푁 + 푀)  Direct implementation. given by PageRank are responsive to  Comparatively low perturbations in network topology, computational rendering it inaccurate for incomplete or complexity. noisy networks (Ghoshal and Barabási,  Global measure. 2011).  This algorithm is unreliable in detecting influential users in OSNs (Pei et al., 2014a).  Furthermore, it is on the assumption of random information diffusion in a network. Nevertheless, in actual networks, information diffusion processes are not totally based on random walks (Goel et al., 2012). K-core  Simple and fast.  Designed for unweighted network. 푂(푁) algorithms  Global measure.  The output of k-core has two sets of core  In OSNs with nodes, namely, the true influential nodes, incomplete data, the k- whose shell level correctly reflects their core algorithm influence in real networks, and nodes with calculates user high shell levels but are not influential influence more spreaders (i.e., core-like group) (Liu et al., efficiently than other 2015a). approaches (Pei et al., 2014a). Learning  Learning algorithms can  Learning algorithms require sufficient Learning algorithms vary algorithms predict influential users training data. based on the type of machine based in user  Learning algorithms encounter difficulties learning algorithms and the characteristics. in extracting global features to train the size of training data. These learning model; therefore, in most studies algorithms also requires the learning model is trained based on offline efforts. local users’ local features.

Page 15 of 34

4 PERFORMANCE EVALUATION OF THE INFLUENTIAL USERS’ IDENTIFICATION TECHNIQUES IN OSNS This section discusses the performance evaluation of the influential user identification techniques. Influence diffusion can be modeled in probabilistic frameworks (Cosley et al., 2010). However, the critical characteristics of OSNs distinguish them from small social networks. These characteristics, such as user privacy, high dynamism, and large- scale network, create difficulty in designing evaluation model that can effectively illustrate the information spread in the OSN context. Most researchers that have analyzed the spread of information in a social network structure report that information dissemination has a complete equivalence with the diffusion of infectious diseases (Leskovec et al., 2007a; Watts et al., 2007). Such reports have triggered the extensive implementations of disease spreading models such as the susceptible-infectious-susceptible (SIS) models and susceptible-infectious-recovered (SIR) for evaluating information diffusion (Pei et al., 2015). The aforementioned models are designed based on the simple understanding of human behavior, which may not be an illustrative of the actual dynamics of information dissemination; thus, it may provide misleading output. (Pei et al., 2015; Pei et al., 2014a). For example, studies (Pei et al., 2015; Pei et al., 2014a) have demonstrated that applying SIR and SIS evaluation models in actual networks, the k-shell shows better results than the classical centrality measures. However, Borge-Holthoefer and Moreno (2012) have employed the rumor dynamics evaluation model and suggest that k-shell is ineffective because influential users were not identified accurately. Studies (Centola and Macy, 2007; Singh et al., 2013)have also reported that the evaluation grounded on the disease spread models is impractical. Moreover, diseases and information spread differently (Centola and Macy, 2007; Singh et al., 2013). Therefore, Pei et al. (2014a) utilized the dynamics of information diffusion in actual social networks to remove the dependency of specific models that simulate dynamics. Another study (Ding et al., 2013). introduces coverage to confirm the efficacy of the proposed solution. The coverage considers link structure and time interval to identify the influence spreading capability of a node within a specified period (Ding et al., 2013). Techniques, such as Kendall’s Tau algorithm and Spearman’s rank, are commonly employed to quantify the correlation between the ranking lists obtained by artificial stochastic models or the real dissemination dynamics and the ranking lists obtained by identification algorithms (Liu et al., 2013; Garas et al., 2012). Comparison metrics, such as accuracy, F-measure, precision, and area under curve (AUC) metrics, are used to compare the performances of algorithms using manually labeled data (Xiao et al., 2013; Chai et al., 2013; Cossu et al., 2015) The drawback of the validation approach is its human intelligence requirement to categorize users as influential or non-influential. Thus, the validation scheme is time-consuming and expensive. Furthermore, humans can only judge the influence of a user-based technology on the static information on user features. However, humans cannot determine impact based on an in-depth analysis of the user’s position in the entire network. Consequently, the validation is based on the local rather than the global features of user influence. Several studies have validated their proposed algorithms by comparing the user ranking obtained by different identification algorithms with the OSN characteristics (as standard), such as volume of propagated content, number of replies per post, and user post content volume. The algorithm is considered to be the best if the ranking list is highly correlated with user characteristics (Feng, 2011). The abovementioned validation approach is simple and straightforward. However, the majority of studies have shown that influential users do not always highly correlate with OSN characteristics (Cha et al., 2010; Morone and Makse, 2015b; Räbiger and Spiliopoulou, 2015; Xiao et al., 2013). Therefore, this validation is not always applicable. Furthermore, this validation can be biased to the authors’ preferences in selecting user characteristics to validate the proposed algorithms. The abovementioned performance evaluation models include limitations. Therefore, future studies are required to propose a model that clearly illustrate actual information spread dynamics.

5 OPTIMAL INFLUENTIAL SPREADERS Influence maximization is a process of identifying the set of vital influential nodes, namely “significant seeds,” in a social network. Significant seeds maximize the spread of the influence (Kempe et al., 2003). The influence maximization problem, known as NP-hard (Kempe et al., 2003), was primarily introduced in the context of a viral marketing (Richardson and Domingos, 2002).

Page 16 of 34

Influence maximization has recently received considerable attention from researchers. In the influence maximization problem, a minimum set of influential users or seeds is searched and selected, which can be innovated to extensive disseminate information and effectively generate more adoptions in the entire network. Kempe et al. (2003) have argued that the aim of the spread of influence is monotonous and submodular. Therefore, an acquisitive approach provides a constant-factor approximation for the problem. However, the solution requires a long period to select optimal seeds. Consequently, Leskovec et al. (2007b) have proposed “cost effective lazy forward,” an algorithm 700 times faster than the one proposed by Kempe et al. (2003). Greedy techniques have effectively addressed the influence maximization problem. However, these techniques are limited by the high computational time and unsuitability to large-scale social networks (Li et al., 2015a). Chen et al. (2009) have applied a degree discount heuristic in their work to improve computational time. Degree discount heuristic reduces the degree of a newly selected seed neighbor by one. Despite their increased computational time, greedy methods are more reliable than heuristic-based methods to generate a minimum set of influential seeds. Li et al. (2015a) have introduced a conformity-aware cascade model to approximate the spread of influence in the network. The model leverages the interaction between influence and conformity to determine the influence probabilities of users using the underlying data. State-of-the-art research on influence maximization have implications relative to either the computational time (e.g., greedy approaches (Kempe et al., 2003; Leskovec et al., 2007b; Chen et al., 2009; Wang et al., 2010)), or the quality of the outcome (e.g., heuristic approaches (Chen et al., 2011; Jiang et al., 2011; Kim et al., 2013)). Moreover, only one party maximizes user influence in a specified network (Li et al., 2015b). However, in practice, one or more parties concurrently maximize user influence. Solving the influence maximization problem in a real network for at least two campaigns within the same network is among the most recent issues solved (Li et al., 2015b). Despite the extensive use of heuristic approaches to find influential users (Kitsak et al., 2010; Altarelli et al., 2013; Java et al., 2006; Nguyen and Szymanski, 2013; Weng et al., 2010; Garas et al., 2012; Wei et al., 2015), the identification of a minimal set of optimal nodes, called “optimal influential spreader,” is an unresolved problem in complex networks (Morone and Makse, 2015b). To solve the inherent problem, Morone and Makse (2015b) have plotted it onto optimal percolation in random networks in order to detect the minimum set of influential spreaders. The most influential users are considerably less than that predicted by heuristic centralities. Remarkably, a huge number of previously ignored weakly connected nodes have appeared among vital influential users. Nodes are identified as low-degree nodes according to the structural analysis of a network surrounded by the hierarchical coronas of hubs and exposed only through the optimal collective exchange of all influential users in the network (Morone and Makse, 2015b).

6 TAXONOMY OF THE IDENTIFICATION OF INFLUENTIAL USER RESEARCH IN THE OSN CONTEXT Fig 1 shows the thematic taxonomy of the identification of influential users in the OSNs. Studies that identify influential users in the OSNs are categorized based on five characteristics, namely, identification algorithms, performance evaluation, type of networks, objectives, and metrics attributes. Identification influential spreaders in OSNs context studies are then compared using these characteristics in Table 4.

6.1 Identification Algorithms The OSNs have created massive communication and social interaction among users. Recently, OSNs have attracted millions of the users. Consequently, they have become the large networks containing millions of the nodes and links. Several algorithms, such as the greedy algorithms, are accurate for identifying the influential users in the small networks. However, due to its inherent limitations, greedy algorithms are unsuitable for large networks. The limitations include the inefficiency due to a high computational time (Li et al., 2015a). State-of-the-art algorithms for identifying influential users in the OSN context include degree, closeness centrality, betweenness centrality, Katz centrality, eigenvector centrality, PageRank-like algorithm, k-core algorithm, machine learning algorithm, and other methods. The details of the algorithms are presented in Section 3.1.

Page 17 of 34

Fig 1. Taxonomy of the identification of influential user studies in OSNs

6.2 Performance evaluation The straightforward validation of the effectiveness of influential user identification algorithms is not possible because full diffusion information is unavailable in the OSNs. Technical and privacy issues arranged by users cause this data unavailability. Section 4 discusses the impact of different validation approaches on evaluating the effectiveness of influential user identification algorithms. Validation approaches are classified into the following four categories: 1) Validation through artificial stochastic models: Artificial stochastic models include susceptible-infectious- recovered (SIR), susceptible-infectious-susceptible (SIS) (Pei et al., 2015), rumor dynamics (Borge-Holthoefer and Moreno, 2012), linear threshold, and independent cascade models (AlFalahi et al., 2014). 2) Validation through real spread dynamics: This validation approach tracks actual information diffusion to obtain the ranking list based on the real spread dynamics of information. The ranking list acquired by different identification algorithms is compared with the list of the real spread dynamic of information using evaluation metrics like imprecision function and recognition rate to verify the effectiveness of an identification method with ranking list obtained from the real spread dynamic of information model (Pei et al., 2014a; Ding et al., 2013; Al-garadi et al., 2017). 3) Validation by manual annotations: The results from the different ranking lists obtained by various identification algorithms are compared with the manual annotation ranking lists. The measures, such as accuracy, F-measure,

Page 18 of 34 precision, and AUC metrics are applied to compare the performances of the algorithms with manually labeled data (Xiao et al., 2013; Chai et al., 2013; Cossu et al., 2015). 4) Validation by direct comparison: The user rankings obtained by different identification algorithms are compared with the OSN characteristics, such as in-degree number, volume of propagated content, number of replies per post, and volume of content posted by users. The algorithm is considered the best if the ranking list it generated is highly correlated with user characteristics (Feng, 2011).

6.3 Network Types The explicit and implicit OSN connections induce connection diversity. The different network connections created within OSNs describe various relational connections among users. For example, social networks characterize the social associations among users, such as following relationships in Twitter or friendships in Facebook. Moreover, propagation networks describe the diffusion networks among users, such as retweet or shared networks in Twitter or Facebook, respectively. Therefore, the application of an identification algorithm on different OSNs provides different user rankings according to the type of constructed network. The most common network types with modern features are social, propagation, engagement, and reply networks. Social networks in OSNs describe the social connections among users. For example, on Facebook, if User A is a friend of User B, then a social connection exists between them. A similar kind of connection also applies to Twitter. Propagation networks characterize how information propagates from one user to another. For example, if User A shares or retweets User B’s post, then the post is propagated from Users B to A, thereby creating a propagating or diffusing connection. Interaction networks describe the ability of a user to involve others in a conversation. An interaction network is constructed if User A tags or mentions User B, thereby creating an engaging connection from Users A to B. An interaction network can also be constructed if User A replies to User B’s post.

6.4 Objectives The OSN can be used as a platform for sharing positive or negative social activities. The identification of influential users aims to both accelerate and hinder the propagation of information.

(1) Accelerating the spread of information in OSNs: Identifying and targeting influential users is significant to enhance the spread of specific information within OSNs. This objective has several applications, such as viral marketing (Richardson and Domingos, 2002; Subramani and Rajagopalan, 2003).

(2) Hindering the spread of information in OSNs: Blocking the spread of unwanted contents, such as rumors, viruses, and spams, to influential users is one of the strategies to restrain the spread of unwanted content (Gao et al., 2011a; Cobb, 2017).

6.5 Metric Attributes The metrics extracted for OSNs are important parameters for the identification of influential user techniques. The different metrics used in algorithms generate varied ranking results (Cha et al., 2010). Therefore, understanding the correlation of the metrics with influential users is important. We have classified the metrics extracted from OSNs into network- and content-based metrics. Network based metrics deal with how users are connected with one another and describe the interaction network among network users. These metrics are related to the network structure. For example, the in-degree metric indicates the audience magnitude of a user. Propagation metric describes how information propagates through OSN networks and exhibits the ability of an OSN user to produce and propagate content throughout the network. Engagement metric shows the ability of a user to involve other users in a discussion within the network. For example, the in-degree metrics in Facebook and Twitter are the friend and follower count, respectively. The propagation metrics in Facebook and Twitter are measured through the sharing and retweeting processes, respectively. Engagement metrics are expressed in Facebook and Twitter by tagging and by mentioning other users, respectively. Furthermore, the output of applying the algorithm on OSNs, such as degree centrality, k-core, and PageRank, can be used as a network metric input to

Page 19 of 34 another identification algorithm such as machine learning algorithms (Cossu et al., 2015; Bigonha et al., 2012; Chai et al.).

Content-based metrics deal with the quality of user posts. Content-based metrics focus on the content features that make a post viral. Identifying influential users in OSNs based only on the structural analysis of the user social networks is weak because the relational connections among users do not necessarily convey a strong indication of influence (Weng et al., 2010). However, considering content similarity among users improves the ranking of an influential user. Content metrics, such as content similarity among users and content quality of a post, are combined with network- based metrics to enhance the identification of influential users (Chen et al., 2012b).

7 APPLICATIONS OF INFLUENTIAL SPREADER IDENTIFICATION IN OSNS The diffusion of information is a pervasive process referring to a wide range of phenomena, from the acceptance of innovation and ideas (Valente, 1995) to the success of marketing strategies (Watts et al., 2007), political movements (González-Bailón et al., 2011), and spread of news (Lerman and Ghosh, 2010) and viruses. The analysis of influence patterns is essential in understanding the rapid adoption of specific trends. Moreover, the knowledge about the spread of certain rumors or negative behaviors in OSNs is beneficial for designing effective methods for controlling such adverse practices. Thus, influential user identification holds practical importance, and it has recently attracted the attention of researchers (Pei et al., 2014a; Kitsak et al., 2010). The applications aim to maximize the spread of the products (Weinberg, 2009), scientific messages (Letierce et al., 2010), political movements and recruitment (González-Bailón et al., 2011), and news (Hu et al., 2012). The identification of influential users in information propagation is significant for planning strategies for accelerating information spread. However, OSN platforms also allow the spread of spams, viruses, rumors, negative behaviors, gossips, and other kinds of disinformation to users (Wen et al., 2014). The challenging task is to control the propagation of unwanted contents. Therefore, three types of strategies can be used to restrain unwanted contents in OSNs. The strategies include blocking unwanted contents at influential nodes (Pastor-Satorras and Vespignani, 2002; Zou et al., 2007; Yan et al., 2011) (Nepusz and Vicsek, 2012; Comin and da Fontoura Costa, 2011; Liu et al., 2011), creating awareness and spreading truth (Budak et al., 2011; Gao et al., 2011b), and implementing an optimal approach to restrain disinformation that could integrate the first two approaches (Wen et al., 2014). Influential communicators can be used in numerous real-life applications because new applications are constantly introduced. Consequently, this review considers several well-known applications as shown in Fig 2. However, the applications of influential users are not limited to these practices and can be applied to many other applications.

7.1 Viral Marketing Influence dissemination patterns are useful in understanding the specific reasons behind the rapid adoption of certain innovations. Patterns help marketers and advertisers implement effective campaigns. Viral marketing depends on the assumption that the purchasing decisions of a user are heavily influenced by the suggestions and recommendations of friends in social networks (Wu et al., 2004; Richardson and Domingos, 2002; Domingos, 2005; Leskovec et al., 2007a). However, some users’ opinions carry more weight in influencing a large part of a population than others. The viral marketing in OSNs aims to increase sales by the rapid spread of marketing information at low costs. This objective can be achieved by precisely identifying the most influential spreaders in OSNs (Zhu, 2013; Ho and Dempsey, 2010). However, given a large number of OSN users, identifying the best minimum set of influential users is beneficial, particularly for companies with a limited budget.

Page 20 of 34

Table 4. Comparison of the identification of influential spreader studies in the OSN context

Metrics Attribut Performance Identification Algorithm Network Type es Evaluation

Study

core

like like

Real

Other Social

model

Direct

Degree

Manual

Content

network network network

Network

Artificial

Learning

dynamics

centrality Closeness centrality centrality centrality

spreading

algorithm algorithm algorithm algorithm

PageRank

Interaction

annotations comparison

Eigenvector

Propagation

Betweenness

(Cha et al., 2010)  − − − − − − −      − − − 

(Kim and Han, 2009)  − − − − − − −    − −  − − −

(Catanese et al., 2012)  −  −  − − −  −  − − − − − −

(Tunkelang, 2009) − − − −  − − −  −  − − − − − −

(Yamaguchi et al., 2010) − − − −  − − −  −    − −  −

(Weng et al., 2010) − − − −  − − −    − − − −  −

(Lü et al., 2011) − − − −  − − −  −  − −  − − −

(Li et al., 2014) − − − −  − − −  −  − −  − − −

(Chen et al., 2012b) − − − −  − − −    − − − − − 

(Jabeur et al., 2012) − − − −  − − −  − −  − − −  − (Ding et al., 2013) − − − −  − − −  − −  − −  − −

(Silva et al., 2013) − − − −  −  −   − − − − −  − (Pei et al., 2014a)

Page 21 of 34

 − − −   − −  − − −  −  − −

(Feng, 2011) − − − − −  − −  −  − − − − − 

(Mei et al., 2015) − − − − − −  −   − − − − − − 

(Liu et al., 2014) − − − − − −  −   − − − − −  −

(Bigonha et al., 2012)     − −  −      − −  −

(Xiao et al., 2013) − − − − − −  −   − − − − −  −

(Chai et al.)  − − −  −  −    − − − −  −

(Cossu et al., 2015)     − −  −    − − − −  −

(Yang et al., 2016) − − − − − − −   −  − −  − − −

(Sun et al., 2016) − − − − − − −   −  − − − − − 

(Ma et al., 2017) − − − − − − −   −  − −  − − −

(Teng et al., 2016) − − − − − − −   −     − −

(Sheikhahmadi and Nematbakhsh, 2017) − − − − − − −   −  − −  − − −

(Zhuang et al., 2017) − − − − − − −   −  −  − − −

(Xia et al., 2016) − − − − − − − − −   −  −  − −

(Zhu, 2013) − − − − − − − −   −  −  − − −

(Tan et al., 2016) − − − − − − − −   −  −  − − −

Page 22 of 34

7.2 Political Movements Politicians in modern democracies worldwide have eagerly adopted OSN platforms to engage with users, specifically during election campaigns (Stieglitz and Dang-Xuan, 2013; Hong and Nadler, 2011). Social media automates the social signals used to encourage widespread behavior (Aral, 2012). Bond et al. (2012) have analyzed whether political behavior could disseminate over an OSN by posting a post that encourages users to vote. Results have shown that the statement influenced the actual voting behavior of millions of people. Moreover, the posts have affected not only the receiving users but also their network of friends. The consequence of social propagation in actual voting is larger than the direct effect of the posts themselves.

Fig 2. Taxonomy of influential user identification applications in the OSN context

7.3 News Spreading The OSNs have become significant channels for broadcasting and collecting news (Kwak et al., 2010). Cha et al. (2012) have discussed how a news item is announced and disseminated on OSNs. Influential news broadcasters are grouped in three, namely, individuals associated with the media, grassroots or ordinary users, and celebrities (Cha et al., 2012). Official media sources reach the majority of OSN (Twitter) users and spread most headlines daily. Opinion leaders and celebrities can reach users distant from the core of the network.

7.4 Health Applications Online social media platforms have also become substantial sources of health data. The OSN website users discuss numerous common interest topics, including health issues. For instance, Twitter is used by patients and doctors to share their personal feelings and experiences (Zhao et al., 2014a). Furthermore, OSNs comprise several health communities that aim to address the concerns of patients or users regarding various diseases. Zhao et al. (2014a) have found that influential users in health-related OSNs for cancer survivors provide social support for dealing with the disease and ameliorating the quality of life, which usually involves. Studies (Abbas et al.; Zhao et al., 2014b; Khan et

Page 23 of 34 al., 2016b) have tried to determine the most influential users and experts in online health social networks. The studies are significant in terms of proposing measures for creating an active, supportive, and sustainable online health community. Khan et al. (2016b) have suggested that separating the bloggers from health community expert is necessary for influential expert identification in online social networks. The identification of influential users in an online social health community provides an opportunity to appreciate and support the contributions to the health community and to spread health awareness (Zhao et al., 2014b).

7.5 Spread of Opinions Posts from influential users with influential friends would be extremely acknowledged and expanded by their followers and may rapidly spread in the network through information dissemination (Aral and Walker, 2012). Opinion formation is the result of information dissemination among OSN users through the communication and discussion of opinions, views, and beliefs about specific topics. Therefore, users are influenced by individual influential users or a group of influential users that present and create a public opinion (Zhang et al., 2016; Watts and Dodds, 2007). Influential users gradually used in OSNs to express views on social issues, breaking news, social events, and political views (Zhang et al., 2016).

7.6 Brand Analysis The active, abundant, and real-time interaction facilitated by OSNs considerably innovates the concept of brand management. This change can be considered significant because of its substantial effect on the performance of specific brands (Gensler et al., 2013). Identifying influential users who post about specific brands along with competing brands has become a significant approach to advertising promotions. The posts of influential users in the entire network and about specific topics (e.g., technology) can provide companies with feedback on their brands and can also aid establish brand popularity due to the dynamic and standard spread of information from top (influential users) to regular users (Zhang et al., 2016). Influential users can also support brands by extensively promoting the positive features of a brand. For example, study by Francalanci and Hussain. (2015) shows that influential users can be used by tourism agencies to identify popular tourist destinations, which can effectively identify the next destinations for their advertising campaigns (Francalanci and Hussain, 2015).

7.7 Innovation Dissemination An increasing number of social system processes are strongly influenced by a large number of individual links in a social network (Goldenberg et al., 2009). Similarly, products, ideas, or innovation implementation can be disseminated progressively through OSNs. The implementation of an innovation by users increases proportionally when their friends implement it within OSNs (Kreindler and Young, 2014; Zhang et al., 2016). Encouraging the acceptance of a specific innovation has become an important research area because technological innovations have become an essential part of users’ lifestyle (Kulviwat et al., 2009). Kulviwat et al. (2009) have reported that social influence positively affects user intention to implement an innovation (Kulviwat et al., 2009). Influential users may have significant traits, such as convincing experts, and have numerous social links; and they are also first to adopt innovations due to their numerous social connections (Goldenberg et al., 2009). Consequently, identifying influential users can benefit the establishment of an innovation’s diffusion process.

7.8 Corporate Communication OSN users have an accumulative influence on the business reputation of organizations. Influential users in OSNs can spread positive or negative posts or opinions on services or products to other OSN users. This process can significantly affect the reputation of an organization because a post within an OSN can spread to numerous users. Organizations should recognize influential users with significant social networking impact in order to respond effectively and efficiently to negative publicity or client attacks (Vollenbroek et al., 2014). Furthermore, influential users can provide an effective communication strategy, which can positively promote products or services (Jothi et al., 2011).

7.9 Spread of Natural Disaster Situation Awareness

Page 24 of 34

OSN platforms are abundant and real-time information sources relative to real-world events, especially during mass crises. Valuable information from OSNs can provide significant insights into time-critical situations for the effective and timely analysis of the hazards of current emergencies and activities (Yin et al., 2012). OSNs have been involved in disasters, such as the earthquakes in Japan (Sakaki et al., 2010) and Haiti (Yates and Paquette, 2011). OSNs ultimately serves as an advantageous knowledge management platform in responding to natural disasters (Dufty, 2012). Individuals can use OSN sites during or after emergencies to communicate internationally. Numerous crisis management scholars have recommended the utilization of OSNs to support the construction of a flexible community from disasters (Dufty, 2012; Gao et al., 2011c). Influential users can play important roles in supportively transmitting and spreading awareness to affected communities through OSN platforms because of their numerous links that largely accelerate the spread of information.

7.9 Rumor Restraint Rumors cause significant social damage. Rumors have received considerable research attention in the literature on OSNs, in which recent studies have demonstrated their rapid spread in the OSN context (Doerr et al., 2012). The majority of studies have analyzed the spread of rumors based on the network structure and the mechanism involved in the people’s belief in rumors (Liu and Chen, 2011; Cheng et al., 2013). The approaches employed to reduce the spread of rumors in OSNs include hindering rumors at the maximum number of users (Cheng et al., 2013), clarifying rumors by spreading truth (Budak et al., 2011), and integrating the first two approaches (Wen et al., 2014).

7.10 Negative Behavior Restraint Furthermore, OSN platforms affect the spread of negative behaviors, such as suicide-related behaviors (Luxton et al., 2012). The spread of the tweets “want to die” and “want to commit suicide” is significantly related to suicidal intent and behavior (Sueki, 2015). Furthermore, the spread of negative behaviors (i.e., cyber bullying) in OSNs recently intensified (Mishna et al., 2010). Therefore, increased attention is vital to comprehend and decrease cyber bullying within the online community.

7.11 Malware Restraint Malwares (e.g., viruses, Trojans, worms, and backdoor software) disrupt computers and steal users’ private information. Malwares can be transferred by the features of information propagation in OSNs (Zhu et al., 2014). In particular, once an OSN user uploads and spreads malware to all of his/her friends within the OSN, most users including their friends’ friends are infected by malware. The process extensively propagates malwares within OSNs. Researchers are currently exploring malware propagation and immunization strategies in the OSNs. (Gao et al., 2011a; Strogatz, 2001). Two major issues are generally concerned with malware propagation, namely, how to accurately model the malware propagation process in complex networks and how to restrain malware propagation efficiently. The identification of influential users is useful for proposing immunization strategies based on network user influence. The strategy selects influential users in the entire network, and immunization is subsequently applied to the selected users to achieve the maximal effect of malware containment at minimal costs (Yang et al., 2015). Moreover, user influence measurements are also employed to discriminate normal users from spammers in OSNs (Chen et al., 2013).

8 OPEN ISSUES AND FUTURE DIRECTIONS This section discusses open research issues that can be regarded as future research directions. We propose a taxonomy of the issues in the state-of-the-art, influential user identification strategies in Fig 3. Fig 3 categorizes the different issues of influential user identification in the OSN context into: (1) network-related issues, (2) efficiency of identification algorithm-related issues, (3) validation-related issues, (4) understanding the role of influential users in OSNs, and (5) user privacy-related issues.

8.1 Network-related Issues

Page 25 of 34

Network-related issues are concern on the structure of OSNs. This section discusses issues related to data collection, user connections, and nature of OSNs. Network-related issues are further categorized into three issues: (1) network data availability-, (2) connection diversity-, and (3) network evolution-related issues.

Fig 3. Taxonomy of the identification of influential spreader issues in the OSN context

8.1.1 Network data availability Previous studies have mostly analyzed OSN graphs by solely using partial network data. Data acquisition from the entire network is difficult (Chau et al., 2007) and the ranking algorithms results can be effected using such partial networks (Khan et al., 2016a). Most OSNs allow users to customize their profiles, privacy settings, and pages with the high flexibility. Therefore, developing crawlers that can efficiently handle dynamic complex networks is also difficult (Ye et al., 2010). Previous studies have failed to explain how the unavailability of the complete network data affected observations and results (Ye et al., 2010). We have categorized four issues for the effectiveness of the data-collecting crawlers in OSNs as follows: (1) crawler efficiency that can be measured by how rapidly the node and links are visited; (2) bias of several crawling algorithms (i.e., breadth-first-search crawling algorithms) toward large-scale networks(Gjoka et al., 2010); (3) influence of black holes on the crawling process (Ye et al., 2010); (4) distinct properties of OSNs despite their provision of similar services. The crawler that should be developed for future studies must consider these four issues. 8.1.2 Connection Diversity As discussed in Section 2.4.2, explicit and implicit OSN connections induce connection diversity. Furthermore, relationship strength is one of the most important factors affecting information dissemination and consequently, influence level (Bakshy et al., 2012). Relationship strength widely diverges, ranging from strong (i.e., best friend) to weak ties (i.e., acquaintances) (Gilbert and Karahalios, 2009). However, the lack of knowledge on the link strength among users in OSNs can result in networks with diverse link strengths (e.g., best friends and acquaintances are mixed) (Xiang et al., 2010). Therefore, the binary relationship (a relationship that describes a relationship that exists without considering strength) will generate a dubious relationship in information representation, which results in deceptive identification results. Several studies have examined the relationship strength models and the connection diversity for information diffusion in OSNs (Xiang et al., 2010; Romero et al., 2011; Gilbert and Karahalios, 2009; Bakshy et al., 2012). However, the manner in which strength and connection diversity affect user influence measurement still requires further investigation. In network theory, the nodes are linked by a single type of static link that describes the relationship between the nodes. However, a single edge is hypothesized to simplify network complexity in many circumstances. Ignoring the reality of multiple relationships among users or combining multiple different connections to a single weighted network changes the topological and dynamic properties of the entire system (Gomez et al., 2013; De Domenico et al., 2015; Al-Garadi et al., 2016). Moreover, the importance of nodes also changes with respect to the entire structure (Halu et al., 2013; Battiston et al., 2014). Consequently, the hypothesis may induce an incorrect identification of the most important nodes (De Domenico et al., 2015). Therefore, future studies should consider the multiple relationships among users for accurate identification. 8.1.3 Network Evolution Network anatomy is important in network analysis. Network structure affects the functionalities of user identification techniques (Strogatz, 2001). However, one of the most inherent difficulties in understanding network structure is network evolution. OSNs are dynamic and evolving. OSN users tend to build an online communication network based on several factors, such as mutual acquaintances, proximity, and common interests (Garg et al., 2009). The analysis

Page 26 of 34 of the aforementioned factors in an evolving OSN demonstrated that preferential attachment can capture network evolution, and the effects of the factors varied based on the node age (i.e., user account age) (Garg et al., 2009). Investigating network evolution helps to predict the spread of information in advance. (Chen et al., 2014). However, how OSNs evolve, what factors are responsible for this evolution, and how they can affect the user’s influence measurement should be understood comprehensively.

8.2 Effectiveness of Identification Algorithm-Related Issues The effectiveness of an applied algorithm is critical for the success of an algorithm’s applicability in real-world networks (Saito et al., 2012). As deliberated in previous sections, existing studies have aimed to identify and targeted influential users who suffered from either computational time (e.g., greedy approaches (Kempe et al., 2003; Leskovec et al., 2007b; Chen et al., 2009; Wang et al., 2010)) or result quality (e.g., heuristic approaches (Chen et al., 2011; Jiang et al., 2011; Kim et al., 2013)). Therefore, measurement algorithms should be selected based on application requirements. For example, a heuristic algorithm is more suitable than a greedy algorithm if the application requires a rapid real-time identification of influential users. Numerous studies have faced a number of issues related to the efficiency of the measures used in influential user identification techniques. Starting with the processing of a large amount of unstructured and incomplete network data subsequently, requires efficient algorithms. Furthermore, previous studies have utilized incomplete network data; thus, number of efficiency issues have emerged, which proved that the efficiency of their approaches would be difficult for researchers (Howison and Wiggins, 2011). Moreover, given the various characteristics of OSNs and the data collection limitations, a different source of bias exists in the identification of influential users, such as selection bias and bias through homophily or assortativity in networks (Aral and Walker, 2012). Therefore, the algorithms from social network analysis or traditional web pages must be optimized to be accurately applied to the OSN context (Mislove et al., 2007). Consequently, an additional investigation is required to overcome the abovementioned issues and accomplish an enhanced perception of the research subject.

8.3 Validation-related Issues Most influential user identification algorithms have been validated through information spread models and not by analyzing the dynamics of real information spread. The models, such as SIR (Kitsak et al., 2010), (Pei and Makse, 2013), SIS (Hethcote, 2000), rumor spreading models (Borge-Holthoefer and Moreno, 2012), and random walks for PageRank (Katsimpras et al., 2015) simulated information dissemination. The aforementioned models fails to generate an accurate diffusion pattern (Pei et al., 2015; Pei et al., 2014a). Studies (Centola and Macy, 2007; Singh et al., 2013) have tracked the actual dissemination processes and demonstrated that the spreads of diseases and information differed. A recent research (Pei et al., 2015) has reported three possible factors affecting the actual contagion of information, namely, human behavior (Muchnik et al., 2013; Iribarren and Moro, 2009), homophily (Aral et al., 2013), and social reinforcement (Iribarren and Moro, 2009). Subsequently, examining human behaviors related to information diffusion in OSNs is vital for many applications. Further investigation is required to enhance the understanding on how the abovementioned factors can affect the information diffusion results in OSNs.

8.4 Understanding the Role of Influential Users in OSNs Understanding the role of OSN influential users is important for an efficient identification method. The majority of the studies have assumed that targeting the most influential user is a key factor in accelerating the dissemination of information and slowing down the spread of disinformation. Influential users are commonly characterized by the presence of strong connections. However, low-degree users with a significant broker role in the network can act as information disseminators (Morone and Makse, 2015b). Moreover, the spread of influence is derived not by influential users but by a large number of easily influenced individuals (Watts and Dodds, 2007). The claim that a critical mass of influential users drives information dissemination creates an open issue on how to accurately and efficiently measure information dissemination in OSNs (Lawyer, 2015). Therefore, further investigation is required to understand the role of each user in a network, individual user characteristics, and the interplay between weakly connected and influential users in OSNs.

8.5 User Privacy-Related Issues

Page 27 of 34

A privacy issue arises when a user account in an OSN is set as private. If the profile is customized as private, then extracting information from accounts becomes impossible unless the account holder grants permission. The visibility of a user profile is necessary and important in OSNs to introduce users to a new set of users. However, changing privacy setting to public may cause attacks, such as identity slander and spamming (Hogben, 2007). Private users act as black holes for crawlers (Ye et al., 2010), thereby creating incomplete network data that yields misleading results in identifying influential users. Moreover, analyzing the personal data of users may yield extremely personal information. Hence, analyzing data while preserving privacy is challenging.

9 CONCLUSION The OSN popularity has provided a unique opportunity for studying and understanding the social interaction and communication among its users. We investigated the state-of-the-art influential user identification algorithms in OSNs. The current validation approaches employed for evaluating the performances of different influential users’ identification algorithms are also reviewed. Furthermore, we presented taxonomy of an influential user identification in this paper and highlighted numerous research issues involved in the influential user identification in OSNs. Future directions in the field may evolve in search for solutions to address the issues indicated in this study.

Acknowledgement: We would like to thank the University of Malaya UMRG (RP028D-14AET) for funding this research.

References

ABBAS, A., KHAN, M. U., ALI, M., KHAN, S. U. & YANG, L. T. A Cloud Based Framework for Identification of Influential Health Experts from Twitter. 15th International Conference on Scalable Computing and Communications (ScalCom). AL-GARADI, M. A., VARATHAN, K. D. & RAVANA, S. D. 2017. Identification of influential spreaders in online social networks using interaction weighted K-core decomposition method. Physica A: Statistical Mechanics and its Applications, 468, 278- 288. AL-GARADI, M. A., VARATHAN, K. D., RAVANA, S. D., AHMED, E. & CHANG, V. 2016. Identifying the influential spreaders in multilayer interactions of online social networks. Journal of Intelligent & Fuzzy Systems, 31, 2721-2735. ALBERT, R., JEONG, H. & BARABÁSI, A.-L. 2000. Error and attack tolerance of complex networks. nature, 406, 378-382. ALFALAHI, K., ATIF, Y. & ABRAHAM, A. 2014. Models of Influence in Online Social Networks. International Journal of Intelligent Systems, 29, 161-183. ALTARELLI, F., BRAUNSTEIN, A., DALL’ASTA, L., WAKELING, J. R. & ZECCHINA, R. 2014. Containing epidemic outbreaks by message- passing techniques. Physical Review X, 4, 021024. ALTARELLI, F., BRAUNSTEIN, A., DALL’ASTA, L. & ZECCHINA, R. 2013. Optimizing spread dynamics on graphs by message passing. Journal of Statistical Mechanics: Theory and Experiment, 2013, P09011. ARAL, S. 2012. Social science: Poked to vote. Nature, 489, 212-214. ARAL, S., MUCHNIK, L. & SUNDARARAJAN, A. 2013. Engineering social contagions: Optimal network seeding in the presence of homophily. Network Science, 1, 125-153. ARAL, S. & WALKER, D. 2012. Identifying influential and susceptible members of social networks. Science, 337, 337-341. BAKSHY, E., HOFMAN, J. M., MASON, W. A. & WATTS, D. J. Everyone's an influencer: quantifying influence on twitter. Proceedings of the fourth ACM international conference on Web search and data mining, 2011. ACM, 65-74. BAKSHY, E., ROSENN, I., MARLOW, C. & ADAMIC, L. The role of social networks in information diffusion. Proceedings of the 21st international conference on World Wide Web, 2012. ACM, 519-528. BARABÁSI, A.-L. & ALBERT, R. 1999. Emergence of scaling in random networks. science, 286, 509-512. BATAGELJ, V. & ZAVERSNIK, M. 2003. An O (m) algorithm for cores decomposition of networks. arXiv preprint cs/0310049. BATTISTON, F., NICOSIA, V. & LATORA, V. 2014. Structural measures for multiplex networks. Physical Review E, 89, 032804. BAVELAS, A. 1950. Communication patterns in task-oriented groups. Journal of the acoustical society of America. BIGONHA, C., CARDOSO, T. N., MORO, M. M., GONÇALVES, M. A. & ALMEIDA, V. A. 2012. Sentiment-based influence detection on Twitter. Journal of the Brazilian Computer Society, 18, 169-183. BLEI, D. M., NG, A. Y. & JORDAN, M. I. 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022. BONACICH, P. 2007. Some unique properties of eigenvector centrality. Social networks, 29, 555-564. BOND, R. M., FARISS, C. J., JONES, J. J., KRAMER, A. D., MARLOW, C., SETTLE, J. E. & FOWLER, J. H. 2012. A 61-million-person experiment in social influence and political mobilization. Nature, 489, 295-298.

Page 28 of 34

BORGATTI, S. P. & EVERETT, M. G. 2006. A graph-theoretic perspective on centrality. Social networks, 28, 466-484. BORGE-HOLTHOEFER, J. & MORENO, Y. 2012. Absence of influential spreaders in rumor dynamics. Physical Review E, 85, 026116. BOUGUESSA, M. An unsupervised approach for identifying spammers in social networks. Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on, 2011. IEEE, 832-840. BRANDES, U. 2001. A faster algorithm for betweenness centrality*. Journal of mathematical sociology, 25, 163-177. BRIN, S. & PAGE, L. 2012. Reprint of: The anatomy of a large-scale hypertextual web search engine. Computer networks, 56, 3825- 3833. BUDAK, C., AGRAWAL, D. & EL ABBADI, A. Limiting the spread of misinformation in social networks. Proceedings of the 20th international conference on World wide web, 2011. ACM, 665-674. CATANESE, S., DE MEO, P., FERRARA, E., FIUMARA, G. & PROVETTI, A. 2012. Extraction and analysis of facebook friendship relations. Computational Social Networks. Springer. CENTOLA, D. & MACY, M. 2007. Complex contagions and the weakness of long ties1. American Journal of Sociology, 113, 702- 734. CHA, M., BENEVENUTO, F., HADDADI, H. & GUMMADI, K. 2012. The world of connections and information flow in twitter. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 42, 991-998. CHA, M., HADDADI, H., BENEVENUTO, F. & GUMMADI, P. K. 2010. Measuring user influence in twitter: The million follower fallacy. Icwsm, 10, 30. CHAI, W., XU, W., ZUO, M. & WEN, X. ACQR: A Novel Framework to Identify and Predict Influential Users in Micro-Blogging. PACIS, 2013. 20. CHAU, D. H., PANDIT, S., WANG, S. & FALOUTSOS, C. Parallel crawling for online social networks. Proceedings of the 16th international conference on World Wide Web, 2007. ACM, 1283-1284. CHEN, D.-B., XIAO, R. & ZENG, A. 2014. Predicting the evolution of spreading on complex networks. Scientific reports, 4. CHEN, D., LÜ, L., SHANG, M.-S., ZHANG, Y.-C. & ZHOU, T. 2012a. Identifying influential nodes in complex networks. Physica a: Statistical mechanics and its applications, 391, 1777-1787. CHEN, K., ZHU, P. & XIONG, Y. Mining Spam Accounts with User Influence. Information Science and Cloud Computing Companion (ISCC-C), 2013 International Conference on, 2013. IEEE, 167-173. CHEN, W., CHENG, S., HE, X. & JIANG, F. Influencerank: An efficient social influence measurement for millions of users in microblog. Cloud and Green Computing (CGC), 2012 Second International Conference on, 2012b. IEEE, 563-570. CHEN, W., COLLINS, A., CUMMINGS, R., KE, T., LIU, Z., RINCON, D., SUN, X., WANG, Y., WEI, W. & YUAN, Y. Influence Maximization in Social Networks When Negative Opinions May Emerge and Propagate. SDM, 2011. SIAM, 379-390. CHEN, W., WANG, Y. & YANG, S. Efficient influence maximization in social networks. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. ACM, 199-208. CHENG, J.-J., LIU, Y., SHEN, B. & YUAN, W.-G. 2013. An epidemic model of rumor diffusion in online social networks. The European Physical Journal B, 86, 1-7. COBB, S. 2017. RoT: Ransomware of Things. Trends. COLIZZA, V., FLAMMINI, A., SERRANO, M. A. & VESPIGNANI, A. 2006. Detecting rich-club ordering in complex networks. Nature physics, 2, 110-115. COMIN, C. H. & DA FONTOURA COSTA, L. 2011. Identifying the starting point of a spreading process in complex networks. Physical Review E, 84, 056105. COSLEY, D., HUTTENLOCHER, D. P., KLEINBERG, J. M., LAN, X. & SURI, S. 2010. Sequential Influence Models in Social Networks. ICWSM, 10, 26. COSSU, J.-V., DUGUÉ, N. & LABATUT, V. 2015. Detecting real-world influence through Twitter. arXiv preprint arXiv:1506.05903. DE DOMENICO, M., SOLÉ-RIBALTA, A., OMODEI, E., GÓMEZ, S. & ARENAS, A. 2015. Ranking in interconnected multilayer networks reveals versatile nodes. Nature communications, 6. DING, Z.-Y., JIA, Y., ZHOU, B., HAN, Y., HE, L. & ZHANG, J.-F. 2013. Measuring the spreadability of users in microblogs. Journal of Zhejiang University SCIENCE C, 14, 701-710. DOERR, B., FOUZ, M. & FRIEDRICH, T. 2012. Why rumors spread so quickly in social networks. Communications of the ACM, 55, 70-75. DOMINGOS, P. 2005. Mining social networks for viral marketing. IEEE Intelligent Systems, 20, 80-82. DOMINGOS, P. 2012. A few useful things to know about machine learning. Communications of the ACM, 55, 78-87. DOROGOVTSEV, S. N., GOLTSEV, A. V. & MENDES, J. F. F. 2006. K-core organization of complex networks. Physical review letters, 96, 040601. DUDA, R. O., HART, P. E. & STORK, D. G. 2012. Pattern classification, John Wiley & Sons. DUFTY, N. 2012. Using social media to build community disaster resilience. Australian Journal of Emergency Management, The, 27, 40. EASLEY, D. & KLEINBERG, J. 2010. Networks, crowds, and markets: Reasoning about a highly connected world, Cambridge University Press. EBEL, H., MIELSCH, L.-I. & BORNHOLDT, S. 2002. Scale-free topology of e-mail networks. Physical Review E, 66, 035103.

Page 29 of 34

FAUST, K. 1997. Centrality in affiliation networks. Social networks, 19, 157-191. FENG, P. E. B. J. 2011. Measuring user influence on twitter using modified k-shell decomposition. FIRE, M., GOLDSCHMIDT, R. & ELOVICI, Y. 2014. Online Social Networks: Threats and Solutions. Communications Surveys & Tutorials, IEEE, 16, 2019-2036. FRANCALANCI, C. & HUSSAIN, A. 2015. A visual analysis of social influencers and influence in the tourism domain. Information and Communication Technologies in Tourism 2015. Springer. FREEMAN, L. C. 1977. A set of measures of centrality based on betweenness. Sociometry, 35-41. FREEMAN, L. C. 1979. Centrality in social networks conceptual clarification. Social networks, 1, 215-239. GAO, C., LIU, J. & ZHONG, N. 2011a. Network immunization and virus propagation in email networks: experimental evaluation and analysis. Knowledge and information systems, 27, 253-279. GAO, C., LIU, J. & ZHONG, N. 2011b. Network immunization with distributed autonomy-oriented entities. Parallel and Distributed Systems, IEEE Transactions on, 22, 1222-1229. GAO, H., BARBIER, G. & GOOLSBY, R. 2011c. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26, 10-14. GARAS, A., SCHWEITZER, F. & HAVLIN, S. 2012. A k-shell decomposition method for weighted networks. New Journal of Physics, 14, 083030. GARG, S., GUPTA, T., CARLSSON, N. & MAHANTI, A. Evolution of an online social aggregation network: an empirical study. Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, 2009. ACM, 315-321. GENSLER, S., VÖLCKNER, F., LIU-THOMPKINS, Y. & WIERTZ, C. 2013. Managing brands in the social media environment. Journal of Interactive Marketing, 27, 242-256. GHOSHAL, G. & BARABÁSI, A.-L. 2011. Ranking stability and super-stable nodes in complex networks. Nature communications, 2, 394. GILBERT, E. & KARAHALIOS, K. Predicting tie strength with social media. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2009. ACM, 211-220. GJOKA, M., KURANT, M., BUTTS, C. T. & MARKOPOULOU, A. Walking in facebook: A case study of unbiased sampling of osns. Infocom, 2010 Proceedings IEEE, 2010. IEEE, 1-9. GOEL, S., WATTS, D. J. & GOLDSTEIN, D. G. The structure of online diffusion networks. Proceedings of the 13th ACM conference on electronic commerce, 2012. ACM, 623-638. GOLDENBERG, J., HAN, S., LEHMANN, D. R. & HONG, J. W. 2009. The role of hubs in the adoption process. Journal of marketing, 73, 1-13. GOMEZ, S., DIAZ-GUILERA, A., GOMEZ-GARDEÑES, J., PEREZ-VICENTE, C. J., MORENO, Y. & ARENAS, A. 2013. Diffusion dynamics on multiplex networks. Physical review letters, 110, 028701. GONZÁLEZ-BAILÓN, S., BORGE-HOLTHOEFER, J., RIVERO, A. & MORENO, Y. 2011. The dynamics of protest recruitment through an online network. Scientific reports, 1. GOYAL, A., BONCHI, F. & LAKSHMANAN, L. V. Learning influence probabilities in social networks. Proceedings of the third ACM international conference on Web search and data mining, 2010. ACM, 241-250. GRANOVETTER, M. S. 1973. The strength of weak ties. American journal of sociology, 1360-1380. GRIFFITHS, T. L. & STEYVERS, M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235. GUILLE, A., HACID, H., FAVRE, C. & ZIGHED, D. A. 2013. Information diffusion in online social networks: A survey. ACM SIGMOD Record, 42, 17-28. HALU, A., MONDRAGÓN, R. J., PANZARASA, P. & BIANCONI, G. 2013. Multiplex pagerank. HE, H. 2007. Eigenvectors and reconstruction. the electronic journal of combinatorics, 14, N14. HETHCOTE, H. W. 2000. The mathematics of infectious diseases. SIAM review, 42, 599-653. HINTON, G. E., OSINDERO, S. & TEH, Y.-W. 2006. A fast learning algorithm for deep belief nets. Neural computation, 18, 1527- 1554. HO, J. Y. & DEMPSEY, M. 2010. Viral marketing: Motivations to forward online content. Journal of Business Research, 63, 1000- 1006. HOGBEN, G. 2007. Security issues and recommendations for online social networks. ENISA position paper. HONG, S. & NADLER, D. Does the early bird move the polls?: The use of the social media tool'Twitter'by US politicians and its impact on public opinion. Proceedings of the 12th Annual International Digital Government Research Conference: Digital Government Innovation in Challenging Times, 2011. ACM, 182-186. HOWISON, J. & WIGGINS, A. 2011. Validity Issues in the Use of Social Network Analysis with Digital Trace Data Journal of the Association for Information Systems, 12, 767-797. HU, M., LIU, S., WEI, F., WU, Y., STASKO, J. & MA, K.-L. Breaking news on twitter. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2012. ACM, 2751-2754. IRIBARREN, J. L. & MORO, E. 2009. Impact of human activity patterns on the dynamics of information diffusion. Physical review letters, 103, 038702.

Page 30 of 34

JABEUR, L. B., TAMINE, L. & BOUGHANEM, M. Active microbloggers: identifying influencers, leaders and discussers in microblogging networks. String Processing and Information Retrieval, 2012. Springer, 111-117. JAVA, A., KOLARI, P., FININ, T. & OATES, T. Modeling the spread of influence on the blogosphere. Proceedings of the 15th international world wide web conference, 2006. 22-26. JIANG, J., WILSON, C., WANG, X., SHA, W., HUANG, P., DAI, Y. & ZHAO, B. Y. 2013. Understanding latent interactions in online social networks. ACM Transactions on the Web (TWEB), 7, 18. JIANG, Q., SONG, G., GAO, C., WANG, Y., SI, W. & XIE, K. Simulated annealing based influence maximization in social networks. Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011. JOTHI, P. S., NEELAMALAR, M. & PRASAD, R. S. 2011. Analysis of social networking sites: A study on effective communication strategy in developing brand communication. Journal of media and communication studies, 3, 234. KATSIMPRAS, G., VOGIATZIS, D. & PALIOURAS, G. Determining Influential Users with Supervised Random Walks. Proceedings of the 24th International Conference on World Wide Web Companion, 2015. International World Wide Web Conferences Steering Committee, 787-792. KATZ, E. & LAZARSFELD, P. F. 1955. Personal Influence, The part played by people in the flow of mass communications, Transaction Publishers. KATZ, L. 1953. A new status index derived from sociometric analysis. Psychometrika, 18, 39-43. KEMPE, D., KLEINBERG, J. & TARDOS, É. Maximizing the spread of influence through a social network. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003. ACM, 137-146. KHAN, M. S., WAHAB, A. W. A., HERAWAN, T., MUJTABA, G., DANJUMA, S. & AL-GARADI, M. A. 2016a. Virtual community detection through the association between prime nodes in online social networks and its application to ranking algorithms. IEEE Access, 4, 9614-9624. KHAN, M. U. S., ALI, M., ABBAS, A., KHAN, S. & ZOMAYA, A. 2016b. Segregating Spammers and Unsolicited Bloggers from Genuine Experts on Twitter. IEEE Transactions on Dependable and Secure Computing. KIM, E. S. & HAN, S. S. An analytical way to find influencers on social networks and validate their effects in disseminating social games. Social Network Analysis and Mining, 2009. ASONAM'09. International Conference on Advances in, 2009. IEEE, 41-46. KIM, J., KIM, S.-K. & YU, H. Scalable and parallelizable processing of influence maximization for large-scale social networks? Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013. IEEE, 266-277. KITSAK, M., GALLOS, L. K., HAVLIN, S., LILJEROS, F., MUCHNIK, L., STANLEY, H. E. & MAKSE, H. A. 2010. Identification of influential spreaders in complex networks. Nature Physics, 6, 888-893. KREINDLER, G. E. & YOUNG, H. P. 2014. Rapid innovation diffusion in social networks. Proceedings of the National Academy of Sciences, 111, 10881-10888. KRZAKALA, F., MOORE, C., MOSSEL, E., NEEMAN, J., SLY, A., ZDEBOROVÁ, L. & ZHANG, P. 2013. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110, 20935-20940. KULVIWAT, S., BRUNER, G. C. & AL-SHURIDAH, O. 2009. The role of social influence on adoption of high tech innovations: The moderating effect of public/private consumption. Journal of Business Research, 62, 706-712. KWAK, H., LEE, C., PARK, H. & MOON, S. What is Twitter, a social network or a news media? Proceedings of the 19th international conference on World wide web, 2010. ACM, 591-600. KWON, S., CHA, M., JUNG, K., CHEN, W. & WANG, Y. Prominent features of rumor propagation in online social media. Data Mining (ICDM), 2013 IEEE 13th International Conference on, 2013. IEEE, 1103-1108. LAWYER, G. 2015. Understanding the influence of all nodes in a network. Scientific reports, 5. LERMAN, K. & GHOSH, R. 2010. Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks. ICWSM, 10, 90-97. LESKOVEC, J., ADAMIC, L. A. & HUBERMAN, B. A. 2007a. The dynamics of viral marketing. ACM Transactions on the Web (TWEB), 1, 5. LESKOVEC, J., KRAUSE, A., GUESTRIN, C., FALOUTSOS, C., VANBRIESEN, J. & GLANCE, N. Cost-effective outbreak detection in networks. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007b. ACM, 420-429. LETIERCE, J., PASSANT, A., BRESLIN, J. & DECKER, S. 2010. Understanding how Twitter is used to spread scientific messages. LI, H., BHOWMICK, S. S., SUN, A. & CUI, J. 2015a. Conformity-aware influence maximization in online social networks. The VLDB Journal—The International Journal on Very Large Data Bases, 24, 117-141. LI, H., CUI, J.-T. & MA, J.-F. 2015b. Social Influence Study in Online Networks: A Three-Level Review. Journal of Computer Science and Technology, 30, 184-199. LI, Q., ZHOU, T., LÜ, L. & CHEN, D. 2014. Identifying influential spreaders by weighted LeaderRank. Physica A: Statistical Mechanics and its Applications, 404, 47-55. LIAO, H., MARIANI, M. S., MEDO, M., ZHANG, Y.-C. & ZHOU, M.-Y. 2017. Ranking in evolving complex networks. Physics Reports. LIU, D. & CHEN, X. Rumor propagation in online social networks like twitter--a simulation study. Multimedia Information Networking and Security (MINES), 2011 Third International Conference on, 2011. IEEE, 278-282.

Page 31 of 34

LIU, J.-G., REN, Z.-M. & GUO, Q. 2013. Ranking the spreading influence in complex networks. Physica A: Statistical Mechanics and its Applications, 392, 4154-4159. LIU, N., LI, L., XU, G. & YANG, Z. Identifying Domain-Dependent Influential Microblog Users: A Post-Feature Based Approach. Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. LIU, Y.-Y., SLOTINE, J.-J. & BARABÁSI, A.-L. 2011. Controllability of complex networks. Nature, 473, 167-173. LIU, Y., TANG, M., ZHOU, T. & DO, Y. 2015a. Core-like groups result in invalidation of identifying super-spreader by k-shell decomposition. Scientific reports, 5. LIU, Y., TANG, M., ZHOU, T. & DO, Y. 2015b. Improving the accuracy of the k-shell method by removing redundant links-from a perspective of spreading dynamics. arXiv preprint arXiv:1505.07354. LÜ, L., CHEN, D., REN, X.-L., ZHANG, Q.-M., ZHANG, Y.-C. & ZHOU, T. 2016. Vital nodes identification in complex networks. Physics Reports, 650, 1-63. LÜ, L., ZHANG, Y.-C., YEUNG, C. H. & ZHOU, T. 2011. Leaders in social networks, the delicious case. PloS one, 6, e21202. LUXTON, D. D., JUNE, J. D. & FAIRALL, J. M. 2012. Social media and suicide: a public health perspective. American Journal of Public Health, 102, S195-S200. MA, S., CHEN, G., FU, L., WU, W., TIAN, X., ZHAO, J. & WANG, X. 2017. Seeking powerful information initial spreaders in online social networks: a dense group perspective. Wireless Networks, 1-19. MARIANI, M. S., MEDO, M. & ZHANG, Y.-C. 2015. Ranking nodes in growing networks: When PageRank fails. Scientific reports, 5. MEI, Y., ZHONG, Y. & YANG, J. Finding and Analyzing Principal Features for Measuring User Influence on Twitter. Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on, 2015. IEEE, 478-486. MISHNA, F., COOK, C., GADALLA, T., DACIUK, J. & SOLOMON, S. 2010. Cyber bullying behaviors among middle and high school students. American Journal of Orthopsychiatry, 80, 362-374. MISLOVE, A., MARCON, M., GUMMADI, K. P., DRUSCHEL, P. & BHATTACHARJEE, B. Measurement and analysis of online social networks. Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, 2007. ACM, 29-42. MORONE, F. & MAKSE, H. A. 2015a. Influence maximization in complex networks through optimal percolation. Nature, 524, 65- 68. MORONE, F., MIN, B., BO, L., MARI, R. & MAKSE, H. A. 2016. Collective Influence Algorithm to find influencers via optimal percolation in massively large social media. Scientific reports, 6. MUCHNIK, L., PEI, S., PARRA, L. C., REIS, S. D., ANDRADE JR, J. S., HAVLIN, S. & MAKSE, H. A. 2013. Origins of power-law degree distribution in the heterogeneity of human activity in social networks. Scientific reports, 3. NEPUSZ, T. & VICSEK, T. 2012. Controlling edge dynamics in complex networks. Nature Physics, 8, 568-573. NGUYEN, T. H. & SZYMANSKI, B. K. Social ranking techniques for the web. Advances in Social Networks Analysis and Mining (ASONAM), 2013 IEEE/ACM International Conference on, 2013. IEEE, 49-55. PASTOR-SATORRAS, R. & VESPIGNANI, A. 2001. Epidemic spreading in scale-free networks. Physical review letters, 86, 3200. PASTOR-SATORRAS, R. & VESPIGNANI, A. 2002. Immunization of complex networks. Physical Review E, 65, 036104. PEI, S. & MAKSE, H. A. 2013. Spreading dynamics in complex networks. Journal of Statistical Mechanics: Theory and Experiment, 2013, P12002. PEI, S., MUCHNIK, L., ANDRADE JR, J. S., ZHENG, Z. & MAKSE, H. A. 2014b. Searching for superspreaders of information in real- world social media. Scientific reports, 4, 5547. PEI, S., MUCHNIK, L., TANG, S., ZHENG, Z. & MAKSE, H. A. 2015. Exploring the complex pattern of information spreading in online blog communities. PROBST, F., GROSSWIELE, D.-K. L. & PFLEGER, D.-K. R. 2013. Who will lead and who will follow: Identifying Influential Users in Online Social Networks. Business & Information Systems Engineering, 5, 179-193. RÄBIGER, S. & SPILIOPOULOU, M. 2015. A framework for validating the merit of properties that predict the influence of a twitter user. Expert Systems with Applications, 42, 2824-2834. RADICCHI, F. 2011. Who is the best player ever? A complex network analysis of the history of professional tennis. PloS one, 6, e17249. RADICCHI, F. & CASTELLANO, C. 2016. Leveraging percolation theory to single out influential spreaders in networks. Physical Review E, 93, 062314. RATKIEWICZ, J., CONOVER, M., MEISS, M., GONÇALVES, B., FLAMMINI, A. & MENCZER, F. Detecting and Tracking Political Abuse in Social Media. ICWSM, 2011. RICHARDSON, M. & DOMINGOS, P. Mining knowledge-sharing sites for viral marketing. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. ACM, 61-70. RIQUELME, F. & GONZÁLEZ-CANTERGIANI, P. 2016. Measuring user influence on Twitter: A survey. Information processing & management, 52, 949-975. ROMERO, D. M., GALUBA, W., ASUR, S. & HUBERMAN, B. A. 2011. Influence and passivity in social media. Machine learning and knowledge discovery in databases. Springer. SAITO, K., KIMURA, M., OHARA, K. & MOTODA, H. 2012. Efficient discovery of influential nodes for SIS models in social networks. Knowledge and information systems, 30, 613-635.

Page 32 of 34

SAKAKI, T., OKAZAKI, M. & MATSUO, Y. Earthquake shakes Twitter users: real-time event detection by social sensors. Proceedings of the 19th international conference on World wide web, 2010. ACM, 851-860. SEN PEI, X. T., SHAMAN, J., MORONE, F. & MAKSE, H. A. 2017. Efficient collective influence maximization in cascading processes with first-order transitions. Scientific Reports, 7. SHEIKHAHMADI, A. & NEMATBAKHSH, M. A. 2017. Identification of multi-spreader users in social networks for viral marketing. Journal of Information Science, 43, 412-423. SILVA, A., GUIMARÃES, S., MEIRA JR, W. & ZAKI, M. ProfileRank: finding relevant content and influential users based on information diffusion. Proceedings of the 7th Workshop on Social Network Mining and Analysis, 2013. ACM, 2. SINGH, P., SREENIVASAN, S., SZYMANSKI, B. K. & KORNISS, G. 2013. Threshold-limited spreading in social networks with multiple initiators. Scientific reports, 3. SONG, X., CHI, Y., HINO, K. & TSENG, B. Identifying opinion leaders in the blogosphere. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007a. ACM, 971-974. SONG, X., CHI, Y., HINO, K. & TSENG, B. L. Information flow modeling based on diffusion rate for prediction and ranking. Proceedings of the 16th international conference on World Wide Web, 2007b. ACM, 191-200. SONG, X., TSENG, B. L., LIN, C.-Y. & SUN, M.-T. Personalized recommendation driven by information flow. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006. ACM, 509-516. STIEGLITZ, S. & DANG-XUAN, L. 2013. Social media and political communication: a social media analytics framework. Social Network Analysis and Mining, 3, 1277-1291. STROGATZ, S. H. 2001. Exploring complex networks. Nature, 410, 268-276. SUBRAMANI, M. R. & RAJAGOPALAN, B. 2003. Knowledge-sharing and influence in online social networks via viral marketing. Communications of the ACM, 46, 300-307. SUEKI, H. 2015. The association of suicide-related Twitter use with suicidal behaviour: A cross-sectional study of young internet users in Japan. Journal of affective disorders, 170, 155-160. SUN, Q., WANG, N., ZHOU, Y. & LUO, Z. 2016. Identification of Influential Online Social Network Users Based on Multi-Features. International Journal of Pattern Recognition and Artificial Intelligence, 30, 1659015. TAN, C. W., YU, P.-D., LAI, C.-K., ZHANG, W. & FU, H.-L. Optimal detection of influential spreaders in online social networks. Information Science and Systems (CISS), 2016 Annual Conference on, 2016. IEEE, 145-150. TANG, X. & YANG, C. C. 2012. Ranking user influence in healthcare social media. ACM Transactions on Intelligent Systems and Technology (TIST), 3, 73. TENG, X., PEI, S., MORONE, F. & MAKSE, H. A. 2016. Collective influence of multiple spreaders evaluated by tracing real information flow in large-scale social networks. Scientific reports, 6. TUNKELANG, D. 2009. A twitter analog to PageRank. http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-PageRank. VALENTE, T. W. 1995. Network models of the diffusion of innovations, Hampton Press Cresskill, NJ. VOLLENBROEK, W., DE VRIES, S., CONSTANTINIDES, E. & KOMMERS, P. 2014. Identification of influence in social media communities. International Journal of Web Based Communities 4, 10, 280-297. WANG, Y., CONG, G., SONG, G. & XIE, K. Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 2010. ACM, 1039-1048. WATTS, D. J. & DODDS, P. S. 2007. Influentials, networks, and public opinion formation. Journal of consumer research, 34, 441- 458. WATTS, D. J., PERETTI, J. & FRUMIN, M. 2007. Viral marketing for the real world, Harvard Business School Pub. WEI, B., LIU, J., WEI, D., GAO, C. & DENG, Y. 2015. Weighted k-shell decomposition for complex networks based on potential edge weights. Physica A: Statistical Mechanics and its Applications, 420, 277-283. WEINBERG, T. 2009. The new community rules: Marketing on the social web, O'Reilly Sebastopol, CA. WEN, S., JIANG, J., XIANG, Y., YU, S., ZHOU, W. & JIA, W. 2014. To Shut Them Up or to Clarify: Restraining the Spread of Rumors in Online Social Networks. Parallel and Distributed Systems, IEEE Transactions on, 25, 3306-3316. WENG, J., LIM, E.-P., JIANG, J. & HE, Q. Twitterrank: finding topic-sensitive influential twitterers. Proceedings of the third ACM international conference on Web search and data mining, 2010. ACM, 261-270. WU, F., HUBERMAN, B. A., ADAMIC, L. A. & TYLER, J. R. 2004. Information flow in social groups. Physica A: Statistical Mechanics and its Applications, 337, 327-335. XIA, Y., REN, X., PENG, Z., ZHANG, J. & SHE, L. 2016. Effectively identifying the influential spreaders in large-scale social networks. Multimedia Tools Appl., 75, 8829-8841. XIANG, R., NEVILLE, J. & ROGATI, M. Modeling relationship strength in online social networks. Proceedings of the 19th international conference on World wide web, 2010. ACM, 981-990. XIAO, C., ZHANG, Y., ZENG, X. & WU, Y. 2013. Predicting user influence in social media. Journal of Networks, 8, 2649-2655. YAMAGUCHI, Y., TAKAHASHI, T., AMAGASA, T. & KITAGAWA, H. 2010. Turank: Twitter user ranking based on user-tweet graph analysis. Web Information Systems Engineering–WISE 2010. Springer.

Page 33 of 34

YAN, G., CHEN, G., EIDENBENZ, S. & LI, N. Malware propagation in online social networks: nature, dynamics, and defense implications. Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, 2011. ACM, 196-206. YANG, L., QIAO, Y., LIU, Z., MA, J. & LI, X. 2016. Identifying opinion leader nodes in online social networks with a new closeness evaluation algorithm. Soft Computing, 1-12. YANG, W., WANG, H. & YAO, Y. 2015. An immunization strategy for social network worms based on network vertex influence. Communications, China, 12, 154-166. YATES, D. & PAQUETTE, S. 2011. Emergency knowledge management and social media technologies: A case study of the 2010 Haitian earthquake. International journal of information management, 31, 6-13. YE, S., LANG, J. & WU, F. Crawling online social graphs. Web Conference (APWEB), 2010 12th International Asia-Pacific, 2010. IEEE, 236-242. YIN, J., LAMPERT, A., CAMERON, M., ROBINSON, B. & POWER, R. 2012. Using social media to enhance emergency situation awareness. IEEE Intelligent Systems, 27, 52-59. YIN, Z. & ZHANG, Y. Measuring pair-wise social influence in microblog. Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), 2012. IEEE, 502-507. ZANIN, M., PAPO, D., SOUSA, P. A., MENASALVAS, E., NICCHI, A., KUBIK, E. & BOCCALETTI, S. 2016. Combining complex networks and data mining: why and how. Physics Reports, 635, 1-44. ZENG, A. & ZHANG, C.-J. 2013. Ranking spreaders by decomposing complex networks. Physics Letters A, 377, 1031-1035. ZHANG, Z.-K., LIU, C., ZHAN, X.-X., LU, X., ZHANG, C.-X. & ZHANG, Y.-C. 2016. Dynamics of information diffusion and its applications on complex networks. Physics Reports, 651, 1-34. ZHAO, K., GREER, G. E., YEN, J., MITRA, P. & PORTIER, K. 2014a. Leader identification in an online health community for cancer survivors: a social network-based classification approach. Information Systems and e-Business Management, 1-17. ZHAO, K., YEN, J., GREER, G., QIU, B., MITRA, P. & PORTIER, K. 2014b. Finding influential users of online health communities: a new metric based on sentiment influence. Journal of the American Medical Informatics Association, 21, e212-e218. ZHAO, L., WANG, Q., CHENG, J., CHEN, Y., WANG, J. & HUANG, W. 2011. Rumor spreading model with consideration of forgetting mechanism: A case of online blogging LiveJournal. Physica A: Statistical Mechanics and its Applications, 390, 2619-2625. ZHU, H., HUANG, C. & LI, H. MPPM: Malware propagation and prevention model in online SNS. Communications Workshops (ICC), 2014 IEEE International Conference on, 2014. IEEE, 682-687. ZHU, Z. 2013. Discovering the influential users oriented to viral marketing based on online social networks. Physica A: Statistical Mechanics and its Applications, 392, 3459-3469. ZHUANG, K., SHEN, H. & ZHANG, H. 2017. User spread influence measurement in microblog. Multimedia Tools and Applications, 76, 3169-3185. ZOU, C. C., TOWSLEY, D. & GONG, W. 2007. Modeling and simulation study of the propagation and defense of internet e-mail worms. Dependable and Secure Computing, IEEE Transactions on, 4, 105-118.

Page 34 of 34