<<

Rochester Institute of Technology RIT Scholar Works

Theses

7-2016

From Cascade to Transfer:Predictive Analyses on Social Networks

Biru Cui [email protected]

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation Cui, Biru, "From to Knowledge Transfer:Predictive Analyses on Social Networks" (2016). Thesis. Rochester Institute of Technology. Accessed from

This Dissertation is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. From Information Cascade to Knowledge Transfer:Predictive Analyses on Social Networks

by

Biru Cui

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of in Computing and Information Sciences

PhD Program in Computing and Information Sciences B. Thomas Golisano College of Computing and Information Sciences

Rochester Institute of Technology Rochester, New York July 2016 From Information Cascade to Knowledge Transfer:Predictive Analyses on Social Networks

by Biru Cui

Committee Approval: We, the undersigned committee members, certify that we have advised and/or supervised the candidate on the work described in this dissertation. We further certify that we have reviewed the dissertation manuscript and approve it in partial fulfillment of the requirements of the degree of Doctor of Philosophy in Computing and Information Sciences.

______Dr. Shanchieh Jay Yang Date Dissertation Advisor

______Dr. Andres Kwasinski Date Dissertation Committee Member

______Dr. Christopher Homan Date Dissertation Committee Member

______Dr. Linwei Wang Date Dissertation Committee Member

______Dr. Victor Perotti Date Dissertation Committee Member

Certified by:

______Dr. Pengcheng Shi Date Director, Computing and Information Sciences Acknowledgments

When I start writing this paragraph, the refreshes and reminds me the moments happened in these past years. It is a long and short journey, a journey with puzzle, confusion, frustration, starting over; a journey with friendship, encourage- ment, ; and a journey will always remind me that be positive and don’t give up.

It is much appreciated to have the guidance during the process of whole study, my supervisor, my mentors and my friends: Dr. Shanchieh Jay Yang, Dr. Andres Kwasinski, Dr. Linwei Wang, Dr. Chris Homan, Dr. Justin Domke, Dr. Pengcheng Shi, Richard Tolleson, Charles Gruener, Emilio Del Plato, Haitao Du, Wenbo Wang. I cannot achieve this without your support and help.

Finally, I would like to dedicate this dissertation to my parents, my wife, my sister, my nephew, my . You are the I’m able to keep moving forward.

iii Abstract

From Information Cascade to Knowledge Transfer: Predictive Analyses on Social Networks

Publication No.

Biru Cui, Ph.D. Rochester Institute of Technology, 2016

Supervisor: Dr. Shanchieh Jay Yang

As social media continues to influence our daily life, much research has fo- cused on analyzing characteristics of social networks and tracking how information flows in social media. Information cascade originated from the study of information diffusion which focused on how decision making is affected by others depending on the network structure. An example of such study is the SIR (Susceptible, Infected, Removed) model. The current research on information cascade mainly focuses on three open questions: diffusion model, network inference, and influence maximiza- tion. Different from these studies, this dissertation aims at deriving a better under- standing to the problem of who will transfer information to whom. Particularly, we want to investigate how knowledge is transferred in social media.

The process of transferring knowledge is similar to the information cascade observed in other social networks in the way that both processes transfer particular

iv information from information container to users who do not have the information. The study first works on understanding information cascade in term of detecting information outbreak in Twitter and the factors affecting the cascades. Then we analyze how knowledge is transferred in the sense of adopting research topic among scholars in the DBLP network. However, the knowledge transfer is not able to be well modeled by scholars’ publications since a “publication” action is a result of many complicated factors which is not controlled by the knowledge transfer only.

So, we turn to Q&A forum, a different type of social media that explicitly con- tain the process of transferring knowledge, where knowledge transfer is embodied by the question and answering process. This dissertation further investigates Stack- Overflow, a popular Q&A forum, and models how knowledge is transferred among StackOverflow users. The knowledge transfer includes two parts: whether a ques- tion will receive answers, and whether an answer will be accepted. By investigating these two problems, it turns out that the knowledge transfer process is affected by the temporal factor and the knowledge level, defined as the combination of the user reputation and posted text. Take these factors into consideration, this work proposes TKTM ( based Knowledge Transfer Modeling) where the likelihood of a user transfers knowledge to another is modeled as a continuous function of time and the knowledge level transferred. TKTM is applied to solve several predictive problems: how many user accounts will be involved in the thread to provide answers and comments over time; who will provide the answer; and who will provide the accepted answer. The result is compared to NetRate, QLI, and regression methods such as RandomForest, linear regression. In all experiments, TKTM outperforms

v other methods significantly.

vi Table of Contents

Acknowledgments iii

Abstract iv

List of Tables xi

List of Figures xiii

Chapter 1. Introduction 1 1.1 Knowledge Mining in (Static) ...... 1 1.2 Information Cascade in Social Network (Dynamic) ...... 3 1.2.1 What is Information Cascade ...... 4 1.2.2 Diffusion Models ...... 5 1.2.3 Network Inference ...... 7 1.2.4 Influence Maximization ...... 8 1.2.5 Influence Estimation ...... 9 1.3 Knowledge Mining in Q&A Forum ...... 9 1.4 Research Problem ...... 11

Chapter 2. Information Flow Detection 13 2.1 Influence Maximization ...... 13 2.2 Sensing Security Vulnerability by using Twitter ...... 14 2.2.1 Problem Definition ...... 15 2.2.2 TCAD (Twitter Critical Account Discovery) ...... 16

vii Reward Functions ...... 17 Total Reward ...... 19 Greedy Algorithm ...... 20 2.2.3 Experiment Design & Result ...... 21 2.2.4 Summary ...... 23

Chapter 3. Information Cascade Prediction 25 3.1 Introduction ...... 26 3.2 Conflicting on Cascade Size Distribution ...... 28 3.3 Phase-Transition and GPC ...... 30 3.4 What Affects the Cascade Size? ...... 33 3.4.1 Temporal Factor ...... 33 Information Attractiveness ...... 33 3.4.2 Spatial Factors ...... 34 Infection at Different Locations ...... 34 Number of Infected Friends(Parents) ...... 35 3.5 Aggregate Cascade Model ...... 35 3.6 Summary ...... 38

Chapter 4. Individual Knowledge Adoption Prediction 39 4.1 Introduction ...... 39 4.2 Influence Factors for Research Topic Adoption ...... 41 4.2.1 Direct ...... 41 4.2.2 Indirect Observation ...... 42 Connections ...... 42 Topic ...... 42 4.3 Model Formulation ...... 43 4.3.1 NetInf* ...... 43 4.3.2 HetNetInf ...... 44 4.4 Experiments ...... 46 4.4.1 The Dataset and Experimental Setting ...... 46 4.4.2 Prediction Study ...... 47 4.5 Summary ...... 50

viii Chapter 5. Knowledge Propagation Network 53 5.1 Introduction ...... 53 5.2 Q&A forum: StackOverflow ...... 57 5.2.1 Dataset Description ...... 58 5.2.2 User Network ...... 59 5.2.3 Properties ...... 60 Reputation ...... 61 Thread Life Time ...... 62 5.3 Factors affecting knowledge transfer ...... 64 5.3.1 Questioner: Whether a Question will Receive Answers . . . . 64 5.3.2 Answerer: Which Answer will be Selected? ...... 65 5.4 TKTM (Time based Knowledge Transfer Modeling) ...... 68 5.5 Community analysis with TKTM ...... 74 5.6 Evaluation on TKTM ...... 77 5.6.1 Task1: Thread Size Prediction ...... 77 Thread Size Estimation with Regression ...... 78 Thread Size Prediction with Link Weight ...... 78 Result ...... 80 5.6.2 Task2: Thread Size Prediction over Time ...... 82 5.6.3 Task3: Individual Action Prediction ...... 84 5.6.4 Task4: Who will Provide the Accepted Answer ...... 86 5.6.5 TKTM & Semantic Analysis ...... 88 5.7 Summary ...... 91

Chapter 6. Conclusion 92

Appendix 95

Appendix 1. Datasets 96 1.1 DBLP ...... 96 1.2 Digg Network ...... 97 1.3 Twitter Network ...... 99 1.4 StackOverflow ...... 99

ix Appendix 2. Network Analysis 102 2.1 Terminologies ...... 102 2.2 Random Networks ...... 102 2.2.1 Poisson Random Network ...... 102 2.2.2 General Degree Distribution Random Network ...... 104 2.3 Giant Component ...... 105 2.4 K-core size Estimation ...... 107 2.4.1 ...... 107 2.4.2 Example ...... 109

Bibliography 112

x List of Tables

3.1 Independent cascade generation process ...... 29 3.2 Aggregate cascade generation process ...... 37

4.1 Topic Adoption Prediction Performance ...... 49 4.2 Individual Features Weight ...... 50

5.1 Q&A forum Terminologies ...... 57 5.2 Datasets Summary ...... 59 5.3 Statistics related to accepted answers ...... 63 5.4 Features for predicting #answers...... 65 5.5 Features for classifying whether an answer will be accepted . . . . . 66 5.6 MSE of estimation on Q&A thread size ...... 81 5.7 Estimation on Answerer in MRR ...... 85 5.8 Algorithm of estimating who will provide accepted answer . . . . . 87 5.9 Estimation on Answerer in MRR ...... 87 5.10 Prediction on Answerer in MRR with different γ ...... 90

1.1 DBLP:Author ...... 96 1.2 DBLP:Paper ...... 97 1.3 DBLP:Citation ...... 97 1.4 DBLP:Term ...... 97 1.5 Digg:Users ...... 98 1.6 Digg:Friendships ...... 98 1.7 Digg:Votes ...... 98

xi 1.8 Twitter:Users ...... 99 1.9 Twitter:Friendships ...... 99 1.10 Twitter:Tweets ...... 100 1.11 StackOverflow datasets ...... 101

2.1 Terminology List ...... 103 2.2 GPC(k-core) size ...... 110 2.3 GPC(k-core) size ...... 111

xii List of Figures

1.1 Cascades with different number of seeds ...... 6

2.1 Each layer stands for a CVE topic; each CVE has a CWE type which belongs to a category. Circles in layers are tweets. Each user may publish tweets among topics...... 17 2.2 Topics coverage of selected accounts in security vulnerability cate- gories ...... 22 2.3 Topic publish timeliness of selected accounts in security vulnera- bility categories ...... 23

3.1 Cascade size distribution ...... 30 3.2 Cascade size distribution with different uniform infection probabil- ity λ ...... 32 3.3 Number of seeds emerged over time ...... 34 3.4 Exponential fitting of the decay of the infection probability at dif- ferent locations ...... 35 3.5 Distribution of the infection waiting time in the Digg network dataset 36 3.6 Cascade size distribution with the aggregate cascade generation al- gorithm ...... 38

4.1 Individual Popularity ...... 43 4.2 Prediction Performance ...... 49 4.3 Prediction Performance ...... 49 4.4 Feature Weight Distribution ...... 50 4.5 Feature Weight Distribution ...... 50

5.1 Q&A thread structure ...... 58

xiii 5.2 Degree Distribution of user network ...... 60 5.3 Distribution of user reputation ...... 61 a 5.4 Distribution of Tc ...... 62 5.5 Importance of Features ...... 67 5.6 Network with TKTM ...... 75 5.7 Distribution of actors’s reputation, closeness and activeness . . . . . 76 5.8 Thread Size Distribution ...... 82 5.9 MSE of prediction over time (answers only) ...... 83 5.10 Prediction on Answerer in MRR with different γ ...... 90

2.1 Poisson Random Network Giant Component Size According . . . . 106 2.2 Graphic solution of u ...... 109 2.3 Digg network out-degree distribution ...... 110

xiv Chapter 1

Introduction

As social media continues to influence our daily life, much research has fo- cused on analyzing characteristics of social networks and tracking how information flows in social media. Information cascade is one important component of the so- cial network analysis. In general, can be grouped into two categories: static information mining and dynamic information mining, where in- formation cascade belongs to the later.

1.1 Knowledge Mining in Social Network (Static)

Social network is a heterogeneous network where each node is linked into different networks based on different properties. For example, an author can be linked into network based on co-author, and can also be linked into the network of whether two authors joined the same conference. Most work in this category focus on how to leverage the rich heterogeneous network information to mine knowledge which is unable or difficult to derive from a homogeneous network.

1 Existing research can be grouped into following categories:

• Ranking & Clustering Using bibliography network as an example, there are links between author-paper, paper-venue, author-venue. Sun et al. [1] used meta-path to define the social connections. In [2], Sun et al.learned the weights of meta-paths, and further use these meta-path for clustering. Clustering is also combined with ranking to achieve better performance. Sun et al. [3] did the ranking and clustering together in bibliography network based on the that authors’ rank distributions on different clusters should be very different. Ji et al. [4] solved the same problem with a little different approach where they believe object’s rank in a class is dependent on its linked features’ rank in the same class.

• Object Reconciliation (Similarity) Sun et al. [5] defined the similarity of two authors as the meta-path between them. Given the adjacency matrix of

T author-venue is Wac, the meta-paths between authors is Wac × (Wac) . In this case, authors are similar if they attend same venues same .

• Link Prediction Link prediction is another big topic which attracts many re- searchers. There are, in general, two approaches: supervised learning and unsupervised learning. Sun et al. [1] analyzed the coauthor probability for two authors who never coauthor before. They derived social connection fea- tures to learn a model which best fits coauthor events with max likelihood. Since the link recommended is the link which has high probability to access, it can be thought as a random walk process. Backstrom et al. [6] learned a

2 function to assign edge weights, such that a random walker is more likely to visit the nodes to which a new link will be created. This function assigns high weight for node pairs where links is going to be created in the ; and low weight for node pairs with no link. Heterogeneous network connections are inputted as features to train the function.

Other unsupervised learning methods in link prediction include: CN (com- mon neighbors) [7], AA(each CN is weighed by its degree) [8], JA (CN over their all neighbors) [8] [9], PA (similarity of two nodes is evaluated by the product of their degree) [10] [7]. Yang et al. [11] proposed a Multi- Relational Influence Propagation(MRIP) model based on the that influence propagates through all types of links. The distribution of influence on these links depends on the correlation between them. A new link is rec- ommended as the node pair which has high likelihood to pass the influence. Goyal et al. [12] proposed another unsupervised learning algorithm to esti- mate the link strength by measuring the ratio of same actions performed by two actors. Sun et al. [13] further tried to predict when the new link will be created.

1.2 Information Cascade in Social Network (Dynamic)

On the other hand, another branch of research is interested in investigating how the information flows in the social network, which is interchangeable with diffusion

3 or cascading. The study focuses on solving two problems: how does the informa- tion propagate over the given network structure; how to infer the inherent network structure based on information propagation observation.

1.2.1 What is Information Cascade

The study of diffusion models and network inference provides the foundation to analyze the information cascade. Leskovec et al. [14] first discussed the cascading behavior in blogs. They raised a fundamental question that how to model the gen- eration of a cascade such that it is able to understand how blogs cite and influence each other and old posts popularity drops over time. Blog is the container of posts; and posts stand for the information. A cascade is defined as a tree structure where it has a single starting post called cascade initiator with no-links to other posts. A blog has a link points to another blog if its post refers (contains the url address) to the second one’s post. Each original post creates a cascade, and the informa- tion is propagated to other blogs by post referring. The observation they found in large of blogs (2.5 million blogs, 21.3 million posts) is: most (97%) cascades are trivial (isolated posts); only a very small number of cascades have relative com- plex topologies; the distribution of cascade size is power-. Similar observations found in but not limited in [15], [16], [17], [18], [19], [20], [21], [22], [23], [24].

Fig. 1.1a shows a network topology, where each circle stands for a node named from letter A to Q. The directed edge from node i to j represents j is watching i’s activities or j is a fan (or follower) of i, and j designates i as its friend. If i is infected before j, i will expose the influence to j. Marked nodes represent infected

4 nodes; the digital number is the time a node is infected. The influence network is composed of the nodes and the directed edges; and all possible cascades are supposed to be explained by this influence network. The weight of directed edges and how to learn the directed edges in some cases where the network structure is unknown will be discussed in Chapter 4. Same to the cascade extracting method of [14] and [25], if a node is infected with none of its friends infected before it, this node is selected as the initiator or the seed of the cascade. Other members of the cascade are chosen by including infected nodes which are fans of the members of the cascade and infected later than the existing members of the cascade. Fig. 1.1b shows the extracted cascades from Fig. 1.1a. Node L belongs to two cascades originated from node H and Q. Though I is a fan of K, since K is infected after I, I is not influenced by K.

1.2.2 Diffusion Models

The purpose of this set of study is to model the information diffusion such that the cascade can be estimated and predicted. The classic model of Influence model are SIR and SIS models where Kermack et al. [26] define the population transition between susceptible, infected, and recovered with immunity. Based on the similar epidemic disease propagation concept, later works model the infection using thresh- old model and cascade model. Kermack et al. [26] proposed the SIR (Susceptible, Infected, and Removed) model to describe the disease infection process among populations and to estimate the likelihood of an epidemic. Granovetter et al. [27] proposed the threshold model where a node adapts when the weight connect to

5 N 1 A 3 B

C 4 E D P F

G H 5 2 Q N A H Q I 6 J 7 K C G I L L 9 L 8 M K K (a) Network Topology (b) Extracted Cascades

Figure 1.1: Cascades with different number of seeds it crosses a threshold. The influence each node receives is the accumulated link weights from its infected friends. The node is infected once the received influence crosses the threshold. In [28], Goldenberg et al.defines the cascade model that a node is independently infected by its neighbor with a probability. Kempe et al. [29] proved these two models, linear threshold model and probability model, are sub- modular and can be solved with greedy algorithm which is bounded by 1 − 1/e approximation of the optimal. Gruhl et al. [30] characterized how topics propagate from individual to individual, and used the theory of infectious diseases to model the flow. Different from previous work which requires knowing the social network,

6 Jaewon et al. [31] model the process of infection with a linear influence model where the number of newly infected node after a new node being infected is mod- eled as a linear function, such that the total infected nodes at each time point is the summation of the infection function of all infected nodes. Leskovec et al. [14] also proposed a simple cascade generation model based on this independent cascade model. Leskovec et al. [24] developed -tracking tool to capture topic-threads propagated among websites. Besides the social connection features of heteroge- neous network, the information adoption could be also due to external influence such as mass media. Myers et al.first discussed the external influence in [32]. Above models are time discrete, while other works extend it to continuous time based. Gomez-Rodriguez et al. [33] proposed NetRate algorithm where each edge is associated with an exponential time function. Du et al. [34] extended NetRate by replacing the hazard function with a linear combination of kernel functions to fit varies networks.

1.2.3 Network Inference

Previous set of work assume the influence network is known where the information is propagated through this network. For example, a retweet in Twitter explicitly shows the information spread from the original tweeter to the retweeter through the established following relation. However, in practice, such influence network structure may not be clear.

• Network Structure is Unknown Gomez-Rodriguez et al. [35] discovered the influence relations beneath the information diffusion network by using

7 an approximation algorithm to best explain the time sequences the nodes adopt information. They selected the edges with high weights where the edge weight is the infection probability. Myers et al. [36] studied the generalized network inference problem and solve it by transferring it to a equal convex optimization problem. Gomez-Rodriguez et al. [33] defined the link weight as NetRate which is decided by a time model. Netrate is learned by fitting all cascades with max likelihood.

• Incomplete Cascade Data Sadikov et al. [37] proposed a numerical method to estimate the whole network properties with partial information diffusion data. Similar work such as [38] where Kim et al.inferred the part of network with partial known data.

1.2.4 Influence Maximization

Above works focused on building the influence model based on intact or partial data. Another set of works discussed how to use the influence where the applica- tion based on the influence propagation focused on how to maximize the effect by selecting a small number of critical nodes to start the cascade. Most applications are about how to maximize the influence. Kempe et al. [29] proposed an greedy method to find the the initial seeds set to reach maximum influence which is the fundamental work of this area. Leskovec et al. [39] proposed a greedy hill-climbing algorithm to detect the information outbreak among blogs by monitoring less nodes and collect much information as soon as possible. Chen et al. [40] analyzed the performance of greedy algorithm and their improvement for influence maximization.

8 1.2.5 Influence Estimation

The study of influence maximization also introduces another problem, i.e., influ- ence estimation. A critical step in the greedy algorithm of influence maximization is to estimate the influence of a node or a set. Kempe et al. [29] left it as an open ques- tion of finding the exact influence, but noted that it can be approximated by Monte Carlo simulation. However, a reasonable approximation requires a large number of simulations (more than 10,000 times) for each seeds set. Subsequent studies in- vestigate various approaches to speed up such estimates [39], [41], [42], [43], [44], [45], [46], [47], by either decreasing the simulation times for Monte-Carlo-based estimations or finding an alternative algorithm. Kimura et al. [46] proposed two closely related algorithms based on shortest paths. Chen et al.’s algorithm [47] is a variant of the shortest path approach. Aggarwal [43] proposed a steady state spread method to estimate the influence. Yang et al. [44] proposed another way to estimate the influence by approximating the influence from in-neighbors as a linear com- bination. While most research worked on cascade size estimation in infinite time, Du [48] studied cascade size estimation within a time period. Cohen [49] proposed algorithm based on sketched method which is able to estimate the influence scale to huge network with billions of edges.

1.3 Knowledge Mining in Q&A Forum

Sec.1.1 and Sec.1.2 list the studies of mining knowledge in static heterogeneous network and dynamic information flow. The knowledge mining in Q&A forum

9 is the combination of these two that Q&A forum is a heterogeneous network and knowledge transfer is one particular type of information propagation.

Yang et al. [50] worked on the problem of predicting whether a question will receive an answer. Anderson et al. [51] analyzed the StackOverflow dataset where they focused on predicting whether a Q&A thread will have long-lasting , i.e., receive high or low pageviews in the future, and whether a question will receive satisfied answer. Asaduzzaman et al. [52] investigated the unanswered questions and tried to predict how long will a question be in unanswered state. Vasilescu et al. [53] investigated how people’s behavior migrates from the mail-list to the Q&A forum, particularly in StackOverflow; where they found users are more active if they contributes to both mail-list and Q&A forum and they answers faster in Q&A forum than in the mail-list. Hanrahan et al. [54] modeled the question difficulty according to the waiting time of a question received the accepted answer, and users’ expertise with their reputation. To route question to its potential answerers, Chang et al. [55] and Li et al. [56] proposed multiple metrics ranking algorithms to select potential answerers; Zhou et al. [57] proposed a classification approach to route question and investigated the results of different combination of features. Liu et al. [58] investigated the importance of features extracted from question, questioner, answers, and answerers in predicting the questioner’s satisfaction and solved as a classification problem.

Many Q&A related studies worked on whether a question can be answered and answered by whom. The classification is a common approach being utilized, and some studies investigated how to select features to improve performance. The

10 classification is able to find the relation between the features and the result, but ignores the inherent process of how knowledge is transferred between users.

1.4 Research Problem

As discussed above, current research on information cascade mainly focused on network inference, diffusion model, influence estimation and influence maximiza- tion. Different from these studies, we are more interested in solving the problem of who will adopt the information or who will join the cascade during the process of the information propagation. Particular, we want to investigate how knowledge being transferred between persons. Compared to conventional social media that in- formation can be adopted by anyone which is more casual and mainly depends on the same interests, the action reflected in knowledge propagation is also dependent on the knowledge and expertise level of the person. The study is motivated to com- bine the merits of the studies on these two fields to model the knowledge transfer process and use the result to perform predictive analysis from an unique perspec- tive. Thus, the research problem is: how to model the process of knowledge being transferred among user accounts in social media, and the study tries to model the process of knowledge being transferred in Q&A forum from the perspective of information diffusion.

The following chapters will discuss these problems in detail. Chapter 2 and Chapter 3 investigate the general information propagation problems; while Chapter.4 and Chapter 5 work on knowledge propagation. Chapter 2 extends from the influ-

11 ence maximization to find critical nodes in the Twitter network to detect the out- break information as soon as possible. Chapter 3 analyzes the whole information cascade’s trend and investigates what factors could affect the cascading process. Chapter 4 investigates how knowledge (research topic) propagates among schol- ars, and predict who will have relevant publication by incorporating social network features among scholars. Chapter 5 works on the StackOverflow Q&A forum and developed a time based knowledge propagation network to model how knowledge propagates from answerer to questioner. Charpter 6 gives the conclusion.

12 Chapter 2

Information Flow Detection

2.1 Influence Maximization

Social media is a platform which connects people all over the world into a huge network. Information can be generated at any position and flows to other parts of the network. One direct application is viral marketing, where the purpose is to spread the information (advertisement) to other people as many as possible, while keep the cost (number of originators) within a constraint. In other words, to maximize information influence.

Define the social network as a network graph G(V,E) which contains |V | = n nodes and |E| = m edges. σ(·) is the influence function, where σ(A) is the total number of nodes activated when A is initial set 1. If every node v has its weight wv, let B denote the set activated by the process with initial activation A. Then

1Note, since A is a set but not a single seed, the max influence is for the total infection size, but not cascade size.

13 P σ(A) = wv. The cost of selecting the initial set A is c(A). So, the problem is v∈B generalized as: max σ(A) subject to c(A) ≤ L (2.1.1) A⊆V where L is the constrain (limitation).

The problem is NP-complete which is similar as clique problem. Basically, it is to find set A from V which gives the best/maximum gain. [29] is the most fun- damental work in answering this question where Kempe et al.proved the influence function σ(·) is submodular, i.e.

σ(S ∪ v) − σ(S) ≥ σ(T ∪ v) − σ(T ) (2.1.2) for all nodes v and all pairs of node sets S ⊆ T . This property of submodular function guarantees the set found by the hill-climbing greedy algorithm approxi- mates the optimum to within a factor of (1 − 1/e). The greedy algorithm is briefly described as: start with an empty set, and repeatedly add a node that gives the max- imum marginal gain. In this way, the original NP-hard problem can be solved with a greedy algorithm.

2.2 Sensing Security Vulnerability by using Twitter

The intuition behind this part of work is whether the maximizing influence algo- rithm can be applied to the security related area. The prevalence of Twitter provides timely information in a concise format. Besides sharing people’s daily life or per- sonal opinions, Twitter can also an in broadcasting emergency information

14 to public when facing disasters. Sakaki et al. [59] first use Twitter as social sensors to detect earthquakes in Japan. Every person who has a connection to Inter- net can be a potential sensor which collects information and publishes on Twitter. Similar uses can be applied to cyber security when new vulnerabilities or exploits are discovered. Unfortunately, as the volume of tweets and the number of Twitter accounts arise, research has shown that only a small portion of tweets are worth reading [60]. The overwhelming number of retweets, duplicated data, and trivial information makes this powerful social media less useful than it should be. So the question is how to obtain most valuable information from least accounts and least tweets in the most timely fashion for the cyber vulnerability related information in Twitter.

2.2.1 Problem Definition

The problem may be formally described as follows. For a specific security vulner- ability category (C), find the set of critical accounts S ⊂ V so that the total reward function R(S) is maximized subject to the cost function C(S) ≤ L. Later on, the work will integrate the cost constraint into our total reward function and solve the multi criteria problem heuristically.

max R(S) subject to cost(S) ≤ L (2.2.1) S⊆V

15 2.2.2 TCAD (Twitter Critical Account Discovery)

The first challenge to identify critical accounts in a specific topic area, i.e.Cyber Vulnerabilities for this paper, is to determine a filtering mechanism of tweets related to the topic. In general, this requires an intelligent semantic processing, allowing to determine not only the set of relevant tweets but also the context of the tweets. In the context of Cyber Vulnerability, a well defined and relatively unambiguous set of keywords is the Common Vulnerabilities and Exposures index (CVE xxxx xxxx), an 8 digit number following ‘CVE’ - a dictionary of publicly known information security vulnerabilities [61]. Using CVE tags as keywords finds tweets that are explicitly addressing the cyber vulnerabilities, and allows this work to concentrate on developing the algorithm that finds the critical accounts from the relevant tweets. The semantic meanings of each tweets are not treated in this work and will be addressed in future work.

Every CVE tag is assigned a CWE (Common Weakness Enumeration) type by a third party organization [62], which defines a software security weakness. Several CWEs make up a security vulnerability category. For example, CWE 119 (Failure to Constrain Operations within the Bounds of a Memory Buffer) and CWE 20 (Im- proper Input Validation) belong to the same category that causes Denial-of-Service (DoS). Fig. 2.1 illustrates the structure of CVE, CWE and security vulnerability categories along with the Twitter accounts and retweet relationship using directed edges.

In this information hierarchy, each CVE tag is treated as a unique information

16 CVE_2012_4894 CWE 119 DoS CVE_2012_1571 CWE 119

CVE_2012_5317 CWE 89

CVE_2012_0327 CWE 79

Account A Account B Figure 2.1: Each layer stands for a CVE topic; each CVE has a CWE type which belongs to a category. Circles in layers are tweets. Each user may publish tweets among topics. topic within a security vulnerability category, and tweets containing this CVE tag are attached to this topic. Each node in Fig. 1 represents a tweet, an original or a retweet. The directed edges reflects the retweet(s) cite the information from the original tweet. Clearly, the original tweets is more timely than the retweets and the time stamps further differentiate quantitatively the timeliness among all tweets. Each Twitter account may post tweets on multiple topics. An account is considered covering a topic if it publishes a tweet in that CVE tag. The more topic an account covers using fewest and timely tweets, the more critical the account is. Three award functions are defined as follows to address this multi-criteria problem.

Reward Functions

The reward functions are defined according to what expect to gain from the selected accounts. In this case, the reward comes from three aspects:

17 Timeliness (RT ). Given multiple tweets about the same news, the first one reveal the information is highly valued, with descending value for tweets thereafter.

The reward function RT for a tweet W published by account A in a CVE topic T is defined as:

D(W, W0(T )) RT (A, W, T ) = 1 − (2.2.2) maxT ∈C D(WL(T ),W0(T )) where W0(T ) and WL(T ) are the first and last tweet in topic T ; D(W, W0(T )) de-

fines the time difference between the tweet W and W0; maxT ∈C D(WL(T ),W0(T )) is the longest time period of CVE topics in category C.

Originality (RO). Though retweets help propagate information, it is still the original tweets that are referenced. People also prefer the original tweets but not the retweets. The intuition of reward RO is that a tweet’s originality is shared by all of its retweets. This is similar to the way PageRank defines the relations between web pages. ˜ RO(P, W, T ) RO(A, W,T ) = (2.2.3) N˜(W ) + 1 where W˜ is a retweet of tweet W ; N˜(W ) is the total number of retweets for W ; P ˜ is the account for the original tweet W . This is a recursive definition. RO(A, W,T ) defines the originality of A’s retweet W˜ that splits the originality of its parent tweet

RO(P, W, T ).

Influence (RF ). If two tweets from accounts A and B contain the same infor- mation, same time stamp and are all original tweets. A should be rewarded more if its tweet has been retweeted by more accounts, implying more people are influ- enced by the tweet from A. Thus, besides the timeliness and originality, the reward

18 of influence is defined in (2.2.4).

N˜(W ) R (A, W, T ) = (2.2.4) F N(T ) where N(T ) is the total number of posts in the topic T .

Total Reward

With the three reward functions RT , RO, RF , the problem becomes finding op- timal set S that maximizes the reward with multiple reward functions. Note that

0 0 it is possible to have two candidates sets S and S , where RT (S) > RT (S ) but

0 RO(S) < RO(S ). A common approach in solving such problem is to find the Pareto-optimal with respect to the following objective function:

X max λ1RT (S) + λ2RO(S) + λ3RF (S) (2.2.5) S⊆V where λi are weights to trade off the importance of the reward functions. As men- tioned in Sec. 2.2.2, it is ambiguous to decide the reward of an account by selecting a subset of these three reward functions. There is no reason to strengthen or weaken one aspect reward. This work uses the same weights for the three rewards while normalizing the rewards in [0 1].

Recall that the optimization should be subject to the cost of the selected ac- counts. This work defines the cost to be the total tweets N(A, T ) published by each selected account A on a topic T . This is equivalent to find the reward an account has on a topic per tweet. The final total reward function an account A gain on topic

19 T per post can then be written as:

X RT (A, W, T ) + RO(A, W, T ) + RF (A, W, T ) R(A, T ) = (2.2.6) N(A, T ) W ∈T

For a set S, the total reward of this set on a topic T is: P A∈S R(A, T ) × N(A, T ) R(S,T ) = P (2.2.7) A∈S N(A, T )

The total reward obtained by selecting a set of accounts in a category is the summation of the reward on all topics belong to this category. Then the reward by selecting an account set S in a category C is:

X R(S,C) = R(S,T ) (2.2.8) T ∈C

Greedy Algorithm

Based on R(S,C) to select K from N Twitter accounts for the optimal total reward could be computationally complex using a brute force approach which takes time exponential in K, especially when N is huge. In N can be significant since Twitter has reached 500 million active accounts in July 2012 [63]. R(S,C) is not monotonic since by selecting an account who always posts outdated retweets may decrease the total R(S,C). It is also not submodular because it does not guarantee ∀S ⊆ S0,A/∈ S0,R(S ∪ A, C) − R(S,C) ≥ R(S0 ∪ A, C) − R(S0,C). This work developed a simple greedy algorithm, which exhibits outstanding performance. The algorithm, independently executed for each category C, begins by selecting one ac- count A into S to maximize R(S, C). Then, during each iteration, it adds an account

20 from the remaining set into S to maximize R(S,C). This greedy algorithm has a complexity of O(KN) that each round iterates left accounts and takes K rounds.

2.2.3 Experiment Design & Result

The experiment collected about 5,000 CVE related tweets between Sep 25 and Nov 2 of 2012. These tweets come from about 1,600 accounts and cover about 900 CVE topics. Nine common security categories are selected according to [63]: GI (Gain Information), RV (Reserved), EC (Execute Code), US (Unspecified), MC (Memory Corruption), DoS (Denial of Service), DT (Directory Traversal), GP (Gain Privi- lege), CSS (Cross-site Script). The collected data is divided into two sets, one for selecting critical accounts, another for testing how selected accounts perform on unknown data.

To evaluate the performance, TCAD is compared to two other approaches: PageRank and number of retweeters. Account A is B’s retweeter if A has one or more retweets from B. The same definition is used to define the directed links between accounts for PageRank. Each approach selects top 10 critical accounts per category. The performance is evaluated in terms of information coverage and timeliness. The information coverage is defined as the topics mentioned by critical accounts over the total topics in a category; the timeliness is defined as the normal- ized RT (A, W, T ) for all W ∈ C. Both of them are normalized in [0 1]. A higher score stands for a better topic coverage or obtaining information earlier.

Figures 2.2 and 2.3 show the information coverage and timeliness achieved by the three algorithms for the two sets of data. It is clear that, by following

21 the selected accounts, TCAD significantly outperforms the other two popularity (retweeter) based algorithms. The only category the two popularity-based algo- rithms not perform terribly is DT (Directory Traversal). The is that there are only 21 accounts in the collected data for DT, while there are typically hundreds of accounts in other categories. Selecting 10 accounts out of 21 guarantees a reason- able information coverage and timeliness. Coincidentally, the two popularity-based approaches choose the same 10 accounts, which is why they have the same perfor- mance in DT.

Topics Coverage in Cyber Vulnerability Categories

1 Data Set 1 0.56 0.64 Data Set 2 0.5

Hit % 0.25 0.16 0.15 0.15 0.08 0.09 0.09 0.06 0.04 0.06 0.06 0.07 0.03 0.00 0.07 0.07 0 GI RV EC US MC DoS DT GP CSS Followers Number

1 0.56 0.64 0.5 Hit % 0.11 0.03 0.04 0.05 0.07 0.07 0.03 0.07 0 0.02 0.00 0.00 0.02 0.01 0.01 0.01 0.00 GI RV EC US MC DoS DT GP CSS PageRank

0.99 0.91 1 0.84 0.86 0.83 0.76 0.75 0.72 0.73 0.64 0.70 0.66 0.69 0.71 0.58 0.53 0.57 0.5 0.47 Hit % 0 GI RV EC US MC DoS DT GP CSS TCAD

Figure 2.2: Topics coverage of selected accounts in security vulnerability cate- gories

Additional experiments are run to test the scenario where the two popularity- based approaches are executed in a category agnostic manner, i.e.accounts are se- lected by mixing all tweets from all categories. The results were worse or the same in all categories. This suggests a different conclusion from Cha et al.[12], which

22 Topics publish timeliness in Cyber Vulnerability Categories

1 Data Set 1 Data Set 2 0.47 0.5 0.36 0.16 0.07 0.09 0.09 0.03 0.02 0.04 0.03 0.03 0.01 0.00 0.05 0.05 0.04 0.03 0.04

Timeliness 0 GI RV EC US MC DoS DT GP CSS Followers Number

1

0.47 0.5 0.36

0.01 0.00 0.00 0.01 0.02 0.03 0.01 0.00 0.04 0.03 0.01 0.00 0.02 0.00 0.06 0.04

Timeliness 0 GI RV EC US MC DoS DT GP CSS PageRank

1 0.85 0.70 0.71 0.60 0.56 0.56 0.59 0.59 0.58 0.63 0.57 0.52 0.51 0.46 0.5 0.42 0.36 0.40 0.28

Timeliness 0 GI RV EC US MC DoS DT GP CSS TCAD

Figure 2.3: Topic publish timeliness of selected accounts in security vulnerability categories found that “influential accounts hold significant influence over a variety of topics.” Another interesting observation is that some critical accounts selected are Twitter machines such as scripts which publish posts when there is a security alert. These accounts may not be the preferred human expert accounts but do provide the most valuable information the earliest.

2.2.4 Summary

The work presented in this chapter tries to connect the social network analysis to the security vulnerability information sensing using Twitter. A Twitter Critical Ac- count Discovery (TCAD) algorithm is developed to find critical accounts for Cyber Vulnerabilities based on rewards from three aspects: timeliness, originality and in- fluence. Compared to other two approaches: PageRank and Number of Retweeters,

23 TCAD algorithm outperforms both in information coverage and timeliness. The results support our hypothesis that ranking or influence model focusing on node popularity may not effectively capture information needed. Also, the critical ac- counts selected from one security category can be very different to another, thus suggesting that popular accounts across many categories may not be the best ac- count to follow for specific needs.

24 Chapter 3

Information Cascade Prediction

As discussed in Sec.1.2, independent cascade model and linear threshold model are two classical models well accepted in SNA. In brief, the independent cascade model is defined as: a node u is infected at time t, it has a single chance to infect each uninfected neighbors v with a probability Pu,v. If u succeeds, v will become infected at time t+1; otherwise, u will not make any further attempts to infect v. In linear threshold model, each node v has its threshold Lv, the incoming edge from its infected neighbor u is weighted as Wu,v. The total influence the node v received is P P the summation of weights from all infected neighbors Wu,v. If Wu,v > Lv, u∈K u∈K the node v will be infected. Given influence network and cascading model, a cas- cade is sampled by randomly selecting an initial seed. Then the cascade distribution can be derived by sampling many times. The size of the cascade is the total number of nodes involved. A subsequent question is what’s the cascade size distribution. Both independent cascade model and linear threshold model are developed from the SIR (Susceptible, Infected, Removed) disease propagation model which is used

25 to estimate the epidemic. Thus, given the influence network and same initial seeds randomly selection procedure, it is also interesting to see whether there is an epi- demic of information propagation and what’s the epidemic size (maximum cascade size) by applying these cascading models.

3.1 Introduction

Leskovec et al. [14] discussed the information cascades in blogs, where a cascade is defined as a DAG structure which has a single starting post called cascade seed. Each node of the cascade is a blog post and a child node is any post that contains the parent post’s URL. A blog has a link pointing to another blog if its post refers (contains the URL) to the second blog’s post. Each seed creates a cascade, and the information is propagated to other blogs by post referring. They found that most (97%) cascades were trivial (isolated posts); only a very small number of cascades contained multiple levels, and the distribution of cascade size follows a power law where the cascade size is the total number of nodes involved in the cascade. Sim- ilar observations were found in [17], [18], [19]. With these observations, existing works attempted to develop cascade generation models to explain the cascade pro- cess. Kermack et al. [26] proposed the SIR (Susceptible, Infected, and Removed) model to describe the disease infection process among populations and to estimate the likelihood of an epidemic. This infection concept was borrowed and applied to explain the information propagation. The infection is defined as the process where a person adopts the information after being exposed to the information from her friends. Granovetter et al. [27] proposed a threshold model where each node has

26 a predefined threshold. The influence each node receives is the accumulated link weights from its infected friends. The node is infected once the received influence crosses the threshold. Goldenberg et al. [28] defined an independent cascade model that a node is infected by its friends independently with a probability. Leskovec et al. [14] also proposed a simple cascade generation model based on this indepen- dent cascade model. Besides modeling the cascade process, other studies attempted to predict the cascade size. Given the initial cascade size, Cheng et al. [18] pre- dicted whether the cascade size could be doubled by including the temporal and spatial features. They solved the cascade size prediction as a classification prob- lem, but ignored the step-by-step propagation dynamics. Also, few of the afore- mentioned studies noticed that the simulated cascade size distribution based on the conventional cascade generation models are very different from the actual cascade size distribution.

This part of work attempts to fill the gap by investigating the cascade size and factors affecting the cascade size distribution. A phase-transition phenomenon is found in simulated cascades on the Digg social network using the independent cas- cade model where the size of a cascade is either very small or very large. Also the probability to have a large size cascade in the actual network is much smaller than the one in the simulated cases. The work first developed the concept of GPC (Giant Propagation Component) to explain the phase-transition phenomenon. While the GPC exists in the social network, the phase transition is not observed in the actual cascades. To explain the difference, this work hypothesizes that the cascade size de- pends on the influence network structure and is also affected by temporal and spatial

27 factors. Furthermore, a non-independent cascade model is proposed to explain the information propagation through “dig/like” actions, where a node is not infected independently by each of its infected friends, but makes the infection decision upon the aggregated effects of its friends and other temporal and spatial factors.

3.2 Conflicting Observations on Cascade Size Distribution

This section first exhibits the difference between the simulated cascade size distri- bution and the actual cascade size distribution. For convenience, this work defines parent and child to represent the friend’s relationship as: child → parent, where child is watching parent’s activities, and parent can infect child. Table 3.1 shows the independent cascade generation process which is similar to the cascade genera- tion model of [14].

In Table 3.1, i is the infected parent of j, Pij is the probability of node i infect- ing node j. According to [12], Pij can be extracted based on the number of same actions done by two nodes over the number of actions done by i. Each simulation run generates one cascade. The experiment considers the Digg network as the in- fluence network. The simulated cascade size distribution is shown in Fig.3.1a by running the simulation 100,000 times. However, it is very different from the actual cascade size distribution2 as shown in Fig.3.1b.

2The cascades are generated using the cascades extraction method specified in [14] on Digg dataset [64].

28 Algorithm 1 Independent Cascade Generation Process Initialize empty infected and susceptible list; randomly select a node s as seed; add each of s’s uninfected children k into the susceptible list

associated with a probability Psk; Loop While susceptible list not empty: check the susceptible node j against its associated

probability Pij; if infected: add j into the infected list; add each of j’s uninfected children k0 into the susceptible list

associated with a probability Pjk0 ; remove j from the susceptible list; End Loop

Table 3.1: Independent cascade generation process

There are two differences. The first difference is that there is a phase transition in the simulation, but no such gap in the real cascade size distribution. Steeg et al. [17] observed similar phenomenon, where they assumed all link weights are the same in their experiment. The second difference is that the probability to reach a large cascade in the actual case is much lower, and the distribution is in a power- law shape. The simulation has much higher probability (about 0.0134, 1338 out of 100,000 cascades) to have a large cascade (size greater than 2500) than the actual cascades (about 0.000008, 15 out of 1780522 cascades).

29 6 10 1010

104 105 2 Count 10 Count

100 100 100 101 102 103 104 100 101 102 103 104 Cascade Size Cascade Size (a) Simulation (b) Actual

Figure 3.1: Cascade size distribution

3.3 Phase-Transition and GPC

To explain the first difference, the phase-transition phenomenon, we introduce the concept of GPC (Giant Propagation Component). The key point of the phase tran- sition is that the cascade is either very large or very small. Intuitively, to have a large size cascade, the propagation needs to keep on infecting new nodes at each iteration, i.e., the probability of infecting at least one node must approach 1. Based on the independent cascade model, the probability Pi of a node i infecting at least one of its children is defined as:

Y Pi = 1 − (1 − Pij) ≥ 1 −  (3.3.1) j∈N(i) where Pij is the probability of node i infecting j; N(i) is the set of children that could be infected by i;  is a very small number. The GPC is the maximally con- nected component where all nodes of the component satisfy (3.3.1). If a node in the GPC is infected, it will infect at least one other node with a high probability; and this newly infected node will attempt to infect another new node. Thus, if any node of the GPC is infected, it will cause a number of other nodes of the GPC being

30 infected with a high probability, and this number is proportional to the size of the GPC 3. Therefore, the of the GPC causes the phase-transition scenario in that:

• If any node of the GPC is infected during the propagation process, the cascade will reach a large size which is proportional to the size of the GPC plus the number of nodes adjacent to the GPC in a high probability.

• If no node of the GPC is infected during the propagation process, the cascade will die quickly and end with a small size.

The algorithm to locate the GPC is by continuously pruning nodes that do not sat- isfy (3.3.1). In this way, the GPC found in the Digg network has 310 nodes with  = 0.01. 4 By selecting any node of the GPC as seed, it will infect most nodes (about 290 out of 310) of GPC, and will eventually infect about 3000 nodes by counting the adjacent nodes of GPC with relevant link infection probability. This numerical estimation is close to the large cascade size of the simulation result shown in Fig.3.1a. By checking all cascades in Fig.3.1b with size large than 2500, the in- fected nodes include most nodes of the derived GPC. This also validates the state- ment that the activation of the GPC is the necessary condition of a large cascade.

Since the GPC is related to the infection probability, we want to test whether changing the infection probability will remove the phase transition. The following

3It is possible that all children of infected nodes are already infected, which will cause the prop- agation to stop. 4GPC size estimation can be found in Appendix 2.4.

31 experiment assumes all links weights having the same λ. Fig.3.2 shows the cascade size distribution as λ increases. By increasing the λ, each node has a higher chance to infect new nodes and the GPC size is larger, and thus, a larger cascade. The phase-transition phenomenon disappears when λ is too small to support a GPC. Though a low infection probability can eliminate the GPC and phase transitions, it also greatly constrain the size of a cascade, where no cascade can be large. This shows that changing λ does not provide a solution to remove the phase transition.

8000

6000

4000

2000 Cascade Size 0 0 0.01 0.02 0.03 Infection Probability λ Figure 3.2: Cascade size distribution with different uniform infection probability λ

The primary issue that causes the GPC is the “independent” infection process. This motivates us to investigate a non-independent cascade model to explain the information propagation in the Digg network. This also accords to the “dig/like” user experience in the Digg social network and Facebook. Intuitively, people make the “dig/like” decision by reading the story and influenced by the aggregated result of the parents “dug/liked” the same story, but not being infected independently by each of their parents.

32 3.4 What Affects the Cascade Size?

Besides the phase transition, the other difference is that the probability to reach a large size cascade in the actual cascade size distribution is much lower than the simulation’s based on the independent cascade model. In other words, a cascade can reach a large size with only a very small chance. So what prevents a cascade to grow large in the real world? To answer this question, we hypothesize that the infection probability,i.e., the likeliness of being infected, is affected by several temporal and spatial factors, and consequently affects the cascade size. The reason of selecting temporal and spatial factors is based on the intuition that the propagation of the information is dependent on whether the information is new (temporal) and whether it is widely (spatial) known.

3.4.1 Temporal Factor

Each story in the Digg network has its lifetime, even for the most popular news. A typical story in the Digg social network will not last for more than a week. This motivates us to think about the attractiveness of the information may constrain a cascade to propagate further. The attractiveness can be represented as the informa- tion’s infection probability which will decay over time.

Information Attractiveness

The seeds are the nodes adopt the information without being influenced by other nodes but solely attracted by the information. Thus, the number of seeds emerged

33 along the time represents the attractiveness of the information. Imaging the “infor- mation” as a virtual node, the seeds are infected by the information itself. Fig.3.3 shows the 5% ∼ 95% percentile of the number of the newly emerged seeds for different stories over time. The red square-marker curve is the mean value, and the green diamond-marker curve is the relevant exponential function fitting. These curves show that the infection probability from the information itself to the unin- fected nodes decays exponentially over time.

mean exponential fitting 100

50 #Seeds

0 0 50 100 150 200 Time (Hour) Figure 3.3: Number of seeds emerged over time

3.4.2 Spatial Factors

Infection at Different Locations

Location is defined as the number of hops a node is away from the seed where the information is originated. By monitoring the infection happened in the same time period, the dots in Fig.3.4 represents the normalized infection probability (defined as the number of infected nodes over the total number of children could be infected) at different locations. The infection probability is smaller if the infection happens farther away from the seed, and the probability decays exponentially. The curve

34 shows the fitted exponential function.

1 data 0.8 fitted curve

0.6

0.4

0.2 Infection Probability 0 2 4 6 8 10 12 Location Figure 3.4: Exponential fitting of the decay of the infection probability at different locations

Number of Infected Friends(Parents)

Besides the location where the infection happens, another possible factor is the infected parents. According to the independent cascade model [28], the conditional infection probability increases exponentially with the number of infected parents. However, Steeg et al. [17] observed that it is not true; the infection probability almost keeps consistent regardless of the number of infected parents in the Digg network. We repeated their experiment and verified the result. This also implies that neither the independent cascade model nor the linear threshold model may be applicable in the information propagation of “dig/like” actions.

3.5 Aggregate Cascade Model

Sections 3.3 and 3.4 discussed the reasons causing the two differences observed in Fig.3.1. To overcome the limitation of the current independent cascade model, we propose an “aggregate cascade model” by considering the following factors.

35 • The probability a node is infected depends on the time passed since the infor- mation is originated.

• The probability a node is infected depends on the distance the node is away from the seed where the information is originated.

• The infection is not triggered independently by each infected parent, but every node assesses its situation to make infection decision only once; the time when the node will assess the infection is defined by an infection waiting time model.

The infection waiting time model comes from the distribution of the infection time difference between the parent node and its child. Fig.3.5 shows the waiting time distribution extracted from the actual cascades of the Digg network.

6 3x 10

2

Count 1

0 0 200 400 600 800 Time (Hour) Figure 3.5: Distribution of the infection waiting time in the Digg network dataset

1 x The waiting time distribution follows an exponential distribution µ exp(− µ ) with µ = 9 (Kolmogorov-Smirnov goodness-of-fit test, sample size 243483, critical value 0.0033, p-value < 2.2e-16). Let δl represent the infection waiting time at

36 location l and drawn from this distribution. Table 3.2 shows the aggregate cascade generation model: In Table 3.2, K(j) is the set of infected parents which can infect

Algorithm 2 Aggregate Cascade Generation Process Initialize empty infected, accessed and susceptible list; randomly select a node s as seed; add each uninfected child into the susceptible list with a timer t and its location L; Loop While susceptible list not empty: if any timer of a susceptible node j expires: assess node j’s infection against the probability 1 P Pj = ( |K(j)| Pij) ∗ exp(a ∗ t + b ∗ L); i∈K(j) if infected: add node j into the infected list; for each of j’s uninfected child k: if k not in accessed list, add k into the susceptible list with a timer t and its location L; remove node j from the susceptible list and add j into the accessed list; End Loop Table 3.2: Aggregate cascade generation process j and |K(j)| is the size of the set; a = −0.03, b = −0.2 are derived from Sections 3.4.1 and 3.4.2 accordingly; t is the elapsed time since the moment information is L P originated estimated by δl; L is the shortest distance of node i to the seed through l=1 the infected parents. Fig.3.6 is the simulated cascade size distribution based on the aggregate cascade algorithm. The distribution is in a power-low like shape without phase transitions, similar to the distribution in Fig.3.1b.

37 106

104

Count 102

100 100 101 102 103 Cascade Size

Figure 3.6: Cascade size distribution with the aggregate cascade generation algo- rithm

3.6 Summary

The work presented in this chapter analyzes the difference of simulated cascade size distribution based on the independent cascade model from the actual cascade size distribution. GPC is introduced to explain the phase-transition phenomenon; and temporal and spatial factors are investigated to adjust the infection probability. Furthermore, a non-independent cascade model is proposed to better reflect the essence of the information propagation of “dig/like” actions, where people make information adoption decision based on the aggregated effects of its parents and other temporal spatial factors.

38 Chapter 4

Individual Knowledge Adoption Prediction

4.1 Introduction

Information cascade has been well studied to explain how individuals adopted infor- mation, but not as much to predict future cascades especially when the information is new and possibly not quite relevant to past cascades. An individual adopts the information when she receives enough influence from her infected neighbors [29], one who already adopted the information, or randomly if the underlying decision making process is unclear [29]. A set of work was developed to infer the inherent influence network [35] [36] [33] based on a number of cascades; the influence net- work inferred is the one which best fits all cascades. In general, the more cascades the more accurate influence network one can recover.

The inferred influence network can recover the most likely influence flows based on the distribution of collected past topic adoption cascades. However, it

39 is unclear whether the future topic cascade can be explained by the distribution of past ones, due to two reasons. First, past cascades can be limited to cover the rela- tionships among all actors. Second, the new topic propagation could be irrelevant to how past topics propagated. Though the topic cascade can be volatile and in many cases there are not sufficient cascades available, the way an author being influenced to work on a new topic is relative stable; an author is likely to be influenced by her colleagues or other researchers with social connections. These social connections contain rich information describing different relationships between authors and can be used to infer the inherent influence flows.

Existing research on heterogeneous information networks mostly focus on rank- ing and clustering [4], similar objects searching [5] and link prediction [1]. Differ- ently, this work tries to predict how a new topic is being adopted by authors as a cascade without knowing the influence network structure, but not to predict an additional new link over the existing network. This work proposes to leverage the rich heterogeneous bibliography network information to complement the past topic cascades for determining the inherent influence network. Besides the social connec- tions, the popularity of the topic itself also affects the adoption process. In general, authors are more likely to follow popular topics than less widely accepted ones. To this end, this work aims at developing an algorithm that finds an influence net- work by optimizing over the social connections and topic popularity subject to past cascades. The influence networks will then be used to predict new topic adoption. DBLP data is used to demonstrate the performance improvements of the proposed method in Mean Average Precision (MAP) and Area under ROC Curve (AUC).

40 4.2 Influence Factors for Research Topic Adoption

This work concerns the adoption of research topics as evidenced by authors’ pub- lications. A “topic” is defined as a popular term that represents a specific scientific concept. Adopting a topic means an author has published at least one paper that contains the term in either the title or the abstract.5 An author X “follows” another author Y if X adopts a topic after Y adopts the same topic. This following behav- ior is also interpreted as an infection process [35]. The following lists the factors that potentially play a role for the influence one author has on the others to adopt a research topic.

4.2.1 Direct Observation

The direct observation of the adoption is the time when an author adopts the topic, and a cascade can be generated based on the chronological ordered observations. Each topic has its own cascade. According to Gomez-Rodriguez et al. [35], re- covering an influence network with K edges will require 2K~5K cascades. This inferred influence network could be used to predict new topic adoption based on the assumption that if author X followed Y in many topics before, it is also likely X will follow Y again for a new topic. Generally speaking, the closer in time X followed Y , the more influence Y has on X based on this direct observation [35].

5Topic adoption is a complex process. Nevertheless, to some extent, we believe that publishing papers with topic terms are reflective of the nature of the adoption.

41 4.2.2 Indirect Observation

The direct observation helps to recover the influence network based on past topic adoption cascades, but it requires sufficient number of relevant cascades to learn the network. This limitation motivates the consideration of other factors that indirectly implies the relationship, hence the influence one author may have on another to adopt a new topic.

Connections

The basic is that authors are more likely to receive influence from their col- leagues, co-authors and other socially connected peers, than unknown persons. Sun et al. [1] [13] suggested that the social connections could be the main reason of people establishing a new relation. Sun et al. [1] use meta-path to define the social connections. A meta-path is defined on the network , where nodes are ob- ject types and edges are relations between object types. For example, two authors are coauthored in a paper is defined as: Author-Paper-Author. To define the social relationships among authors, eight common social connections are defined: CI: cite peer’s paper; CA: coauthor; CV: publish in the same venue; CSA: are co-authors of same authors; CT: write about the same topic; CICI: cites papers cite peer’s; CIS: cite the same papers; SCI: cited by the same papers.

Topic Popularity

Besides the social connections affecting the adoption probability, the research topic itself is also a factor. Not only the match between the topic and the author’s own

42 interest, but also the popularity of the topic can affect how fast and easily it is being adopted. The topic popularity is defined as the average number of papers adopting the topic since the first year it appeared. Fig. 4.1 shows the popularity of four selected machine learning topics extracted from 20 conferences from 1989 to 2010. 30 SVM EM 20 PCA KNN 10

Popularity(#papers) 0 1985 1990 1995 2000 2005 2010 Year Figure 4.1: Individual Popularity

As the figure shows, different terms have different popularity curves. Given the same social connections and the same initial authors who adopted the topic earliest, different topics could result in different adoption cascades. There also exists the case where an author adopts the topic without receiving any influence from her so- cial peers. Based on these observations, this work hypothesizes that topic popularity can be a factor that affects the adoption process. To include the topic popularity in the model, a virtual author is added where the connection between any author to this virtual author represents the influence comes from the topic popularity.

4.3 Model Formulation

4.3.1 NetInf*

Traditional network inference algorithm such as NetInf [35] uses the direct obser- vation i.e.adoption time difference to estimate the influence probability, where the

43 infection probability for each link (ij) on cascade c is calculated as:

∆ P rc = exp(− ij ) (4.3.1) ij α where ∆ij is the adoption time difference between i and j, and α is set to 1 accord- ing to [35]. Based on that, NetInf selects links which contribute mostly on infection probability over all cascades with a greedy algorithm. However, the purpose of NetInf is only to infer the most likely influence network based on past cascades. To do the prediction on future topic, NetInf* is developed as an extension based on NetInf that it estimates the influence probability for all links as:

Y c P rij = 1 − (1 − P rij) (4.3.2) c∈C where P rij is the probability i will adopt after j for any topic, and the transition probability matrix can be set accordingly.

4.3.2 HetNetInf

The second algorithm HetNetInf uses the indirect observation, i.e.social connec- tions and topic popularity to generate the influence probability. Every author can be infected (adopt a topic) by her infected neighbor with a probability. This neighbor could be the actual peer with social connections, or a virtual author which repre- sents the topic popularity. So the problem becomes how to use the social connection features and topic popularity to represent the infection probability for each author- pair, and to reflect the time order they adopt the topic. Given authors i and j, author i is infected at time ti and j is infected at time tj (ti > tj). Let Xij be the feature

44 vector between author i and j, and Xij = [Xij,f1Xij,f2...Xij,fN Xpt]. Xij,fk is the

th k social connection feature and Xpt is the popularity of topic t.

if j is the virtual author:Xij = [0...0Xpt]

if j is a normal author:Xij = [Xij,f1Xij,f2...Xij,fN 0]

The probability author j infects author i in topic cascade c is defined as:

c 1 P rij(β) = exp(− ) (4.3.3) Xij ∗ β where Xij is the feature vector between authors i and j; and β > 0 is the weight which transfers the features to the influence probability.

For every infected neighbor of i, they all have chances to infect author i. Then the likelihood of infecting author i in cascade c is:

c Y c I(ti

c window and the Li is interpreted as the probability i is not infected by any neighbor;

c otherwise, Li stands for the probability i is infected.

Two concerns remain. Firstly, it is unclear whether one should use the same parameter β for all authors, which assumes the same adoption behavior, i.e.weight distribution for all authors. So, each author has its own βi, and HetNetInf runs the optimization for each author independently. Equation (4.3.3) is updated as:

c 1 P rij(βi) = exp(− ) (4.3.5) Xij ∗ βi

45 For cascade set C, the likelihood of author i being infected for all cascades c ∈ C is:

Y c Li(βi) = Li (βi) (4.3.6) c∈C Substituting (4.3.4) in the (4.3.6), the log likelihood is:

X Y c I(ti

Secondly, because the influence network is sparse, β need to be penalized to have fewer links which is same to have some links with very small link weight. Take it into consideration, the optimization is to minimize following function:

min −LLi(βi) + kβik2 subject to βi > 0 (4.3.8) βi

The ridge regularization term of kβik2 is the penalty for the influence network spar- sity. For each author i, ten optimizations with random initial points will be run to find the local optimals, and the best one is selected.

4.4 Experiments

4.4.1 The Dataset and Experimental Setting

This paper selects DBLP to evaluate how to infer the influence network and use that to predict who will adopt a new topic once it emerges. The experiment exam- ines publications on top 20 computer science conferences in four areas,6 and in two

6Data mining: KDD, PKDD, ICDM, SDM, PAKDD; Database: SIGMOD Conference, VLDB, ICDE, PODS, EDBT; Information Retrieval: SIGIR, ECIR, ACL, WWW, CIKM; and Machine Learning: NIPS, ICML, ECML, AAAI, IJCAI.

46 periods, T0 = [1991, 2000] and T1 = [2001, 2010]. All topic terms are extracted from the paper title or the abstract, and filtered by the machine learning related key- word list in Microsoft Academic website 7. A total of 111 machine-learning related topics that were first introduced in period T0 are selected for training, and addi- tional 57 topics from T1 are selected for testing; all terms appear in more than 10 papers. For each topic, the time an author first adopted the topic is recorded. The author will be treated as not adopting the topic if either the author did not adopt the topic at all or adopted the topic beyond the monitored period. As mentioned in Section 4.2.2, eight social connection features are extracted from period T0. In- cluding the topic popularity (TP), the feature vector includes nine features in total: CI,CA,CV,CSA,CT,CICI,CIS,SCI,TP.

The experiment selects 196 authors who published at least 1 paper in both pe- riods in these conferences as low productive authors (Low Pub Set), and an another set of 47 authors who published at least 3 papers in both periods as high productive authors (High Pub Set). The Low Pub Set is a superset of High Pub Set. The reason to select these authors is to have authors who are active in both periods, such that the relationships among them are relatively stable.

4.4.2 Prediction Study

Each algorithm uses the 111 topic cascades appeared in T0 as training set to infer the influence network; and uses 57 terms in T1 for testing. For each term, the authors

7http://academic.research.microsoft.com

47 who adopted the topic earliest are labeled as the initial authors. To predict who will follow the topic, both algorithms return the probabilities of adopting the topic for all authors. Then performance is measured by comparing all authors’ adoption probabilities (excluding the initial adopted authors) to the actual adopted authors in Mean Average Precision (MAP) and Area under ROC Curve (AUC).

Table 4.1 shows the average MAP and AUC achieved by NetInf* and HetNetInf for the topics being tested. A high MAP value means that the author who adopted the topic is also predicted with a high probability. Looking at “High Pub Set” column in MAP, “% Authors Adopt = 0.0621” means that there are roughly only 1 out of 16 authors follow the topic. NetInf* has the chance of 0.1649 to have a correct prediction, while HetNetInf increases the chance to 0.2121. These numbers are significantly higher than 0.0621, and HetNetInf outperforms NetInf*. These numbers, however are not close to 1.0, which would be ideal for perfect prediction. This is because each past cascade only covers less than 10% of authors. The MAP results are worse for “Low Pub Set”, because the same number of cascades are used to explain more authors.

In terms of AUC, HetNetInf also exhibits better performance than NetInf* for both sets. Interestingly, both algorithms achieve better AUC for “Low Pub Set” than for “High Pub Set”. This is because, as the network becomes larger, though the percentage of authors adopt the topic decreases, the absolute number of authors followed the topic increases. It also increases the number of positive samples that true positive rate increases more smoothly, and thus, improves the AUC.

Fig. 4.2 and Fig. 4.3 show the MAP for each topic being tested on “High

48 Table 4.1: Topic Adoption Prediction Performance

MAP AUC High Pub Set Low Pub Set High Pub Set Low Pub Set NetInf* 0.1649 0.0970 0.5624 0.6333 HetNetInf 0.2121 0.1044 0.6188 0.6376 % Authors Adopt 0.0621 0.0305 Pub Set” and “Low Pub Set”, respectively. As the figures show, the result varies from topic to topic. The topic “gene expression data” shows an example of how social connections help predict the adoption of novel topic. In this case, authors S and M published about the topic in 2003 and 2007, respectively. In T0, M only followed S once among 111 topics. Compared to other authors who followed S many times, the influence from S to M is relative weak in NetInf*. However, there is very strong social connection between them. As a result, this helps increase the probability for M to follow S using HetNetInf. In this case, HetNetInf has MAP = 1; while NetInf* has MAP = 0.026.

High Pub Set Low Pub Set 1 1 NetInf* NetInf* HetNetInf HetNetInf % Authors Adopt % Authors Adopt 0.5 0.5 MAP MAP

0 0 0 10 20 30 40 0 10 20 30 40 50 Topic Index Topic Index Figure 4.2: Prediction Performance Figure 4.3: Prediction Performance

HetNetInf also provides information about each author’s preference (βi) in which factors would decide the adoption process. For example, independent re- searchers may be used to discover topics by themselves, while other researchers step into a new topic under the influence of their research community. Table 4.2

49 shows two authors’ normalized features weights. These two authors have very dif- ferent following behaviors: author X’s action is mostly affected by other authors’ work in the same area; and author Y is mostly affected by the topic popularity.

Table 4.2: Individual Features Weight

Author CI CA CV CSA CT CICI CIS SCI TP X 0.4471 0.6248 0.0103 0.6396 0.0104 0.0103 0.0102 0.0102 0.0103 Y 0.2686 0.2514 0.2548 0.2654 0.3155 0.2540 0.2897 0.2517 0.6465

Fig. 4.4 and 4.5 further show the normalized feature weight distributions. There is no one feature that dominates the adoption process for all authors, and the weight distribution varies significantly from author to author. This also validates the approach of HetNetInf that uses different βi for different authors.

High Pub Set Low Pub Set 1 1

0.5 0.5 Weight Weight

0 0 CI CA CV CSA CT CICI CIS SCI TP CI CA CV CSA CT CICI CIS SCI TP Feature Feature Figure 4.4: Feature Weight Distribution Figure 4.5: Feature Weight Distribution

4.5 Summary

The work presented in this chapter investigates the problem of predicting which au- thor will adopt the novel topics, and tested two algorithms using the DBLP dataset. To solve this problem, the basis is to find the inherent influence network from past observations. The first approach NetInf* is based on the direct observations of past

50 following behavior on topic adoption. It has the shortcoming that the prediction per- formance is much dependent on the amount of collected past cascades and whether the novel topic cascade is relevant to these past cascades. Though all topic terms selected are in the same area (machine learning), all these terms are unique. Each topic can have its own unique cascade which could be totally different from past ones. Alternatively, HetNetInf defines the author’s following behavior based on the factors affecting the adoption process, such as social connections and topic popular- ity. Given the adoption information of the author’s social connection peers and the topic popularity, it can estimate the probability the author adopting the topic, even the topic is totally new and irrelevant to past ones. The experiment using DBLP shows that HetNetInf outperforms NetInf* in predicting the novel topic adoption. Furthermore, HetNetInf provides information on individual author’s preference on the influence factors.

The issue in above work is the prediction performance is not good enough. One reason is there is limited number of cascades, such that the inferred influence net- work may not be able to capture all possible cascades’ patterns, especially for future cascades. Another reason in depth could be the exposed information’s influence is not the predominant factor which decides a person’s action in the real world.

Actually, most SNA works on information propagation are limited in the on- line SNs. The reason is because the information adoption can be observed in SN as retweet, repost, and this information adoption equals to the repost like action itself. Also, this information adoption is much decided by watching the same in- formation exposed by other friends, and the relation between friends is explicitly

51 defined. So, the information propagation can be well studied in a proper context where the information propagation is most dependent on the exposed information’s influence. However, it may not be able to apply the same influence algorithm to other contexts where the information adoption does not equal to the action. For ex- ample, the awareness of a commercial product may not be transferred to a purchase behavior. It is believed the product’s information is still propagated through the in- fluence network. However, this information adoption can not be observed, but only the purchase activity can be observed. In other words, the influence network is sup- posed to be able to predict whether the product information is adopted, but not to predict whether there is a purchase. Though it still can be explained by probability, this derived probability is not able to make accurate prediction since the awareness of the information is not the main factor. So, the intuition is that how much an action can be explained by the information propagation algorithm depends on how likely the exposed information can be transferred to an action. Thus, the problem is whether it is able to quantitatively define the influence of exposed information on a certain type of action. Such that the applicable context of information propagation algorithm can be defined. Due to this reason, we turn to the StackOverflow Q&A forum in Chapter.5, where the knowledge propagation can be mostly represented as the action of answering a question.

52 Chapter 5

Knowledge Propagation Network

5.1 Introduction

Q&A forum acts as a knowledge container that includes a large number of Q&A threads, where each thread starts from a question and includes multiple answers and comments. Regarding to the and variability of answers, it gradually be- comes one of the most important approach to learn and query knowledge for both beginners and experts. The Q&A threads reflect the problems user accounts met in different scenarios, and different approaches to solve the problem. A question may receive multiple answers, and it is possible that there are more than one correct answer. Though typically only one answer will be accepted, other answers also pro- vide knowledge and value to the thread and could be helpful to others searching for the solution. All answers and even comments of the question can be thought as the body of the knowledge to deal with a particular problem. The research problem is to understand how knowledge is transferred in Q&A forum between user accounts. The specific problems are how many people will be involved in a Q&A thread (final

53 and over time), who will provide answer to the question, and who will provide the accepted answer.

To predict who will provide the answer, the existing methods either treat this as a pure optimization problem (e.g., regression or classification) or find the potential answerer according to the similarity of questions answered by the candidate and the new coming question (e.g., semantic analysis). Classification is the most common approach utilized in these studies, and many of them investigated feature selection to improve performance. Existing classification methods are able to find the rela- tion between the selected features and the response, such as predicting whether a question will be answered with features of the question, but it ignores the inherent process of knowledge being transferred between actors. Similarly for the seman- tic analysis that it only cares about the language similarity. However, few studies worked from the perspective of how information diffuse or how knowledge being transferred from one actor to another. This work studies the problem of how to model the process of knowledge being transferred among actors and uses the result to perform predictive analysis such as how many user accounts will join the thread, and who will provide the answer.

Though in typical Q&A forum each question only has one accepted answer, other answers and comments also provide the knowledge to solve the question. In this work, we assume all published answers or comments provide knowledge to solve the question. More specifically, knowledge is transferred when an answer- ing or a commenting action occurs. Certainly, how much knowledge each answer or comment is being provided requires detailed semantic analysis. This work fo-

54 cuses on modeling how likely one actor will provide knowledge to another without concerning the exact amount of knowledge.

The process of knowledge being transferred in Q&A forum is similar to the information propagation observed in other social networks in the way that both processes transfer particular information from information container to users who do not have the information, but also has its own properties. First, contrary to the conventional social network where nodes joined the information cascade is the in- formation receiver, new nodes joined the Q&A thread is the knowledge provider. Second, the transfer of knowledge from knowledge provider to knowledge seeker during the answering or commenting process depends on factors such as the dif- ficulty of the question and reputation difference. Third, while there is no explicit constraint to limit information propagating, once a question is solved, the number of answers and comments joined the Q&A thread will dramatically decrease; and in many cases, accepted answer is the last post of the Q&A thread. All these make the process of knowledge being transferred between knowledge provider and knowl- edge seeker different from the process of conventional information propagation.

This study focuses on StackOverow, a popular Q&A forum. In general, a knowledge transfer process includes two parts: on questioner’s side, whether a question will receive answers; on answerer’s side, whether an answer will be ac- cepted. We found that the knowledge level, the combination of questioner’s reputa- tion, and the question’s body text, affects whether a question will receive answers; while the temporal factor dominates whether an answer will be selected. By taking these factors into consideration, TKTM (Time based Knowledge Modeling) is pro-

55 posed to model the process of knowledge being transferred where the likelihood of a person will transfer knowledge to another one is modeled as a continuous function of time and knowledge level.

The experiment evaluates the TKTM by applying it to estimate how many user accounts will be involved in the thread to provide the knowledge; and who will answer the question. In the first set of experiments, the performance of TKTM is compared to NetRate, regression methods such as RandomForest, linear regression; and TKTM outperforms other methods. In the experimnet of who will answer the question, we compare the performance of QLI, TKTM and TKTM+Semantic anal- ysis. TKTM alone outperforms QLI a lot. By integrating semantic analysis with the TKTM, it provides better performance than TKTM alone depending on the level of emphasis to bring in semantics.

The rest of this chapter is organized as follows: Sec.5.2 gives an introduc- tion of the StackOverflow dataset and discusses its properties. Sec.5.3 analyzes the factors that affect the knowledge transfer from both the questioner’s and an- swerer’s sides. Sec.5.4 introduces the TKTM and Sec.5.5 analyzes the hypothe- sized network/community derived from TKTM. Sec.5.6 evaluates the performance of TKTM in predicting thread size and predicting individual actions, and compared to other methods such as NetRate, and regression methods such as RandomForest, linear regression and logistic regression; and Sec.5.7 gives the conclusion.

56 5.2 Q&A forum: StackOverflow

A Q&A forum is composed of a number of Q&A threads where each thread in- cludes one question and multiple answers and comments. An actor in the Q&A forum may have different based on her/his action(s) in a Q&A thread. This work uses the following terminologies:

Term Description Actor user who acts in a Q&A thread Questioner actor who publishes the question Answerer actor who answeres the question Commenter actor who comments the question or an answer Thread size #actors in a Q&A thread excluding questioner

Table 5.1: Q&A forum Terminologies

Typical relationships among the Q&A forum are : h questioner, answerer i, h questioner, commenter i, h answerer, commenter i 8. A post published by an actor is referred to as either a question, an answer or a comment.

A typical Q&A thread structure is shown as Fig.5.1 where the circle stands for a post which could be a question (Q), an answer (A) or a comment (C). The question is followed by multiple answers; and only one answer will be selected as accepted per Q&A thread. Both the question and the answers can have comments. The di- rection of the arcs represents how knowledge being transferred between users. In general, knowledge flows from the answerer to the questioner since the answerer

8The study did not include voter since voter’s information is not published by StackOverflow dataset.

57 Q

A A A

C C C C C C

Figure 5.1: Q&A thread structure provides the knowledge the questioner is looking for. For the comment, it can be thought as the complementary to the answers or the question. Some- times, a comment also points out the missing or incorrect part of the answer or the question, which can also be thought as providing extra knowledge. Thus, the study makes the assumption that

Assumption 1. an answer or a comment action in a Q&A thread indicates knowl- edge transferred from the answerer to the questioner, the commenter to the ques- tioner, or the commenter to the answerer.

5.2.1 Dataset Description

This study utilizes the StackOverflow 9 which includes multiple topics where each topic is a dataset. Each topic dataset includes many attributes describing the ques- tions, answers, comments, and users in xml format. This work selects Posts.xml, Comments.xml, Users.xml and Votes.xml for the analysis. The content of each xml file is as follows:

9https://archive.org/details/stackexchange

58 • Posts.xml: type, AcceptedAnswerId, CreattionDate, Score, Body, OwnerUserId, LastEditorUserId, LastEditDate, Title, Tags, AnswerCount, CommentCount, FavoriteCount

• Comments.xml: Score, Text, CreationDate, UserId

• User.xml: Reputation, DisplayName, WebsiteUrl, Location, Views, UpVotes, DownVotes, Age, AccountId

• Votes.xml: PostId, voteType, Date

Note that Posts.xml includes information for both the questions and the answers.

In this study, two topic datasets are selected: “mechanics” and “security”. The set “mechanics” mainly includes questions related to automotive mechanism and maintenance; while “security” includes questions related to cyber security and se- cure coding. The mechanics dataset represents a relatively more traditional industry, where one may hypothesize that the knowledge evolved relatively slower than the computer science. Table 5.2 lists the statistics for these two datasets.

Dataset #Posts #Comments #Votes #Users mechanics 15463 19175 46975 8417 security 62901 98841 377874 53728

Table 5.2: Datasets Summary

5.2.2 User Network

Since the user actions include answering and commenting, one can define a network where there is a link from user A to user B if A answers or comments on B’s

59 question or answer. Probability Probability indegree outdegree

Figure 5.2: Degree Distribution of user network

Fig.5.2 shows the indegree and outdegree distribution of the user network of dataset “security” 10. Similar to other social networks, the degree distribution also follows a power-law, where the solid line is the best fitting exponential distribution. Since each link refers an answering or commenting action between two users, the power law shown in Fig.5.2 implies that there is a small number of users who are very active in answering other users’ questions. In other words, most knowledge is transferred by a small number of users.

5.2.3 Properties

Comparing to other social networks, Q&A forum has two unique properties: each user has a reputation and each question thread has its life time.

10Similar degree distribution also found in dataset “mechanics”

60 Reputation

As mentioned in [51], a user’s reputation in StackOverflow system is based on whether her/his published answers received positive response. In other words, the reputation represents a user’s ability to answer questions, and, to some extend, rep- resents the user’s expertise level. Probability Probability Reputation Reputation

(a) mechanics (b) security

Figure 5.3: Distribution of user reputation

Fig.5.3 shows the distribution of the users’ reputation in the two datasets. The reputation of the questioners also follows the power law and most questioners’ rep- utation are small. This is also consistent with the observation from the degree dis- tribution shown in Fig.5.2 that few users are active and have the most knowledge. Comparing to the users in “mechanics”, users’ reputation in “security” are relatively higher. Since reputation is related to the answering actions, it implies that the users in “security” are more active.

61 Thread Life Time

Another unique property of Q&A forum is that each Q&A thread has its lifetime. There are two timestamps related to a question. The first one is the creation time

a of the accepted answer, named as Tc . The second one is the time of last post

a published on the thread Te. Fig.5.9a and Fig.5.9b show the distribution of Tc for datasets “mechanics” and “security” respectively. In these figures, threads that only have the accepted answers are selected. Probability Probability Day Day

(a) mechanics (b) security

a Figure 5.4: Distribution of Tc

Half of the questions receive their accepted answers within 2 hours, while some questions take over 4 years (for example, one question was raised at 2010/11, and its

a Tc is 2015/02). The exact time a question is solved, i.e., the moment the accepted

a answer is selected, should be between Tc and Te. However, the StackOverflow dataset does not contain this information, but only which answer is the accepted one. Table 5.3 shows the statistics related to the accepted answer, where “Answs” stands for including answers only; and “AnswsCommts” stands for including both the answers and comments.

62 Dataset %Stop %After Answs mechanics 62% 20% AnswsCommts mechanics 42% 33% Answs security 58% 22% AnswsCommts security 35% 36%

Table 5.3: Statistics related to accepted answers

a “%Stop” is the percentage of threads that stop at Tc , i.e., the accepted answer is the last post on the thread. For the threads that still receive answers or comments after the accepted answer, “%After” is the percentage of posts published after the accepted answer. In “security” dataset (similar observation also found in “mechan-

a ics” dataset), considering answers only, about 58% threads stopped at Tc ; and for

a the left 42% threads, 22% of answers are published after Tc . When including both

a the answers and the comments, there are about 35% threads stopped at Tc and 36%

a a posts published after Tc in threads which did stop at Tc . Since comments are at- tached to the answers, such that users can still publish comments after the accepted answer. This is also why the percentage is lower after including comments.

Although the exact time of when a question is solved is unknown (not published

a a by StackOverflow), it is bounded by Tc and Te, and Tc can be thought as a label of the Q&A thread life time that only a portion of answers and comment will be published after that. The power-law distribution of Fig.5.4 also implies that most questions are within the knowledge scope of the community, while some difficult questions require longer time to have the accepted answer.

63 5.3 Factors affecting knowledge transfer

In order to model the knowledge transfer process, we need to know what factors affecting the process. Sec.5.2.3 discussed two properties of the StackOverflow: user reputation and temporal factor. In general, knowledge transferred from an answerer to a questioner includes two parts: on the questioner’s side, whether the question will receive answers; and on the answerer’s side, whether an answer will be selected 11. Besides the reputation and temporal feature, the following sections investigate the factors that could affect the knowledge transfer process from the questioner’s and answerer’s perspective.

5.3.1 Questioner: Whether a Question will Receive Answers

On the questioner’s side, the problem is to investigate which factors will affect a question to receive answers. Since reputation represents a user’s expertise level, it also reflects the knowledge level of the question the user published. For example, one may hypothesize that the question raised by a user with low reputation could be solved quicker than the question raised by a high reputation user, and its thread size is also relatively small. On the other hand, the difficulty of a question affects who will answer the question. [54] modeled the question difficulty according to the waiting time of a question received the accepted answer. To determine what could affect the number of answers received, we extract features as listed in Table 5.4.

11Knowledge transfer also happens in commenting action. Here, we mainly analyze the answering behavior, and assume the knowledge transfer in both the answering and commenting are affected by similar factors.

64 Description q reput the reputation of the questioner questions #questions published by the questioner title total #words in the title title nonstop #words in the title except stopwords title unique #unique non-stopwords in the title body total #words in the tbody body nonstop #words in the body except stopwords body unique #unique non-stopwords in the body AAWTime waiting time for accepted answer

Table 5.4: Features for predicting #answers.

The intuition of selecting the number of words in the title and the post body is that the more detail a question is described, the better to know the context of the question and may affect whether an actor will join the thread. Also, the complexity of the question could be reflected in the length of the title or the body. Random- Forest is applied to estimate how many answers can be received by a question. The regression result shows that “AAWTime”, “body total” and “q reput” are the most important features. Since “AAWTime” is the result of answerers’s actions after the question was published, question body text and questioner’s reputation are more suitable for prediction. Thus, we use the combination of the reputation and post body text to represent the knowledge level of a question or an answer.

5.3.2 Answerer: Which Answer will be Selected?

From the answerer’s perspective, the problem is to investigate which factors will determine an answer will be selected as the accepted one. It can be solved as a

65 classification problem. Based on the S8 features set of [51], we made some changes and finalized 13 features to focus more on the answer and the answerer. Table.5.5 lists the detail. Need to be noted, an actor’s rank is derived from the PageRank [65]

Description answerScore the score of the answer timeDiff time difference between an- swer and question published answererReput reputation of the answer’s au- thor reputDiff reputation difference between answerer and questioner commentsNum number of the answer’s com- ments avgCommentsScore average score of the answer’s comments answererRank the rank of the answerer in the user network rankDiff difference between the answerer’s rank and ques- tioner’s rank existAnswersNum the number of existing an- swers avgAnswersScore the average score of existing answers upvotes number of up votes downvotes number of down votes votes number of total votes

Table 5.5: Features for classifying whether an answer will be accepted of the actor in the user network which described in Sec.5.2.2.

In both datasets, “upvotes” and “votes” are highly correlated since most votes

66 are up-votes which is the same as the observation found in [51]. Another correla- tion is observed among “answererReput”, “answererRank” and “upvotes”. This is because a user’s reputation is related to the number of up-votes received; also the reputation is related to the answering actions which derived the user network and the “answererRank”.

Since the dataset is unbalanced that each Q&A thread only has one accepted answer, we over-sample the accepted answers to have a balanced dataset. For each answer published, 13 features are extracted, and we use RandomForest to perform the classification. With 10-fold cross validation, the average AUC for “mechanics” and “security” are 0.948 and 0.961, respectively.

timeDiff avgAnswersScore votes timeDiff downVotes answersNum avgAnswersScore votes answersNum downVotes answerScore upVotes upVotes answerScore commentsNum answererReput answererReput reputDiff avgCommentsScore commentsNum reputDiff avgCommentsScore rankDiff answererRank answererRank rankDiff

meanDecreaseGini meanDecreaseGini (a) mechanics (b) security

Figure 5.5: Importance of Features

Fig.5.5 shows the importance of features on these two datasets in classifying which answer will be selected. “timeDiff”, “averageAnswerScore”, “answersNum” and “Votes” are among the top important features in both cases. Interestingly, time

67 difference is also among the top important features in both cases. This implies that the later the user published the answer, the more likely the answer will be accepted. It sounds unreasonable. However, if we take both the “average score of existing answers” and “time difference” into consideration, it implies that the later a user published an answer, a more comprehensive answer she/he could publish by covering existing high quality answers’ points, and has high probability to be accepted. Votes related features are the next important factors. It is intuitive that the accepted answer should also receive many votes. However, since the time the answer is selected as accepted is unknown, it is unclear which vote is before that time point and which is after. Thus, we did not include the votes information as the major factor.

5.4 TKTM (Time based Knowledge Transfer Modeling)

Sec.5.3 discussed the factors affecting the knowledge transfer. In , mod- eling the process of knowledge being transferred is to learn the transferring prob- ability. Method such as [12] can be utilized to estimated the probability based on answering/commenting . However, it is not able to reflect the temporal prop- erty of the Q&A thread. Also as discussed in Sec.5.3.2, time is an important factor that a later answer may take the advantage of existing answers and is more likely to be accepted. Note that this does not mean a later answerer receives the knowledge from existing answers; on the contrary, a later answer may transfer knowledge to previous answerer since a later answer may provide extra or different information.

68 By taking the reputation, post body text and the temporal factor into account, we propose “TKTM (Time based Knowledge Transfer Modeling)” to model the knowl- edge transfer process between actors in the Q&A forum. Logically, a later answer (or comment) may transfer knowledge to all existing answers and comments, but without further information we only consider the knowledge is transferred to the ac- tor whose question/answer being answered or commented by this answer (or com- ment). The link between two actors in TKTM does not represent the friendship between them; instead, it represents the likelihood one will provide knowledge to another if they have knowledge in the same area but with different understanding, opinion or expertise level.

Given a size N network (N actors) and a thread set C, each thread includes a question, multiple answers and comments, tagged with time-stamps. Thus, C =

1 2 |C| c c c c c {t , t , ··· , t }, where t = {t1, t2, ··· , tN }. ti is the time-stamp of i posts an answer or comment in the thread c.

The link from node i to node j means the likelihood node i transfers knowledge to j when i issued a post (it can either be an answer or a comment) after j. Firstly, i can only transfer knowledge to j if i published a post after j; secondly, difference on expertise level (represented by the reputation difference) could be a factor affects whether and how much knowledge could be transferred between nodes; thirdly, it is also affected by the knowledge level obtained by the question. As discussed in Sec.5.3.1, knowledge level of a question is represented by the questioner reputation and body text. Since votes happened after the answers or the comments, votes information is not included in the model. Thus, the conditional likelihood function

69 is:

( c fji(tj,i, δr(j, i), kj ) ti > tj fji(ti|tj) = (5.4.1) 0 ti <= tj c c where kj = rj ∗ vj

where tj,i is the time difference between j and i published their posts; δr(i, j) is the

c reputation difference between j and i; rj is the absolute reputation of j; vj is the

c number of unique words of j’s post in Q&A thread c. kj is the knowledge level

c contained in post j of thread c, which is the product of rj and vj .

In general, f can be in the form of exponential function of the production of the selected features, such that

c fji(tj,i, δr(j, i), kj )

(1) (2) (3) c = exp −(αji ∗ tj,i + αji ∗ δr(j, i) + αji ∗ kj ) (5.4.2)

Since reputation difference is relative constant,

Z +∞ c 1 = fji(tj,i, δr(j, i), kj ) −∞ Z +∞ Z +∞ −α(2)δ (j,i) −α(1)∗t −α(3)∗k = e ji r e ji dt e ji dk 0 0 (2) 1 1 −αji δr(j,i) = e (1) (3) (5.4.3) αji αji

(2) (1) (3) −αji δr(j,i) so, e = αji αji , and fji depends only on variable t and k, and Eq.(5.4.2) can be simplified as:

c c fji(tj,i, kj ) =αjiβji exp −(αjitj,i + βjikj ) (5.4.4)

70 Accordingly, the likelihood node i does not transfer knowledge to j by the end of the thread is defined as a survival function:

c Z t −tj c c Sji(t − tj) = 1 − fji(tj,i, kj ) 0 c Z t −tj Z +∞ −αjit −βjik = 1 − αjie dt βjie dk 0 0 c = e−αji(t −tj ) (5.4.5) tc is the end time (the time elapsed since the question to the last post in the thread) of the Q&A thread c. knowledge k is positive (product of reputation and number of unique words of the post), so it is integral from 0 to ∞. This survival function means, for any question or answer j published, i does not transfer knowledge to j

c for any time from tj to the end time of this thread t .

According to the assumption 1, once a post is published, it provides some knowledge to previous ones. It is also possible an actor published multiple com- ments/answers in a Q&A thread. In many cases, the subsequent comments pub- lished by the same actor are to explain her/his previous answer or comment, such that the knowledge provided could be overlapping. So, for each actor, only her/his first answer or comment is included. Therefore, the likelihood an actor i provides knowledge to a Q&A thread is the likelihood either the actor published an answer or a comment on existing answering/commenting links.

+ c c I(Eoi) Y c c I(Eji) `i (c) = foi(to,i, ko) fji(tj,i, kj ) c c j:j6=o,tj

71 c c c where o is the questioner such that to < ti for ∀i : i 6= o; I(Eji) represents whether

c there is an answering or commenting action from i to j in thread c; tj,i is the time difference between j’s and i’s post. The likelihood a node i does not transfer any knowledge (does not publish any answer or comment) to the Q&A thread is:

− Y c `i (c) = Sji(t − tj) (5.4.7) c j:tj ≤t

It is the product of the survival function that for each node j, if j published a post but i did not transfer knowledge to j.

Therefore, the total likelihood for a thread is:

Y + Y − `(c) = `i (c) `i (c) (5.4.8) c c c c i:ti ≤t i:ti >t

and the likelihood of all threads is

Y L(C; A, B) = `(c, A, B) (5.4.9) c∈C

72 relevant log likelihood is

− LL(C; A, B) X = − log `(c, A, B) c∈C X X + X − = − ( log `i (c) + log `i (c)) c c c c c∈C i:ti ≤t i:ti >t c X X X c c I(Eji) = − ( log fji(tj,i, kj ) c c c c c∈C i:ti ≤t j:tj ≤ti X X c + log Sji(t − tj)) c c c c i:ti >t j:tj ≤t c X X X c I(Eji) = − ( (log αji + log βji − αjitj,i − βjikj ) c c c c c∈C i:ti ≤t j:tj ≤ti X X c + −αji(t − tj)) (5.4.10) c c c c i:ti >t j:tj ≤t The problem is transferred to X minimizeA,B − log `(c, A, B) (5.4.11) c∈C

subject to αj,i ≥ 0, βj,i ≥ 0, i, j = 1, . . . , N, i 6= j

where A := αj,i|i, j = 1, . . . , n, i 6= j

where B := βj,i|i, j = 1, . . . , n, i 6= j

The whole optimization is computation intensive, but it can be accelerated by running optimization on each node in parallel. Though we treat answers and com- ments the same during the modeling process, answers are more important than comments since most knowledge is contained in answers. Also, the answers are the basis of subsequent comments. Thus, each dataset is tested in two cases: in- cluding answers only; including both answers and comments. The time stamp of

73 the question is set as 0 (the beginning of the thread), while the time stamp of the following answers and comments are set as the time elapsed since the question is posted. All time stamps are normalized by the maximum time stamp across all threads. The reputation is normalized according to the maximum actor’s reputation in the dataset; while the number of unique words is normalized according to the maximum number of post’s body text’s unique words across all posts.

The optimal α and β from both datasets also follow the power law. This means that in the network, the knowledge is transferred faster on a small portion of links; or in other words, there are a small number of users who are more active than others to answer questions. This observation is the same as the assumption proposed in [51] that users are organized as a reputation pyramid where the high reputation users are on the top and answer questions quicker than low reputation users who are on the bottom of the pyramid.

5.5 Community analysis with TKTM

Given the network extracted from TKTM, the relationship between two actors is represented by a link and the weight of the link is the likelihood one actor will transfer knowledge to another.

Fig.5.6 shows the extracted network with TKTM. Most actors are connected while there are some accounts are isolated from the community. Since there are no links from these isolated accounts to the rest of the community, they are more likely zombie account that registered but never activated or read-only account that person

74 Figure 5.6: Network with TKTM only browses information from the forum but never post their own opinions.

By excluding the isolated actors who have no connection to others, Fig.5.7 shows the distribution of actors’ reputation, closeness and activeness on two datasets, respectively. Closeness measures the distance of an actor to the left of the network extracted with TKTM, defined as

1 Ci = P (5.5.1) j∈N dij where N is the actors set, dij is the shortest path from i to j. The larger the close- ness, the more central of the actor in the network such that it is easier to reach others. Activeness measures how many times an actor joined Q&A threads by ei- ther positing a post or a comment. The distribution of these three metrics on two datasets shows similar pattern that all actors are mainly grouped into 3 clusters.

75 4000 5000 3500 4000 3000 2500 3000 2000 2000 1500 1000 1000 500 activeness activeness 0 0 −500 −1000 40000 200000 35000 30000 150000 25000 0.0000 20000 0.0000 100000 0.0005 15000 0.0002 50000 closeness0.0010 (centrality) 10000 closeness0.0004 (centrality) 5000 0.0006 0.0015 0.0008 0 0.0020 0 reputation 0.0010 reputation 0.0025−5000 0.0012−50000 (a) mechanics (b) security

Figure 5.7: Distribution of actors’s reputation, closeness and activeness

Cluster 1 - low closeness, low reputation: these are actors who are far away from the main stream. Cluster 2 - high closeness, low reputation: these are actors who have ways to connect to some one with high closeness such that they can also reach others with low distance. Another interesting things is, some of actors in this group, though they participates in many threads (high activeness), it does not bring them high reputation. Basically, they do not bring real knowledge to the community. Cluster 3 - high closeness, high reputation but low activeness: these could be the true experts in the community that they do no answers lots of questions, but only answers the difficult questions with high score return.

By looking at the distribution of closeness only, it ranges from 0.0002 to 0.0020. Most actors have relative high closeness which is close to 0.0020. This observation is consistent with our finding in chapter 3 that there is a giant propagation compo- nent. Here, this component can be thought as a component where actors involved have high probability to transfer knowledge to others.

76 The community or network derived from TKTM shows the relationship among user accounts which provides a way to estimate the probability a user account will transfer knowledge to another. Note, given the TKTM network, logically, knowl- edge transfer probability could be derived between any two user accounts; but in Q&A forum, we are only interested in 1-hop and 2-hop knowledge transfer which represent the answering and commenting actions.

5.6 Evaluation on TKTM

We apply TKTM to solve several predictive analysis tasks: how many user accounts will be involved in each thread; how many user accounts will be involved in each thread over time; who will answer the question; and who will provide the accepted answer.

5.6.1 Task1: Thread Size Prediction

The first question is to know how many user accounts will be involved in a Q&A thread. There are mainly two type of approaches to solve the problem.

• Use regression methods to determine the relation between the properties of the initial question/questioner and the thread size;

• Learn the link weight and then estimate the thread size.

77 Thread Size Estimation with Regression

When an actor raised a question, the only known information are from the ques- tion and the questioner. All features from Table 5.4 except AAWTime are selected. AAWTime is excluded because it is derived from the result of received answers. Therefore, selected features include: q reput, #questions, title total, title nostop, ti- tle unique, body total, body nostop, body unique. Linear Regression and Random Forest are utilized to train the regression models.

Thread Size Prediction with Link Weight

The second method is to learn the link weight and then estimate the thread size. We use TKTM and NetRate [66] to learn the link weight, where both methods recover the link weight as a likelihood function.

Given j published a question, the thread size is the summation of the probabil- ity that each actor joins j’s thread. By including answers only 12, the thread size is the summation of the probability of any actor i answers j’s question:

X (1) TSj = Pji (5.6.1) i6=j

(1) where TSj is the thread size of j’s question; Pji is the probability i answers j’s question, and there is direct link from i to j in the TKTM or NetRate learned net- work.

12By including answers only, we only include answers information in the training process, such that TKTM or NetRate learn the link weights based on answering actions.

78 By including both answers and comments13, the thread size is:

X (1) (2) TSj = (1 − (1 − Pji )(1 − Pji )) (5.6.2) i6=j

(1) In this case, Pji is the probability i answers or comments j’s question, i.e., i can (2) directly reach j; and Pji is the probability i comments on answers of j’s question, i.e., i is 2-hop away from the j in the TKTM or NetRate learned network.

Given the network learned from TKTM or Netarte, after j published a question, if there is direct link from i to j, the probability i transfers knowledge to j by posting an answer or a comment by time T is:

Z T (1) Pji = fjidt (5.6.3) 0 where fji is the likelihood function for the link from i to j. In TKTM, fji is defined

−αjit as Eq.5.4.4; while in NetRate, fji = αjie . Similarly, if j published an answer, the probability i posts a comment on j’s answer can also be derived as Eq.(5.6.3).

If i is 2-hop away from j, by time T , the probability of i joining j’s thread by

13By including both answers and comments, TKTM or NetRate is trained based on both the answering and commenting actions.

79 posting a comment on an answer of j’s question is:

(2) Y (1) (1) Pji = 1 − (1 − Pjl Pli ) l Y Z T = 1 − (1 − fjl(t, kj)fli(T − t, kl)dt) l 0 c c Z kj Z kl Y −βjlk −βlik = 1 − (1 − βjle dk βlie dk l 0 0 Z T −αjlt −αli(T −t) αjle αlie dt) 0 Y = 1 − (1 − (1 − e−βjlkj )(1 − e−βlikl ) l −αliT αjlαlie [e(−αji+αli)T − 1]) −αjl + αli where l is the bridge node that there is a link from l to j and a link from i to l. For each node i 2-hop away from the questioner j, if the bridge node l is indeed an answerer of j’s question, applying its actual “knowledge level” with l’s reputation and l’s answer, otherwise, randomly sample a knowledge level for kl.

Result

The experiment runs 10-fold cross validation. All Q&A threads (including ques- tions both answered and not answered) are randomly split into 10 folds. Each round selects 1 fold of threads as the testing data, and the remaining as the training data. Each method trains the regression model or learns the link weights from the training data, and tests on the testing data. The mean square error (MSE) is the difference between the prediction and the ground on all testing threads. Table 5.6 shows the mean MSE on all 10 rounds on “mechanics”and “security”, respectively. RF and

80 LM represent the case of using RandomForest and LienarRegression. “Ans” means only including answers in a thread such that the thread size is the total number of answerers in the thread; while “AnsComt” includes both answers and comments in a thread that the thread size is the total number of actors (answerer and commenter) in the thread. Both regression models, NetRate and TKTM are trained separately for “Ans” and “AnsComt”. In “AnsComt” case, NetRate and TKTM will include more edges which from commenting actions.

Dataset RF LM NetRate TKTM Ans Mech 1.59 1.58 6.33 1.29 AnsComt Mech 4.37 4.39 21.30 3.86 Ans Secu 2.70 2.80 12.30 2.58 AnsComt Secu 13.00 13.27 44.55 12.90

Table 5.6: MSE of estimation on Q&A thread size

As shown in Table 5.6, TKTM outperforms other methods in all cases. Two regression methods perform similarly. NetRate does not work well in this situation since it only takes the fact of whether an actor has posts in a thread into account, but not leveraging the existence of explicit answering/commenting actions. In addition, NetRate’s link fucntion only depends on the time which ignores the variety of ques- tion/answer published. For example, for the same questioner, answerer’s behavior could be different with different question published.

Fig.5.8 shows the histogram of thread size (answers only) in the actual ground truth data (GT), TKTM and Random Forest (RF) 14 in dataset “security”. RF and

14The histogram when using linear regression is similar to Random Forest.

81 Frequency Frequency Frequency 0 4000 0 1000 2500 0 1000 2000 0 15 30 0 4 8 14 1 3 5 7 threadSize threadSize threadSize

(a) GT (b) TKTM (c) RF

Figure 5.8: Thread Size Distribution

LM is biased on majority of the small size threads such that the estimated thread size is close to the mean value. TKTM shows the similar heavy tail pattern as the ground truth. Note that the GF histogram is not exactly power law. This is because in the dataset, some of the questions have not been answered yet, i.e., thread size is 0; while most questions receive at least one answer.

5.6.2 Task2: Thread Size Prediction over Time

The second prediction task is to estimate the thread size over time. Table 5.6 shows the MSE of thread size prediction until the end of all threads. Comparing to the regression models, TKTM has the advantage that the probability is a time function, which can show how thread size changes over time. Given a time T , the probability of a node will join the thread by time T can be derived as discussed in Sec.5.6.1. Different from the regression method in Sec.5.6.1 which is based on question and

82 questioner features, to estimate the thread size over time with regression method, the temporal factor are also included. Such that the regression model is trained with features: q reput, #questions, title total, title nostop, title unique, body total, body nostop, body unique, time, and the response (thread size).

Also, different from the Sec.5.6.1 where the threads are randomly split, in this test, all Q&A threads are sorted in the ascending order of when the question is raised. All three methods (RF,LM,TKTM) use the first 90% threads to train the model, and the remaining 10% threads for the testing. Only answers are included in this experiment, i.e., the thread size is the #answerers. The MSE at each time point is the difference between the estimated thread size and the actual thread size.

Fig.5.9 shows the MSE of three methods over time. The x-axis is the normal- ized time and “10” is the maximum time of all threads in the dataset.

1.8 2.8 1.6 2.6 2.4 1.4 TKTM 2.2 TKTM 1.2 RF 2.0 RF

MSE LM MSE LM 1.0 1.8 1.6 0.8 1.4 0.6 1.2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 time time

(a) mechanics (b) security

Figure 5.9: MSE of prediction over time (answers only)

In both datasets, TKTM outperforms RF and LM. TKTM has a better estima-

83 tion from the beginning to the end of the threads. One interesting observation in “mechanics” is that there is a turn point at time “4”. This is because most threads received their accepted answer before time “4” such that the actual thread size in- creases until time “4” and keeps constant after that. Though regression method is also able to reflect the thread size change by including the time feature, TKTM’s exponential function models the thread size change smoother. MSE at time “10” in FIg.5.9 is also better than the overall MSE in Table 5.6. This is due to that all threads are sorted in time order such that the trained TKTM network from training threads is able to provide more guidance in predicting future testing threads.

5.6.3 Task3: Individual Action Prediction

The third task is to predict who will provide answers to a question. Since the thread size is the number of actors joining the thread, it is necessary to know how these methods work on the prediction of whether each actor will join a Q&A thread. This question is the same as asking how to routing question to answerers. QLI [56] is utilized as the baseline method. QLI calculates the probability an actor will answer a question according to the similarity of the question against all questions answered by the actor. For each questioner and its question, TKTM is applied to estimate the probability any other actor except the questioner will join the thread. These actors are ranked based on the probability they will join a thread from the highest to the lowest. The actual actor in each thread is then identified with a rank among all actors. The higher the rank of the actual actor, the better the prediction, i.e., which means the prediction is closer to the ground truth. The final result is evaluated with

84 Mean Reciprocal Rank (MRR) [56] which is defined as the mean of the reverse of the rank.

Same as Sec.5.6.2, we use first 90% threads as training data, and left 10% for the testing. Table 5.7 shows the prediction result of TKTM and QLI on “me- chanics” and “security”, respectively. TKTM performs much better than the text

Dataset QLI TKTM Mechanics 0.013 0.232 Security 0.001 0.036

Table 5.7: Estimation on Answerer in MRR mining method QLI. In principle, QLI and TKTM are based on the same assump- tion that if an actor has the specific knowledge on a field, she is more likely to answer the relevant questions. Differently, QLI takes the approach of text mining that it assumes that the new question may have some similarities with the previ- ous questions; while TKTM focuses on mining the likelihood of an actor transfers knowledge to another. The result implies that in these two datasets, the description of the question or the selection of words has more uncertainties; while the interests and knowledge buildup of a person evolves gradually so that an actor’s question is more likely to be answered by the same actor who answered his question before. Comparing the performance for the two datasets, both QLI and TKTM perform worse in “security”. This is due to that the “security” dataset has more users, mak- ing prediction more difficult. On the other hand, it also implies that the “security” dataset is more volatile that both the words and the actors on the community change relatively rapidly.

85 5.6.4 Task4: Who will Provide the Accepted Answer

The fourth task is to predict who will provide the accepted answer. As mentioned in Sec.5.3.2, the question-solving is a process of accumulating answers until received the final proper one. TKTM can estimate the probability an actor joining a thread by a time T ; on the other hand, it can also be used to derive when will an actor join the thread with a probability P . From Eq.(5.6.3),

1 Pqi T = − log(1 − −β kc ) (5.6.4) αqi 1 − e qi q c where kq is the knowledge level of the question raised by q in thread c, and Pqi is the probability i will provide an answer to questioner q.

As Fig.5.5 has shown, whether an answer will be accepted is mostly dependent on when the answer is published, how many answers already exist and the average score of existing answers. Since the answer score is unknown during prediction, we only use the first two features to build the classifier, i.e., when an answer is published (t), and #existing answers (e).

Table 5.8 shows the algorithm to find the rank of the probability each actor will provide the accepted answer, where A is the actor set; P r is the predefined probability used to derive the time t; g is the ground-truth actor who provided the accepted answer to question q. The high rank of pg the better performance. In this case, P r is set as 0.5 which is to estimate when an actor will provide an answer to the question over 50% chance.

We test the two datasets using the algorithm shwon in Table 5.8, which is built upon TKTM. The resulting MRR are 0.165 and 0.035 for “mechanics” and “secu-

86 Algorithm Who will provide an accepted answer Init Train Classifier(T,E) for which answer will be accepted based on when the answer is posted t and number of existing answers e.

Func Rank(q, P r, A)

for each actor ai ∈ A: ti = Est(q, ai, P r) according to Eq.5.6.4 pi = 0 e = 0

for each ti in ascending order: pi = Classifer(ti, e) e = e + 1

return probability vector P of pi

Main P = Rank(q, P r, A)

find rank of pg in P Table 5.8: Algorithm of estimating who will provide accepted answer rity”, respectively. QLI is applied as the baseline that the rank of a user is based on the similarity between the new coming question and the questions the user ever provided the accepted answers. The result is shown in Table 5.9.

Dataset QLI TKTM Mechanics 0.0013 0.165 Security 0.0003 0.035

Table 5.9: Estimation on Answerer in MRR

Comparing to the results shown in Sec.5.6.3, TKTM achieves decent perfor- mance in estimating who will provide an accepted answer, which is related to the

87 question of predicting who will provide an answer since it provides a set of candi- dates, and the selected one is one of these answerers.

5.6.5 TKTM & Semantic Analysis

TKTM method is extracting the knowledge transfer probability. Though TKTM uses the knowledge level which is defined as the product of the user’s reputation and number of unique words to evaluate whether a user’s post will be attractive, it does not cover the relationship between users who never have contact before but may transfer knowledge in the future based on their same interests. This is also the intuition of methods such as QLI.

Thus, in this section, we build a network based on the language similarity of any two users. Treat the language of a user as a document which includes all posts and comments she/he published excluding the stop words. The language pattern of this user is represented by the TF.IDF (Term Frequency times Inverse Document

Frequency) [9]. Given N documents, fij is the frequency of term i in document j,

fij TFij = (5.6.5) max fkj where max fkj is the maximal number of occurrence of any term of document j.

N IDFi = log (5.6.6) ni where ni is the number of documents the term i appears. Therefore, TFij evaluates the relative frequency of term i appears in the doucment j, and IDFi represents

88 whether the term i is frequent on all documents. Such that, if a term is very common in all documents, the IDF will be small and it is not able to represent the language pattern of a document.

The final TF.IDF is the product of TF and IDF

T F.IDFij = TFijIDFi (5.6.7)

Therefore, the similarity of two users’ language is reflected by the similarity of the TF.IDF of the terms appeared in their language (document). Here, we use the cosine similarity (other similarity metrics should also work). Given the TF.IDF vector of two documents X and Y , the cosine similarity is:

XY cos = (5.6.8) k X kk Y k

In this way, we can have a network based on the language similarity, and it is independent to the TKTM. To combine TKTM and language similarity together, define the new metric:

R = P rTKTM(ji) + γ cosji (5.6.9) where P rTKTM(ji) is the probability user i will transfer knowledge to j, and cosji is the language similarity between them 15. The reason do not use “product” to combine P rTKTM(ji) and cosji is because P rTKTM(ji) could be zero, such that the product will cancel the effect of cosji. cosji is also a value between 0 and 1, and γ is the coefficient to control how much the prediction is dependent on the language similarity 16.

89 γ 0 0.0001 0.001 0.01 0.1 0.5 1.0 MRR(Mechanics) 0.2324 0.2652 0.2653 0.2685 0.2668 0.2379 0.1790 MRR(Security) 0.0360 0.0434 0.0429 0.0429 0.0391 0.0239 0.0240

Table 5.10: Prediction on Answerer in MRR with different γ

100 mechanics security

10-1 MRR

10-2 10-6 10-5 10-4 10-3 10-2 10-1 100 γ Figure 5.10: Prediction on Answerer in MRR with different γ

Table 5.10 and Fig.5.10 show the MRR change with the γ in predicting who will provide the answer. The optimal result exists when γ is about 0.01 and 0.0001 for dataset “mechanics” and “security”, respectively. When γ is 0, the prediction is based on TKTM only. This shows by including the semantic analysis, it does help improve the performance than TKTM only depending on the level of emphasis to bring in semantics.

15 cosji = cosij 16Logically, γ can be any real number. In this study, we only investigate the case when γ is between 0 and 1.

90 5.7 Summary

Q&A forum attracts many studies as it has become one of the most important re- source for knowledge inquiry. Most existing studies focused on predictive analysis such as what’s the thread size and who will provide the answer. However few of studies analyzed how knowledge being transferred among user accounts. In this chapter, we model the process of knowledge being transferred in a Q&A forum: StackOverflow, by extending the concept of information diffusion. Although the process of knowledge being transferred is similar to conventional information diffu- sion, it has its own properties such as user reputation and thread life time. This study analyzes the factors affecting the knowledge transfer and it turns out the knowledge level of the question and questioner, and the temporal factors are critical to predict whether a question will receive answers and whether an answer will be accepetd. By taking these factors, we propose TKTM where the knowledge transfer likelihood or link weights are learned from the history data and described as a continuous func- tion of time and knowledge level. The experiment results show TKTM outperforms in both predicting the thread size over time and the individual action including who will answer the question. Comparing to traditional regression approaches and QLI, TKTM not only achieves better performance in predictive analysis, but also pro- vides an unique approach that offer insights to the Q&A community. Since TKTM and semantic analysis are independent approaches, the work also investigates the possibility to integrate these two methods together to have better performance in predicting potential answerers.

91 Chapter 6

Conclusion

The whole study works on the problem of how information propagates and par- ticularly how knowledge being transferred. Chapter 2 analyzes the problem of how to detect information flow, and chapter 3 discusses how information cascade forms and factors such as temporal and spatial factors affect a cascade growing. These observations help to model the knowledge transfer process. Actually, the study of modeling the process of knowledge being transferred among user accounts is built upon and motivated by the works of information cascade, where information cas- cade is the trace where information propagates from the originator to others. The process of knowledge being transferred is similar to the information cascade in the way that both processes transfers particular information from the information con- tainer to those do not have the information. Chapter 4 analyzes how knowledge is transferred in the sense of adopting research topic among scholars in the DBLP network where the likelihood of knowledge transfer between two user accounts is modeled as a function of social network spatial features. The experiment shows

92 prediction performance can be improved by introducing spatial social connection features. However, the improvement is marginal that having a publication is a com- plex process which is not affected by whether the knowledge has been transferred to and adopted by an author only. Thus, in chapter 5, we turned to the Q&A fo- rum and particular in StackOverflow dataset where the process of knowledge being transferred among user accounts is mostly represented by the answering or com- menting action. The study discussed the properties of StackOverflow from three aspects: knowledge oriented, knowledge being transferred over threads and the life time of threads. To understand the life time of a thread, the study investigated the problem of when a question will be solved and which answer will be selected as the accepted one. As the experience learned from work in chapter 3 and 4, the process of knowledge being transferred is affected by reputation or the knowledge level, and the temporal factor. Take these factors into consideration, we proposed TKTM where the likelihood of a person will transfer knowledge to another one is modeled as a continuous function on time and knowledge level. To evaluate the performance, the experiments are performed by applying TKTM to solve two prob- lems: who will answer the question and what is the thread size; and compared to NetRate and other regression methods such as RandomForest, linear regression and logistic regression. In both experiments, TKTM beats all other methods. The value of TKTM is not to have better result in predicting whether a user will answer a question, but provide a platform to model the inherent process of how knowledge being transferred among user accounts. Also, the TKTM is a function based on time where it can be easily connected to solve problems related to the time, such

93 as whether a user will answer the question by a time, or be extended to solve the problem of whether a question can be solved by a time.

94 Appendix

95 Appendix 1

Datasets

1.1 DBLP

The Digital Bibliography and Library Project (DBLP) is a collection of biblio- graphic information on major computer science journals and proceedings, which can be used to build a heterogeneous information network with multi-typed objects along with rich text data. The dataset is contributed by Tang et al. [67] [68] [69] [70] [71] which contains 1,572,277 papers and 2,084,019 citation relationships by 2011-01-08. Four databases are created to store the information of author, paper, citation and term receptively. The structure of these four tables are listed as follows:

Table 1.1: DBLP:Author

Field Type Null Key Default Extra id int(11) No PRI NULL auto increment author varchar(100) Yes MUL NULL paper id int(11) Yes MUL NULL

96 Table 1.2: DBLP:Paper

Field Type Null Key Default Extra paper id int(11) No PRI NULL conf varchar(200) Yes MUL NULL citation int(11) Yes MUL NULL year int(11) Yes MUL NULL title text Yes NULL abstract text Yes NULL

Table 1.3: DBLP:Citation

Field Type Null Key Default Extra id int(11) No PRI NULL auto increment orig id int(11) Yes MUL NULL ref id int(11) Yes MUL NULL

1.2 Digg Network

This data set is contributed by Lerman et al. [64] about stories promoted to Digg’s front page over a period of a month in 2009. It contains 3,018,197 votes on 3553 popular stories made by 139,409 distinct users, and 1,731,658 friendship links of 71,367 distinct users. For each story, Lerman et al.collected the list of all Digg users who have voted for the story up to the time of data collection, and the time stamp of each vote. The voters’ friendship links are also retrieved. The semantics

Table 1.4: DBLP:Term

Field Type Null Key Default Extra id int(11) No PRI NULL auto increment term varchar(100) Yes MUL NULL frequence int(11) Yes NULL words num int(11) Yes NULL first year int(11) Yes NULL

97 of the friendship links are as follows: (user id −→ friend id) means that user id is a fan of friend id. User ids have been anonymized, but are unique in the data set: a user with a specific id in the friendship links table and a user with the same id in the votes table correspond to the same actual user. Three databases are created for users, friendships and votes.

Table 1.5: Digg:Users

Field Type Null Key Default Extra uid int(11) No PRI NULL auto increment user id int(11) Yes MUL NULL friends num int(11) Yes MUL NULL cares num int(11) Yes MUL NULL

Table 1.6: Digg:Friendships

Field Type Null Key Default Extra rid int(11) No PRI NULL auto increment mutual int(11) Yes NULL create time int(11) Yes NULL user id int(11) Yes MUL NULL friend id int(11) Yes MUL NULL prob same act float Yes NULL prob netinf float Yes NULL

Table 1.7: Digg:Votes

Field Type Null Key Default Extra vid int(11) No PRI NULL auto increment vote time int(11) Yes MUL NULL voter id int(11) Yes MUL NULL story id int(11) Yes MUL NULL

98 1.3 Twitter Network

This dataset is contributed by Li et al. [72]. It is a subset of Twitter. It contains 284 million following relationships, 3 million user profiles and 50 million tweets. The dataset was collected at May 2011. Three databases are created for tweets, users and friendships.

Table 1.8: Twitter:Users

Field Type Null Key Default Extra rid bigint(20) No PRI NULL name varchar(45) Yes MUL NULL friend count bigint(20) Yes MUL NULL follower count bigint(20) Yes MUL NULL status count bigint(20) Yes MUL NULL favorite count bigint(20) Yes MUL NULL account age datetime Yes MUL NULL location varchar(45) Yes MUL NULL

Table 1.9: Twitter:Friendships

Field Type Null Key Default Extra rid bigint(20) No PRI NULL auto increment user id bigint(20) Yes MUL NULL friend id bigint(20) Yes MUL NULL prob same act float Yes NULL prob netinf float Yes NULL

1.4 StackOverflow

Stack Overflow provides its data to public which covers 277 different topics in programming http://blog.stackoverflow.com/tags/cc-wiki-dump/. Each topic’s raw

99 Table 1.10: Twitter:Tweets

Field Type Null Key Default Extra id bigint(20) No PRI NULL type varchar(45) Yes NULL origin text Yes NULL text text Yes NULL url varchar(200) Yes NULL time datetime Yes MUL NULL retcount bigint(20) Yes MUL NULL favorite varchar(45) Yes NULL hashtag varchar(200) Yes MUL NULL dataset contains xml files of all posts (either a question or an answer), comments, and relevant users. The attributes of these files are:

• Posts: type, AcceptedAnswerId, CreattionDate, Score, Body, OwnerUserId, LastEditorUserId, LastEditDate, Title, Tags, AnswerCount, CommentCount, FavoriteCount

• Comment: Score, Text, CreationDate, UserId

• User: Reputation, DisplayName, WebsiteUrl, Location, Views, UpVotes, Down- Votes, Age, AccountId

In this study, we picked two datasets: “mechanics” and “security” 17.

17the processing code is very general, and can be directly to apply to any dataset by given a different directory name

100 Dataset #Users #Threads #Posts #Comments mechanics 11787 5548 19369 29114 security 69856 7730 83300 137721

Table 1.11: StackOverflow datasets

101 Appendix 2

Network Analysis

2.1 Terminologies

Table 2.1 shows the terminologies used in the following sections.

2.2 Random Networks

Random network is created to model the real network. In general, there are two approaches to model the random network.

2.2.1 Poisson Random Network

Define a network graph G(V,E) which contains |V | = n nodes and |E| = m edges. Two classic models are: G(n, m) and G(n, p). G(n, m) is to randomly assign m n n edges in 2 node pairs. G(n, p) is to assign any edge in 2 node pairs with same probability p. If each m edges network has the same probability, G(n, m) can be

102 Name Definition activated a node adopted the information information propagation the process of the information flows from activated nodes to others infected activated cascade a number of nodes are activated in an order started from an initiator; and these nodes form a DAG. cascade size the total number of nodes of the cascade. cascades size the size of multiple cascades. seed the initiator of a cascade initial set multiple seeds involved in the same information propagation G(V,E) graph V nodes set of a graph E edges set of a graph eu,v directional edge from v to u, information flows from u to v σ(·) influence function σ(A) total number of nodes activated if A is the initial set c(A) cost of selecting initial set A R(·) reward function ROC receiver operating characteristic AUC area under ROC curve MAP Mean Average Precision Pu,v probability of node u infects v Wu,v the weight of edge eu,v child node a node which is watching its infected parents’ behaviors parent node the node being watched by its child node, and can infect its child node K(v) the parents set of child node v N(u) the children set of parent node u

Table 2.1: Terminology List represented by the G(n, p), where

m n −m P (G) = p (1 − p)(2) (2.2.1) n m n −m P (m) = 2 p (1 − p)(2) (2.2.2) m

P (G) is the probability to have a specific m-edge graph; P (m) is the probability to have graphs with m-edge. Relevant mean number of edges hmi, mean degree hki

103 are derived as:

n (2) X n hmi = m × P (m) = p (2.2.3) 2 m=0 n (2) X 2m hki = × P (m) = (n − 1)p (2.2.4) n m=0 let c = (n − 1)p, c is the average degree.

Degree distribution is: n − 1 ck P = pk(1 − p)n−1−k ≈ e−c (2.2.5) k k k! Because the degree probability is Poisson distribution, this random network is also called Poisson random network.

Clustering is the probability that any two neighbors are connected. In G(n, p), clustering is same as p.

2.2.2 General Degree Distribution Random Network

Though Poisson random network shows many characters of the networks, it does not have some real networks’ properties such as transitivity/clustering, community, correlation between connected vertices. Also the degree distribution is different. To fill the gap, general degree distribution random network model is proposed where a random network is generated given a specific degree distribution. Generating function is used for expressing the degree distribution.

∞ 2 X k g(z) = p0 + p1z + p2z + ··· = pkz (2.2.6) k=0

104 −α For a power law degree distribution, pk = Ck , α > 0 and k ≥ 1, so ∞ ∞ P 0 P k−1 0 g(1) = pk = 1, g (z) = kpkz and g (1) = hki. The power of gen- k=0 k=0 erating function is: for a degree distribution pk and its generating function g(z), and m independent variables (k1 ··· km) from this distribution, whose generating function h(z) = [g(z)]m.

2.3 Giant Component

In a random network, giant component can be proved existence and the size distri- bution of small components can also be derived. Giant component is the component which size is proportional to n. When p = 0, there is no giant component and the largest component size is 1; when p = 1, it’s fully connected and the giant com- ponent size is n. Thus there is a intermediate p which separates these two phases. Define u as the probability a vertex does not belong to the giant component (GC). Then for the node i and any node j, either i has no link to the GC, or i connects to j, but j has no links to the GC.

P (∀j, i∈ / GC) = (1 − p) + up

(n−1) ∴ u = (1 − p + up) ⇒ u ≈ e−c(1−u) (2.3.1)

Let S = 1 − u be the probability a vertex belongs to the GC, then S = 1 − e−cS It is hard to have the closed-form solution , but this can be solved with diagram solution. As shown in Fig. 2.1, it is same to find the intersection point of y = u and y = e−c(1−u), given different value of c (average degree).

105 1 y=S 0.8 c=0.5 c=1 0.6 c=1.5 y 0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 S Figure 2.1: Poisson Random Network Giant Component Size According

When c > 1, it has non-zero solution for S, i.e.there is giant component.

It is also able to show that only one single Giant Component exist. Assume there are two giant components and relevant probability of nodes belong to these two giant components are S1 and S2, and the size of these two giant components

2 are S1n and S2n respectively. Such that there are S1S2n pairs, and the probability they are separated is:

2 c 2 P = (1 − p)S1S2n = (1 − )S1S2n n − 1 2 c ln P = S1S2 lim n ln(1 − ) n→∞ n − 1 c 1 c = S S (n2(− + ( )2)) 1 2 n − 1 2 n − 1 c2 = S S (−nc + ) 1 2 2 S S c2 1 2 e−S1S2nc P = e 2

(2.3.2) thus, when n → ∞, P → 0.

106 2.4 K-core size Estimation

2.4.1 Methodology

In a random network, the probability a node has m-degree is pm. Name the prob- ability a node belongs to k-core (GPC) is S, such that S is equal to the probability that it has at least k neighbors which also belong to the GPC.

∞ m X X m S = p Sj(1 − S)m−j (2.4.1) m j m=k j=k However, it is not easy to solve above equation. Another approach is to look at the nodes which do not belong to the k-core. Randomly selecting an edge and follow it to the end, excluding the edge following, the probability the end vertex has k edges is named as the excess degree distribution [73]:

(k + 1)p q = k+1 (2.4.2) k hki P where hki = kpk is the mean degree of the network. The probability generating k functions of pk and qk are: ∞ X k G0(z) = pkz (2.4.3) k=0 ∞ X k G1(z) = qkz (2.4.4) k=0 Let u be the probability that upon following an edge to a vertex which does not belong to the GPC. It equals to other edges point to this vertex do not connect to GPC either. Averaging on the distribution of excess degree distribution:

107 ∞ X k u = qku = G1(u) (2.4.5) k=0 and ∞ ∞ 1 X 1 X 1 (1) G (u) = (k + 1)p uk = kp uk−1 = G (u) (2.4.6) 1 hki k+1 hki k hki 0 k=0 k=0 (1) where hki = G0 (1). Such that ∞ P k−1 (1) kpku G0 (u) k=1 u = G1(u) = = ∞ (2.4.7) G(1)(1) P 0 kpk k=1 p1 Note, G1(0) = ∞ . So, the u can be derived from 2.4.7. P kpk k=1 On the other hand, the probability a vertex does not belong to the GPC is it has less than k edges connect to vertices belong to GPC 18. X m X m X  m  p um + p (1 − u)um−1 + ··· + p (1 − u)k−1um−k+1 m 0 m 1 m k − 1 m m m (1 − u)k−1 = G (u) + (1 − u)G(1)(u) + ··· + G(k−1)(u) 0 0 (k − 1)! 0 k−1 j X (1 − u) (j) = G (u) (2.4.8) j! 0 j=0 So, the expected size of GPC can be calculated by input u calculated from 2.4.7: k−1 j X (1 − u) (j) S = 1 − G (u) (2.4.9) j! 0 j=0 In practice, the exact k-core can be found by keeping on removing nodes with less than k-degree.

18Similar approach is used in [74] for estimating bicomponents

108 2.4.2 Example

Given a random network with power law degree distribution where ( 0 if k = 0 pk = k−α (2.4.10) ζ(α) if k > 0

∞ where ζ(α) = P k−α and α > 0. According to Equation 2.4.7 k=1

∞ ∞ P k−1 P −α k−1 (1) kpku kk u G0 (u) k=1 k=1 u = (1) = ∞ = ∞ (2.4.11) G (1) P P −α 0 kpk kk k=1 k=1 It is not easy to have a closed form solution, but can be solved with graphic solution.

1

α=1 0.8 α=2 α=3 0.6 α=4

y y=u 0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 u Figure 2.2: Graphic solution of u

Fig. 2.2 shows the solution of u on power law distributions with different α. There is a solution when 1.8 ≤ α ≤ 3.2, which is similar to the result of [73] which mentions the solution of u exists when 2 < α < 3.47.

109 In this case, given α = 3, the approximate solution of u is about 0.85. So the GPC size is: k−1 j X (1 − u) (j) S = 1 − G (u) (2.4.12) k j! 0 j=0 u=0.85 In a 10000 nodes power law network, the GPC (k-core) sizes (fraction) for different k values are: Table 2.2: GPC(k-core) size

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6

Sk 0.1857 0.0123 0.0031 0.0013 6.8916e-004 4.2818e-004

when k = 1, it is the giant component of the network.

Now, let’s test in a real network. Digg network snapshot captured in [64] has 279631 nodes, its out-degree distribution is shown in Fig. 2.3.

6 10 out degree α=1.9

4 10

2 10 number of users

0 10 0 1 2 3 4 5 10 10 10 10 10 10 out degree Figure 2.3: Digg network out-degree distribution

The out-degree distribution of Digg network is close to a α = 1.9 power law degree distribution. According to above equations, u = 0.07, a 50-core size is about S = 0.0176. This means in a α = 1.9 power law network with 279631 nodes, it

110 is about 0.0176 × 279631 = 4921 nodes in the GPC. In the real Digg network, the actual 50-core includes 938 nodes. Since the real network is not exact power law, the calculation is adjusted with the actual degree probability distribution. Plug in the real out-degree distribution, u = 0.094, and a 50-core size is about 5593. Though there is a gap against the real network k-core size when k is large, the estimated value is close when k is relative small as shown in Table 2.3. Also note, above is to provide a mathematic way to estimate the GPC if the out-degree distribution is known. The exact size of GPC can always be found by a program which keeps on removing the nodes with less than k out-degree.

Table 2.3: GPC(k-core) size

k = 10 k = 20 k = 30 k = 40

Sk 20022 11745 8613 6795 digg k-core 16697 9358 5951 3638

111 Bibliography

[1] Y. Sun, R. Barber, M. Gupta, C. C. Aggarwal, and J. Han, “Co-author rela- tionship prediction in heterogeneous bibliographic networks,” in Proceedings of the 2011 International Conference on Advances in Social Networks Analy- sis and Mining, ASONAM ’11, (Washington, DC, USA), pp. 121–128, IEEE Computer Society, 2011.

[2] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu, “Integrating meta- path selection with user-guided object clustering in heterogeneous information networks,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, (New York, NY, USA), pp. 1348–1356, ACM, 2012.

[3] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu, “Rankclus: integrat- ing clustering with ranking for heterogeneous information network analysis,” in Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09, (New York, NY, USA), pp. 565–576, ACM, 2009.

112 [4] M. Ji, J. Han, and M. Danilevsky, “Ranking-based classification of hetero- geneous information networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, (New York, NY, USA), pp. 1298–1306, ACM, 2011.

[5] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta path-based top-k similarity search in heterogeneous information networks,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 992–1003, 2011.

[6] L. Backstrom and J. Leskovec, “Supervised random walks: predicting and recommending links in social networks,” in Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, (New York, NY, USA), pp. 635–644, ACM, 2011.

[7] M. E. Newman, “Clustering and preferential attachment in growing networks,” Physical review E, vol. 64, no. 2, p. 025102, 2001.

[8] L. A. Adamic and E. Adar, “Friends and neighbors on the web,” Social net- works, vol. 25, no. 3, pp. 211–230, 2003.

[9] G. Chowdhury, Introduction to modern information retrieval. Facet publish- ing, 2010.

[10] A.-L. Barabasi,ˆ H. Jeong, Z. Neda,´ E. Ravasz, A. Schubert, and T. Vicsek, “ of the social network of scientific collaborations,” Physica A: Sta- tistical mechanics and its applications, vol. 311, no. 3, pp. 590–614, 2002.

113 [11] Y. Yang, N. Chawla, Y. Sun, and J. Hani, “Predicting links in multi-relational and heterogeneous networks,” in 2012 IEEE 12th International Conference on Data Mining, pp. 755–764, IEEE, 2012.

[12] A. Goyal, F. Bonchi, and L. V. Lakshmanan, “Learning influence probabilities in social networks,” in Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, (New York, NY, USA), pp. 241– 250, ACM, 2010.

[13] Y. Sun, J. Han, C. C. Aggarwal, and N. V. Chawla, “When will it happen?: relationship prediction in heterogeneous information networks,” in Proceed- ings of the fifth ACM international conference on Web search and data mining, WSDM ’12, (New York, NY, USA), pp. 663–672, ACM, 2012.

[14] J. Leskovec, M. McGlohon, C. Faloutsos, N. S. Glance, and M. Hurst, “Pat- terns of cascading behavior in large blog graphs.,” in SDM, vol. 7, pp. 551– 556, SIAM, 2007.

[15] J. Leskovec, A. Singh, and J. Kleinberg, “Patterns of influence in a recom- mendation network,” in Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’06, (Berlin, Heidelberg), pp. 380–389, Springer-Verlag, 2006.

[16] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics of viral marketing,” ACM Transactions on the Web (TWEB), vol. 1, no. 1, p. 5, 2007.

114 [17] G. V. Steeg, R. Ghosh, and K. Lerman, “What stops social epidemics?,” in Proceedings of the Fifth International Conference on Weblogs and Social Me- dia, Barcelona, Catalonia, Spain, July 17-21, 2011, 2011.

[18] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec, “Can cascades be predicted?,” in Proceedings of the 23rd international conference on World wide web, pp. 925–936, ACM, 2014.

[19] A.-L. Barabasi, “The origin of bursts and heavy tails in human dynamics,” Nature, vol. 435, no. 7039, pp. 207–211, 2005.

[20] J. Yang and S. Counts, “Predicting the speed, scale, and range of information diffusion in twitter,” in Proceedings of the Fourth International Conference on Weblogs and Social Media, ICWSM 2010, Washington, DC, USA, May 23-26, 2010, pp. 355–358, 2010.

[21] P. A. Dow, L. A. Adamic, and A. Friggeri, “The anatomy of large facebook cascades,” in Proceedings of the Seventh International Conference on Weblogs and Social Media, ICWSM 2013, Cambridge, Massachusetts, USA, July 8-11, 2013., 2013.

[22] A. Kupavskii, L. Ostroumova, A. Umnov, S. Usachev, P. Serdyukov, G. Gusev, and A. Kustarev, “Prediction of retweet cascade size over time,” in Proceed- ings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, (New York, NY, USA), pp. 2335–2338, ACM, 2012.

115 [23] J. Cannarella and J. A. Spechler, “Epidemiological modeling of online social network dynamics,” arXiv preprint arXiv:1401.4208, 2014.

[24] J. Leskovec, L. Backstrom, and J. Kleinberg, “Meme-tracking and the dynam- ics of the news cycle,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, (New York, NY, USA), pp. 497–506, ACM, 2009.

[25] R. Ghosh and K. Lerman, “A framework for quantitative analysis of cascades on networks,” in Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, (New York, NY, USA), pp. 665– 674, ACM, 2011.

[26] W. Kermack and A. McKendrick, “Contributions to the mathematical theory of epidemics. ii. the problem of endemicity,” Proceedings of the Royal society of London. Series A, vol. 138, no. 834, pp. 55–83, 1932.

[27] M. Granovetter, “Threshold models of behavior,” American journal of , vol. 83, no. 6, p. 1420, 1978.

[28] J. Goldenberg, B. Libai, and E. Muller, “Talk of the network: A complex systems look at the underlying process of word-of-mouth,” Marketing letters, vol. 12, no. 3, pp. 211–223, 2001.

[29] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of the ninth ACM SIGKDD interna-

116 tional conference on Knowledge discovery and data mining, KDD ’03, (New York, NY, USA), pp. 137–146, ACM, 2003.

[30] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins, “Information diffusion through blogspace,” in Proceedings of the 13th International Conference on World Wide Web, WWW ’04, (New York, NY, USA), pp. 491–501, ACM, 2004.

[31] J. Yang and J. Leskovec, “Modeling information diffusion in implicit net- works,” in International Conference on Data Mining (ICDM), pp. 599–608, 2010.

[32] S. A. Myers, C. Zhu, and J. Leskovec, “Information diffusion and external in- fluence in networks,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, (New York, NY, USA), pp. 33–41, ACM, 2012.

[33] M. Gomez-Rodriguez, D. Balduzzi, and B. Scholkopf,¨ “Uncovering the tem- poral dynamics of diffusion networks,” in International Conference on Ma- chine Learning, pp. 561–568, 2011.

[34] N. Du, L. Song, M. Yuan, and A. J. Smola, “Learning networks of heteroge- neous influence,” in Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 2780– 2788, Curran Associates, Inc., 2012.

117 [35] M. Gomez-Rodriguez, J. Leskovec, and A. Krause, “Inferring networks of diffusion and influence,” ACM Trans. Knowl. Discov. Data, vol. 5, pp. 21:1– 21:37, Feb. 2012.

[36] S. A. Myers and J. Leskovec, “On the convexity of latent social network infer- ence,” in Advances in Neural Information Processing Systems, pp. 1741–1749, 2010.

[37] E. Sadikov, M. Medina, J. Leskovec, and H. Garcia-Molina, “Correcting for missing data in information cascades,” in Proceedings of the Fourth ACM In- ternational Conference on Web Search and Data Mining, WSDM ’11, (New York, NY, USA), pp. 55–64, ACM, 2011.

[38] M. Kim and J. Leskovec, “The network completion problem: Inferring miss- ing nodes and edges in networks,” in SDM, pp. 47–58, 2011.

[39] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance, “Cost-effective outbreak detection in networks,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, (New York, NY, USA), pp. 420–429, ACM, 2007.

[40] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks,” in Proceedings of the 15th ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, KDD ’09, (New York, NY, USA), pp. 199–208, ACM, 2009.

118 [41] A. Goyal, W. Lu, and L. V. Lakshmanan, “Celf++: Optimizing the greedy algorithm for influence maximization in social networks,” in Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11, (New York, NY, USA), pp. 47–48, ACM, 2011.

[42] C. Zhou, P. Zhang, J. Guo, X. Zhu, and L. Guo, “Ublf: An upper bound based approach to discover influential nodes in social networks,” in Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp. 907–916, IEEE, 2013.

[43] C. C. Aggarwal, A. Khan, and X. Yan, On Flow Authority Discovery in Social Networks, ch. 45, pp. 522–533.

[44] Y. Yang, E. Chen, Q. Liu, B. Xiang, T. Xu, and S. A. Shad, “On approximation of real-world influence spread,” in Proceedings of the 2012 European Con- ference on Machine Learning and Knowledge Discovery in Databases - Vol- ume Part II, ECML PKDD’12, (Berlin, Heidelberg), pp. 548–564, Springer- Verlag, 2012.

[45] Q. Liu, B. Xiang, E. Chen, H. Xiong, F. Tang, and J. X. Yu, “Influence max- imization over large-scale social networks: A bounded linear approach,” in Proceedings of the 23rd ACM International Conference on Conference on In- formation and Knowledge Management, CIKM ’14, (New York, NY, USA), pp. 171–180, ACM, 2014.

[46] M. Kimura and K. Saito, “Tractable models for information diffusion in social networks,” in Proceedings of the 10th European Conference on Principle and

119 Practice of Knowledge Discovery in Databases, PKDD’06, (Berlin, Heidel- berg), pp. 259–271, Springer-Verlag, 2006.

[47] W. Chen, C. Wang, and Y. Wang, “Scalable influence maximization for preva- lent viral marketing in large-scale social networks,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, (New York, NY, USA), pp. 1029–1038, ACM, 2010.

[48] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha, “Scalable influence es- timation in continuous-time diffusion networks,” in Advances in Neural In- formation Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 3147–3155, Curran Asso- ciates, Inc., 2013.

[49] E. Cohen, D. Delling, T. Pajor, and R. F. Werneck, “Sketch-based influence maximization and computation: Scaling up with guarantees,” in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pp. 629–638, 2014.

[50] L. Yang, S. Bao, Q. Lin, X. Wu, D. Han, Z. Su, and Y. Yu, “Analyzing and predicting not-answered questions in community-based question answering services,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial , AAAI’11, pp. 1273–1278, AAAI Press, 2011.

[51] A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec, “Discovering value from community activity on focused question answering sites: A case

120 study of stack overflow,” in Proceedings of the 18th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, KDD ’12, (New York, NY, USA), pp. 850–858, ACM, 2012.

[52] M. Asaduzzaman, A. S. Mashiyat, C. K. Roy, and K. A. Schneider, “Answer- ing questions about unanswered questions of stack overflow,” in Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, (Piscataway, NJ, USA), pp. 97–100, IEEE Press, 2013.

[53] B. Vasilescu, A. Serebrenik, P. Devanbu, and V. Filkov, “How social q&a sites are changing knowledge sharing in open source software communities,” in Proceedings of the 17th ACM Conference on Computer Supported Coopera- tive Work & Social Computing, CSCW ’14, (New York, NY, USA), pp. 342– 354, ACM, 2014.

[54] B. V. Hanrahan, G. Convertino, and L. Nelson, “Modeling problem difficulty and expertise in stackoverflow,” in Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work Companion, CSCW ’12, (New York, NY, USA), pp. 91–94, ACM, 2012.

[55] S. Chang and A. Pal, “Routing questions for collaborative answering in com- munity question answering,” in Proceedings of the 2013 IEEE/ACM Inter- national Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, (New York, NY, USA), pp. 494–501, ACM, 2013.

[56] B. Li and I. King, “Routing questions to appropriate answerers in community question answering services,” in Proceedings of the 19th ACM International

121 Conference on Information and Knowledge Management, CIKM ’10, (New York, NY, USA), pp. 1585–1588, ACM, 2010.

[57] T. C. Zhou, M. R. Lyu, and I. King, “A classification-based approach to ques- tion routing in community question answering,” in Proceedings of the 21st International Conference on World Wide Web, WWW ’12 Companion, (New York, NY, USA), pp. 783–790, ACM, 2012.

[58] Y. Liu, J. Bian, and E. Agichtein, “Predicting information seeker satisfaction in community question answering,” in Proceedings of the 31st Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, (New York, NY, USA), pp. 483–490, ACM, 2008.

[59] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real- time event detection by social sensors,” in Proceedings of the 19th interna- tional conference on World wide web, WWW ’10, (New York, NY, USA), pp. 851–860, ACM, 2010.

[60] B. Spice, “Press release: Quarter of tweets not worth reading, twitter users tell researchers.” http://www.cmu.edu/news/stories/archives/2012/ february/feb1_twitterresearch.html, Feb 2012.

[61] “Common Vulnerabilities and Exposures.” http://cve.mitre.org/. (Sep. 2012).

[62] “Common Weakness Enumeration.” http://cwe.mitre.org/. (Oct. 2012).

122 [63] “Security Vulneribility Datasource.” http://www.cvedetails.com/. (Oct. 2012).

[64] K. Lerman and R. Ghosh, “Information contagion: An empirical study of the spread of news on digg and twitter social networks.,” ICWSM, vol. 10, pp. 90– 97, 2010.

[65] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation rank- ing: bringing order to the web.,” 1999.

[66] M. Gomez-Rodriguez, D. Balduzzi, and B. Scholkopf,¨ “Uncovering the tem- poral dynamics of diffusion networks,” in Proceedings of the 28th Interna- tional Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 561–568, 2011.

[67] J. Tang, D. Zhang, and L. Yao, “Social network extraction of academic re- searchers,” in Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 292–301, IEEE, 2007.

[68] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998, ACM, 2008.

[69] J. Tang, L. Yao, D. Zhang, and J. Zhang, “A combination approach to web user profiling,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 5, no. 1, p. 2, 2010.

123 [70] J. Tang, J. Zhang, R. Jin, Z. Yang, K. Cai, L. Zhang, and Z. Su, “Topic level expertise search over heterogeneous networks,” Machine Learning Journal, vol. 82, no. 2, pp. 211–237, 2011.

[71] J. Tang, A. C. Fong, B. Wang, and J. Zhang, “A unified probabilistic frame- work for name disambiguation in digital library,” IEEE Transactions on Knowl- edge and Data Engineering, vol. 24, no. 6, pp. 975–987, 2012.

[72] R. Li, S. Wang, H. Deng, R. Wang, and K. C.-C. Chang, “Towards social user profiling: unified and discriminative influence model for inferring home locations,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1023–1031, ACM, 2012.

[73] M. Newman, Networks: An Introduction. New York, NY, USA: Oxford Uni- versity Press, Inc., 2010.

[74] M. Newman and G. Ghoshal, “Bicomponents and the robustness of networks to failure,” Physical review letters, vol. 100, no. 13, p. 138701, 2008.

124