<<

ACM Transactions on Knowledge Discovery from Data

An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs For Peer Review Journal: ACM Transactions on Knowledge Discovery from Data

Manuscript ID: TKDD-2007-12-0071

Manuscript Type: Regular full paper submission

Date Submitted by the 30-Dec-2007 Author:

Complete List of Authors: Asur, Sitaram; Ohio State University, Computer Science and Engineering Parthasarathy, Srinivasan; Ohio State University, Computer Science and Engineering Ucar, Duygu; Ohio State University, Computer Science and Engineering

Keywords: Interaction networks, Evolutionary analysis Page 1 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 6 7 An Event-based Framework for Characterizing the 8 Evolutionary Behavior of Interaction Graphs 9 10 11 SITARAM ASUR 12 SRINIVASAN PARTHASARATHY 13 and 14 DUYGU UCAR 15 Ohio State University 16 17 18 For Peer Review 19 Interaction graphs are ubiquitous in many fields such as bioinformatics, sociology and physical 20 sciences. There have been many studies in the literature targeted at studying and mining these 21 graphs. However, almost all of them have studied these graphs from a static point of view. The 22 study of the evolution of these graphs over time can provide tremendous insight on the behavior of entities, communities and the flow of information among them. In this work, we present 23 an event-based characterization of critical behavioral patterns for temporally varying interaction 24 graphs. We use non-overlapping snapshots of interaction graphs and develop a framework for 25 capturing and identifying interesting events from them. We use these events to characterize 26 complex behavioral patterns of individuals and communities over time. We show how semantic 27 information can be incorporated to reason about community-behavior events. We also demonstrate 28 the application of behavioral patterns for the purposes of modeling evolution, link prediction and influence maximization. Finally, we present a diffusion model for evolving networks, based on our 29 framework. 30 31 Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications— Data Mining 32 33 General Terms: Algorithms, Measurement 34 Additional Key Words and Phrases: Interaction networks, Evolutionary analysis, Diffusion of innovations 35 36 37 38 39 1. INTRODUCTION 40 Many social and biological systems can be represented as complex interaction net- 41 works where nodes represent entities (i.e. individuals, proteins) and edges mimic 42 the interactions among them. These interaction networks arise from a wide variety 43 of scientific domains like computer science, physics, biology and sociology. Online 44 communities such as Flickr, MySpace and Orkut, e-mail networks, co-authorship 45 networks and WWW networks are examples of interesting interaction networks. 46 The study of these complex interaction networks can provide insight into their 47 structure, properties and behavior. 48 Early research in this area [Girvan and Newman 2002; Newman 2006; Barabasi 49 and Bonabeau 2003; Barabasi et al. 2002] has primarily focused on the static proper- 50 ties of these networks, neglecting the fact that most real-world interaction networks 51 are dynamic in nature. In reality, many of these networks constantly evolve over 52 time, with the addition and deletion of edges and nodes representing changes in the 53 interactions among the modeled entities. Recently, there has been some interest in 54 studying dynamic graphs [Leskovec et al. 2005; Backstrom et al. 2006; Chakrabarti 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001, Pages 1–32. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 2 of 42

1 2 3 4 5 2 · Sitaram Asur et al. 6 7 et al. 2006; Falkowski et al. 2006]. Identifying the portions of the network that are 8 changing, characterizing the type of change, predicting future events (e.g. link pre- 9 diction), and developing generic models for evolving networks are challenges that 10 need to be addressed. For instance, the rapid growth of online communities has dic- 11 tated the need for analyzing large amounts of temporal data to reveal community 12 structure, dynamics and evolution. 13 14 Interaction networks are often modular in nature. The interactions existing be- 15 tween nodes can be used to group them into clusters or communities. For instance, 16 in a social network, these clusters represent people with similar contact patterns 17 or interests. The problem of identifying these clusters or communities from static 18 graphs has bForeen extensiv elyPeerstudied [Girv anReviewand Newman 2002; Clauset et al. 2004; 19 Reichardt and Bornholdt 2004; Flake et al. 2002] over the past decade. However, 20 in the case of evolving graphs, the clusters are typically not static. Instead they 21 constantly change over time, as the network evolves. We believe that studying the 22 evolution of these clusters, in particular their formation, transitions and dissolution, 23 can be extremely useful for effectively characterizing the corresponding changes to 24 the network over time. 25 Another important aspect is the behavior of the nodes of the network. Nodes 26 of an evolving interaction network represent entities whose interaction patterns 27 change over time. The movement of nodes, their behavior and influence over other 28 nodes, can help make inferences regarding future interactions as well as predicting 29 changes to communities in the network. For instance, in a social network, if a 30 person is very sociable, the chances of him/her interacting with new people and 31 joining new groups is very high. In the case of a collaboration network, if a person 32 is known to collaborate frequently with different people, then the chance of a new 33 collaboration involving this person is high. The influence exerted by a node can 34 be studied in terms of its effect on other nodes. If several other people join a 35 community when a particular individual does, it indicates a high degree of positive 36 influence for that person. 37 38 Another important factor influencing behavior and evolution is the semantic na- 39 ture of the interactions itself. For instance, in a co-authorship network, two authors 40 are connected if they publish a paper together. The topic or the subject area of 41 the paper will definitely influence future collaborations for each of these authors. If 42 there are different authors working on similar topics, the chances of them collabo- 43 rating in the future is higher than two authors working on unrelated areas. We wish 44 to examine the influence of the semantics of the interaction on future interactions. 45 The study of diffusion or flow of information in an evolving network is impor- 46 tant for social science research, viral marketing applications and epidemiology. For 47 instance, pandemic viruses pose a severe threat to society due to their potential 48 to spread rapidly and cause tremendous widespread illnesses and deaths. In viral 49 marketing, the goal is to propagate an idea or innovation through an interaction 50 network. Analysis of the evolution of interactions in a social network and the 51 identification of influential nodes can be used to devise effective containment poli- 52 cies for pandemic disease spread in the case of epidemiological studies, as well as 53 ‘word-of-mouth’ advertising in marketing. 54 In this paper, we provide an event-based framework for characterizing the evolu- 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 3 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 3 6 7 tion of interaction networks. We begin by converting an evolving graph into static 8 snapshot graphs at different time points. We obtain clusters at each of these snap- 9 shots independently. Next, we characterize the transformations of these clusters by 10 defining and identifying certain critical events. We define efficient incremental al- 11 gorithms involving bit-matrix computations for this purpose. We use these critical 12 events to compute and reason about novel behavior-oriented measures, which offer 13 new and interesting insights for the characterization of dynamic behavior of inter- 14 action graphs. We also develop semantic information measures that can be used 15 to analyze and reason about community behavior. We illustrate our framework 16 17 on three different evolving networks - the DBLP co-authorship network, Wikipedia 18 and a clinicalFortrials patien Peert network. In eacReviewh case, the behavioral patterns that we 19 discover using our framework help us make useful inferences about cluster evolution 20 and link prediction. Finally, we use the behavioral measures to detail a diffusion 21 model for evolving networks and demonstrate their application for the task of in- 22 fluence maximization. 23 In short, the key contributions of this work are 24 (1) The identification of key critical events that occur in evolving interaction 25 networks. 26 (2) Efficient incremental algorithms for the discovery of these critical events 27 (3) Novel behavioral measures for stability, sociability, influence and popularity 28 that can be computed incrementally over time 29 30 (4) A diffusion model for evolving networks based on our framework 31 (5) Application of the events and behavioral measures on two real datasets for 32 modeling evolution, predicting behavior and trends (link prediction) and influence 33 maximization. 34 35 2. RELATED WORK 36 There has been enormous interest in mining interaction graphs for interesting pat- 37 terns in various domains. However, the majority of these studies [Girvan and 38 Newman 2002; Newman 2006; Barabasi and Bonabeau 2003; Barabasi et al. 2002; 39 Clauset et al. 2004; Reichardt and Bornholdt 2004; Flake et al. 2002] have fo- 40 cused on mining static graphs to identify community structures, patterns and novel 41 information. Recently, the dynamic behavior of clusters and communities have at- 42 tracted the interest of several groups. Leskovec et al [Leskovec et al. 2005] studied 43 the evolution of graphs based on various topological properties, such as the degree 44 distribution and small-world properties of large networks. They proposed a graph 45 generation model, called Forest Fire model, to explain their findings about evolu- 46 tionary behaviors of graphs. Backstrom et al [Backstrom et al. 2006] studied forma- 47 tion of groups and the ways they grow and evolve over time. To estimate probability 48 of an individual joining a community, they proposed using features of communi- 49 ties and individuals, applying decision-tree techniques. To identify communities 50 that are likely to grow, they also used community features on a decision-tree based 51 analysis. 52 Chakrabarti et al [Chakrabarti et al. 2006] proposed evolutionary settings for two 53 widely-used clustering algorithms (k-means and agglomerative hierarchical cluster- 54 ing). They defined evolutionary clustering as the task of incrementally obtaining 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 4 of 42

1 2 3 4 5 4 · Sitaram Asur et al. 6 7 high-quality clusters for a set of objects while also maintaining similarity with 8 clusters identified in previous timestamps. To obtain the clusters for a particular 9 snapshot, they also used history information to obtain a clustering consistent with 10 earlier snapshots. Falkowski et al [Falkowski et al. 2006] analyzed the evolution 11 of communities that are stable or fluctuating based on subgroups. Although they 12 analyzed interaction graphs, their focus is different from ours. They have exam- 13 ined overlapping snapshots of interaction graphs and applied standard statistical 14 measures to identify persistent subgroups. Our focus is on identifying key events 15 and behavioral patterns that can characterize, model and predict future behavioral 16 17 trends. We are interested in changes that occur over time. In this regard, we 18 specifically targetFornodes ofPeerthe network andReviewanalyze their evolutionary behavior. 19 The seminal paper by Samtaney et al [Samtaney et al. 1994] described an ap- 20 proach for extracting coherent regions from 2-dimensional and 3-dimensional scalar 21 and vector fields for tracking purposes. To study the evolution of these regions over 22 time, they presented certain evolutionary events for objects. Event-based meth- 23 ods have also been applied on spatial data [Yang et al. 2005] and clustered stream 24 data [Spiliopoulou et al. 2006]. Although they use events, they do not deal with 25 evolving graphs. Kalnis et al [Kalnis et al. 2005] have studied spatio-temporal 26 datasets to identify moving clusters for object trajectories. They have defined a 27 moving cluster as a sequence of spatial clusters that appear in consecutive snapshots 28 and that share a large number of common objects. 29 The use of Semantic similarity based on ontologies has been studied many times in 30 the past [Lin 1998; Ganesan et al. 2003]. It has been successfully applied on various 31 taxonomies such as WordNet [Richardson et al. 1994] and the Gene Ontology [Couto 32 et al. 2005]. Resnik [Resnik 1999] suggested a novel way to evaluate semantic 33 similarity in an ontology based on notion of information content. In our work, 34 semantic similarity concepts are used to quantify similarity for clusters over time. 35 36 3. PROBLEM DEFINITION 37 Before describing our event-based framework in detail, we introduce the basic no- 38 tations used throughout the paper. As we mentioned earlier, our focus in this work 39 is to study the evolution of graphs, in particular to understand behavioral pat- 40 terns for communities and individuals over time. In order to fully understand the 41 temporal evolution of graphs, it becomes necessary to study and characterize the 42 transformations undergone by the graph at different time instants along the way. 43 In this regard, we make use of temporal snapshots to examine static versions of the 44 evolving network at different time points. 45 46 Definition: An interaction graph G is said to be evolving if its interactions vary 47 over time. Let G = (V, E) denote a temporally varying interaction graph where V 48 represents the total unique entities and E the total interactions that exist among 49 the entities. We define a temporal snapshot Si = (Vi, Ei) of G to be a graph repre-

50 senting only entities and interactions active in a particular time interval [Tsi , Tei ], 51 called the snapshot interval. 52 As the graph evolves, new nodes and edges can appear. Similarly, nodes and 53 edges can also cease to exist. This dynamic behavior of a graph over time can thus 54 be represented as a set of S equal, non-overlapping temporal snapshots. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 5 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 5 6

7 A B F A B F A B F 8 E 9 E E 10 C D G C D G C D G 11 T1 T2 T1,2 12 13 Fig. 1. Temporal Snapshots a) at Time t=1 b) at Time t=2 c) Cumulative snapshot at 14 Time t=2 15 16 Note that, different snapshots are mutually exclusive. They do not contain any 17 information in common. This is in contrast to the representation provided in some 18 earlier researcForh [Chakrabarti Peeret al. 2006; LinReviewet al. 2004] which define a snapshot 19 considering the entire set of interactions upto the current time interval. Figure 1(a- 20 b) illustrates an example evolving graph over two time intervals. We find that 21 in the first time interval, interactions exist between A and C, and between A 22 and D. In the second time interval, these interactions do not continue to exist. 23 Figure 1(c) depicts a cumulative snapshot of the second time interval. We find that 24 the information regarding the loss of interactions AC and AD is lost. Also, the 25 community structure depicted in Figure 1(c) does not reflect the actual structure. 26 To prevent this loss of information, we choose short time intervals and generate 27 snapshots representing only the information of that specific interval. The collection 28 of all T temporal snapshots is represented by S = {S , S , . . . , ST }. 29 1 2 To study the evolution of the graph, we need a representation of its structure 30 at different snapshots. For this purpose, we generate clusters for each snapshot 31 graph. Each Si is partitioned into ki communities or clusters denoted by Ci = 32 1 2 ki th j j j {Ci , Ci , . . . , Ci }. The j cluster of Si, Ci is also a graph denoted by (Vi , Ei ) 33 j j j j 34 where Vi are nodes in Si and Ei denotes the edges between nodes in Vi . Finally, 1 2 ki 35 for each Si = (Vi, Ei), Vi ∪ Vi ∪ . . . ∪ Vi = Vi. 36 To choose a clustering algorithm for this work, we examined the performance 37 of various graph clustering algorithms on several interaction graphs in terms of 38 modularity of clusters. We found that the MCL algorithm [van Dongen 2000], a fast 39 and scalable unsupervised clustering algorithm, consistently yielded clusters of high 40 modularity. Hence, we use MCL to obtain the clusters at different timestamps 1. 41 The MCL algorithm does not require a parameter specifying the number of clusters. 42 Instead it uses a granularity parameter and the cluster structure prevalent in the 43 graph to determine the number of partitions. Accordingly, for each snapshot, the 44 number of clusters may vary depending on the interactions in that time interval. 45 We used a granularity parameter of 1.2 for our experiments, since the graphs were 46 fairly sparse. 47 Algorithm 1 shows the outline of the framework we propose. We design an incre- 48 mental strategy to mine the clusters over time to identify significant changes that 49 occur among snapshots, referred to as critical events. These events are then used to 50 study more complex behavioral patterns. In Section 5, we will describe the critical 51 events and how we find them. In Section 6, we mine these events further to find 52

53 1 54 Note that, the event-based framework we propose is relatively independent of the clustering algorithm used to obtain the snapshot clusters. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 6 of 42

1 2 3 4 5 6 · Sitaram Asur et al. 6 7 complex behavioral patterns for analysis. 8 9 10 11 Algorithm 1 Mine-Events(G,T ) 12 Input: Interaction graph G = (V, E) and T , the 13 number of intervals 14 Convert graph G = (V, E) into T temporal snapshots S = {S1, S2, . . . , ST }. 15 for i = 1 to T do 16 Cluster Si 1 2 ki Ci = {Ci , Ci , . . . , Ci } 17 end for 18 for i = 1 to TFor− 1 do Peer Review Events = Find events(Si,Si+1) [Section 5] 19 Mine Events for complex patterns [Section 6] 20 end for 21 22 23 4. DATASETS 24 25 We employ three different datasets in this work. 26 4.1 DBLP co-authorship network: 27 28 The DBLP bibliography maintains information on more than 800000 computer 29 science publications. We used the DBLP data to generate a co-authorship net- 30 work representing authors publishing in several important conferences in the field 31 of databases, data mining and AI. We chose all papers over a 10 year period (1997- 32 2006) that appeared in 28 key conferences spanning mainly these three areas. The 33 conferences we considered are - (PKDD, ACL, UAI, NIPS, KR, KDD, ICML, 34 ICCV, IJCAI, CVPR, AAAI, ER, COOPIS, SSDBM, DOOD, SSD, FODO, DAS- 35 FAA, DEXA, ICDM, IDEAS, CIKM, EDBT, ICDT, ICDE, VLDB, PODS, SIG- 36 MOD). We converted this data into a co-authorship graph, where each author is 37 represented as a node and an edge between two authors corresponds to a joint 38 publication by these two authors. The graph spanning 10 years contained 23136 39 nodes and 54989 edges. We chose the snapshot interval to be a year, resulting in 40 10 consecutive snapshot graphs. These graphs are then clustered and analyzed to 41 identify critical events and patterns. We believe that studying the evolution of the 42 DBLP dataset can afford information about the nature of collaborations and the 43 factors that influence future collaborations between authors. 44 45 46 4.2 Wikipedia Dataset 47 The Wikipedia online encyclopaedia is a large collection of webpages providing 48 comprehensive information concerning various topics. The dataset we employ rep- 49 resents the Wikipedia revision history and was obtained from Berberich [Berberich 50 et al. 2007]. It consists of a set of webpages as well as links among them. It com- 51 prises of the editing history from January 2001 to December 2005. The temporal 52 information for the creation and deletion of nodes (pages) and edges (links) are 53 also provided. Since our primary goal for using this data is to perform seman- 54 tic analysis, we make use of a category hierarchy, which we have obtained from 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 7 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 7 6 7 Gabrilovich [Gabrilovich and Markovitch 2007]. The Wikipedia revision history 8 dataset consisted of around 1.6 million webpages (nodes). We chose all webpages 9 for which category information was available We also included some pages that 10 did not have category information but had strong links to the chosen pages. The 11 resulting dataset consisted of 779005 nodes (webpages), 32.5 M edges and 78664 12 categories. We constructed snapshots of 3 month intervals, and considered the first 13 10 snapshots for our analysis. Each snapshot consisted only of nodes and edges 14 (links) that existed during that particular interval. 15 16 4.3 Clinical Trials Data: 17 18 In clinical trials,Forpharmaceutical Peercompanies Reviewtest a new drug for efficacy and toxicity 19 - efficacy to evaluate its effectiveness in curing or controlling the disease in question 20 and toxicity to determine if the drug is safe for consumption and with minimal side 21 effects. Releasing a drug that turns out to be toxic can cost companies billions of 22 dollars and more importantly lead to loss in life. In this paper we use a dataset ob- 23 tained from a major pharmaceutical company, consisting of both healthy people as 24 well as patients suffering from certain diseases (diabetes and hepatic impairment). 25 As part of the study, they were given either a placebo (a formulation that includes 26 only the inactive ingredients) or the drug under study. Liver toxicity information 27 can be obtained from eight serum analytes, often referred to in the literature as 28 the liver panel. This dataset consists of patients on the placebo and the drug in 29 the ratio 40:60. The initial snapshot of this data is composed of the measurements 30 of the analytes obtained before patients were treated with the drug or the placebo. 31 The subsequent snapshots correspond to measurements taken every week. The data 32 thus consists of 7 snapshots spanning a 6 week period since the beginning of the 33 treatment. We transformed the data for each snapshot into a graph, based on the 34 correlations that exist between the analyte values of patients. If there exists a high 35 correlation (greater than a threshold Tcorr) in the analyte values between two pa- 36 tients, the two patients have an edge between them in the snapshot graph 2. Note 37 that, if we consider each patient separately, we are limited to only intrinsic infor- 38 mation, whereas by modeling patients as a graph, we are able to utilize intrinsic as 39 well as extrinsic properties. 40 41 5. CRITICAL EVENTS 42 43 In this section, we introduce and afford a formal definition to certain critical 44 events that occur in evolving graphs. Some of the critical events described in 45 this section are inspired by a similar notion described by Samtaney et al [Samtaney 46 et al. 1994]. in the context of tracking and visualizing features. They have also 47 been used since, for tracking spatial objects [Yang et al. 2005] and clustered text 48 streams [Spiliopoulou et al. 2006]. 49 The events that we define are primarily between two consecutive timestamps but 50 it is possible to coalesce events from contiguous timestamps by analyzing the meta- 51 data collected from the event mining framework. We proceed to use these events 52 in the later sections to define more complex behavior. We distribute the critical 53 54 2We examined the distribution of correlations and picked a Tcorr value of 0.7 for our experiments 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 8 of 42

1 2 3 4 5 8 · Sitaram Asur et al. 6 7 1 1 C C 8 1 2 9 10

2 2 1 C1 11 C2 C 3

12 t=1 t=2 t=3 1 1 1 C 13 C C 6 4 5 3 C2 C 14 6 6 15 C 2 C3 C 2 C3 C4 C4 C5 16 4 4 5 5 5 6 6

17 t=4 t=5 t=6 18 ForFig. 2.PeerTemporal Snapshots Reviewat time t=1 to 6 19 20 21 events which graphs can undergo into two categories - events involving communities 22 and events involving individuals. 23 Figure 2 displays a set of snapshots of the network which will be used as a running 24 example in this section. At time t = 1, 2 clusters are discovered (shown in different 25 colors). 26 27 Events involving communities: We define 5 basic events which clusters can 28 undergo between any two consecutive time intervals or steps. Let Si and Si+1 be 29 snapshots of S at two consecutive time intervals with Ci and Ci+1 denoting the set 30 of clusters respectively. The five proposed events are: 31 j k j 32 1) Continue: A cluster Ci+1 is marked as continuation of Ci if Vi+1 is the same k 33 as Vi . We do not impose the constraint that the edge sets should be the same. 34 k j k j Continue(Ci , Ci+1) = 1 iff Vi = Vi+1 35 The main motivation behind this is that if certain nodes are always part of the 36 same cluster, any information supplied to one node will eventually reach the others. 37 Therefore, as long as the vertex set remains same, the information flow is not hin- 38 dered. The addition and deletion of edges merely indicates the strength between 39 the nodes. An example of a continue event is shown at t=2 in Figure 2. Note that 40 an extra interaction appears between the nodes in Cluster C 1 but the clusters do 41 2 not change. 42 43 2) κ-Merge: Two different clusters Ck and Cl are marked as merged if there exists 44 i i 45 a cluster in the next timestamp that contains at least κ% of the nodes belonging to these two clusters. The essential condition for a merge is : 46 k l j 47 Merge(Ci , Ci , κ) = 1 iff ∃Ci+1 such that 48 |(V k ∪ V l) ∩ V j | 49 i i i+1 > κ% (1) k l j 50 Max(|Vi ∪ Vi |, |Vi+1|) 51 k l 52 k j |Ci | l j |Ci | and |Vi ∩ Vi+1| > 2 and |Vi ∩ Vi+1| > 2 . This condition will only hold if 53 there exist edges between V k and V l in timestamp i + 1. Intuitively, it implies that 54 i i new interactions have been created between nodes which previously were part of 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 9 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 9 6 7 different clusters. This caused κ% 3 of nodes in the two original clusters to join 8 the new cluster. Note that, in an ideal or complete merge, with κ = 100, all nodes 9 in the two original clusters are found in the same cluster in the next timestamp. 10 The two original clusters are completely lost in this scenario. Figure 2 shows an 11 example of a complete merge event at t=3. The dotted lines represent the newly 12 created edges. All the nodes now belong to a single cluster (C 1). 13 3 14 3) κ-Split: A single cluster Cj is marked as split if κ% of nodes from this cluster 15 i are present in 2 different clusters in the next timestamp. The essential condition is 16 that: 17 j k l 18 Split(Ci , κFor) = 1 iff ∃C iPeer+1, Ci+1 such that Review 19 |(V k ∪ V l ) ∩ V j | i+1 i+1 i > κ% (2) 20 k l j 21 Max(|Vi+1 ∪ Vi+1|, |Vi |) 22 |Ck | |Cl | and |V k ∩ V j | > i+1 , |V l ∩ V j | > i+1 . 23 i+1 i 2 i+1 i 2 24 25 Intuitively, a split signifies that the interactions between certain nodes are broken 26 and not carried over to the current timestamp, causing the nodes to part ways and 27 join different clusters. Also note that a broken edge, by itself, does not necessarily 28 indicate a split event, as there may be other interactions existing between vertices 29 in the cluster (similar to the notion of k-connectivity). Time t=4 in Figure 2 shows 30 a split event when a cluster gets completely split into three smaller clusters. 31 k 32 4) Form:A new cluster Ci+1 is said to have been formed if none of the nodes in 33 the cluster were grouped together at the previous time interval i.e. no 2 nodes in k 34 Vi+1 existed in the same cluster at time period i. k j k j 35 F orm(Ci+1) = 1 iff ∃ no Ci such that Vi+1 ∩ Vi > 1 36 Intuitively, a form indicates the creation of a new community or new collaboration. 37 Figure 2 at time t=5 shows a form event when two new nodes appear and a new 38 cluster is formed. 39 k 40 5) Dissolve: A single cluster Ci is said to have dissolved if none of the vertices in 41 the cluster are in the same cluster in the next timestamp i.e. no two entities in the 42 original cluster have an interaction between them in the current time interval. 43 k j k j Dissolve(Ci ) = 1 iff ∃ no Ci+1 such that Vi ∩ Vi+1 > 1 44 Intuitively, a dissolve indicates the lack of contact or interactions between a group 45 of nodes in a particular time period. This might signify the breakup of a community 46 or a workgroup. Figure 2 at time t=6 shows a dissolve event when there are no 47 1 longer interactions between the three nodes in Cluster C5 resulting in a breakup of 48 the cluster into 3 clusters - C1, C2 and C3. 49 6 6 6 50 Events involving individuals: We wish to analyze not only the evolution of com- 51 munities but the influence of the behavior of individuals on communities. In this 52 regard, we introduce four basic transformations involving individuals over snap- 53 54 3We used κ values of 30 and 50 in our experiments. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 10 of 42

1 2 3 4 5 10 · Sitaram Asur et al. 6 7 shots. 8 9 1) Appear: A node is said to appear when it occurs in C j but was not present in 10 i any cluster in the earlier timestamp. 11 Appear(v, i) = 1 iff v ∈/ V and v ∈ V 12 i−1 i This simple event indicates the introduction of a person (new or returning) to a 13 network. In Figure 2, at time t = 5 two new nodes appear in the network. 14 15 2) Disappear: A node is said to disappear when it was found in a cluster C j 16 i−1 17 but is not present in any cluster in the timestamp i. 18 Disappear(v, i) = 1 iff v ∈ Vi−1 and v ∈/ Vi This indicatesForthe departure Peerof a person fromReviewa network. In Figure 2, at time t = 6 19 4 20 two nodes of cluster C5 disappear from the network.

21 j 22 3) Join: A node is said to join cluster Ci if it exists in the cluster at timestamp i. This may be due to an Appear event or due to a leave event from a different 23 j 24 cluster. Note that in case, the cluster Ci must be sufficiently similar to a cluster k 25 Ci−1

26 k j j k k j |Ci−1| k 27 Join(v, Ci ) = 1 iff ∃Ci and Ci−1 such that Ci−1 ∩ Ci > 2 and v ∈/ Vi−1 and 28 j v ∈ Vi 29 j The cluster similarity condition ensures that Ci is not a newly formed cluster. 30 This condition differentiates a Join event from a F orm event. Nodes forming a 31 new cluster will not be considered to be Join events since there will be no cluster 32 |Ck | Ck in the previous timestamp with similarity > i−1 with the newly formed 33 i−1 2 cluster. 34 35 k 4) Leave: A node is said to leave cluster Ci−1 if it no longer is present in a cluster 36 k 37 with most of the nodes in Vi−1. A node that leaves a cluster may leave the network 38 as a Disappear event or may join a different cluster. In a collaboration network, a Leave event might correspond to a student graduating and leaving a group. 39 k j j k k j |Ci−1| k 40 Leave(v, Ci ) = 1 iff ∃Ci and Ci−1 such that Ci−1 ∩ Ci > 2 and v ∈ Vi−1 j 41 and v ∈/ Vi 42 The similarity constraint between the two clusters is used to maintain cluster 43 correspondence. Note that if the original cluster dissolves, the nodes in the cluster 44 are not said to participate in a Leave event. This is due to the fact that there will 45 |Ck | no longer be a cluster with similarity > i−1 with the dissolved cluster Ck . 46 2 i−1 47 5.0.1 Algorithms for Event Extraction:. We leverage the use of efficient bit ma- 48 trix operations to compute the events between snapshots. First, for each temporal 49 snapshot, we construct a binary ki × n matrix Ti where ki is the number of clusters 50 at timestamp i and n is the number of nodes. We then compare the matrices of suc- 4 51 cessive snapshots to find events between them . Let Ti(x, :) and Ti(:, y) correspond 52

53 4 54 If the number of nodes changes between the timestamps, we will increase the length of the matrices to reflect the largest of the two 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 11 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 11 6 7 th th to the x row and y column vector of matrix Ti respectively. To compute all the 8 events between two snapshots, we perform a set of binary operations (AND and 9 OR) on the corresponding matrices. The linear operations performed to identify 10 each event are presented below. Let |x| represent the L1-norm of a binary vector x. 11 1 12 13 |x| 14 |x|1 = X xi (3) 15 i=1 16 We can compute the events as : 17 Dissolve(Ti,Ti+1) = {x|1 ≤ x ≤ ki, arg max1≤y≤k (|AND(Ti(x, :), Ti+1(y, :))|1 ≤ 1} 18 For Peeri +1Review 19 Form(Ti,Ti+1) = Dissolve(Ti+1,Ti)

20 Merge(Ti,Ti+1,κ) = {< x, y, z > |1 ≤ x ≤ ki, 21 1 ≤ y ≤ ki, x 6= y, 1 ≤ z ≤ ki+1, |AND(OR(Ti(x, :), Ti(y, :)), Ti+1(z, :))|1 ≥ κ, 22 |Ti(x,:)|1 |AND(Ti(x, :), Ti+1(z, :))|1 ≥ 2 , 23 |Ti(y,:)|1 |AND(Ti(y, :), Ti+1(z, :))|1 ≥ 2 } 24 25 Split(Ti,Ti+1,κ)= Merge(Ti+1,Ti,κ)

26 Continue(Ti,Ti+1)= {< x, y > |1 ≤ x ≤ ki, 1 ≤ y ≤ ki+1, 27 OR(Ti(x, :), Ti+1(y, :)) == AND(Ti(x, :), Ti+1(y, :))}

28 Appear(Ti,Ti+1) = {v|1 ≤ v ≤ |V |, 29 |Ti(:, v)|1 == 0, |Ti+1(:, v)|1 == 1}

30 Disappear(Ti,Ti+1) = {v|1 ≤ v ≤ |V |, 31 |Ti(:, v)|1 == 1, |Ti+1(:, v)|1 == 0}

32 Join(Ti,Ti+1) = {< y, v > |1 ≤ y ≤ ki+1, 1 ≤ v ≤ |V |, Ti+1(y, v) == 1, |Ti(x,:)|1 33 ∃x, 1 ≤ x ≤ ki s.t. |AND(Ti(x, :), Ti+1(y, :))|1 > 2 , 34 Ti(x, v) == 0} 35 Leave(Ti,Ti+1)= {< x, v > |1 ≤ x ≤ ki, 1 ≤ v ≤ |V |, Ti(x, v) == 1, 36 T x, ∃y, 1 ≤ y ≤ k s.t. |AND(T (x, :), T (y, :))| > | i( :)|1 , 37 i+1 i i+1 1 2 T (y, v) == 0} 38 i+1 T (x, y) represents the value in the xth row and yth column of T . The construc- 39 i i tion of the matrices and the operations to find the events are all linear in time 40 complexity(O(n)), assuming that k << n and k << n. The timing results 41 i i+1 42 for event detection for various values of n are shown in Table I. The number of 43 clusters, ki and ki+1 are 50 in each case. The advantage of using the bit ma- 44 trix operations is that they enable us to leverage GPU [Govindaraju et al. 2005] 45 and multi-core [Kurzak and Dongarra 2006] architectures quite efficiently. Note 46 that, since we are computing events for two timestamps at a time, the whole event 47 detection process can be trivially parallelized. 48 49 6. BEHAVIORAL ANALYSIS 50 Next, we use the critical events to study important behavioral patterns in evolving 51 graphs. Most of the research on evolving networks [Backstrom et al. 2006; Falkowski 52 et al. 2006] have focused solely on analyzing community behavior. In this section, 53 we begin by presenting some movement-based measures to study the behavior of 54 nodes in the network and their influence on others. In the second part, we describe 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 12 of 42

1 2 3 4 5 12 · Sitaram Asur et al. 6 7 Number of nodes Time (secs) 8 1000 0.006507 9 10000 0.0721 10 100000 0.714485 11 1000000 7.987603 12 2000000 10.435108 4000000 20.8885 13 14 Table I. Timing Results 15 16 how semantic information can be incorporated to aid the analysis of community- 17 based events. 18 6.1 Movement-basedForAnalysis Peer Review 19 20 Movement for individuals is defined using the basic events - Join and Leave. We 21 use these basic events to identify more complex behavior. In particular, we are 22 interested in capturing the behavioral tendencies of individuals that contribute to 23 the evolution of the graph. We wish to use these behavioral patterns to perform 24 reasoning and predict future trends of the graph. We define four behavioral mea- 25 sures that can be incrementally computed at each time interval using the events 26 discovered in the current interval. 27 6.1.1 Stability Index. The Stability index measures the tendency of a node to 28 have interactions with the same nodes over a period of time. A node is highly stable 29 if it belongs to a very stable cluster, one that does not change much over time. Let 30 cl (x) represent the cluster that node x belongs to in the ith time interval. The 31 i Stability Index (SI) for node x over T timestamps is measured incrementally as: 32 T 33 |cli(x)| 34 SI(x, T ) = X Vi (Leave(j, cl (x)) + Join(j, cl (x))) 35 i=1 Pj=1 i i 36 and Activity(x) >= Min activity (4) 37 38 39 T Here Activity(x) = Pi=1(x ∈ Vi) represents the number of time intervals x is ac- 40 tive in. The threshold Min activity corresponds to the minimum number of active 41 intervals for a node to be considered sociable. We used a Min activity value of 1/2 42 the number of time intervals, for our experiments. 43 44 Stability for the Clinical Trials Dataset: In the case of the clinical tri- 45 als data, nodes in a cluster correspond to individuals having similar observations. 46 When a node has a low Stability index score, it indicates that the observations of 47 that particular patient fluctuate appreciably. This causes the node to jump from 48 one cluster to another repeatedly. This behavior represents an anomaly and can 49 indicate possible side-effects of the drug being administered. Note that the dataset 50 contains two groups of people, one group on the placebo and the other group on the 51 drug with a distribution of 40:60. If the people with very low Stability index (out- 52 liers in this case) happen to be people on the drug, there is a reasonable indication 53 that there may be a hepatoxic effect from the drug intake, whereas if it is uniform 54 over both sets of people, it would indicate there are no noticeable side-effects. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 13 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 13 6 7 Disease Treat- Age Sex 8 ment diabetes +/- renal impairment Drug 62 M 9 diabetes +/- renal impairment Drug 59 M 10 hepatic impairment Drug 56 M diabetic neuropathy Drug 66 F 11 diabetes +/- renal impairment Drug 60 M 12 diabetes +/- renal impairment Drug 62 F 13 diabetes +/- renal impairment Drug 70 F diabetes +/- renal impairment Drug 66 M 14 diabetes +/- renal impairment Drug 55 M 15 diabetes +/- renal impairment Drug 50 M diabetes +/- renal impairment Drug 49 M 16 hepatic impairment Drug 50 M 17 diabetic neuropathy Drug 69 M 18 diabetes mellitus (type 2 niddm) Drug 52 M For hepaticPeerimpairment ReviewDrug 48 M 19 hepatic impairment Drug 48 M 20 diabetes +/- renal impairment Drug 49 M hepatic impairment Drug 49 M 21 diabetes mellitus (type 2 niddm) Placebo 56 M 22 23 24 Table II. Low Stability Index - Clinical Trials Data 25 26 Accordingly, we computed the Stability index for all the nodes in the clinical trial 27 data. On examination, we found 19 nodes having very low Stability index scores 28 (below a threshold), indicating that these nodes move between clusters in almost 29 every time interval that they are active. This suggests a significantly unstable 30 behavior exhibited by these nodes. The 19 patients are shown in Table II. 31 In this particular application unstable nodes (patients) are a cause for concern 32 since that may be an indication of toxicity. Drilling down on the nineteen most 33 unstable nodes we find that only one of them is on the placebo (rest were on the 34 drug). In fact drilling down even further it is observed that out of the top 200 most 35 unstable patients, eighty percent were on the drug. This indeed was very suspicious 36 given the original distribution (40:60) and points to potential toxicity. As it turns 37 out, according to domain experts, this drug was discontinued for toxicity 38 two years after this study was conducted. 39 40 6.1.2 Sociability Index. For the DBLP data, we define a related measure, called 41 the Sociability Index. The Sociability Index is a measure of the number of different 42 interactions that a node participates in. This behavior can be captured by the 43 number of Join and Leave events that this node is involved in. Let cli(x) be the 44 cluster that node x belongs to at time i. Then, the Sociability Index is defined as: 45 T (Join(x, cl (x)) + Leave(x, cl (x))) 46 SoI(x) = Pi=1 i+1 i (5) 47 |Activity(x)| 48 T where Activity(x) = Pi=1(x ∈ Vi) indicates the number of intervals that node x is 49 active. Similar to the stability index, this is computed incrementally. The measure 50 gives high scores to nodes that are involved in interactions with different groups. 51 Note that this measure does not represent the degree, which is a factor of the 52 number of interactions a node is involved in. A case in point was a node that had 53 a degree of 80, but a sociability of close to 0. When we examined the clusters the 54 node belonged to, we found that the node interacted with the same nodes over 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 14 of 42

1 2 3 4 5 14 · Sitaram Asur et al. 6 7 several timestamps. 8 9 Application of Sociability for Link Prediction: 10 11 12 Problem Definition: The goal in link prediction [Liben-Nowell and Kleinberg 13 2003] is to use past interaction information to predict future links between nodes. 14 Since, in this paper we are analyzing the evolution of clusters, our goal is to predict 15 future co-occurrences of nodes in clusters. 16 17 In the case of the DBLP collaboration network, two nodes are clustered together if 18 they work onForrelated pap ersPeeror belong to theReviewsame work-group, as we have seen. We 19 use the behavioral patterns we have discovered to predict the likelihood of authors 20 being clustered together in the future. 21 For prediction, we employ the Sociability index which, as we described before, gives 22 the likelihood of an author being involved in different collaborations. If an author 23 has a high Sociability Index score, the chances of him/her joining a new cluster 24 in the future is very high. We compute the Sociability Index scores for authors 25 using Equation 5. We use a degree threshold to prune these authors. 5 We find all 26 authors who have high Sociability index scores (> 0.75) and degree higher than the 27 threshold and who have not been clustered together in the past. We then predict 28 future cluster co-occurrences between them. 29 The seminal paper on link prediction [Liben-Nowell and Kleinberg 2003] provided 30 an empirical analysis of several techniques for link prediction. We adopt the same 31 scenario and split our DBLP snapshots into two parts. We use the clusterings for 32 the first 5 years (1997-2001) to predict new cluster co-occurrences for the next 5 33 years. Note that we are only considering new links between authors. Hence we 34 consider only authors that have not been clustered together previously. 35 Similar to the evaluation performed by the Liben-Nowell and Kleinberg [Liben- 36 Nowell and Kleinberg 2003], we use as our baseline a random predictor that ran- 37 domly predicts pairs of authors who have not been clustered together before, and 38 report the accuracy of all the methods methods relative to the random predictor. 39 To perform comparisons, we implement three other approaches that were shown to 40 perform well by the authors above: 41 42 Common Neighbor-based: This approach [Newman 2001; Liben-Nowell and 43 44 Kleinberg 2003] gives high similarity scores to nodes that have a large number of 45 neighbors in common. This measure is based on the notion that if two authors 46 have a large number of common neighbors and have not yet collaborated, there is 47 a good chance that they will, in the future. It is given by: 48 Score(a, b) = |γ(a) ∩ γ(b)| (6) 49 50 where γ(a) represents the neighbors of node a. 51 52 Adamic-Adar: This measure, originally proposed by Adamic and Adar [Lada 53 54 5The threshold value we used was 50 papers 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 15 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 15 6 7 Predictor Accuracy 8 Random Predictor Probability 0.14% Sociability Index 275 9 Common Neighbors 25 10 Adamic-Adar 46 11 Jaccard Coefficient 23 12 13 Table III. Cluster Link Prediction Accuracy. Accuracy score specifies the factor improve- 14 ment over the random predictor. This method of evaluation is consistent with the one 15 performed by Liben-Nowell and Kleinberg [Liben-Nowell and Kleinberg 2003]. 16 17 et al. 2003] in relation to similarity between web pages, weights a common neigh- 18 bor based onForits importance. PeerIt is defined Reviewas: 19 1 20 Score(a, b) = (7) X log(|γ(c)| 21 c∈(γ(a)∩γ(b)) 22 Nodes that have fewer neighbors are deemed more important than nodes with high 23 degrees. 24 25 26 Jaccard coefficient: This measure, a popularly used similarity metric, computes 27 the probability of two nodes having a common neighbor. 28 |γ(a) ∩ γ(b)| Score(a, b) = (8) 29 |γ(a) ∪ γ(b)| 30 31 We used all the algorithms to predict cluster links for the last 5 years (2002-2006). 32 We only considered pairs of authors who have not been clustered together in any 33 of the 5 earlier snapshot graphs. The accuracy was computed as a factor of the 34 random predictor [Liben-Nowell and Kleinberg 2003], which was found to give a 35 correct result with probability 0.14%. The results are shown in Table III. We find 36 that the Sociability Index-based method performs the best overall, outperforming 37 other approaches appreciably with a large ratio of correct predictions (275). This 38 result suggests that behavioral patterns of evolving graphs can be used to predict 39 future behavior. 40 6.1.3 Popularity Index. The Popularity index is a measure defined for a cluster 41 or community at a particular time interval. The Popularity Index of a cluster at 42 time interval [i, i + 1] is a measure of the number of nodes that are attracted to it 43 44 during that interval. It is defined as: 45 Vi Vi j j j 46 P I(Ci ) = (X Join(x, Ci )) − (X Leave(x, Ci )) (9) 47 x=1 x=1 48 This measure is based on the transformation a cluster undergoes over the course 49 of a time interval. If a cluster does not dissolve in [i, i + 1] and a large number of 50 nodes join the cluster and few leave it, then the cluster will have a high Popularity 51 Index score. Note, that the Popularity index is an influence measure defined for a 52 cluster. Also note, that this measure does not simply reflect the size of a cluster. 53 If a cluster is very large, it does not indicate that it is attracting new nodes to it. 54 In the DBLP dataset, the popularity index can be used to find topics of interest 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 16 of 42

1 2 3 4 5 16 · Sitaram Asur et al. 6 7 for a particular year. For instance, if a large number of nodes join a cluster at a 8 particular time point and a high percentage of them are working on a specific topic, 9 it indicates a buzz around that topic for that year. On the other hand, if a large 10 number of authors leave a cluster, and there are not many new nodes joining it, it 11 indicates a loss of interest in a particular topic. 12 To find hot topics, we computed the popularity index scores for each cluster, and 13 identified the most popular clusters, at each timestamp. We then examined the 14 clusters that had high popularity scores to see if a large percentage of the authors 15 in them were working on a particular topic. 16 17 We will now present an interesting result we obtained for the time span 1999- 18 2000. In 1999,Forthree authors PeerStefano Ceri, ReviewPiero Fraternali and Stefano Paraboschi 19 formed a cluster. They were involved in a few papers on XML and web applications. 20 In the next year (2000), these three authors were involved in a large number of 21 collaborations, resulting in around 50 joins to their cluster. When we examined 22 the topics of the papers that resulted, we found that 30 of these authors published 23 papers related to XML. Since there were no papers on XML before 1999, this was 24 a new and hot topic at that point. Since then there have been large number of 25 papers on XML. Figure 3 shows the original 3 person cluster as well as the authors 26 from the new cluster who were involved in XML related work in that particular 27 time interval. 28 6.1.4 Influence Index. The influence index of a node is a measure of the influence 29 this node has on others. Note that the influence that we are considering, in this 30 case, is with regard to cluster evolution. We would like to find nodes that influence 31 other nodes into participating in critical events. This behavior is measured for a 32 node x, over all timestamps, by considering all other nodes that leave or join a 33 cluster when x does. If a large number of nodes leave or join a cluster with high 34 frequency when a certain node x does, it suggests that node x has a certain positive 35 36 influence on the movement of the others. Let Companions(x) represent all nodes 37 over all timestamps that join or leave clusters with node x. The Influence for node 38 x is given by: 39 |Companions(x)| Inf(x) = (10) 40 |Moves(x)| 41 42 Here Moves(x) represents the number of Join and Leave events x participates 6 43 in. Note that, this definition by itself, does not measure influence, since nodes 44 that interact and move along with highly influential nodes will have high Influence 45 score values as well. Hence, to eliminate these follower nodes, additional pruning 46 constraints are needed. 47 Let Max Int(x) denote the node with which node x has the maximum number of 48 interactions. Let Deg(x) denote the number of neighbors of node x. 49 Influence Index(x) = Inf(x) unless any of the following hold : 50 —Inf(Max Int(x)) > Inf(x) 51 52 6 53 To compute the Influence index efficiently, we incrementally update Companions() and Deg() for Join Leave 54 all nodes. The number of and events (Moves()) are used in the Sociability case also and are stored incrementally as well. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 17 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 17 6 7 8 9 10 11 12 13 14 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Fig. 3. Illustration of certain authors belonging to a very popular cluster (1999-2000 time 35 period). Original cluster (3 authors) shown in small box. In the large graph, we show the 36 connections among 25 authors from the new cluster who published XML-related papers 37 in that time-frame. 38 39 —Deg(Max Int(x)) > Deg(x) 40 If any of the two conditions hold, Influence Index(x) = 0. 41 The additional constraints are imposed in order to ensure that we find the most 42 influential nodes in the datasets. 43 We computed the Influence Index scores for nodes in the DBLP dataset. The 44 top 20 authors are shown in Table IV. We further illustrate the use of the Influence 45 index in the next section. 46 47 6.2 Incorporating semantic content 48 For several real-world interaction networks such as online communities, WWW, 49 collaboration and citation networks, an important factor influencing behavior and 50 evolution is the semantic nature of the interactions itself. For instance, in a co- 51 authorship network, two authors are connected if they publish a paper together. 52 The topic or the subject area of the paper will definitely influence future collabo- 53 rations for each of these authors. If there are different authors working on similar 54 topics, the chances of them collaborating in the future is higher than two authors 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 18 of 42

1 2 3 4 5 18 · Sitaram Asur et al. 6 7 Author Influence 8 Index H. V. Jagadish 290.125 9 Hongjun Lu 268.5 10 Jiawei Han 266.625 Philip S. Yu 251.66 11 Rajeev Rastogi 246.85 12 Beng Chin Ooi 237 13 Tok Wang Ling 220.428 Heikki Mannila 206.5 14 Wenfei Fan 200.142 15 Qiang Yang 199 Johannes Gehrke 179.85 16 Christos Faloutsos 167.85 17 Rakesh Agrawal 157.875 18 Edward Y. Chang 153 For PeerGuy M. Lohman Review131.375 19 Dennis Shasha 129.29 20 Jennifer Widom 128.375 Hamid Pirahesh 127.625 21 Michael J. Franklin 121.5 22 Hector Garcia-Molina 118.625 23 24 Table IV. Top 20 Influence Index Values - DBLP Data 25 26 27 working on unrelated areas. In the case of Wikipedia, pages that are semantically 28 related are linked together. 29 We wish to examine the influence of the semantics of the interaction on future 30 interactions and incorporate semantic information for reasoning about evolution. 31 Apart from reasoning, we also wish to develop measures for evaluating the events 32 obtained from a semantic standpoint. For this purpose, we make use of both se- 33 mantic category hierarchies and information theoretic measures. 34 We use the DBLP co-authorship and the Wikipedia datasets for this analysis. 35 For the DBLP dataset, to quantify the relevance of author pairs and construct 36 a knowledge-base, we make use of semantic information from the topics of their 37 papers. We begin by identifying a set of unique keywords composed of frequently 38 used technical terms from the topics of all the papers in our corpus. We then group 39 related keywords together to form keyword-sets, k = {w1, ..., wn}, where each wi is 40 a keyword. Each paper can be labeled with a set of related keywords. An example of 41 a keyword-set is {0W W W 0,0 W eb0,0 Internet0}. Thus each author can be associated 42 with a paper-set, P , consisting of the union of the keyword-sets from all the papers 43 that she/he co-authored in a particular time period. P = {k1, ..., kp} where each ki 44 is a keyword-set. However, the relationship between two authors cannot be inferred 45 by merely comparing their paper-sets since different keywords are associated with 46 different semantic meanings. One needs to consider the distribution of topics and 47 the relationships among them. 48 To capture this, we constructed an ontology in the form of a hierarchy or a 49 DAG where each node represents a keyword-set. 7 Nodes at higher levels in the 50 hierarchy represent keyword-sets that are more general, while nodes closer to the 51 leaves represent more specific keywords. A node a has a child b if b is a keyword-set 52 that represents a more specific term related to a. An example hierarchy is shown 53 54 7The keyword ontology was manually constructed by the authors for this work. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 19 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 19 6

7 4 x 10 Distribution of Information Content 8 2.5 9 10 2 11

12 Data Mining 1.5 13

14 Temporal Mining 15 1

16 Time Series Number of Categories 17 0.5 18 Trend Analysis ForTransformation PeerForecasting Review 19 0 2 4 6 8 10 12 20 Information Content 21 22 23 Fig. 4. a) A sample subgraph of the keyword DAG hierarchy. b) Distribution of 24 Information Content for Wikipedia. The X-axis gives the IC values and the Y-axis 25 represents the number of categories that have that particular value. 26 27 28 29 in Figure 4(a). 30 The Wikipedia dataset that we use [Gabrilovich and Markovitch 2007], already 31 contains a category hierarchy comprising of 78664 categories. Most Wikipedia pages 32 have more than one category associated with them. The categories of Wikipedia 33 serve the same purpose as the keyword-sets detailed above. They provide infor- 34 mation for classifying webpages (nodes) of the network according to their semantic 35 relevance. We will refer to both of them (categories and keyword-sets) as terms, 36 for the rest of this section. 37 Using these hierarchies, we can define the notion of semantic similarity [Lin 1998; 38 Ganesan et al. 2003] in this context. To begin with, the Information Content (IC) 39 of a term (category or keyword-set), using Resnik’s definition [Resnik 1999], is given 40 as: 41 F (ki) 42 IC(ki) = −ln (11) 43 F (root) 44 where ki represents a term and F (ki) is the frequency of encountering that par- 45 ticular term over all the entire corpus. Here, F (root) is the frequency of the root 46 term of the hierarchy. Note that frequency count of a term includes the frequency 47 counts of all subsumed terms in an is-a hierarchy. Accordingly, the root of our 48 hierarchy includes the frequency counts of every other term in the ontology, and is 49 associated with the lowest IC value. Note that terms with smaller frequency counts 50 will therefore have higher information content values (i.e. more informative). The 51 distribution of IC for the Wikipedia dataset is shown in Figure 4(b). It can be 52 observed that most categories have high IC (low frequency) with only a few having 53 low IC (high frequency). 54 Using the above definition, the Semantic Similarity (SS) between two terms (cat- 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 20 of 42

1 2 3 4 5 20 · Sitaram Asur et al. 6 7 egories or keyword-sets) can be computed as follows: 8 9 SS(ki, kj ) = IC(lcs(ki, kj )) (12) 10 where lcs(k , k ) refers to the lowest common subsumer of terms k and k . 11 i j i j Next, we demonstrate how semantic similarity can be used to reason about 12 community-based critical events, using examples from both datasets. To aid our 13 14 analysis, we extend information theoretic measures such as information entropy and 15 mutual information in this context. 16 6.2.1 Group Merge:. The key intuition that we employ here is that the proba- 17 bility of a merge event depends on the Semantic Similarity between two clusters. 18 For instance,Forif two clusters Peerare comprised Reviewof authors working on highly related 19 topics, it stands to reason that there is a high likelihood of a merge between them. 20 We would like to investigate the strength of this relationship. 21 a b a b Let us consider two clusters Ci and Ci associated with k and k terms respec- 22 tively. A simple way to compute the semantic similarity between two clusters is 23 based on the information content of their terms, as shown below: 24 m=1 n=1 25 a b SS(m, n) Inter SS(Ca, Cb) = Pk Pk (13) 26 i i ka ∗ kb 27 Clusters with high values of Inter SS(), can be expected to contain authors or web- 28 pages with similar topics and can hence be considered candidates for a possible 29 merge in the future. 30 31 Semantic Mutual Information (SMI): To define the semantic similarity be- 32 tween two clusters, one can also employ an information theoretic mutual informa- 33 34 tion measure [Strehl and Ghosh 2002]. Mutual information is a measure of the 35 amount of statistical information shared between two distributions. We can com- 36 pute the mutual information between two clusters based on the terms of all nodes 37 belonging to those clusters. This would be applicable when there is history data 38 that can be used to estimate the probabilities. Given probabilities of terms m and 39 n occurring in a cluster as p(m) and p(n) respectively, and their co-occurrence 40 probability p(mn), we can define the Semantic Mutual Information (SMI) between a b 41 the two clusters Ci and Cj as:

42 ka kb 43 a b p(mn) SMI(C , C ) = SS(m, n) ∗ p(mn) ∗ log a b 44 i i X X k ∗k p(m) ∗ p(n) m=1 n=1 45 46 47 Note that each term pair is weighted by the information content of their most 48 informative common ancestor in the keyword ontology (i.e. Lowest Common Sub- 49 sumer). 50 In our implementation, we used the immediate history data, i.e the previous 51 snapshot’s data, to compute the probabilities of terms (categories) as well as their 52 co-occurrence probabilities 8. 53 54 8Note that it is possible to consider more history information to compute these probabilities 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 21 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 21 6 7 Cluster 1 Cluster 2 Merged Cluster 8 1) Querying Aggregate Data Exact and Approximate 1) On the Content of Materialized Aggregation in Aggregate Views. 9 Constraint Query 10 2) On the Orthographic Dimension of 2) Automatic Aggregation Constraint Databases Using Explicit Metadata 11 3) A Performance Evaluation of 3) Reachability and Connectivity 12 Spatial Join Processing Strategies Queries in Constraint Databases 13 14 Table V. Group Merge Event - Column 1 and 2 show the papers from the original 15 clusters. Column 3 shows the papers from the merged cluster. 16 17 18 The term information can be used to analyze the effect of topics on community For Peer Reviewa b based events. If two clusters in snapshot Si - Ci and Ci , are merging in time Ti into 19 c a b cluster Ci+1, the inter-cluster similarity of Ci and Ci and their similarity to the 20 c 21 new cluster formed, Ci+1 can be indicative of the evolution of a topic by combining 22 two similar sub-topics. 23 We illustrate this with an example from the DBLP dataset. In 1999, we found two 24 clusters with high semantic similarity scores (shown in the first two columns of Ta- 25 ble V), due to the common keywords - ‘constraint’, ‘query’, ‘aggregate/aggregation’. 26 We found that these two clusters merged in the following year (2000) giving a single 27 cluster. All three clusters are shown in Table V. 28 In the Wikipedia dataset, we found the merge events with highest SMI from each 29 time stamp. We provide a couple of examples below. depicting high SMI merges 30 of two clusters at timestamps 2 and 3, both into single clusters and with the main 31 motifs being Cinema and Chemical Elements respectively 9. 32 33 34 Cluster 1 at Time 2: Cinema of the United Kingdom ; Blowup ; Wilde ; Boris Karloff ; British Academy of Film and Television Arts ; Rowan Atkinson ; Time Bandits ; Ben Kingsley ; The Madness 35 of King George ; Roger Moore ; School for Scoundrels ; Withnail and I ; My Beautiful Laundrette ; 36 Michael Caine ; Stan Laurel ; No Highway ; Whisky Galore! ; James Whale ; If.... ; Born Free ; A Matter of Life and Death ; Albert Finney ; Neil Jordan ; John Boorman ; Richard Attenborough ; Our 37 Man in Havana ; First Men in the Moon ; Brighton Rock ; The Happiest Days of Your Life ; Robert 38 Morley ; Julie Walters ; The Man in the White Suit ; Truly Madly Deeply ; Mrs. Brown ; Billy Elliot ; David Lean ; Lionel Jeffries ; Forty-Ninth Parallel ; Topsy-Turvy ; Michael Powell ; My Left Foot ; 39 The African Queen ; Irene Handl ; The Ipcress File ; Leslie Phillips ; Sid James ; Hayley Mills ; The 40 Railway Children ; Joyce Grenfell ; Brassed Off ; Brief Encounter ; The Pope Must Die ; The Rebel ; The Lavender Hill Mob ; Chaplin ; James Mason ; A Canterbury Tale ; Michael Latham Powell ; Helen 41 Mirren ; Walkabout ; Peter Sellers ; The Trials of Oscar Wilde ; Hugh Grant ; Straw Dogs ; Scrooge ; 42 Blackmail ; John Schlesinger ; Ann Todd ; Alfie ; Billy Liar ; Danny Boyle ; Great Expectations ; The Ladykillers ; Little Voice ; Peter Greenaway ; Otley ; Genevieve ; The Man Who Fell to Earth ; Blithe 43 Spirit ; The Day of the Jackal ; How I Won the War ; Bill Nighy ; British Independent Film Awards ; 44 Jude Law ; Staggered ; Alec Guinness ; The Private Life of Sherlock Holmes ; The Winslow Boy ; Hattie 45 Jacques ; Joan Sims; 46 Cluster 2 at Time 2: George Lucas ; Plant ; Fish ; Cooking ; Food ; Osteichthyes ; Shark ; Species ; Nature (journal) ; Triumph of the Will ; International Energy Agency ; Fiji ; Pie ; Dicotyle- 47 don ; Economic and monetary union ; Black Narcissus ; Deathwatch ; The Entertainer ; My Learned 48 Friend ; Vivien Leigh ; The First of the Few ; Juliet Mills ; The Curse of Frankenstein ; Dead of Night ; 49 David Niven ; Mike Nichols ; The Goose Steps Out ; Elizabeth Hurley ; Oliver Twist ; A Room with a View ; The Third Man ; The Wooden Horse ; A Passage to India ; Phyllis Calvert ; The Wrong Arm of 50 the Law ; Carry On Sergeant ; To Sir, with Love ; Doctor in the House ; Ned Kelly ; Jim Sheridan ; John 51 Mills ; Dudley Moore ; Lindsay Anderson ; The Lady Vanishes ; The Adventures of Baron Munchausen ; Privilege ; The Full Monty ; 10 Rillington Place ; 84 Charing Cross Road ; Michael York ; Jenny Agutter 52 ; Joan Collins ; Whistle Down the Wind ; The Italian Job ; The Wicker Man ; The Man Who Never Was 53 54 9The merged cluster is large in both cases and hence not shown 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 22 of 42

1 2 3 4 5 22 · Sitaram Asur et al. 6

7 ; Superman ; Deborah Kerr ; The Mummy; A Fish Called Wanda ; An American Werewolf in London 8 ; Dog Soldiers ; The Collector ; Herb ; Winter ; Constitutional monarchy ; Sturmabteilung ; Reduction 9 ; New Guinea ; Spice ; Romanticism ; Altruism ; Jesus ; Kuru ; Olive ; Rose ; Hamlet ; Dill ; Onion ; Districts of Luxembourg ; Geography of Luxembourg ; Jean-Claude Juncker ; Luxembourg (district) ; 10 Luxembourg (city) ; Lemon balm ; Mint ; Carolus Linnaeus ; Essential oil ; Candy ; Lamiales ; Annual 11 plant ; Oregano ; Basil ; Chimpanzee ; Sandwich ; French cuisine ; Catalan language ; Classical music ; Sect ; Cannibalism ; Bay leaf ; Lao language ; Electronic media; 12 13 Cluster 1 at Time 3: Chemist ; Gadolinium ; Gadolinite ; Fluoride ; Rare earth ; Period 6 14 element ; Geologist ; Lanthanide ; Sulfide ; Thermal neutron ; Hexagon ; Bromide ; Selenide ; Terbium 15 ; Johan Gadolin ; Metabolism ; Didymium ; Compact disc ; Chloride ; Alpha decay ; Xenon ; Magnetic resonance imaging ; Dysprosium ; F-block ; Bastnasite ; Monazite ; Nuclear control rod ; Toxicity ; 16 Nitride ; Iodide ; Infrared ; Holmium ; Marc Delafontaine ; Garnet ; Laser ; Erbium ; Absorption band ; 17 Carl Gustaf Mosander ; Magnetic moment ; Ytterbium ; Thulium ; Niobe ; Euxenite ; Period 7 element 18 ; Vaxholm ; Greek ; Nuclear energy ; Xenotime ; Fergusonite ; Polycrase ; Chalcogenide ; Nuclear fusion ; High Flux IsotopFore Reactor ; OakPeerRidge National LabRevieworatory ; Los Alamos National Laboratory ; 1 E4 19 s ; Anhydrous ; Filter ; Charles James ; Dopant ; Ultraviolet ; 20 Cluster 2 at Time 3: Antoine Lavoisier ; Year ; Acid ; Oxygen ; Fluorine ; Bullet ; Atomic 21 number ; Scientific notation ; Electron ; Boiling point ; Half-life ; Chemical series ; Specific heat ca- 22 pacity ; Ohm ; Decay product ; Kilogram per cubic metre ; Ionization potential ; Thermal conductivity ; Natural abundance ; Chemical element ; Crystal structure ; Glass ; Decay energy ; Periodic table ; 23 Energy level ; Neutron ; Zinc ; Atomic radius ; Speed of sound ; Si ; Covalent radius ; Periodic table 24 (standard) ; Melting point ; Stable isotope ; Vapor pressure ; Molar volume ; Decay mode ; Electroneg- ativity ; Metre per second ; Electron configuration ; List of elements by name ; Isotope ; Kelvin ; 25 Electrical conductivity ; Periodic table block ; Periodic table period ; Periodic table group ; Atomic 26 mass unit ; Kilojoule per mole ; Mega ; List of elements by symbol ; Van der Waals radius ; Metal ; Color ; Pascal ; Uranium ; Nuclear weapon ; Radioactive ; Calcium ; 1 E-25 kg ; Europium ; Germanium 27 ; Dmitri Mendeleev ; P-block ; Distillation ; Transistor ; Gram ; Silicon ; Gallium ; Crystal ; Iodine ; 28 Einsteinium ; Indium ; Rectifier ; Thermistor ; Indigo ; Ferdinand Reich ; True metal ; Solder ; Thallium ; Poor metal ; Welding ; Deuterium ; Tungsten ; Light bulb ; Cubic metre ; Spallation ; 1811 ; Sodium ; 29 Alkali metal ; Lithium ; Lutetium ; Cerium ; Boron ; Technetium ; Niobium ; Potassium permanganate 30 ; Steelmaking ; Manganese nodule ; Arginase ; Dry cell ; Cobalt ; Neodymium ; Enamel ; Promethium ; Rocket ; Tantalus ; Pyrochlore ; 1 E8 s ; 1 E10 s ; 1 E15 s ; 1 E11 s ; Silicate ; Neutron emission ; 31 Heinrich Rose ; Isomeric transition ; Superconducting magnet ; Capacitor ; Charles Hatchett ; Osmium 32 ; Phonograph ; Fountain pen ; Shocked quartz ; Dinosaur ; Fingerprint ; Hassium ; Artificial pacemaker 33 ; Smithson Tennant ; Neon ; Granite ; Neptunium ; Philip Abelson ; Transmutation ; Edwin McMillan ; Protactinium ; Plutonium ; University of California ; X-ray ; Gamma ray ; Scandium ; Electrode 34 ; Sulphur (disambiguation) ; Czochralski process ; Feldspar ; Humphry Davy ; Opal ; Jasper ; Zone 35 melting ; Amethyst ; Diatom ; Period 3 element ; Meteoroid ; Mass number ; Silane ; Trichlorosilane ; Hornblende ; Zinc chloride ; Silicon tetrachloride ; Semiconductor device ; Tektite ; Chemical equation 36 ; Abrasive ; Solar cell ; Turbine ; reaction ; Sodium cyanide ; Bleach ; Chlorite ; Mustard gas ; Water 37 purification ; Synthetic rubber ; Critical temperature ; Critical exponent ; Isotherm ; Miscibility ; Check valve ; John Ambrose Fleming ; Barium oxide ; 38 39 40 The above examples illustrate the relationship between semantic similarity be- 41 tween clusters and the Merge event. The SMI measure defined above can thus 42 be used to evaluate the semantic relevance of merges. High SMI merges provide 43 semantic justification for our event-detection process. On the other hand, SMI also 44 allows one to identify unexpected merges, which can be quite interesting. These 45 are merges with lower SMI, and hence represent the convergence of clusters with 46 lower semantic similarity. 47 For instance, let us consider a cluster merge event for the DBLP dataset, that 48 occurred in the 2005-2006 time interval. Our algorithm identified two groups (one 49 from Germany and one from Italy) who independently published articles in different 50 conferences in 2005. 51 Cluster 1 at Time 9: 52 AAAI 2005: Niels Landwehr, Kristian Kersting, Luc De Raedt: nFOIL: Integrating Nave Bayes and FOIL 53 AAAI 2005: Luc De Raedt, Kristian Kersting, Sunna Torge: Towards Learning Stochastic Logic 54 Programs from Proof-Banks. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 23 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 23 6 7 Cluster 2 at Time 9: 8 ICML 2005 : Sauro Menchetti, Fabrizio Costa, Paolo Frasconi: Weighted Decomposition Kernels. 9 IJCAI 2005 : Andrea Passerini and Paolo Frasconi: P. Kernels on Prolog Ground Terms. 10 Merged Cluster at Time 10: 11 ILP 2006 : Niels Landwehr, Andrea Passerini, Luc De Raedt, Paolo Frasconi: kFOIL: Learning Simple Relational Kernels 12 13 The original clusters had low SMI, due to the fact that the corresponding authors 14 were working on reasonably diverse topics, with Niels Landwehr and Luc De Raedt 15 working on Inductive Logic and Passerini and Frasconi, working on kernels. How- 16 ever, they still collaborated together to combine their ideas. The authors describe 17 their joint work as ‘A novel and simple combination of inductive logic programming 18 with kernel methoFords is prPeeresented. The kF OILReviewalgorithm integrates the well-known 19 inductive logic programming system FOIL with kernel methods.’ 20 21 Also, when we further analyzed the merge events which had lower SMI values, 22 we found that although some of the corresponding clusters contained nodes asso- 23 ciated with suitably disparate categories, they were being clustered together due 24 to their being linked by a common neighbor. This caused the clusters to 25 merge, although their semantic similarity was fairly low. These kind of merges can 26 also be interesting, since they can help one to uncover hidden relationships across 27 nodes. This is related to the notion of Sociability that we examined in the previous 28 subsection. Sociable nodes are nodes that interact with very different nodes. This 29 can sometimes lead to the disparate nodes themselves getting involved together. 30 31 Note that we observed this behavior more with DBLP than Wikipedia. For the 32 Wikipedia dataset, all merges had reasonably high SMI values. This is due to the 33 fact that in Wikipedia, pages are linked due to similar semantic content. Hence, 34 merges are likely to have a high semantic context. For DBLP, our analysis is limited 35 by the hierarchy that we have manually constructed, using only paper titles. It is 36 entirely possible that all semantic relationships are not captured. 37 One relatively straightforward conclusion we could make from our observations 38 in this and the previous subsection is that the propensity of a merger between 39 clusters is dependent on two main factors - the sociability of the nodes and the 40 semantic similarity of the terms involved. We have presented measures to capture 41 both these types of behavior. 42 6.2.2 Group Split:. We believe that an important factor for a Split event is 43 topic divergence. The probability of topic divergence is inversely proportional to 44 the semantic similarity between topics in a cluster. We can define the intra-cluster 45 semantic similarity Intra SS as the average of the semantic similarity scores be- 46 tween the keyword-sets in the cluster. The semantic similarity within a cluster is 47 given by: 48 m=1 n=m+1 49 a a SS(m, n) Intra SS(Ca) = Pk Pk (15) 50 i ka ∗ (ka − 1) 51 a a where k , as before, represents the total number of terms in cluster Ci . If the intra- 52 cluster semantic similarity is small, then it indicates that the cluster is likely to split 53 in the next few timestamps. For instance, in 2001 we found a cluster with rela- 54 tively low intra-cluster semantic similarity. This cluster contained very disparate 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 24 of 42

1 2 3 4 5 24 · Sitaram Asur et al. 6 7 Cluster Split Cluster 1 Split Cluster 2 8 1) Web Site Evaluation: Spatio-temporal Information RoadRunner: automatic data Methodology and Case Study Systems in a extraction from data-intensive 9 Statistical Context. web sites. 10 2) RoadRunner: Towards Automatic Data Extraction from Large Web Sites. 11 3) SIT-IN: a Real-Life Spatio-Temporal 12 Information System. 13 14 Table VI. Group Split Event - Column 1 shows the papers from the original cluster. 15 Columns 2 and 3 show the papers from the split clusters. 16 17 keyword-sets {{web,data extraction} and {spatio-temporal, information system}}. 18 The papers inForthis cluster Peerare shown in theReviewfirst column of Table VI. In 2002, this 19 cluster split into two different clusters, shown in the columns 2 and 3 in Table VI. 20 Thus, the semantic similarity within clusters can be indicative of possible future 21 Split events. 22 23 24 Semantic Information Gain (SIG): We can also define an information theoretic 25 measure to analyze a Split event, using the notion of Information Gain. Information 26 Gain is popular in use in decision trees (C4.5), to identify splitting attributes. It 27 measures the difference in entropy between the original distribution and the current 28 distribution. In the case of a Split event, it can be used to measure the change in 29 entropy in terms of categories of the original cluster and the two split clusters. 30 The key intuition here, is that if a Split occurs due to the divergence of topics 31 or categories, the entropy of the resultant clusters will be much lower than the a b 32 original cluster. Assume a cluster Ci splits into two clusters at time i + 1, Ci+1 c 33 and Ci+1. Let Ka be the number of terms in the original cluster, and let Kb and 34 Kc be the number of these terms that are carried through to the two split clusters. a 35 The entropy for a cluster Ci based on the terms Ka it is associated with, can be 36 given as: 37 K a 1 38 H(Ca) = − IC(m) ∗ p(m) ∗ log( ) (16) i X p(m) 39 m=1 40 where p(m) represents the probability of term m occurring in a cluster. Note that 41 we are weighting the entropy by the Information Content of the terms involved. If 42 43 a cluster is associated with multiple different categories, it can have high entropy 44 within itself. The Semantic Information Gain for a Split event can be computed 45 based on the entropies of the three clusters as : 46 47 a b c a Kb b Kc c SIG(Ci , Ci+1, Ci+1) = H(Ci ) − ( H(Ci+1) + H(Ci+1)) (17) 48 Ka Ka 49 A high value of Semantic Information Gain is indicative of divergence of topics, as 50 the two resultant clusters will be more aligned to certain terms than others. By 51 studying these Split events, one can discern important branches in the evolution 52 of topics over time. Note that, although we have presented the notation above for 53 a 2-way Split event, it can be generalized to multi-way splits. Also note that, in 54 the above formulation, we are evaluating the Split based only on the terms of the 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 25 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 25 6 7 original cluster. Hence we are not considering the new nodes that might belong to 8 these new clusters and their associated terms. 9 We illustrate the efficacy of this measure with a few examples from the Wikipedia 10 dataset. For each Split event that we discovered, we computed the resulting Se- 11 mantic Information Gain. We analyze the ones with the highest and lowest SIG 12 13 values. 14 15 High Semantic Information Gain : Below, we show a 3-way Split event with 16 high SIG obtained between snapshots 4 and 5 of the Wikipedia dataset. The origi- 17 nal cluster at time 4 contained pages on 3 different topics - folk music, Philippines 18 and Mississippi.ForThe resulting Peer3 clusters atReviewtime 5 comprise of pages belonging to 19 each of these categories. 20 Cluster at Time 4: Folk music, , Milonga ; Lambada ; Tsifteteli ; Christine Lavin ; Hula 21 ; Meringue ; Plena ; ; Cante jondo ; Qawwali ; Lam ; Country dancing ; Guajira ; Bachata ; 22 Taarab ; Reel ; Maqam ; Batucada ; Yodeling ; The Dubliners ; Shango ; Parang ; West gallery music 23 ; Chimurenga ; Kalinda ; ; The Weavers ; Mbira ; Fandango ; Sawt ; Choro ; The Chieftains ; Tambu ; Francis James Child ; Bangsawan ; Folk clubs ; ; ; Pentangle ; Beguine ; 24 Forro ; ; Dangdut ; Giddha ; Planxty ; Oro ; ; Honkyoku ; Seamus Ennis ; Rada ; Kheyal 25 ; Farruca ; Dastgah ; Donegal fiddle tradition ; Clannad ; Tarantella ; Waiata ; Carimbo ; Fairport Convention ; Compas ; Thumri ; Jig ; Clare ; Halling ; Shabad ; Tsamiko ; Sid Kipper ; Zydeco ; ; 26 Klezmer ; Timba ; Mississippi ; Hattiesburg, Mississippi ; Vicksburg, Mississippi ; Alternative political 27 spellings ; Pontotoc, Mississippi ; Ackerman, Mississippi ; McComb, Mississippi ; Starkville, Mississippi ; Columbus, Mississippi ; Laurel, Mississippi ; Iuka, Mississippi ; Gulfport, Mississippi ; Blue Mountain 28 College ; Jackson State University ; University of Mississippi Medical Center ; Mississippi College ; 29 Corinth, Mississippi ; Natchez, Mississippi ; Gulf Islands National Seashore ; Belhaven College ; Hurri- cane Camille ; Tombigbee River ; Mississippi University for Women ; Magnolia Bible College ; Tougaloo 30 College ; Yazoo River ; Adams County, Mississippi ; Alcorn State University ; Delta State University ; 31 Brookhaven, Mississippi ; Cat Island ; Kosciusko, Mississippi ; Tishomingo County, Mississippi ; Wayne County, Mississippi ; Yazoo County, Mississippi ; University of Southern Mississippi ; Mississippi Valley 32 State University ; Horn Island ; Natchez Trace Parkway ; William Carey College ; Foreign relations of 33 the Philippines ; Transportation in the Philippines ; Communications in the Philippines ; Economy of the Philippines ; Ilocos Region ; EDSA Revolution ; Flag of the Philippines ; Central Visayas ; Eastern 34 Visayas ; Western Visayas ; Central Luzon ; Bicol Region ; Cagayan Valley 35 36 Cluster 1 at Time 5: Muezzin ; Philippine peso ; Foreign relations of Peru ; Military of Peru ; Metro Manila ; Lupang Hinirang ; Philippine Sea ; Autonomous Region in Muslim Mindanao ; Ge- 37 ography of the Philippines ; Davao Region ; Caraga ; Zamboanga Peninsula ; Northern Mindanao ; 38 SOCCSKSARGEN ; Gloria Macapagal-Arroyo ; Filipino ; Transportation in the Philippines ; Ilocos Region ; Central Visayas ; Flag of the Philippines ; Bicol Region ; Central Luzon ; Cagayan Valley ; 39 Cordillera Administrative Region ; Eastern Visayas ; Western Visayas 40 Cluster 2 at Time 5: Choctaw ; Mississippi ; Flag of Mississippi ; Gulfport, Mississippi ; Hat- 41 tiesburg, Mississippi ; Greenville, Mississippi ; Millsaps College ; Pontotoc, Mississippi ; Ackerman, 42 Mississippi ; McComb, Mississippi ; Starkville, Mississippi ; Tupelo, Mississippi ; Laurel, Mississippi ; 43 Iuka, Mississippi ; Blue Mountain College ; University of Mississippi Medical Center ; Mississippi Col- lege ; Corinth, Mississippi ; Natchez, Mississippi ; Gulf Islands National Seashore ; Hurricane Camille ; 44 Mississippi State University ; Tombigbee River ; Woodall Mountain ; Mississippi University for Women 45 ; Magnolia Bible College ; Yazoo River ; Adams County, Mississippi ; Alcorn State University ; Delta State University ; Brookhaven, Mississippi ; Cat Island ; Pascagoula, Mississippi ; Kosciusko, Mississippi 46 ; Tishomingo County, Mississippi ; Wayne County, Mississippi ; Yazoo County, Mississippi ; University 47 of Southern Mississippi ; University of Mississippi ; Mississippi Valley State University ; Horn Island ; Natchez Trace Parkway ; William Carey College ; Choctaw mythology 48 49 Cluster 3 at Time 5: Folk music ; Dhrupad ; Milonga ; Lambada ; Tsifteteli ; Hula ; Meringue ; Plena ; Kunqu ; Cante jondo ;Qawwali ; Lam ; Country dancing ; Bachata ; Taarab ; Reel ; Batucada 50 ; Yodeling ; The Dubliners ; Shango ; Parang ; West gallery music ; Chimurenga ; Kalinda ; Muqam ; 51 The Weavers ; Berimbau ; Merengue ; Mbira ; Bhajan ; Sawt ; Choro ; Tambu ; Bangsawan ; Folk clubs 52 ; Gwo ka ; Candombe ; Beguine ; Forro ; Dangdut ; Giddha ; Planxty ; Oro ; Honkyoku ; Seamus Ennis ; Rada ; Tarana ; Kheyal ; Sea shanty ; Dastgah ; Clannad ; Tarantella ; Waiata ; Carimbo ; ; 53 Fairport Convention ; Compas ; Thumri ; Halling ; Shabad ; Tsamiko ; Sid Kipper ; Zydeco ; Fado ; 54 Klezmer ; Timba ; Blood on the Tracks ; Macarena 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 26 of 42

1 2 3 4 5 26 · Sitaram Asur et al. 6 7 Low Information Gain : As is evident from the above example, high SIG values 8 can indicate semantically meaningful splits. On the other hand, low SIG values can 9 indicate subtle changes across snapshots, where a cluster splits into two parts due 10 to a small semantic difference among the associated categories. These splits can 11 be interesting as they can reveal differences that may not be obvious. This can be 12 considered akin to drilling down a hierarchy to discover subtle specializations of a 13 category. We demonstrate this below with a couple of examples from the Wikipedia 14 dataset. 15

16 Cluster at Time 1: Algebra ; Linear equation ; Quadratic equation ; Scalar ; System of linear 17 equations ; Superposition ; Linear function ; Cubic equation ; Function (mathematics) ; Quartic equa- 18 tion ; Quintic equationFor; Line (mathematics)Peer; Iden titReviewy 19 Cluster 1 at Time 2:Quadratic equation ; Fundamental theorem of algebra ; Loss of significance 20 ; Complex conjugate ; Cubic equation ; Quartic equation ; Root (mathematics) ; Brahmagupta ; Quin- 21 tic equation ; Completing the square ; Quadratic irrational ; Abraham bar Hiyya Ha-Nasi ; Al-Khwarizmi 22 Cluster 2 at Time 2: Algebra ; Linear equation ; Scalar ; System of linear equations ; Superposi- 23 tion ; Linear function ; Function (mathematics) ; Line (mathematics) 24 We can observe that the cluster on algebra and equations at Time 1, has split into 25 two clusters specializing in information regarding Quadratic and Linear equations 26 respectively. This change has low SIG, since both the new clusters are strongly 27 related to each other and the original cluster. A more ironic example, given below, 28 shows two new clusters on East and West Berlin. 29

30 Cluster at Time 2: August 13 ; Berlin Wall ; East Berlin ; November 9 ; October 3 ; West Berlin 31 ; German reunification ; Boroughs of Berlin ; Pankow ; Lichtenberg ; Prenzlauer Berg ; Friedrichshain ; 32 Treptow 33 Cluster 1 at Time 3: November 9 ; October 3 ; Sovereignty ; West Berlin ; German reunification ; 34 History of Germany since 1945 ; Spandau ; Kreuzberg ; Boroughs of Berlin ;Charlottenburg ; Wilmers- 35 dorf ; Tiergarten ; Berlin S-Bahn ; Judgment in Berlin ; Stunde Null ; Zehlendorf, Berlin

36 Cluster 2 at Time 3: East Berlin ; Pankow ; Lichtenberg ; Prenzlauer Berg ; Friedrichshain ; 37 Treptow 38 39 For the DBLP dataset, we found a Split event with low SIG, involving a cluster 40 consisting of papers on structure extraction from HTML and unstructured docu- 41 42 ments. 43 Cluster at Time 3 44 FODO 1998: Seung Jin Lim, Yiu-Kai Ng: Constructing Hierarchical Information Structures of 45 Sub-Page Level HTML Documents ER 1998: David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu-Kai 46 Ng, Dallan Quass, Randy D. Smith: A Conceptual-Modeling Approach to Extracting Data from 47 the Web. IDEAS 1998: Aparna Seetharaman, Yiu-Kai Ng: A Model-Forest Based Horizontal Fragmenta- 48 tion Approach for Disjunctive Deductive Databases 49 CIKM 1998: David W. Embley, Douglas M. Campbell, Randy D. Smith, Stephen W. Liddle: 50 Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents 51 In the next year (1999), this cluster splits into two different clusters. While Seung 52 Jin Lim, Yiu-Kai Ng and David W. Embley continue working on extracting infor- 53 mation from Web Documents, Stephen W. Liddle, Douglas M. Campbell, Chad 54 Crawford specialized on Business Reports. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 27 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 27 6 7 Cluster 1 Cluster 2 8 Object Recognition Using Hierarchical Organization of Appearance-Based Appearance-Based Parts and Relations Parts and Relations for Object Recognition 9 Mining Insurance Data at Swiss Life A Data Mining Support Environment and 10 its Application on Insurance Data M-tree: An Efficient Access Method for Processing Complex Similarity 11 Similarity Search in Metric Spaces Queries with Distance-Based Access Methods 12 Optimizing Queries in Distributed Distributed View Expansion in Composable 13 and Composable Mediators Mediators Scaling up Dynamic Time Warping to Massive Dataset Scaling up dynamic time warping for 14 to Massive Datasets datamining applications 15 16 Table VII. Continue Events - Column 1 shows a paper from a cluster that is part of a 17 18 Continue evenFort. Column Peer2 shows the pap erReviewfrom the cluster in the next timestamp. 19 20 Cluster 1 at Time 4 21 CIKM 1999: Seung Jin Lim, Yiu-Kai Ng: An Automated Approach for Retrieving Hierarchical 22 Data from HTML Tables. DASFAA 1999: Seung Jin Lim, Yiu-Kai Ng: WebView: A Tool for Retrieving Internal Structures 23 and Extracting Information from HTML Documents 24 SIGMOD 1999: David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary Discovery in Web Documents. 25 26 Cluster 2 at Time 4 CIKM 1999: Stephen W. Liddle, Douglas M. Campbell, Chad Crawford: Automatically Ex- 27 tracting Structure and Data from Business Reports. 28 29 6.2.3 Group Continue:. For a Continue event, since the nodes belonging to the 30 cluster do not change, one can glean information about how ideas evolve. This is of 31 more relevance to the DBLP dataset, where one can study the evolution of topics. 32 The clusters that correspond to a continue event will tend to have a reasonably 33 34 high SMI score. Note that, in this case, we are measuring SMI of the same cluster 35 across successive snapshots, rather than two clusters in the same snapshot as in the 36 Merge case. We present some examples of the papers corresponding to continue 37 events for the DBLP dataset in Table VII. The first column in the table represents 38 a paper from the cluster at time stamp i and the second column denotes the most 39 similar paper from the continuing cluster at i + 1. As we can observe, there is a 40 marked progression in the topics of papers over time. 41 42 7. DIFFUSION MODEL FOR 43 EVOLVING NETWORKS 44 We use the behavioral patterns discussed in the previous section to define a diffu- 45 sion model for evolving networks. Diffusion models have been studied for complex 46 networks [Alkemade and Castaldi 2005; Cowan and Jonard 2004] and specifically 47 in the context of influence maximization [Kempe et al. 2003; 2005] where the task 48 is to identify key start nodes that can be used to effectively propagate information 49 through the network. The information can be either an idea or an innovation that 50 propagates through the network over time. In this regard, Kempe et al [Kempe 51 et al. 2003; 2005] discuss two models for the spread of influence through social 52 networks. We examine this scenario from an evolving perspective, where the nodes 53 and edges of the network are transient. 54 Let us consider an idea or innovation that arrives into the network at timestamp 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 28 of 42

1 2 3 4 5 28 · Sitaram Asur et al. 6 7 a. We define four states for nodes in the evolving network - active, inactive, 8 contagious and isolated. These states are not mutually exclusive, as we will see 9 later. At the beginning of the diffusion process, at time a, all nodes in the network 10 are inactive. The diffusion model begins with a set of nodes that are activated 11 (provided the information) at the first timestamp. These active nodes will be 12 contagious briefly, in that, in the next timestamp they can activate other nodes 13 they interact with, passing on the information they received. Subsequently, the 14 newly contagious nodes proceed to attempt to activate their inactive neighbors. 15 The process continues, with the information propagating through the network until 16 17 at time T there are σ(T ) active nodes in the network. In earlier work, the effect 18 of a contagiousFornode has Peerbeen limited toReviewone timestamp, which means that an 19 active node can attempt to activate its neighbors only once. However this does 20 not capture the fact that the network topology can change, with the neighbors 21 of nodes changing over time. After a contagious node has activated some of its 22 neighbors, new nodes might come in contact with it in subsequent time instances. 23 In this regard, we relax this constraint allowing a node to remain contagious when 24 confronted with new neighbors. A node can thus attempt to activate each unique 25 neighbor once. 26 When a node is surrounded by contagious nodes, it’s propensity to get activated 27 is given by an activation function. 28 Definition: The activation function for a node v, Acv() is a non-negative function 29 that maps the weights associated with the neighbors of v, wt(x, v) ∀x = neighbor(v) 30 to either 0 or 1. 31 32 We describe two Activation functions,Max and Sum, for a node v as 33 max Acv (u1, u2, ..., um) = (arg max (wtv (ui)) ≥ θv (18) 34 1≤i≤m

35 sum 36 Acv (u1, u2, ..., um) = X wtv(ui) ≥ θv (19) 37 1≤i≤m 38 Here, θv denotes the activation threshold for node v. The weights on the edges 39 represent the likelihood of that particular interaction leading to an activation. If 40 the edge between two nodes has a high weight, it indicates that if one of the nodes 41 gets activated, the chance of it activating the other is high. In our case, we define 42 the weights for an interaction based on the Sociability Index values of the nodes 43 involved, since Sociability can best capture the aforementioned property. If a node 44 is highly sociable, it has a high propensity of passing on information to other nodes 45 it interacts with. Hence, for each interaction of node x with a neighbor, y, the 46 weight of the interaction is given by 47 48 wtx(y) = SoI(y) (20) 49 Similarly wty(x) = SoI(x). Note that since we are dealing with diffusion over time, 50 the SoI(x) represents the cumulative value defined in (5) until the current time 51 point. The Sociability values thus can change over time. 52 The set of nodes activated in a given time interval i due to the initial node x and 53 the cardinality of this set are given by R (i) and σ (i) respectively. The total set 54 x x and number of nodes activated due to x after T timestamps of the diffusion process 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 29 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 29 6 7 A A 8 B B 9 Time i Time i+1 10 C C 11 12 D E D E 13 14 Fig. 5. Isolation of active nodes. The double circles indicate active nodes. The grey inner 15 circle represents contagious nodes. Nodes D and E are inactive. 16 17 are given as 18 T For PeerRx(T ) = ∪Reviewi=1Rx(i) (21) 19 20 T 21 σx(T ) = Xσx(i) (22) 22 i=1 23 It is also important to consider the effect of deleted nodes and edges. When a 24 node is not participating in any interaction in the current timestamp it is said to 25 be isolated. An isolated node cannot influence any other nodes since it has no 26 interactions. 27 28 Claim 7.1. An active node can be isolated. 29 Proof. As we mentioned earlier, the topology of the network can change at 30 every timestamp. Hence, a node that has just become active can be separated from 31 its neighbors due to the deletion of edges. The node will then remain isolated until 32 a new interaction is formed with it. 33 34 An example of this scenario is shown in Figure 5. Node A begins the diffusion pro- 35 cess activating node B. B is contagious at time i and activates node C. However 36 at the next timestamp, C no longer interacts with B, D and E. Although it is active 37 and contagious, it is isolated at this time instant. In the future, if it interacts with 38 other nodes, it can attempt to activate them once. 39 40 Influence Maximization: Influence Maximization is an important problem for 41 diffusion models and has practical applications in viral marketing and epidemiology. 42 The challenge is to find an initial set of active nodes that can influence the most 43 number of inactive nodes over the duration of the diffusion. 44 45 Problem Definition: Given a graph G that evolves over T timestamps and a 46 diffusion model, the task is to find the set of k initial nodes S to maximize RS(T ) 47 where RS(T ) = ∪x∈SRx(T ) 48 Kempe et al [Kempe et al. 2003; 2005] discuss a greedy algorithm for finding the 49 initial set that maximizes the influence. They find the start nodes that maximize 50 σ(T ), where σ(T ) = Px∈S σx(T ). To find σx(T ) for all nodes x, they simulate the 51 diffusion process over the network. However, in our case, the network is dynamic 52 with edges and nodes getting added or deleted. At a particular timestamp i, it is 53 unclear how the network is going to change at time i + 1. Hence, simulating the 54 diffusion on the static graph will not work. Considering high-degree nodes to start 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 30 of 42

1 2 3 4 5 30 · Sitaram Asur et al. 6 7 Method Activated nodes (%) Activated nodes (%) 8 Max Activation Sum Activation Random 16.67 20.39 9 Accumulated Degree 51.9 65.33 10 Influence 61.12 81 11 Table VIII. Diffusion Results 12 the diffusion process has been examined in social network research [Wasserman and 13 Faust. 1994]. However, using the degree to determine the initial nodes may not be 14 a good option [Kempe et al. 2003], since it is possible for nodes of high degree to be 15 16 clustered, which limits their range. Instead, we advocate the use of the Influence 17 Index we defined in the previous section for this purpose. The Influence Index is an 18 incremental Formeasure whic Peerh considers the Reviewbehavior of the nodes over the previous 19 timestamps and chooses nodes that have the highest degree of influence over other 20 nodes. Also, by pruning followers of influential nodes, we are ensuring that the 21 nodes with high influence index are not likely to be clustered. 22 23 Empirical Evaluation: We conducted an experiment to evaluate the performance 24 of the Influence index-based initialization. To compare, we employed an approach 25 based on accumulated degree, where we picked nodes that had the highest degree, 26 over the preceding timestamps, to be the start nodes. As a baseline, we implemented 27 a random approach where the initial nodes are chosen at random. We constructed a 28 graph using a subset of nodes from the DBLP collaboration network. We considered 29 the interactions from 1997-2001 to compute sociability, degree and influence scores. 30 We then assumed the introduction of a new idea at 2002 and then tracked its 31 diffusion through the network over the next 4 timestamps (till 2006). We used 32 an active set size, k, of 5 and both the Sum and Max activation functions. We 33 performed the experiments 100 times, choosing random activation thresholds for 34 the nodes from [0,1]. The results are shown in Table VIII. Our results suggest that 35 the Influence index can be useful in this regard. It succeeds in activating 61% and 36 81% of the nodes in the network in 4 timestamps for the Max and Sum Activation 37 functions respectively, clearly outperforming the other approaches. 38 39 40 41 8. CONCLUSION AND FUTURE WORK 42 In this paper, we have presented an event-based framework for characterizing the 43 evolution of dynamic interaction graphs. The framework is based on the use of 44 certain critical events that facilitate our ability to compute and reason about novel 45 behavior-oriented measures, which can offer new and interesting insights for the 46 characterization of dynamic behavior of such interaction graphs. We have shown 47 how semantic information can be incorporated to reason about community-based 48 events such as Merges and Splits. We have presented a diffusion model for evolving 49 networks and have shown the use of behavioral patterns for influence maximization. 50 We have demonstrated the efficacy of our framework in characterizing and reason- 51 ing on three different datasets - DBLP, Wikipedia and a clinical trials dataset. 52 The application of the behavioral patterns we obtained to a cluster link prediction 53 scenario provided favorable results, with the Sociability Index producing a large 54 number of accurate predictions. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 31 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 5 An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs · 31 6 7 9. ACKNOWLEDGEMENTS 8 9 This work is supported in part by the DOE Early Career Principal Investigator 10 Award No. DE-FG02-04ER25611, NSF CAREER Grant IIS-0347662 and NSF 11 SGER Grant IIS-0742999. We would like to thank Klaus Berberich and Evgeniy 12 Gabrilovich for providing us with the Wikipedia dataset and the category hierarchy 13 respectively. The authors would also like to thank Sameep Mehta for his useful 14 comments and suggestions, and Xintian Yang for help in processing the Wikipedia 15 dataset. 16 REFERENCES 17 18 Alkemade, F.Forand Castaldi, PeerC. 2005. Strategies Reviewfor the diffusion of innovations on social 19 networks. Computational Economics 25, 1-2. 20 Backstrom, L., Huttenlocher, D. P., and Kleinberg, J. M. 2006. Group formation in large social networks: membership, growth, and evolution. SIGKDD. 21 Barabasi, A.-L. and Bonabeau, E. 2003. Scale-free networks. Scientific American 288, 60–69. 22 Barabasi, A.-L., Jeong, H., Ravasz, R., Nda, Z., Vicsek, T., and Schubert, A. 2002. On the 23 topology of the scientific collaboration networks. Physica A 311, 590–614. 24 Berberich, K., Bedathur, S., Neumann, T., and Weikum, G. 2007. A time machine for text 25 search. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on 26 Research and development in information retrieval. ACM, New York, NY, USA, 519–526. 27 Chakrabarti, D., Kumar, R., and Tomkins, A. 2006. Evolutionary clustering. SIGKDD. 28 Clauset, A., Newman, M. E. J., and Moore, C. 2004. Finding community structure in very 29 large networks. Physical Review E 70. Couto, F. M., Silva, M. J., and Coutinho, P. 2005. Semantic similarity over the gene ontol- 30 ogy: family correlation and selecting disjunctive ancestors. Proc. ACM CIKM Intl. Conf. on 31 Information and Knowledge Management, 343–344. 32 Cowan, R. and Jonard, N. 2004. Network structure and the diffusion of knowledge. Journal of 33 Economic Dynamics and Control 28, 1557–1575. 34 Falkowski, T., Bartelheimer, J., and Spiliopoulou, M. 2006. Mining and visualizing the 35 evolution of subgroups in social networks. IEEE/WIC/ACM International Conference on 36 Web Intelligence 0. 37 Flake, G. W., Lawrence, S. R., Giles, C. L., and Coetzee, F. M. 2002. Self-organization and identification of web communities. IEEE Computer 36, 66–71. 38 Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia- 39 based explicit semantic analysis. International Joint Conference on Artificial Intelligence (IJ- 40 CAI). 41 Ganesan, P., Garcia-Molina, H., and Widom, J. 2003. Exploiting hierarchical domain structure 42 to compute similarity. ACM Trans. Inf. Syst 21, 1. 43 Girvan, M. and Newman, M. E. J. 2002. Community structure in social and biological networks. 44 Proc. National Academy of Sciences of the United States of America 99, 12, 7821–7826. 45 Govindaraju, N. K., Lin, M., and Manocha, D. 2005. General purpose computations using graphics processors. Ninth Annual Workshop on High Performance Embedded Computing. 46 Kalnis, P., Mamoulis, N., and Bakiras, S. 2005. On discovering moving clusters in spatio- 47 temporal data. Advances in Spatial and Temporal Databases 3633, 364–381. 48 Kempe, D., Kleinberg, J., and Tardos, E. 2003. Maximizing the spread of influence through a 49 social network. SIGKDD. 50 Kempe, D., Kleinberg, J., and Tardos, E. 2005. Influential nodes in a diffusion model for social 51 networks. Proc. Intl. Colloquium on Automata, Languages and Programming (ICALP). 52 Kurzak, J. and Dongarra, J. 2006. Implementing linear algebra routines on multicore processors 53 with pipelining and a lookahead,. PARA. 54 Lada, A. A., , and Adar, E. 2003. Friends and neighbors on the web. Social Networks 25, 3 (July), 211–230. 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 ACM Transactions on Knowledge Discovery from Data Page 32 of 42

1 2 3 4 5 32 · Sitaram Asur et al. 6 7 Leskovec, J., Kleinberg, J. M., and Faloutsos, C. 2005. Graphs over time: densification laws, 8 shrinking diameters and possible explanations. SIGKDD. 9 Liben-Nowell, D. and Kleinberg, J. M. 2003. The link prediction problem for social networks. 10 Proc. ACM CIKM Intl. Conf. on Information and Knowledge Management. 11 Lin, D. 1998. An information-theoretic definition of similarity. Proc. 15th Intl. Conf. Machine 12 Learning. 13 Lin, J., Vlachos, M., Keogh, E., and Gunopulos, D. 2004. Iterative incremental clustering of 14 time series. EDBT , 106–122. Newman, M. 2001. Clustering and preferential attachment in growing networks. Phys. Rev. E 64. 15 Newman, M. E. J. 2006. Modularity and community structure in networks. Proc. National 16 Academy of Sciences of the United States of America 103, 23, 8577–8582. 17 Reichardt, J. and Bornholdt, S. 2004. Detecting fuzzy community structures in complex 18 networks withFora potts mo del.PeerPhysical Review ReviewLetters 93. 19 Resnik, P. 1999. Semantic similarity in a taxonomy: An information-based measure and its 20 application to problems of ambiguity in natural language. Journal of Artifical Intelligence 21 Research 11, 95–130. 22 Richardson, R., Smeaton, A. F., and Murphy, J. 1994. Using WordNet as a knowledge base 23 for measuring semantic similarity between words. Tech. Rep. CA-1294, Dublin, Ireland. Samtaney, R., Silver, D., Zabusky, N., and Cao, J. 1994. Visualizing features and tracking 24 their evolution. IEEE Computer 27, 7, 20–27. 25 Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult/, R. 2006. Monic - modeling and 26 monitoring cluster transitions. SIGKDD. 27 Strehl, A. and Ghosh, J. 2002. Cluster ensembles a knowledge reuse framework for combining 28 multiple partitions. Journal on Machine Learning Research 3, 583–617. 29 van Dongen, S. 2000. A cluster algorithm for graphs. Tech. Rep. INS-R0010, National Research 30 Institute for Mathematics and Computer Science in the Netherlands, Amsterdam. 31 Wasserman, S. and Faust., K. 1994. Social network analysis. Cambridge University Press. 32 Yang, H., Parthasarathy, S., and Mehta, S. 2005. Mining spatial object patterns in scientific data. Proc. 9th Intl. Joint Conf. on Artificial Intelligence. 33 34 35 Received February 1986; November 1993; accepted January 1996 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 ACM Transactions on Computational Logic, Vol. 2, No. 3, 09 2001. 57 58 59 60 Page 33 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 3 4 An Event-based Framework for Characterizing the 5 Evolutionary Behavior of Interaction Graphs ∗ 6 7 8 9 Sitaram Asur Srinivasan Parthasarathy Duygu Ucar 10 Department of Computer Department of Computer Department of Computer 11 Science and Engineering Science and Engineering Science and Engineering 12 Ohio State University Ohio State University Ohio State University Columbus, OH Columbus, OH Columbus, OH 13 [email protected] [email protected] [email protected] 14 15 16 ABSTRACT of interest and edges mimic the interactions among them. 17 Interaction graphs are ubiquitous in many fields such as These interaction networks arise from a wide variety of sci- entific domains like computer science, physics, biology and 18 bioinformatics, sociology and physical sciences. There have For Peer Reviewsociology. Online communities such as Flickr, MySpace and 19 been many studies in the literature targeted at studying and mining these graphs. However, almost all of them have stud- Orkut, e-mail networks, co-authorship networks and WWW 20 networks are examples of interesting interaction networks. 21 ied these graphs from a static point of view. The study of the evolution of these graphs over time can provide tremen- The study of these complex interaction networks can pro- 22 dous insight on the behavior of entities, communities and the vide insight into their structure, properties and behavior. 23 flow of information among them. In this work, we present Early research on these networks [11, 17, 4, 5], has pri- 24 an event-based characterization of critical behavioral pat- marily focused on static properties such as their modular 25 terns for temporally varying interaction graphs. We use nature, neglecting the fact that most real-world interaction 26 non-overlapping snapshots of interaction graphs and develop networks are dynamic in nature. In reality, many of these networks constantly evolve over time, with the addition and 27 a framework for capturing and identifying interesting events from them. We use these events to characterize complex be- deletion of edges and nodes representing changes in the in- 28 teractions among the modeled entities. Recently, there has 29 havioral patterns of individuals and communities over time. We demonstrate the application of behavioral patterns for been some interest in studying dynamic graphs [14, 3, 6, 9]. 30 the purposes of modeling evolution, link prediction and in- Identifying the portions of the network that are changing, 31 fluence maximization. Finally, we present a diffusion model characterizing the type of change, predicting future events 32 for evolving networks, based on our framework. (e.g. link prediction), and developing generic models for 33 evolving networks are challenges that need to be addressed. 34 Categories and Subject Descriptors: H.2.8 Database For instance, the rapid growth of online communities has 35 Management: Database Applications - Data Mining dictated the need for analyzing large amounts of temporal data to reveal community structure, dynamics and evolu- 36 General Terms: Algorithms, Measurement tion. We believe that studying the evolution of clusters of 37 these networks, in particular their formation, transitions and 38 Keywords: Interaction networks, Evolutionary analysis, dissolution, can be extremely useful for effectively character- 39 Diffusion of innovations izing the corresponding changes to the network over time. 40 Another important aspect is the behavior of the nodes of 41 the network. Nodes of an evolving interaction network rep- resent entities whose interaction patterns change over time. 42 1. INTRODUCTION 43 The movement of nodes, their behavior and influence over Many social and biological systems can be represented as other nodes, can help make inferences regarding future in- 44 complex interaction networks where nodes represent entities teractions as well as predicting changes to communities in 45 the network. For instance, in a social network, if a person ∗ 46 This work is supported in part by the DOE Early Career is very sociable, the chances of this person interacting with 47 Principal Investigator Award No. DE-FG02-04ER25611 and new people and joining new groups is very high. The influ- NSF CAREER Grant IIS-0347662. The authors would like 48 to thank Sameep Mehta for his useful comments and sug- ence exerted by a node can be studied in terms of its effect 49 gestions. on other nodes. If several other people join a community 50 when a particular individual does, it indicates a high degree 51 of positive influence for that person. 52 The study of diffusion or flow of information in an evolv- ing network is important for social science research, viral 53 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are marketing applications and epidemiology. For instance, pan- 54 not made or distributed for profit or commercial advantage and that copies demic viruses pose a severe threat to society due to their po- 55 bear this notice and the full citation on the first page. To copy otherwise, to tential to spread rapidly and cause tremendous widespread 56 republish, to post on servers or to redistribute to lists, requires prior specific illnesses and deaths. In viral marketing, the goal is to prop- permission and/or a fee. 57 KDD’07, August 12–15, 2007, San Jose, California, USA. agate an idea or innovation through an interaction network. 58 Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00. 59 60 ACM Transactions on Knowledge Discovery from Data Page 34 of 42

1 2 Analysis of the evolution of interactions in a social network clusters for a set of objects while also maintaining similarity 3 and the identification of influential nodes can be used to with clusters identified in previous timestamps. Falkowski 4 devise effective containment policies for pandemic disease et al [9] analyze the evolution of communities that are stable word- 5 spread in the case of epidemiological studies, as well as or fluctuating based on subgroups. Although they analyze of-mouth advertising in marketing. interaction graphs, their focus is different from ours. While 6 In this paper, we provide an event-based framework for they apply standard statistical measures to identify persis- 7 characterizing the evolution of interaction networks. We be- tent subgroups, our focus is on identifying key events and 8 gin by converting an evolving graph into static snapshot behavioral patterns that can characterize, model and predict 9 graphs at different time points. We obtain clusters at each future behavioral trends. 10 of these snapshots independently. Next, we characterize the The seminal paper by Samtaney et al [19] described an 11 transformations of these clusters by defining and identifying approach for extracting coherent regions from 2-dimensional 12 certain critical events, which can be efficiently detected us- and 3-dimensional scalar and vector fields for tracking pur- 13 ing bit-matrix computations. We use these critical events poses. To study the evolution of these regions over time, to compute and reason about novel behavior-oriented mea- they present certain evolutionary events for objects. Event- 14 sures, which offer new and interesting insights for the char- based methods have also been applied on spatial data [23] 15 acterization of dynamic behavior of interaction graphs. We and clustered stream data [20]. 16 illustrate our framework on two different evolving networks 17 - the DBLP co-authorship network and a clinical trials pa- 3. PROBLEM DEFINITION 18 tient network. In each case, the behavioral patterns that For Peer ReviewOur focus in this work is to study the evolution of graphs, we discover using our framework help us make useful infer- 19 in particular to understand behavioral patterns for commu- ences about cluster evolution and link prediction. Finally, 20 nities and individuals over time. In this regard, it becomes we use the behavioral measures to detail a diffusion model 21 necessary to study and characterize the transformations un- for evolving networks and demonstrate their application for 22 dergone by the graph at different time instants along the the task of influence maximization. way. For this purpose, we make use of temporal snapshots 23 In short, the key contributions of this work are 24 to examine static versions of the evolving network at differ- 25 • The identification of key critical events that occur in ent time points. 26 evolving interaction networks using efficient incremen- tal algorithms. Definition: An interaction graph G is said to be evolving 27 if its interactions vary over time. Let G = (V, E) denote 28 • Novel behavioral measures for stability, sociability, in- a temporally varying interaction graph where V represents 29 fluence and popularity that can be computed incre- the total unique entities and E the total interactions that 30 mentally over time exist among the entities. We define a temporal snapshot 31 Si = (Vi, Ei) of G to be a graph representing only entities • A diffusion model for evolving networks based on our 32 and interactions active in a particular time interval [Tsi , Tei ], framework 33 called the snapshot interval. 34 • Application of the events and behavioral measures on As the graph evolves, new nodes and edges can appear. two real datasets for modeling evolution, predicting be- 35 Similarly, nodes and edges can also cease to exist. This dy- havior and trends (link prediction) and influence max- 36 namic behavior of a graph over time can thus be represented imization. 37 as a set of S equal, non-overlapping temporal snapshots. 38 Note that, in this work different snapshots are mutually 39 2. RELATED WORK exclusive. This is in contrast to the representation provided 40 There has been enormous interest in mining interaction in some earlier research [6, 16] which define a snapshot con- 41 graphs for interesting patterns in various domains. How- sidering all the interactions upto the current time interval. 42 ever, the majority of these studies [11, 17, 4, 5, 7, 18, 10] Figure 1(a-b) illustrates an example evolving graph over two time intervals. We find that in the first time interval, inter- 43 have focused on mining static graphs to identify commu- nity structures, patterns and novel information. Recently, actions exist between A and C, and between A and D. In 44 the dynamic behavior of clusters and communities have at- the second time interval, these interactions do not continue 45 tracted the interest of several groups. Leskovec et al [14] to exist. Figure 1(c) depicts a cumulative snapshot of the 46 studied the evolution of graphs based on various topological second time interval. We find that the information regarding 47 properties, such as the degree distribution and small-world the loss of interactions AC and AD is lost. Also, the com- 48 properties of large networks. They proposed a graph gen- munity structure depicted in Figure 1(c) does not reflect 49 eration model, called Forest Fire model, to explain their the actual structure. To prevent this loss of information, 50 findings about evolutionary behaviors of graphs. Backstrom we choose short time intervals and generate snapshots rep- 51 et al [3] studied formation of groups and the ways they grow resenting only the information of that specific interval. The collection of all T temporal snapshots is represented by S = 52 and evolve over time. To estimate probability of an indi- vidual joining a community, they proposed using features {S1, S2, . . . , ST }. 53 of communities and individuals, applying decision-tree tech- To study the evolution of the graph, we need a represen- 54 niques. tation of its structure at different snapshots. For this pur- 55 Chakrabarti et al [6] proposed evolutionary settings for pose, we generate clusters for each snapshot graph. Each 56 two widely-used clustering algorithms (k-means and agglom- Si is partitioned into ki communities or clusters denoted by 1 2 ki th j 57 erative hierarchical clustering). They define evolutionary Ci = {Ci , Ci , . . . , Ci }. The j cluster of Si, Ci is also a j j j j 58 clustering as the task of incrementally obtaining high-quality graph denoted by (Vi , Ei ) where Vi are nodes in Si and 59 60 Page 35 of 42 ACM Transactions on Knowledge Discovery from Data

1 2 A B F A B F A B F chose the snapshot interval to be a year, resulting in 10 con- 3 E E E secutive snapshot graphs. These graphs are then clustered 4 C D G C D G C D G and analyzed to identify critical events and patterns. It has 5 been shown that collaboration networks display many of the T1 T2 T1,2 structural features of social networks [12]. Hence, this is a 6 good representative dataset for this study. We believe that 7 Figure 1: Temporal Snapshots a) at Time t=1 b) at studying the evolution of the DBLP dataset can afford in- 8 Time t=2 c) Cumulative snapshot at Time t=2 formation about the nature of collaborations and the factors 9 j j that influence future collaborations between authors. Ei denotes the edges between nodes in Vi . Finally, for each 10 1 2 ki Si = (Vi, Ei), V ∪ V ∪ . . . ∪ V = Vi. 11 i i i Clinical Trials Data: In clinical trials, pharmaceutical To choose a clustering algorithm for this work, we exam- companies test a new drug for efficacy and toxicity - effi- 12 ined the performance of various graph clustering algorithms 13 cacy to evaluate its effectiveness in curing or controlling the on several interaction graphs in terms of modularity of clus- disease in question and toxicity to determine if the drug is 14 ters. We found that the MCL algorithm [21], a fast and scal- safe for consumption and with minimal side effects. In this 15 able unsupervised clustering algorithm, consistently yielded paper we use datasets obtained from a major pharmaceuti- 16 clusters of high modularity. Hence, we use MCL to obtain 1 cal company, consisting of healthy people as well as patients 17 the clusters at different timestamps . The MCL algorithm suffering from certain diseases. As part of the study, they 18 does not require a parameter specifying the number of clus- were given either a placebo (a formulation that includes only ters. Instead it uses a granularitFory parameter andPeerthe cluster Review 19 the inactive ingredients) or the drug under study. Liver tox- structure prevalent in the graph to determine the number 20 icity information can be obtained from eight serum analytes of partitions. Accordingly, for each snapshot, the number (often referred to in the literature as the liver panel). The 21 of clusters may vary depending on the interactions in that 22 initial snapshot of this data is composed of the measure- time interval. We used a granularity parameter of 1.2 for ments of the analytes obtained before patients were treated 23 our experiments, since the graphs were fairly sparse. with the drug or the placebo. The subsequent snapshots 24 Algorithm 1 shows the outline of the framework we pro- correspond to measurements taken every week. The data incremental strategy 25 pose. We design an to mine the clusters thus consisted of 7 snapshots spanning a 6 week period since 26 over time to identify significant changes that occur among the beginning of the treatment. We transformed the data critical events. 27 snapshots, referred to as These events are for each snapshot into a graph, based on the correlations then used to study more complex behavioral patterns. In that exist between the analyte values of patients. If there 28 Section 5, we will describe the critical events and how we 29 exists a high correlation (greater than a threshold (Tcorr) find them. In Section 6, we mine these events further to between two patients, there will be an edge between them 30 find complex behavioral patterns for analysis. 2 31 in the snapshot graph . Note that, if we consider each patient separately, we are limited to only intrinsic informa- 32 Algorithm 1 Mine-Events(G,T ) Input: Interaction graph G = (V, E) and T , the tion, whereas by modeling patients as a graph, we are able 33 number of intervals to utilize intrinsic as well as extrinsic properties. 34 Convert graph G = (V, E) into T temporal 35 snapshots S = {S1, S2, . . . , ST }. for i = 1 to T do 5. CRITICAL EVENTS 36 Cluster Si 1 2 ki In this section, we introduce and afford a formal definition 37 Ci = {Ci , Ci , . . . , Ci } 38 end for to certain critical events that occur in evolving graphs. Some for i = 1 to T − 1 do of the critical events described in this section are inspired 39 Events = Find events(Si,Si+1) [Section 5] Mine Events for complex patterns [Section 6] by a similar notion described by Samtaney et al [19]. in the 40 end for context of tracking and visualizing features. 41 The events that we define are primarily between two con- 42 secutive timestamps but it is possible to coalesce events from 43 4. DATASETS contiguous timestamps by analyzing the meta-data collected 44 We employ two different datasets in this work. from the event mining framework. We distribute the critical 45 events which graphs can undergo into two categories - events 46 DBLP co-authorship network: We used a subset of the involving communities and events involving individuals. Figure 2 displays a set of snapshots of the network which 47 DBLP bibliography (http://dblp.uni-trier.de/) to generate a co-authorship network representing authors publishing in will be used as a running example in this section. At time 48 several important conferences in the field of databases, data t = 1, 2 clusters are discovered (shown in different colors). 49 mining and AI. We chose all papers that appeared in 28 50 key conferences over a 10 year period (1997-2006) spanning Events involving communities: We define 5 basic events 51 mainly these three areas. We converted this data into a which clusters can undergo between any two consecutive 52 co-authorship graph, where each author is represented as time intervals or steps. Let Si and Si+1 be snapshots of 53 a node and an edge between two authors corresponds to a S at two consecutive time intervals with Ci and Ci+1 denot- 54 joint publication by these two authors. The graph span- ing the set of clusters respectively. 55 ning 10 years contained 23136 nodes and 54989 edges. We 56 1Note that, since our proposed framework operates on top 57 of the clusters, it is relatively independent of the clustering 2We examined the distribution of correlations and picked a 58 algorithm used to obtain the snapshot clusters. Tcorr value of 0.7 for our experiments 59 60 ACM Transactions on Knowledge Discovery from Data Page 36 of 42

1

2 C1 C1 Time t=4 in Figure 2 shows a split event when a cluster 1 2 3 gets completely split into three smaller clusters.

4 k 4) Form:A new cluster Ci+1 is said to have been formed if 5 2 2 1 C1 C2 C 3 none of the nodes in the cluster were grouped together at 6 k t= 1 t= 2 t= 3 the previous time interval i.e. no 2 nodes in Vi+1 existed in 7 1 1 1 C the same cluster at time period i. C C 6 4 5 k j k j 8 C2 C3 F orm(C +1) = 1 iff ∃ no C such that V +1 ∩ V > 1 6 6 i i i i 9 Intuitively, a form indicates the creation of a new commu-

2 3 2 3 4 4 5 nity or new collaboration. Figure 2 at time t=5 shows a 10 C C C C C C C 11 4 4 5 5 5 6 6 form event when two new nodes appear and a new cluster 12 t= 4 t= 5 t= 6 is formed. 13 Figure 2: Temporal Snapshots at time t=1 to 6 5) Dissolve: A single cluster Ck is said to have dissolved 14 i The five proposed events are: if none of the vertices in the cluster are in the same cluster 15 j in the next timestamp i.e. no two entities in the original 1) Continue: A cluster Ci+1 is marked as continuation 16 j cluster have an interaction between them in the current time of Ck if V is the same as V k. We do not impose the 17 i i+1 i interval. constraint that the edge sets should be the same. k j k j 18 k j k j Dissolve(Ci ) = 1 iff ∃ no Ci+1 such that Vi ∩ Vi+1 > 1 Continue(Ci , C ) = 1 iff VFori = V Peer Review i+1 i+1 Intuitively, a dissolve indicates the lack of contact or in- 19 The main motivation behind this is that if certain nodes are teractions between a group in a particular time period. This 20 always part of the same cluster, any information supplied to might signify the breakup of a community or a workgroup. 21 one node will eventually reach the others. The addition and Figure 2 at time t=6 shows a dissolve event when there are 22 deletion of edges merely indicates the strength between the no longer interactions between the three nodes in Cluster 23 nodes. An example of a continue event is shown at t=2 in 1 C5 resulting in a breakup of the cluster into 3 clusters. 24 Figure 2. Note that an extra interaction appears between the nodes in Cluster C1 but the clusters do not change. 25 2 Events involving individuals: We wish to analyze not 26 only the evolution of communities but the influence of the 2) κ-Merge: Two different clusters Ck and Cl are marked i i behavior of individuals on communities. In this regard, we 27 as merged if there exists a cluster in the next timestamp introduce four basic transformations involving individuals 28 that contains at least κ% of the nodes belonging to these over snapshots. 29 two clusters. The essential condition for a merge is : k l j 30 Merge(Ci , Ci , κ) = 1 iff ∃Ci+1 such that j 31 1) Appear: A node is said to appear when it occurs in Ci k l j but was not present in any cluster in the earlier timestamp. |(Vi ∪ Vi ) ∩ V | 32 i+1 > κ% (1) k l j Appear(v, i) = 1 iff v ∈/ Vi−1 and v ∈ Vi 33 Max(|Vi ∪ Vi |, |Vi+1|) This simple event indicates the introduction of a person (new 34 k l or returning) to a network. In Figure 2, at time t = 5 two k j |Ci | l j |Ci| 35 and |Vi ∩ Vi+1| > 2 and |Vi ∩ Vi+1| > 2 . This condi- new nodes appear in the network. k l 36 tion will only hold if there exist edges between Vi and Vi 37 in timestamp i + 1. Intuitively, it implies that new inter- 2) Disappear: A node is said to disappear when it was actions have been created between nodes which previously found in a cluster Cj but is not present in any cluster in 38 3 i−1 were part of different clusters. This caused κ% of nodes 39 the timestamp i. in the two original clusters to join the new cluster. Figure 2 Disappear(v, i) = 1 iff v ∈ Vi 1 and v ∈/ Vi 40 − shows an example of a complete merge event (κ = 100) at This indicates the departure of a person from a network. In 41 4 t=3. The dotted lines represent the newly created edges. Figure 2, at time t = 6 two nodes of cluster C5 disappear 1 42 All the nodes now belong to a single cluster (C3 ). from the network. 43 3) κ-Split: A single cluster Cj is marked as split if κ% of j 44 i 3) Join: A node is said to join cluster Ci if it exists in the 45 nodes from this cluster are present in 2 different clusters in cluster at timestamp i. This may be due to an Appear event the next timestamp. The essential condition is that: or due to a leave event from a different cluster. Note that in 46 j k l j Split(C , κ) = 1 iff ∃C +1, C +1 such that i i i case, the cluster Ci must be sufficiently similar to a cluster 47 k k l j Ci−1 48 |(Vi+1 ∪ Vi+1) ∩ Vi | j > κ% (2) 49 k l |Ck | Max(|Vi+1 ∪ Vi+1|, |Vi |) Join(v, Cj ) = 1 iff ∃Ck such that Ck ∩ Cj > i−1 50 i i−1 i−1 i 2 k l k j k j |Ci+1| l j |Ci+1| and v ∈/ Vi−1 and v ∈ Vi 51 and |Vi+1 ∩ Vi | > 2 , |Vi+1 ∩ Vi | > 2 . j 52 The cluster similarity condition ensures that Ci is not a newly formed cluster. This condition differentiates a Join 53 Intuitively, a split signifies that the interactions between certain nodes are broken and not carried over to the current event from a F orm event. Nodes forming a new cluster 54 will not be considered to be Join events since there will be 55 timestamp, causing the nodes to part ways and join different k clusters. no cluster Ci−1 in the previous timestamp with similarity 56 |Ck | > i−1 with the newly formed cluster. 57 2 58 3We used a κ value of 50 in our experiments. 59 60 Page 37 of 42 ACM Transactions on Knowledge Discovery from Data

1 k 2 4) Leave: A node is said to leave cluster Ci−1 if it no of Ti. The construction of the matrices and the operations k 3 longer is present in a cluster with most of the nodes in Vi−1. to find the events are all linear in time complexity(O(n)), 4 A node that leaves a cluster may leave the network as a assuming that ki << n and ki+1 << n. We show timing 5 Disappear event or may join a different cluster. In a col- results in our technical report [2]. laboration network, a Leave event might correspond to a 6 student graduating and leaving a group. 7 j j k k j 6. BEHAVIORAL ANALYSIS 8 Leave(v, Ci ) = 1 iff ∃Ci and Ci−1 such that Ci−1 ∩ Ci > k Most of the research on evolving networks [3, 9] have 9 |Ci−1| k j 2 and v ∈ Vi−1 and v ∈/ Vi focused solely on analyzing community behavior. In this 10 The similarity constraint between the two clusters is used section, we begin by presenting some interesting results on 11 to maintain cluster correspondence. Note that if the original community behavior and then move on to study evolution 12 cluster dissolves, the nodes in the cluster are not said to from a new perspective, by considering the behavior of nodes 13 participate in a Leave event. This is due to the fact that in the network. By analyzing the community-based events k 14 |Ci−1| obtained, we observed several interesting merge and split there will no longer be a cluster with similarity > 2 15 k events in the DBLP dataset, that afforded insight into inter- with the dissolved cluster Ci−1. 16 esting relationships between group collaborations as well as 17 5.0.1 Algorithms for Event Extraction: the evolution of topics. For instance, let us consider a clus- ter merge event that occurred in the 2005-2006 time interval. 18 We leverage the use of efficienFort bit matrix Peeroperations to ReviewOur algorithm identified two groups (one from Germany and 19 compute the events between snapshots. First, for each tem- one from Italy) who independently published articles in dif- 20 poral snapshot, we construct a binary ki ×n matrix Ti where ferent conferences in 2005. 21 ki is the number of clusters at timestamp i and n is the num- ber of nodes. We then compare the matrices of successive 22 4 Cluster 1 in 2005 snapshots to find events between them . Let Ti(x, :) and 23 th th AAAI 2005: Niels Landwehr, Kristian Kersting, Luc Ti(:, y) correspond to the x row and y column vector of 24 De Raedt: nFOIL: Integrating Na¨ıve Bayes and FOIL matrix Ti respectively. To compute all the events between AAAI 2005: Luc De Raedt, Kristian Kersting, Sunna 25 two snapshots, we perform a set of binary operations (AND 26 Torge: Towards Learning Stochastic Logic Programs from and OR) on the corresponding matrices. The linear opera- Proof-Banks. 27 tions performed to identify each event are presented below. 1 Cluster 2 in 2005 28 Let |x|1 represent the L -norm of a binary vector x. ICML 2005 : Sauro Menchetti, Fabrizio Costa, Paolo 29 |x| Frasconi: Weighted Decomposition Kernels. 30 IJCAI 2005 : Andrea Passerini and Paolo Frasconi: |x|1 = X xi (3) 31 i=1 P. Kernels on Prolog Ground Terms. 32 We can compute the events as : Dissolve(T ,T ) = {x|1 ≤ x ≤ k , arg max (|AND(T (x, : 33 i i+1 i 1≤y≤ki+1 i Merged Cluster in 2006 ), T (y, :))| ≤ 1} 34 i+1 1 ILP 2006 : Niels Landwehr, Andrea Passerini, Luc De Raedt, Paolo Frasconi: kFOIL: Learning Simple Re- 35 Form(Ti,Ti+1) = Dissolve(Ti+1,Ti) lational Kernels 36 Merge(Ti,Ti+1,κ) = {< x, y, z > |1 ≤ x ≤ ki, From the merge event, we can hypothesize that Niels Landwehr 37 1 ≤ y ≤ ki, x 6= y, 1 ≤ z ≤ ki+1, and Luc De Raedt, who were working on Inductive Logic in |AND(OR(Ti(x, :), Ti(y, :)), Ti+1(z, :))|1 ≥ κ, 38 |Ti(x,:)|1 2005 are collaborating on Passerini and Frasconi who worked |AND(Ti(x, :), Ti+1(z, :))|1 ≥ 2 , 39 |Ti(y,:)|1 separately on kernels and the resultant paper is a combina- |AND(Ti(y, :), Ti (z, :))| ≥ } 40 +1 1 2 tion of these ideas. 41 Split(Ti,Ti+1,κ)= Merge(Ti+1,Ti,κ) Indeed, in the abstract of the 2006 paper, the authors de- scribe the paper as ”A novel and simple combination of in- 42 Continue(T ,T )= {< x, y > |1 ≤ x ≤ k , 1 ≤ y ≤ k , i i+1 i i+1 ductive logic programming with kernel methods is presented. 43 OR(Ti(x, :), Ti+1(y, :)) == AND(Ti(x, :), Ti+1(y, :))} The kFOIL algorithm integrates the well-known inductive 44 Appear(Ti,Ti+1) = {v|1 ≤ v ≤ |V |, logic programming system FOIL with kernel methods.” |T (:, v)| == 0, |T (:, v)| == 1} 45 i 1 i+1 1 One relatively simple conclusion we could make from our

46 Disappear(Ti,Ti+1) = {v|1 ≤ v ≤ |V |, observations is that the propensity of a merger between clus- 47 |Ti(:, v)|1 == 1, |Ti+1(:, v)|1 == 0} ters seems to be dependent on two main factors - the prox- 48 imity or sociability of the authors and the similarity of the Join(Ti,Ti+1) = {< y, v > |1 ≤ y ≤ ki+1, 1 ≤ v ≤ |V |, Ti+1(y, v) == 49 1, topics of the papers involved. We have obtained several x, x k AND T x, , T y, > |Ti(x,:)|1 , 50 ∃ 1 ≤ ≤ i s.t. | ( i( :) i+1( :))|1 2 other interesting results for community behavior which we T (x, v) == 0} 51 i have detailed in our technical report [2]. Next, we study evolution from the perspective of nodes in the network. Our 52 Leave(Ti,Ti+1)= {< x, v > |1 ≤ x ≤ ki, 1 ≤ v ≤ |V |, Ti(x, v) == 1, |Ti(x,:)|1 goal is to capture the behavioral tendencies of individuals 53 ∃y, 1 ≤ y ≤ ki+1 s.t. |AND(Ti(x, :), Ti+1(y, :))|1 > , 2 that contribute to the evolution of the graph. 54 Ti+1(y, v) == 0} th th We define four behavioral measures that can be incremen- Ti(x, y) represents the value in the x row and y column 55 tally computed at each time interval using the events discov- 56 4If the number of nodes changes between the timestamps, we ered in the current interval. 57 will increase the length of the matrices to reflect the largest 58 of the two 59 60 ACM Transactions on Knowledge Discovery from Data Page 38 of 42

1 2 6.1 Stability Index of intervals that node x is active. Similar to the stability 3 The Stability index measures the tendency of a node to index, this is computed incrementally. The measure gives 4 have interactions with the same nodes over a period of time. high scores to nodes that are involved in interactions with 5 A node is highly stable if it belongs to a very stable clus- different groups. ter, one that does not change much over time. Let cli(x) Note that this measure does not represent the degree, 6 th which is a factor of the number of interactions a node is 7 represent the cluster that node x belongs to in the i time interval. involved in. A case in point was a node that had a degree 8 The Stability Index (SI) for node x over T timestamps is of 80, but a sociability of close to 0. When we examined 9 measured incrementally as: 10 the clusters the node belonged to, we found that the node T 11 |cli(x)| interacted with the same nodes over several timestamps. SI(x, T ) = X 12 Vi i=1 Pj=1(Leave(j, cli(x)) + Join(j, cli(x))) 13 Application of Sociability for Link Prediction: The (4) goal in link prediction [15] is to use past interaction infor- 14 Stability for the Clinical Trials Dataset: In the case mation to predict future links between nodes. Since, in this 15 of the clinical trials data, nodes in a cluster correspond to paper we are analyzing the evolution of clusters, our goal 16 individuals having similar observations. When a node has a is to predict future co-occurrences of nodes in clusters. In 17 low Stability index score, it indicates that the observations the case of the DBLP collaboration network, two nodes are 18 of that particular patient fluctuate appreciably. This causes clustered together if they work on related papers or belong the node to jump from one clusterForto another Peerrepeatedly. Review 19 to the same work-group, as we have seen. We can make use This behavior represents an anomaly and can indicate pos- 20 of the behavioral patterns we have discovered, in particular sible side-effects of the drug being administered. Note that the Sociability Index, to predict the likelihood of authors 21 the dataset contains two groups of people, one group on the 22 being clustered together in the future. placebo and the other group on the drug with a distribution If an author has a high Sociability Index score, the chances 23 of 40:60. If the people with very low Stability index (out- of him/her joining a new cluster in the future is very high. 24 liers in this case) happen to be people on the drug, there We compute the Sociability Index scores for authors using 25 is a reasonable indication that there may be a hepatoxic ef- Equation 5. We use a degree threshold to prune these au- 26 fect from the drug intake, whereas if it is uniform over both thors. 5 We find all authors who have high Sociability index 27 sets of people, it would indicate there are no noticeable side- scores (> 0.75) and degree higher than the threshold and effects. who have not been clustered together in the past. We then 28 Accordingly, we computed the Stability index for all the 29 predict future cluster co-occurrences between them. nodes in the clinical trial data. On examination, we found The seminal paper on link prediction [15] provided an em- 30 19 nodes having very low Stability index scores (below a 31 pirical analysis of several techniques for link prediction. We threshold). This indicates that these nodes move between adopt the same scenario and split our DBLP snapshots into 32 clusters in almost every time interval that they are active. two parts. We use the clusterings for the first 5 years (1997- 33 In this particular application unstable nodes (patients) 2001) to predict new cluster co-occurrences for the next 5 34 are a cause for concern since that may be an indication of years. Note that we are only considering new links between 35 toxicity. Drilling down on the nineteen most unstable nodes authors. Hence we consider only authors that have not been 36 we find that only one of them is on the placebo. In fact clustered together previously. drilling down even further it is observed that out of the top 37 Similar to the evaluation performed by Liben-Nowell and 200 most unstable patients, eighty percent were on the drug. Kleinberg [15], we use as our baseline a random predictor 38 This indeed was very suspicious given the original distribu- 39 that randomly predicts pairs of authors who have not been tion (40:60) and points to potential toxicity. As it turns clustered together before, and report the accuracy of all the 40 out, according to domain experts, this drug was dis- methods relative to the random predictor. To perform com- 41 continued for toxicity two years after this study was parisons, we implement three other approaches that were 42 conducted. We should also note that when we applied the shown to perform well by the authors - Common Neighbor- 43 same procedure on a clinical study where the drug in ques- based, Adamic-Adar and the Jaccard coefficient. For more 44 tion was considered safe and met with FDA approval we details on these approaches please refer to our technical re- did not find such a pattern of behavior. 45 port [2]. 46 6.2 Sociability Index We used all the algorithms to predict cluster links for the last 5 years (2002-2006). We only considered pairs of au- 47 For the DBLP data, we define a related measure, called 48 thors who have not been clustered together in any of the the Sociability Index. The Sociability Index is a measure of 5 earlier snapshot graphs. The accuracy was computed as 49 the number of different interactions that a node participates a factor of the random predictor [15], which was found to 50 in. This behavior can be captured by the number of Join give a correct result with probability 0.14%. The results 51 and Leave events that this node is involved in. Let cli(x) are shown in Table 1. We find that the Sociability Index- 52 be the cluster that node x belongs to at time i. based method performs the best overall, outperforming other 53 approaches appreciably with a large ratio of correct predic- 54 Then, the Sociability Index is defined as: tions (275). This result suggests that behavioral patterns of T evolving graphs can be used to predict future behavior. 55 (Join(x, cli+1(x)) + Leave(x, cli(x))) SoI(x) = Pi=1 (5) 56 |Activity(x)| 57 T 5 58 where Activity(x) = Pi=1(x ∈ Vi) indicates the number The threshold value we used was 50 papers 59 60 Page 39 of 42 ACM Transactions on Knowledge Discovery from Data

1 Predictor Accuracy 2 Random Predictor Probability 0.14% 3 Sociability Index 275 4 Common Neighbors 25 Adamic-Adar 46 5 Jaccard Coefficient 23 6 Table 1: Cluster Link Prediction Accuracy. Accuracy 7 score specifies the factor improvement over the random 8 predictor. This method of evaluation is consistent with 9 the one performed by Liben-Nowell and Kleinberg [15]. 10 11 12 6.3 Popularity Index 13 The Popularity index is a measure defined for a cluster 14 or community at a particular time interval. The Popularity 15 Index of a cluster at time interval [i, i + 1] is a measure of 16 the number of nodes that are attracted to it during that 17 interval. It is defined as: 18 Vi Vi j j For Peerj Review 19 P I(Ci ) = (X Join(x, Ci )) − (X Leave(x, Ci )) (6) 20 x=1 x=1 Figure 3: Illustration of certain authors belonging to a very popular cluster (1999-2000 time period). Original 21 This measure is based on the transformation a cluster un- cluster (3 authors) shown in small box. 22 dergoes over the course of a time interval. If a cluster does 23 not dissolve in [i, i + 1] and a large number of nodes join the 24 cluster and few leave it, then the cluster will have a high nodes that leave or join a cluster when x does. If a large 25 Popularity Index score. Note, that the Popularity index is number of nodes leave or join a cluster with high frequency an influence measure defined for a cluster. 26 when a certain node x does, it suggests that node x has a In the DBLP dataset, the popularity index can be used to 27 certain positive influence on the movement of the others. find topics of interest for a particular year. For instance, if Let Companions(x) represent all nodes over all timestamps 28 a large number of nodes join a cluster at a particular time 29 that join or leave clusters with node x. The Influence for point and a high percentage of them are working on a specific node x is given by: 30 topic, it indicates a buzz around that topic for that year. On the other hand, if a large number of authors leave a cluster, |Companions(x)| 31 Inf(x) = (7) 32 and there are not many new nodes joining it, it indicates a |Moves(x)| loss of interest in a particular topic. 33 Here Moves(x) represents the number of Join and Leave 34 To find hot topics, we computed the popularity index 6 scores for each cluster, and identified the most popular clus- events x participates in. Note that, this definition by it- 35 ters, at each timestamp. We then examined the clusters that self, does not measure influence, since nodes that interact 36 had high popularity scores to see if a large percentage of the and move along with highly influential nodes will have high 37 authors in them were working on a particular topic. Influence score values as well. Hence, to eliminate these fol- 38 We will now present an interesting result we obtained for lower nodes, additional pruning constraints are needed. 39 the time span 1999-2000. In 1999, three authors Stefano Let Max Int(x) denote the node with which node x has the 40 Ceri, Piero Fraternali and Stefano Paraboschi formed a clus- maximum number of interactions. Let Deg(x) denote the number of neighbors of node x. 41 ter. They were involved in a few papers on XML and web applications. In the next year (2000), these three authors Influence Index(x) = Inf(x) unless any of the following 42 hold : 43 were involved in a large number of collaborations, result- ing in around 50 joins to their cluster. When we examined • Inf(Max Int(x)) > Inf(x) 44 the topics of the papers that resulted, we found that 30 of 45 these authors published papers related to XML. Since there • Deg(Max Int(x)) > Deg(x) 46 were no papers on XML before 1999, this was a new and 47 hot topic at that point. Since then there have been large If any of the two conditions hold, Influence Index(x) = 0. 48 number of papers on XML. Figure 3 shows the original 3 The additional constraints are imposed in order to ensure 49 person cluster as well as the authors from the new cluster that we find the most influential nodes in the datasets. 50 who were involved in XML related work in that particular We computed the Influence Index scores for nodes in the DBLP dataset. The top 20 authors are shown in Table 2. 51 time interval. We further illustrate the use of the Influence index in the 52 6.4 Influence Index next section. 53 The influence index of a node is a measure of the influence 54 this node has on others. Note that the influence that we are 55 considering, in this case, is with regard to cluster evolution. 6To compute the Influence index efficiently, we incremen- 56 We would like to find nodes that influence other nodes into tally update Companions() and Deg() for all nodes. The 57 participating in critical events. This behavior is measured number of Join and Leave events (Moves()) are used in the 58 for a node x, over all timestamps, by considering all other Sociability case also and are stored incrementally as well. 59 60 ACM Transactions on Knowledge Discovery from Data Page 40 of 42

1 Author Influence Index We describe two Activation functions,Max and Sum, for a 2 H. V. Jagadish 290.125 3 Hongjun Lu 268.5 node v as Jiawei Han 266.625 max 4 Acv (u1, u2, ..., um) = (arg max (wtv(ui)) ≥ θv (8) Philip S. Yu 251.66 1≤i≤m 5 Rajeev Rastogi 246.85 Beng Chin Ooi 237 sum 6 Acv (u1, u2, ..., um) = X wtv(ui) ≥ θv (9) Tok Wang Ling 220.428 7 1≤i≤m Heikki Mannila 206.5 8 Wenfei Fan 200.142 Here, θv denotes the activation threshold for node v. The 9 Qiang Yang 199 weights on the edges represent the likelihood of that particu- Johannes Gehrke 179.85 10 Christos Faloutsos 167.85 lar interaction leading to an activation. If the edge between 11 Rakesh Agrawal 157.875 two nodes has a high weight, it indicates that if one of the 12 Edward Y. Chang 153 nodes gets activated, the chance of it activating the other is Guy M. Lohman 131.375 high. In our case, we define the weights for an interaction 13 Dennis Shasha 129.29 14 Jennifer Widom 128.375 based on the Sociability Index values of the nodes involved, 15 Hamid Pirahesh 127.625 since Sociability can best capture the aforementioned prop- Michael J. Franklin 121.5 erty. If a node is highly sociable, it has a high propensity 16 Hector Garcia-Molina 118.625 of passing on information to other nodes it interacts with. 17 Table 2: Top 20 Influence Index Values - DBLP Data Hence, for each interaction of node x with a neighbor, y, the 18 For Peer Reviewweight of the interaction is given by 19 7. DIFFUSION MODEL FOR EVOLVING wtx(y) = SoI(y) (10) 20 NETWORKS 21 Similarly wty(x) = SoI(x). Note that since we are dealing 22 Diffusion models have been studied for complex networks [1, with diffusion over time, the SoI(x) represents the cumula- 8] and specifically in the context of influence maximiza- 23 tive value defined in (5) until the current time point. The tion [12, 13] where the task is to identify key start nodes that Sociability values thus can change over time. 24 can be used to effectively propagate information through the 25 The set of nodes activated in a given time interval i due to network. The information can be either an idea or an inno- the initial node x and the cardinality of this set are given by 26 vation that propagates through the network over time. In Rx(i) and σx(i) respectively. The total set and number of 27 this regard, Kempe et al [12, 13] discuss two models for the nodes activated due to x after T timestamps of the diffusion 28 spread of influence through social networks. We examine process are given as this scenario from an evolving perspective, where the nodes 29 T 30 and edges of the network are transient. Rx(T ) = ∪i=1Rx(i) (11) 31 Let us consider an idea or innovation that arrives into the network at timestamp a. We define four states for T 32 nodes in the evolving network - active, inactive, contagious σx(T ) = Xσx(i) (12) 33 and isolated. At the beginning of the diffusion process, at i=1 34 time a, all nodes in the network are inactive. The diffu- It is also important to consider the effect of deleted nodes 35 sion model begins with a set of nodes that are activated and edges. When a node is not participating in any interac- 36 (provided the information) at the first timestamp. These tion in the current timestamp it is said to be isolated. An 37 active nodes will be contagious briefly, in that, in the next isolated node cannot influence any other nodes since it has 38 timestamp they can activate other nodes they interact with, no interactions. 39 passing on the information they received. Subsequently, the 40 newly contagious nodes proceed to attempt to activate their Influence Maximization: Influence Maximization is an inactive neighbors. The process continues, with the infor- important problem for diffusion models and has practical 41 mation propagating through the network until at time T applications in viral marketing and epidemiology. The chal- 42 there are σ(T ) active nodes in the network. In earlier work, lenge is to find an initial set of active nodes that can influence 43 the effect of a contagious node has been limited to one times- the most number of inactive nodes over the duration of the 44 tamp, which means that an active node can attempt to acti- diffusion. 45 vate its neighbors only once. However this does not capture 46 the fact that the network topology can change, with the Problem Definition: Given a graph G that evolves over T timestamps and a diffusion model, the task is to find the 47 neighbors of nodes changing over time. After a contagious set of k initial nodes S to maximize RS(T ) where RS (T ) = 48 node has activated some of its neighbors, new nodes might come in contact with it in subsequent time instances. In this ∪x∈SRx(T ) 49 regard, we relax this constraint allowing a node to remain 50 Kempe et al [12, 13] discuss a greedy algorithm for finding contagious when confronted with new neighbors. A node the initial set that maximizes the influence. They find the 51 can thus attempt to activate each unique neighbor once. start nodes that maximize σ(T ), where σ(T ) = σx(T ). 52 When a node is surrounded by contagious nodes, it’s propen- Px∈S To find σx(T ) for all nodes x, they simulate the diffusion pro- 53 sity to get activated is given by an activation function. cess over the network. However, in our case, the network is 54 dynamic with edges and nodes getting added or deleted. At 55 Definition: The activation function for a node v, Acv() is a particular timestamp i, it is unclear how the network is 56 a non-negative function that maps the weights associated going to change at time i+1. Hence, simulating the diffusion 57 with the neighbors of v, wt(x, v) ∀x = neighbor(v) to either on the static graph will not work. Considering high-degree 58 0 or 1. nodes to start the diffusion process has been examined in 59 60 Page 41 of 42 ACM Transactions on Knowledge Discovery from Data

1 Method Activated nodes (%) Activated nodes (%) 9. REFERENCES 2 Max Activation Sum Activation 3 Random 16.67 20.39 [1] F. Alkemade and C. Castaldi. Strategies for the diffusion of 4 Accumulated Degree 51.9 65.33 innovations on social networks. Computational Economics, Influence 61.12 81 25(1-2), 2005. 5 [2] S. Asur, S. Parthasarathy, and D. Ucar. An event-based 6 Table 3: Diffusion Results framework for characterizing the evolution of interaction graphs. Technical Report Oct 2006, Updated Jun 2007, 7 OSU-CISRC-2/07-TR16., 2007. 8 social network research [22]. However, using the degree to [3] L. Backstrom, D. P. Huttenlocher, and J. M. Kleinberg. Group determine the initial nodes may not be a good option [12], formation in large social networks: membership, growth, and 9 since it is possible for nodes of high degree to be clustered, evolution. SIGKDD, 2006. 10 which limits their range. Instead, we advocate the use of the [4] A.-L. Barabasi and E. Bonabeau. Scale-free networks. 11 Scientific American, 288:60–69, 2003. Influence Index we defined in the previous section for this [5] A.-L. Barabasi, H. Jeong, R. Ravasz, Z. N¨ı£¡da, T. Vicsek, and 12 purpose. The Influence Index is an incremental measure A. Schubert. On the topology of the scientific collaboration 13 which considers the behavior of the nodes over the previous networks. Physica A, 311:590–614, 2002. timestamps and chooses nodes that have the highest degree [6] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary 14 clustering. SIGKDD, 2006. 15 of influence over other nodes. Also, by pruning followers of [7] A. Clauset, M. E. J. Newman, and C. Moore. Finding 16 influential nodes, we are ensuring that the nodes with high community structure in very large networks. Physical Review influence index are not likely to be clustered. E, 70, 2004. 17 [8] R. Cowan and N. Jonard. Network structure and the diffusion 18 of knowledge. Journal of Economic Dynamics and Control, Empirical Evaluation: We conductedForan Peerexperiment to Review28:1557–1575, 2004. 19 evaluate the performance of the Influence index-based ini- [9] T. Falkowski, J. Bartelheimer, and M. Spiliopoulou. Mining 20 tialization. To compare, we employed an approach based and visualizing the evolution of subgroups in social networks. IEEE/WIC/ACM International Conference on Web 21 on accumulated degree, where we picked nodes that had Intelligence, 2006. 22 the highest degree, over the preceding timestamps, to be [10] G. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee. 23 the start nodes. As a baseline, we implemented a ran- Self-organization and identification of web communities. IEEE dom approach where the initial nodes are chosen at ran- Computer, 36:66–71, 2002. 24 [11] M. Girvan and M. E. J. Newman. Community structure in 25 dom. We constructed a graph using a subset of nodes from social and biological networks. Proc. National Academy of 26 the DBLP collaboration network. We considered the inter- Sciences of the United States of America, 99(12):7821–7826, actions from 1997-2001 to compute sociability, degree and 2002. 27 influence scores. We then assumed the introduction of a new [12] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread 28 of influence through a social network. SIGKDD, 2003. idea at 2002 and then tracked its diffusion through the net- [13] D. Kempe, J. Kleinberg, and E. Tardos. Influential nodes in a 29 work over the next 4 timestamps (till 2006). We used an diffusion model for social networks. Proc. Intl. Colloquium on 30 active set size, k, of 5 and both the Sum and Max activa- Automata, Languages and Programming (ICALP), 2005. tion functions. We performed the experiments 100 times, [14] J. Leskovec, J. M. Kleinberg, and C. Faloutsos. Graphs over 31 time: densification laws, shrinking diameters and possible 32 choosing random activation thresholds for the nodes from explanations. SIGKDD, 2005. 33 [0,1]. The results are shown in Table 3. Our results suggest [15] D. Liben-Nowell and J. M. Kleinberg. The link prediction problem for social networks. Proc. ACM CIKM Intl. Conf. on 34 that the Influence index can be useful in this regard. It suc- ceeds in activating 61% and 81% of the nodes in the network Information and Knowledge Management, 2003. 35 [16] J. Lin, M. Vlachos, E. Keogh, and D. Gunopulos. Iterative in 4 timestamps for the Max and Sum Activation functions incremental clustering of time series. EDBT, pages 106–122, 36 respectively, clearly outperforming the other approaches. 2004. 37 [17] M. E. J. Newman. Modularity and community structure in 38 networks. Proc. National Academy of Sciences of the United States of America, 103(23):8577–8582, 2006. 39 [18] J. Reichardt and S. Bornholdt. Detecting fuzzy community 40 8. CONCLUSION AND FUTURE WORK structures in complex networks with a potts model. Physical In this paper, we have presented an event-based frame- Review Letters, 93, 2004. 41 [19] R. Samtaney, D. Silver, N. Zabusky, and J. Cao. Visualizing 42 work for characterizing the evolution of dynamic interaction features and tracking their evolution. IEEE Computer, 43 graphs. The framework is based on the use of certain criti- 27(7):20–27, 1994. cal events that facilitate our ability to compute and reason [20] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult/. 44 about novel behavior-oriented measures, which can offer new Monic - modeling and monitoring cluster transitions. 45 SIGKDD, 2006. and interesting insights for the characterization of dynamic [21] S. van Dongen. A cluster algorithm for graphs. Technical 46 behavior of such interaction graphs. We have presented a Report INS-R0010, National Research Institute for 47 diffusion model for evolving networks and have shown the Mathematics and Computer Science in the Netherlands, 48 use of behavioral patterns for influence maximization. We Amsterdam, 2000. [22] S. Wasserman and K. Faust. Social network analysis. 49 have demonstrated the efficacy of our framework in charac- Cambridge University Press, 1994. 50 terizing and reasoning on two different datasets - the DBLP [23] H. Yang, S. Parthasarathy, and S. Mehta. Mining spatial dataset and a clinical trials dataset. The application of the object patterns in scientific data. Proc. 9th Intl. Joint Conf. 51 on Artificial Intelligence, 2005. 52 behavioral patterns we obtained to a cluster link prediction scenario provided favorable results, with the Sociability In- 53 dex producing a large number of accurate predictions. In 54 the future, we would like to extend our framework to in- 55 corporate semantic information to augment our behavioral 56 analysis. Our preliminary work in this direction is detailed 57 in our technical report [2]. We would also like to examine 58 extensions to large graphs that do not fit in memory. 59 60 ACM Transactions on Knowledge Discovery from Data Page 42 of 42

1 2 3 Difference from the original conference paper: 4 5 Apart from extending the text with more explanations and results, we have 6 included 7 a new dataset, the Wikipedia webpage dataset to our study. We also 8 have a new section detailing the use of semantic content for analysis of 9 community 10 behavior. In the conference paper, we have developed behavioral measures for 11 nodes. 12 Here, we also develop measures for community evolution, making use of semantic 13 similarity concepts and information theory. 14 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60